Skip to content

OcterAI/a1-mini

Repository files navigation

A1-Mini: 500M Parameter Language Model

A1-Mini is a decoder-only transformer language model with approximately 500 million parameters, designed for efficient training and inference.

Features

Model Architecture

  • Decoder-only Transformer: GPT-style architecture optimized for text generation
  • Rotary Positional Embeddings (RoPE): Advanced position encoding for better long-range dependencies
  • SwiGLU Activation: Enhanced feed-forward networks for improved performance
  • Multi-head Attention: Efficient attention mechanism with configurable heads
  • Layer Normalization: Pre-norm architecture for stable training

Training Features

  • Mixed Precision Training: Automatic mixed precision with gradient scaling
  • Distributed Training: Multi-GPU support with DistributedDataParallel
  • Gradient Checkpointing: Memory-efficient training for large models
  • Advanced Schedulers: Cosine warmup and linear decay learning rate schedules
  • Comprehensive Logging: Detailed metrics tracking and visualization

Data Processing

  • Flexible Tokenization: BPE tokenizer with character-level fallback
  • Smart Preprocessing: Text cleaning, normalization, and format conversion
  • Streaming Support: Memory-efficient data loading for large datasets
  • Multiple Formats: Support for TXT, JSON, and JSONL data formats

Quick Start

1. Installation

# Clone or download the project
# Navigate to project directory

# Install dependencies
pip install -r requirements.txt

2. Validate Setup

python run_check.py

This will verify that all components are working correctly.

3. Prepare Data

Place your training data in data/raw/:

  • .txt files: Plain text
  • .json files: JSON with text fields
  • .jsonl files: JSON Lines format

The model will automatically preprocess your data and train a tokenizer.

4. Start Training

python train.py

5. Monitor Training

Check logs in logs/training.log and model checkpoints in models/saved/.

Configuration

Model Size (500M parameters)

  • Hidden Size: 1024
  • Layers: 16
  • Attention Heads: 16
  • Vocabulary: 32,000 tokens
  • Context Length: 2048 tokens

Training Settings

  • Batch Size: 8 (with gradient accumulation)
  • Learning Rate: 3e-4 with cosine warmup
  • Optimizer: AdamW with weight decay
  • Mixed Precision: Enabled for efficiency

Customization

Changing Model Size

Edit config.py to modify model parameters:

# For smaller model (~125M params)
config = A1MiniConfigs.get_125m_config()

# For larger model (~1B params)  
config = A1MiniConfigs.get_1b_config()

Training Configuration

Modify training settings in config.py:

# Adjust batch size for your GPU memory
config.batch_size = 4  # Reduce for smaller GPUs
config.gradient_accumulation_steps = 8  # Maintain effective batch size

# Modify learning rate and schedule
config.learning_rate = 1e-4
config.warmup_steps = 1000
config.max_steps = 50000

Data Processing

Customize preprocessing in dataset/data_preprocess.py:

preprocess_config = {
    'normalize_unicode': True,
    'remove_extra_whitespace': True,
    'min_length': 50,        # Minimum text length
    'max_length': 4096,      # Maximum text length
}

Advanced Usage

Distributed Training

# Multi-GPU training
torchrun --nproc_per_node=4 train.py

# Multi-node training
torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=HOST:PORT train.py

Resume Training

python train.py --resume_from_checkpoint models/saved/latest_checkpoint.pt

Inference

from model import A1MiniModel
from config import A1MiniConfig
from dataset.tokenizer import get_tokenizer
import torch

# Load configuration and model
config = A1MiniConfig()
model = A1MiniModel(config)
tokenizer = get_tokenizer(config)

# Load trained weights
checkpoint = torch.load('models/saved/best_checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Generate text
input_text = "The future of artificial intelligence"
input_ids = torch.tensor([tokenizer.encode(input_text)]).long()

with torch.no_grad():
    generated = model.generate(
        input_ids,
        max_length=100,
        temperature=0.8,
        top_p=0.9
    )

output = tokenizer.decode(generated[0].tolist())
print(output)

Performance Tips

  1. GPU Memory: Use gradient checkpointing for larger models
  2. Data Loading: Increase num_workers in DataLoader for faster I/O
  3. Batch Size: Find the largest batch size that fits in GPU memory
  4. Mixed Precision: Always enabled for modern GPUs (Volta+)
  5. Compilation: Use torch.compile() for PyTorch 2.0+ speed improvements

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA 11.7+ (for GPU training)
  • 16GB+ RAM recommended
  • 8GB+ GPU memory for training

License

This project is provided as-is for educational and research purposes.

Contributing

Feel free to submit issues and enhancement requests!


A1-Mini - A complete, production-ready language model implementation for students and researchers.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages