A1-Mini is a decoder-only transformer language model with approximately 500 million parameters, designed for efficient training and inference.
- Decoder-only Transformer: GPT-style architecture optimized for text generation
- Rotary Positional Embeddings (RoPE): Advanced position encoding for better long-range dependencies
- SwiGLU Activation: Enhanced feed-forward networks for improved performance
- Multi-head Attention: Efficient attention mechanism with configurable heads
- Layer Normalization: Pre-norm architecture for stable training
- Mixed Precision Training: Automatic mixed precision with gradient scaling
- Distributed Training: Multi-GPU support with DistributedDataParallel
- Gradient Checkpointing: Memory-efficient training for large models
- Advanced Schedulers: Cosine warmup and linear decay learning rate schedules
- Comprehensive Logging: Detailed metrics tracking and visualization
- Flexible Tokenization: BPE tokenizer with character-level fallback
- Smart Preprocessing: Text cleaning, normalization, and format conversion
- Streaming Support: Memory-efficient data loading for large datasets
- Multiple Formats: Support for TXT, JSON, and JSONL data formats
# Clone or download the project
# Navigate to project directory
# Install dependencies
pip install -r requirements.txtpython run_check.pyThis will verify that all components are working correctly.
Place your training data in data/raw/:
.txtfiles: Plain text.jsonfiles: JSON with text fields.jsonlfiles: JSON Lines format
The model will automatically preprocess your data and train a tokenizer.
python train.pyCheck logs in logs/training.log and model checkpoints in models/saved/.
- Hidden Size: 1024
- Layers: 16
- Attention Heads: 16
- Vocabulary: 32,000 tokens
- Context Length: 2048 tokens
- Batch Size: 8 (with gradient accumulation)
- Learning Rate: 3e-4 with cosine warmup
- Optimizer: AdamW with weight decay
- Mixed Precision: Enabled for efficiency
Edit config.py to modify model parameters:
# For smaller model (~125M params)
config = A1MiniConfigs.get_125m_config()
# For larger model (~1B params)
config = A1MiniConfigs.get_1b_config()Modify training settings in config.py:
# Adjust batch size for your GPU memory
config.batch_size = 4 # Reduce for smaller GPUs
config.gradient_accumulation_steps = 8 # Maintain effective batch size
# Modify learning rate and schedule
config.learning_rate = 1e-4
config.warmup_steps = 1000
config.max_steps = 50000Customize preprocessing in dataset/data_preprocess.py:
preprocess_config = {
'normalize_unicode': True,
'remove_extra_whitespace': True,
'min_length': 50, # Minimum text length
'max_length': 4096, # Maximum text length
}# Multi-GPU training
torchrun --nproc_per_node=4 train.py
# Multi-node training
torchrun --nnodes=2 --nproc_per_node=4 --rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=HOST:PORT train.pypython train.py --resume_from_checkpoint models/saved/latest_checkpoint.ptfrom model import A1MiniModel
from config import A1MiniConfig
from dataset.tokenizer import get_tokenizer
import torch
# Load configuration and model
config = A1MiniConfig()
model = A1MiniModel(config)
tokenizer = get_tokenizer(config)
# Load trained weights
checkpoint = torch.load('models/saved/best_checkpoint.pt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Generate text
input_text = "The future of artificial intelligence"
input_ids = torch.tensor([tokenizer.encode(input_text)]).long()
with torch.no_grad():
generated = model.generate(
input_ids,
max_length=100,
temperature=0.8,
top_p=0.9
)
output = tokenizer.decode(generated[0].tolist())
print(output)- GPU Memory: Use gradient checkpointing for larger models
- Data Loading: Increase
num_workersin DataLoader for faster I/O - Batch Size: Find the largest batch size that fits in GPU memory
- Mixed Precision: Always enabled for modern GPUs (Volta+)
- Compilation: Use
torch.compile()for PyTorch 2.0+ speed improvements
- Python 3.8+
- PyTorch 2.0+
- CUDA 11.7+ (for GPU training)
- 16GB+ RAM recommended
- 8GB+ GPU memory for training
This project is provided as-is for educational and research purposes.
Feel free to submit issues and enhancement requests!
A1-Mini - A complete, production-ready language model implementation for students and researchers.