# Phase 2.1: Base Pre-Training for Mathematical Reasoning

This notebook demonstrates the complete pre-training infrastructure for training a decoder-only transformer on mixed mathematical and general text corpora.

## üöÄ Quick Start

**For GPU Training (Recommended):**
1. Open this notebook in [Google Colab](https://colab.research.google.com)
2. Go to Runtime ‚Üí Change runtime type ‚Üí Select GPU (T4 or better)
3. Run all cells

**For CPU Testing (Local):**
- Just run all cells (will use smaller model and fewer steps)

## üì¶ What's Included

- Streaming dataset for large-scale corpora
- Mixed-domain sampling (ArXiv + General text)
- Distributed training support (DDP)
- Mixed precision (fp16/bf16)
- Gradient accumulation
- Learning rate scheduling
- Automatic checkpointing
- TensorBoard logging

## 1. Setup and Installation

In [None]:
# Check if running on Colab
try:
    import google.colab
    IN_COLAB = True
    print("‚úì Running on Google Colab")
except ImportError:
    IN_COLAB = False
    print("‚úì Running locally")

# Clone repository if on Colab
if IN_COLAB:
    print("\nCloning repository...")
    !git clone https://github.com/Alpyaman/AI-Mathematical-Olympiad.git
    %cd AI-Mathematical-Olympiad
    !git checkout claude/setup-decoder-transformer-tpZw9
    print("‚úì Repository cloned")

In [None]:
# Install dependencies
print("Installing dependencies...")
!pip install -q torch numpy tqdm

# Optional: Install TensorBoard for logging
!pip install -q tensorboard

print("‚úì Dependencies installed")

In [None]:
# Import required libraries
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import os
import json
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Import Phase 2.1 Components

In [None]:
# Import model and tokenizer
from src import (
    get_small_config,
    get_base_config,
    MathTransformerDecoder,
    MathTokenizer,
)

# Import training infrastructure
from src.training import PreTrainer, PreTrainingConfig

# Import data utilities
from src.data.pretraining_dataset import (
    create_sample_pretraining_data,
    prepare_pretraining_data,
    PreTrainingDataCollator,
)

print("‚úì All Phase 2.1 components imported successfully")

## 3. Configure Training

We'll automatically adjust based on available hardware.

In [None]:
# Detect hardware and configure accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
USE_GPU = device == "cuda"

if USE_GPU:
    # GPU Configuration - Faster training
    print("üöÄ GPU Training Configuration")
    MODEL_SIZE = "small"  # Can use "base" for larger GPUs
    BATCH_SIZE = 4
    GRAD_ACCUM_STEPS = 8
    MAX_STEPS = 500  # Increase to 10000+ for real training
    MIXED_PRECISION = "bf16" if torch.cuda.is_bf16_supported() else "fp16"
    NUM_WORKERS = 2
else:
    # CPU Configuration - Slower but still works
    print("üíª CPU Training Configuration (Demo Mode)")
    MODEL_SIZE = "small"
    BATCH_SIZE = 1
    GRAD_ACCUM_STEPS = 2
    MAX_STEPS = 20  # Very short for CPU demo
    MIXED_PRECISION = "fp32"
    NUM_WORKERS = 0

print(f"\nTraining Configuration:")
print(f"  Device: {device}")
print(f"  Model size: {MODEL_SIZE}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Gradient accumulation: {GRAD_ACCUM_STEPS}")
print(f"  Effective batch size: {BATCH_SIZE * GRAD_ACCUM_STEPS}")
print(f"  Max steps: {MAX_STEPS}")
print(f"  Mixed precision: {MIXED_PRECISION}")

## 4. Prepare Pre-Training Data

We'll create sample mathematical and general text data for demonstration.

In [None]:
# Create sample data
print("Creating sample pre-training data...")
print("-" * 70)

data_dir = "./data/pretraining_demo"
create_sample_pretraining_data(data_dir)

print("\n‚úì Sample data created!")
print(f"  Location: {data_dir}")
print(f"  ArXiv samples: 5 mathematical texts")
print(f"  General samples: 5 general texts")

In [None]:
# Preview the data
print("Sample ArXiv text:")
print("=" * 70)
with open(f"{data_dir}/arxiv/sample.jsonl", 'r') as f:
    sample = json.loads(f.readline())
    print(sample['text'])

print("\n" + "=" * 70)
print("Sample General text:")
print("=" * 70)
with open(f"{data_dir}/general/sample.jsonl", 'r') as f:
    sample = json.loads(f.readline())
    print(sample['text'])

## 5. Initialize Tokenizer

Our enhanced mathematical tokenizer with 200+ symbols.

In [None]:
# Initialize tokenizer
print("Initializing mathematical tokenizer...")
tokenizer = MathTokenizer()

print(f"‚úì Tokenizer initialized")
print(f"  Vocabulary size: {len(tokenizer):,}")
print(f"  Special tokens: {tokenizer.special_tokens}")
print(f"  Mathematical symbols: 200+")

# Test tokenization
test_text = "Let f: ‚Ñù ‚Üí ‚Ñù be continuous. Then ‚à´‚ÇÄ¬π f(x)dx exists."
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded['input_ids'])

print(f"\nTokenization test:")
print(f"  Original: {test_text}")
print(f"  Decoded:  {decoded}")
print(f"  Tokens: {len(encoded['input_ids'])}")

## 6. Prepare Streaming Dataset

Create a mixed-domain dataset that samples 30% from ArXiv and 70% from general text.

In [None]:
# Prepare streaming dataset
print("Preparing mixed-domain streaming dataset...")
print("-" * 70)

train_dataset = prepare_pretraining_data(
    data_dir=data_dir,
    sources=["arxiv", "general"],
    tokenizer=tokenizer,
    max_seq_length=512,  # Shorter for demo
    mix_weights=[0.3, 0.7],  # 30% math, 70% general
)

print("‚úì Streaming dataset created")
print(f"  Data sources: ArXiv (30%), General (70%)")
print(f"  Streaming mode: Yes (memory efficient)")
print(f"  Max sequence length: 512 tokens")

In [None]:
# Create data loader
collator = PreTrainingDataCollator(pad_token_id=tokenizer.pad_token_id)
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    collate_fn=collator,
)

print("‚úì Data loader created")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Workers: {NUM_WORKERS}")

# Test loading a batch
sample_batch = next(iter(train_loader))
print(f"\nSample batch:")
print(f"  Input shape: {sample_batch['input_ids'].shape}")
print(f"  Attention mask shape: {sample_batch['attention_mask'].shape}")
print(f"  Labels shape: {sample_batch['labels'].shape}")

## 7. Initialize Model

Create the decoder-only transformer from Phase 1.1.

In [None]:
# Get model configuration
if MODEL_SIZE == "small":
    config = get_small_config()
    # Further reduce for demo if on CPU
    if not USE_GPU:
        config.hidden_size = 256
        config.num_hidden_layers = 4
        config.num_attention_heads = 4
        config.num_key_value_heads = 4
        config.intermediate_size = 1024
elif MODEL_SIZE == "base":
    config = get_base_config()

# Update vocab size to match tokenizer
config.vocab_size = len(tokenizer)
config.max_position_embeddings = 512

# Initialize model
print(f"Initializing {MODEL_SIZE} model...")
model = MathTransformerDecoder(config)

# Count parameters
num_params = sum(p.numel() for p in model.parameters())
num_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úì Model initialized")
print(f"  Architecture: Decoder-only (Llama-style)")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Layers: {config.num_hidden_layers}")
print(f"  Attention heads: {config.num_attention_heads}")
print(f"  Parameters: {num_params:,} ({num_trainable:,} trainable)")
print(f"  Positional encoding: RoPE (dynamic scaling)")
print(f"  Activation: SwiGLU")

# Show model size in MB
param_size_mb = num_params * 4 / (1024 ** 2)  # 4 bytes per float32 param
print(f"  Model size: {param_size_mb:.2f} MB")

## 8. Configure Pre-Training

In [None]:
# Create training configuration
training_config = PreTrainingConfig(
    model_config_name=MODEL_SIZE,
    vocab_size=config.vocab_size,
    max_seq_length=512,
    data_dir=data_dir,
    
    # Training hyperparameters
    micro_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    max_steps=MAX_STEPS,
    warmup_steps=min(50, MAX_STEPS // 10),
    learning_rate=3e-4,
    
    # Optimization
    mixed_precision=MIXED_PRECISION,
    gradient_checkpointing=USE_GPU,  # Only on GPU
    
    # Checkpointing
    checkpoint_dir="./checkpoints/pretraining_notebook",
    save_interval=max(100, MAX_STEPS // 2),
    
    # Logging
    log_interval=5 if not USE_GPU else 10,
    use_wandb=False,
    use_tensorboard=False,  # Disabled for cleaner output
    
    # System
    num_workers=NUM_WORKERS,
    seed=42,
)

print("‚úì Training configuration created")
print(f"\nConfiguration summary:")
print(f"  Effective batch size: {training_config.effective_batch_size}")
print(f"  Total steps: {training_config.max_steps}")
print(f"  Warmup steps: {training_config.warmup_steps}")
print(f"  Peak learning rate: {training_config.learning_rate}")
print(f"  Mixed precision: {training_config.mixed_precision}")
print(f"  Gradient checkpointing: {training_config.gradient_checkpointing}")

## 9. Initialize Pre-Trainer

In [None]:
# Initialize pre-trainer
print("Initializing Pre-Trainer...")
print("=" * 70)

trainer = PreTrainer(
    model=model,
    config=training_config,
    train_dataloader=train_loader,
    val_dataloader=None,
)

print("\n‚úì Pre-Trainer ready!")

## 10. Run Pre-Training

This is where the magic happens! üéØ

In [None]:
# Start training
print("\n" + "=" * 70)
print("STARTING BASE PRE-TRAINING")
print("=" * 70)

if not USE_GPU:
    print("‚ö†Ô∏è  Running on CPU - this will be slow!")
    print("    For faster training, use Google Colab with GPU")
    print()

trainer.train()

print("\n" + "=" * 70)
print("‚úÖ PRE-TRAINING COMPLETE!")
print("=" * 70)

## 11. Test the Trained Model

Let's generate some text to see what the model learned!

In [None]:
# Function to generate text
@torch.no_grad()
def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8, device="cuda"):
    """Generate text continuation from a prompt."""
    model.eval()
    
    # Encode prompt
    encoded = tokenizer.encode(prompt)
    input_ids = torch.tensor([encoded['input_ids']], dtype=torch.long).to(device)
    
    # Generate
    for _ in range(max_length):
        # Forward pass
        outputs = model(input_ids)
        
        # Get next token logits
        next_token_logits = outputs[0, -1, :] / temperature
        
        # Sample next token
        probs = torch.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        
        # Append to sequence
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)
        
        # Stop if EOS token
        if next_token.item() == tokenizer.eos_token_id:
            break
    
    # Decode
    generated_ids = input_ids[0].tolist()
    return tokenizer.decode(generated_ids)

print("Testing text generation...")
print("=" * 70)

In [None]:
# Test 1: Mathematical prompt
math_prompt = "Let f: ‚Ñù ‚Üí ‚Ñù be a continuous function. Then"
generated = generate_text(
    model=trainer.raw_model,
    tokenizer=tokenizer,
    prompt=math_prompt,
    max_length=30,
    device=device
)

print("Mathematical generation:")
print(f"Prompt: {math_prompt}")
print(f"Generated: {generated}")
print()

In [None]:
# Test 2: Theorem prompt
theorem_prompt = "Theorem: For any prime number p, we have"
generated = generate_text(
    model=trainer.raw_model,
    tokenizer=tokenizer,
    prompt=theorem_prompt,
    max_length=30,
    device=device
)

print("Theorem generation:")
print(f"Prompt: {theorem_prompt}")
print(f"Generated: {generated}")
print()

In [None]:
# Test 3: General text prompt
general_prompt = "The history of mathematics began in"
generated = generate_text(
    model=trainer.raw_model,
    tokenizer=tokenizer,
    prompt=general_prompt,
    max_length=30,
    device=device
)

print("General text generation:")
print(f"Prompt: {general_prompt}")
print(f"Generated: {generated}")

## 12. Save and Load Checkpoints

In [None]:
# Save final checkpoint
checkpoint_path = Path(training_config.checkpoint_dir) / "final_notebook.pt"
trainer.save_checkpoint("final_notebook.pt")

print(f"‚úì Checkpoint saved to: {checkpoint_path}")
print(f"\nCheckpoint contains:")
print(f"  - Model weights")
print(f"  - Optimizer state")
print(f"  - Training step: {trainer.global_step}")
print(f"  - Tokens seen: {trainer.tokens_seen:,}")

In [None]:
# Example: Load checkpoint
if checkpoint_path.exists():
    print("Loading checkpoint...")
    checkpoint = torch.load(checkpoint_path, map_location=device)
    
    print(f"\n‚úì Checkpoint loaded")
    print(f"  Training step: {checkpoint['global_step']}")
    print(f"  Tokens seen: {checkpoint['tokens_seen']:,}")
    print(f"  Config: {checkpoint['config']['model_config_name']}")

## 13. Summary and Next Steps

In [None]:
print("=" * 70)
print("PHASE 2.1 DEMONSTRATION COMPLETE!")
print("=" * 70)
print("\n‚úì Successfully demonstrated:")
print("  1. Streaming dataset for large-scale corpora")
print("  2. Mixed-domain data sampling (ArXiv + General)")
print("  3. Decoder-only transformer architecture")
print("  4. Pre-training with causal language modeling")
print("  5. Mixed precision training" if USE_GPU else "  5. CPU training (demo mode)")
print("  6. Automatic checkpointing")
print("  7. Text generation from trained model")
print()
print("Training Statistics:")
print(f"  Steps completed: {trainer.global_step}")
print(f"  Tokens processed: {trainer.tokens_seen:,}")
print(f"  Model parameters: {num_params:,}")
print()
print("Next steps for FULL pre-training:")
print("  1. Prepare large-scale datasets:")
print("     - ArXiv papers (LaTeX extraction): ~2M papers")
print("     - C4 corpus: 750GB of web text")
print("     - Wikipedia: ~6M articles")
print("     - Books corpus")
print()
print("  2. Scale up training:")
print("     - Use base or large model")
print("     - Train for 100K-1M steps")
print("     - Use multiple GPUs with DDP:")
print("       torchrun --nproc_per_node=4 pretrain.py")
print()
print("  3. Monitor with wandb:")
print("     python pretrain.py --use-wandb --wandb-project my-project")
print()
print("  4. Proceed to Phase 2.2: Mathematical Fine-tuning")
print("     - Fine-tune on MATH dataset")
print("     - Add reinforcement learning")
print("     - Outcome supervision")
print()
print("Checkpoints saved to:", training_config.checkpoint_dir)
print("=" * 70)

## 14. (Optional) Download Checkpoint

If running on Colab, you can download the checkpoint to your local machine.

In [None]:
if IN_COLAB:
    from google.colab import files
    
    # Zip checkpoints
    !zip -r checkpoints.zip checkpoints/
    
    print("Downloading checkpoint...")
    files.download('checkpoints.zip')
    print("‚úì Download started")
else:
    print("Checkpoints are saved locally at:", training_config.checkpoint_dir)

---

## üìö Additional Resources

**Documentation:**
- See `PHASE_2_1_README.md` for comprehensive documentation
- Run `python pretrain.py --help` for CLI options

**Scaling Up:**
```bash
# Multi-GPU training (4 GPUs)
torchrun --nproc_per_node=4 pretrain.py \
    --model-size base \
    --batch-size 4 \
    --gradient-accumulation-steps 8 \
    --max-steps 500000 \
    --mixed-precision bf16 \
    --use-wandb
```

**Key Papers:**
- [Chinchilla: Training Compute-Optimal LLMs](https://arxiv.org/abs/2203.15556)
- [LLaMA: Open Foundation LLMs](https://arxiv.org/abs/2302.13971)
- [Minerva: Mathematical Reasoning](https://arxiv.org/abs/2206.14858)

---

**Happy Pre-Training! üöÄ**