# TinyGPT Training on Google Colab

This notebook trains TinyGPT from scratch using a Colab GPU.

**Prerequisites:**
- Google account with Drive access
- Colab Pro recommended for longer training

**Runtime:** Select GPU in Runtime > Change runtime type

## 1. Setup

In [None]:
# Mount Google Drive (for saving checkpoints)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Clone the repository (update with your repo URL)
!git clone https://github.com/YOUR_USERNAME/LLM_From_Scratch.git /content/LLM_From_Scratch
%cd /content/LLM_From_Scratch

In [None]:
# Install dependencies
!pip install -q torch numpy tqdm safetensors

In [None]:
# Check GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Download and Prepare Data

In [None]:
# Download dataset (choose one)
# Tiny (~5MB) - for quick testing
!./scripts/download_data.sh --tiny

# Small (~50MB) - for toy model
# !./scripts/download_data.sh --small

# Medium (~200MB) - for serious training
# !./scripts/download_data.sh --medium

In [None]:
# Prepare data (train tokenizer + encode tokens)
!python scripts/prepare_data.py --vocab_size 4096

## 3. Train the Model

In [None]:
# Configuration
PRESET = "toy"  # or "small"
MAX_STEPS = 2000
BATCH_SIZE = 64
GRAD_ACCUM = 2
CHECKPOINT_DIR = "/content/drive/MyDrive/tinygpt_checkpoints"

# Create checkpoint directory
import os
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
print(f"Checkpoints will be saved to: {CHECKPOINT_DIR}")

In [None]:
# Train!
!python -m src.train \
    --preset {PRESET} \
    --max_steps {MAX_STEPS} \
    --batch_size {BATCH_SIZE} \
    --grad_accum {GRAD_ACCUM} \
    --checkpoint_dir {CHECKPOINT_DIR} \
    --eval_interval 250 \
    --save_interval 500

## 4. Generate Text

In [None]:
# Generate from trained model
!python -m src.sample \
    --checkpoint {CHECKPOINT_DIR}/best.pt \
    --tokenizer data/tokenizer \
    --prompt "Once upon a time" \
    --max_tokens 100 \
    --temperature 0.8

In [None]:
# Try different prompts
prompts = [
    "The quick brown fox",
    "It was a dark and stormy night",
    "In the beginning",
    "Hello, my name is"
]

for prompt in prompts:
    print(f"\n{'='*50}")
    print(f"Prompt: {prompt}")
    print(f"{'='*50}")
    !python -m src.sample \
        --checkpoint {CHECKPOINT_DIR}/best.pt \
        --prompt "{prompt}" \
        --max_tokens 50 \
        --temperature 0.7

## 5. Export for GGUF

In [None]:
# Export model
!python -m src.export_hf \
    --checkpoint {CHECKPOINT_DIR}/best.pt \
    --tokenizer data/tokenizer \
    --output exports/tinygpt

In [None]:
# Copy to Google Drive for local use
!cp -r exports/tinygpt /content/drive/MyDrive/
print("Exported model copied to Google Drive!")

## 6. Convert to GGUF (Optional)

You can do this on Colab or locally.

In [None]:
# Clone llama.cpp
!git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
!pip install -q gguf

In [None]:
# Convert to GGUF
!python scripts/convert_to_gguf.py \
    --input exports/tinygpt \
    --output exports/tinygpt/model-f16.gguf

In [None]:
# Copy GGUF to Drive
!cp exports/tinygpt/*.gguf /content/drive/MyDrive/
print("GGUF model copied to Google Drive!")

## Done!

Your model files are saved to Google Drive:
- Checkpoints: `/content/drive/MyDrive/tinygpt_checkpoints/`
- Exported model: `/content/drive/MyDrive/tinygpt/`

To run locally with llama.cpp:
1. Download the GGUF file from Drive
2. Run: `./scripts/gguf_quantize.sh model-f16.gguf`
3. Run: `./scripts/run_llamacpp.sh model-q4_k_m.gguf`