# Nanochat Full Training Pipeline on Google Colab

Run the complete nanochat "speedrun" pipeline end-to-end on Colab: environment bootstrapping, tokenizer training, pretraining, midtraining, supervised finetuning, optional RL, evaluation, and export.

## Pipeline Overview
- Prepare the runtime, GPU, and optional Google Drive storage.
- Clone the repo and install everything with `uv`.
- Build the Rust tokenizer, download datasets, and fetch evaluation bundles.
- Train through base ? mid ? SFT stages (optional RL) with periodic evaluations.
- Chat with the model and export checkpoints for downstream use.

## Before You Start
- In Colab select `Runtime ? Change runtime type ? GPU` before running cells.
- Free-tier T4 works with the `t4_quick` preset (expect a few hours). Multi-GPU runs can switch to `speedrun_full`.
- Long sessions should sync checkpoints to Drive to guard against disconnects.

## Step 0: Verify GPU availability

In [None]:
!nvidia-smi


## Step 1 (Optional): Mount Google Drive for checkpoints
Skip this step if you prefer manual downloads or are not running inside Colab.

In [None]:
import os

try:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive')
    checkpoint_dir = '/content/drive/MyDrive/nanochat_checkpoints'
    os.makedirs(checkpoint_dir, exist_ok=True)
    print(f'Drive mounted. Checkpoints will sync to {checkpoint_dir}')
except ImportError:
    print('google.colab not available; skipping Drive mount.')


## Step 2: Clone (or update) the nanochat repository

In [None]:
%%bash
set -euo pipefail
cd /content
if [ ! -d 'nanochat' ]; then
  git clone https://github.com/HarleyCoops/nanochat.git
else
  cd nanochat
  git pull --ff-only
fi


In [None]:
%cd /content/nanochat


## Step 3: Configure experiment presets and environment
Choose the preset that matches your hardware. Update any values before running downstream cells.

In [None]:
import json
import os
from pathlib import Path

PROJECT_ROOT = '/content/nanochat'
os.environ['PROJECT_ROOT'] = PROJECT_ROOT

CONFIG_PRESETS = {
    't4_quick': {
        'dataset_shards': 40,
        'tokenizer_max_chars': 200_000_000,
        'model_depth': 12,
        'device_batch_size': 8,
        'max_seq_len': 2048,
        'num_iterations': 6000,
        'nproc': 1,
        'run_name': 'colab_t4',
    },
    'speedrun_full': {
        'dataset_shards': 240,
        'tokenizer_max_chars': 2_000_000_000,
        'model_depth': 20,
        'device_batch_size': 32,
        'max_seq_len': 2048,
        'num_iterations': 21400,
        'nproc': 8,
        'run_name': 'speedrun_d20',
    },
}

ACTIVE_PRESET = 't4_quick'  # change to 'speedrun_full' when you have >=8 GPUs
CONFIG = CONFIG_PRESETS[ACTIVE_PRESET].copy()
CONFIG.update({
    'dataset_cache_dir': str(Path.home() / '.cache' / 'nanochat'),
    'eval_bundle_dir': str(Path.home() / '.cache' / 'nanochat' / 'eval_bundle'),
    'eval_bundle_url': 'https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip',
    'drive_checkpoints_dir': '/content/drive/MyDrive/nanochat_checkpoints',
    'wandb_project': 'nanochat-colab',
    'wandb_login': False,
    'chat_prompt': 'Hello! Summarize what Nanochat is.',
})

extra_paths = ['/root/.local/bin', str(Path.home() / '.cargo' / 'bin')]
for path in extra_paths:
    if os.path.isdir(path) and path not in os.environ.get('PATH', ''):
        os.environ['PATH'] = f"{path}:{os.environ['PATH']}"

for key, value in CONFIG.items():
    os.environ[f'NANOCHAT_{key.upper()}'] = str(value)

print(f'Active preset: {ACTIVE_PRESET}')
print(json.dumps(CONFIG, indent=2))


## Step 4: Install uv and project dependencies
Mirrors the `speedrun.sh` bootstrap using uv-managed virtual environments.

In [None]:
%%bash --env PROJECT_ROOT
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if ! command -v uv >/dev/null 2>&1; then
  curl -LsSf https://astral.sh/uv/install.sh | sh
  export PATH="$HOME/.local/bin:$PATH"
fi
cd "$PROJECT_ROOT"
if [ ! -d '.venv' ]; then
  uv venv
fi
uv sync
uv pip install maturin --upgrade


## Step 5: Install Rust and build the tokenizer extension

In [None]:
%%bash --env PROJECT_ROOT
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ ! -f "$HOME/.cargo/env" ]; then
  curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
fi
source "$HOME/.cargo/env"
cd "$PROJECT_ROOT"
uv run maturin develop --release --manifest-path rustbpe/Cargo.toml
rustc --version


## Step 6: Download pretraining shards
Adjust `CONFIG['dataset_shards']` if storage or bandwidth is limited.

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_DATASET_SHARDS

set -euo pipefail
cd "$PROJECT_ROOT"
echo "Downloading ${NANOCHAT_DATASET_SHARDS} shards into ~/.cache/nanochat"
uv run python -m nanochat.dataset -n "${NANOCHAT_DATASET_SHARDS}"


## Step 7: Train and evaluate the tokenizer

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_TOKENIZER_MAX_CHARS
set -euo pipefail
cd "$PROJECT_ROOT"
uv run python -m scripts.tok_train --max_chars="${NANOCHAT_TOKENIZER_MAX_CHARS}"
uv run python -m scripts.tok_eval


## Step 8: Fetch the evaluation bundle (CORE metrics)

In [None]:
%%bash --env NANOCHAT_EVAL_BUNDLE_URL --env NANOCHAT_EVAL_BUNDLE_DIR

set -euo pipefail
if [ -f "$NANOCHAT_EVAL_BUNDLE_DIR/metadata.json" ]; then
  echo "Eval bundle already present at $NANOCHAT_EVAL_BUNDLE_DIR"
  exit 0
fi
tmp_dir=$(mktemp -d)
cleanup() { rm -rf "$tmp_dir"; }
trap cleanup EXIT
curl -L -o "$tmp_dir/eval_bundle.zip" "$NANOCHAT_EVAL_BUNDLE_URL"
unzip -q "$tmp_dir/eval_bundle.zip" -d "$tmp_dir"
mkdir -p "$(dirname "$NANOCHAT_EVAL_BUNDLE_DIR")"
rm -rf "$NANOCHAT_EVAL_BUNDLE_DIR"
mv "$tmp_dir/eval_bundle" "$NANOCHAT_EVAL_BUNDLE_DIR"
echo "Eval bundle ready at $NANOCHAT_EVAL_BUNDLE_DIR"


## Step 9 (Optional): Log in to Weights & Biases
Set `CONFIG['wandb_login'] = True` in the configuration cell if you want live dashboards.

In [None]:
import os

flag = os.environ.get('NANOCHAT_WANDB_LOGIN', 'False').lower() in {'1', 'true', 'yes'}
if flag:
    import wandb
    wandb.login()
else:
    print('Skipping wandb login. Enable CONFIG["wandb_login"] to opt in.')


## Step 10: Pretrain the base model
This mirrors `scripts/base_train.py`. Expect several hours depending on the preset and GPU.

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_MODEL_DEPTH --env NANOCHAT_DEVICE_BATCH_SIZE --env NANOCHAT_MAX_SEQ_LEN --env NANOCHAT_NUM_ITERATIONS --env NANOCHAT_RUN_NAME --env NANOCHAT_NPROC
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/base_train.py       --depth="${NANOCHAT_MODEL_DEPTH}"       --device_batch_size="${NANOCHAT_DEVICE_BATCH_SIZE}"       --max_seq_len="${NANOCHAT_MAX_SEQ_LEN}"       --num_iterations="${NANOCHAT_NUM_ITERATIONS}"       --run="${NANOCHAT_RUN_NAME}"


## Step 11: Evaluate the base model (loss + CORE)

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_DEVICE_BATCH_SIZE --env NANOCHAT_NPROC
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/base_loss.py       --device_batch_size="${NANOCHAT_DEVICE_BATCH_SIZE}"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/base_eval.py


## Step 12: Midtrain on conversational data + tools

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_DEVICE_BATCH_SIZE --env NANOCHAT_NPROC --env NANOCHAT_RUN_NAME
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/mid_train.py       --device_batch_size="${NANOCHAT_DEVICE_BATCH_SIZE}"       --run="${NANOCHAT_RUN_NAME}_mid"


## Step 13: Evaluate the midtrained chat model

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_NPROC
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/chat_eval.py -i mid


## Step 14: Supervised finetuning (SFT)

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_DEVICE_BATCH_SIZE --env NANOCHAT_NPROC --env NANOCHAT_RUN_NAME
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/chat_sft.py       --device_batch_size="${NANOCHAT_DEVICE_BATCH_SIZE}"       --run="${NANOCHAT_RUN_NAME}_sft"


## Step 15: Evaluate the SFT checkpoint

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_NPROC
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/chat_eval.py -i sft


## Step 16 (Optional): Reinforcement learning on GSM8K
RLHF is optional and slow but can improve math accuracy. Uncomment and run the command below when you have spare budget.

```bash
%%bash --env PROJECT_ROOT --env NANOCHAT_DEVICE_BATCH_SIZE --env NANOCHAT_NPROC
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/chat_rl.py       --device_batch_size="${NANOCHAT_DEVICE_BATCH_SIZE}"       --run="${NANOCHAT_RUN_NAME}_rl"
uv run torchrun --standalone --nproc_per_node="${NANOCHAT_NPROC}" scripts/chat_eval.py -i rl -a GSM8K
```

## Step 17: Chat with the model (single prompt demo)

In [None]:
%%bash --env PROJECT_ROOT --env NANOCHAT_CHAT_PROMPT
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run python -m scripts.chat_cli -i sft -p "${NANOCHAT_CHAT_PROMPT}"


## Step 18: Sync checkpoints to Google Drive (if mounted)

In [None]:
import os
import shutil
from pathlib import Path

project_root = os.environ.get('PROJECT_ROOT', '/content/nanochat')
source_dir = Path(project_root) / 'checkpoints'
drive_dir = os.environ.get('NANOCHAT_DRIVE_CHECKPOINTS_DIR')

if not source_dir.exists():
    print(f'No checkpoints found at {source_dir}. Run training first.')
elif drive_dir and os.path.isdir(drive_dir):
    target = Path(drive_dir) / 'latest'
    target.mkdir(parents=True, exist_ok=True)
    print(f'Syncing checkpoints to {target}...')
    shutil.copytree(source_dir, target, dirs_exist_ok=True)
    print('Sync complete.')
else:
    print('Drive not mounted or destination unavailable; skipping sync.')


## Step 19: Export the SFT model to Hugging Face format

In [None]:
%%bash --env PROJECT_ROOT
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
if [ -f "$HOME/.cargo/env" ]; then
  source "$HOME/.cargo/env"
fi
cd "$PROJECT_ROOT"
uv run python scripts/export_to_huggingface.py --source sft -o exports/hf_model


## Next Steps
- Review the generated `HYPERBOLIC_DEPLOYMENT_SUMMARY.md` and report artifacts for run metadata.
- Sweep different depths or batch sizes by editing the configuration cell and re-running the relevant stages.
- Share the exported `exports/hf_model` directory or upload it directly to the Hugging Face Hub.
- Tweak datasets or add new mid/SFT mixtures to go beyond the baseline speedrun recipe.