# a-lm colab training

This notebook runs a full from-scratch pretrain on Colab using the larger `nano` config and the Colab corpus preset.


## Optional drive mount
Use this if your repo or outputs live on Google Drive.


In [1]:
# from google.colab import drive

# drive.mount("/content/drive")

## Locate or clone the repo
If you already uploaded the repo, this will use it. Otherwise it clones into `/content/a-lm`.


In [2]:
import os

repo_candidates = ["/content/a-lm", "/content/drive/MyDrive/a-lm"]
repo_path = None
for candidate in repo_candidates:
    if os.path.isdir(candidate):
        repo_path = candidate
        break

if repo_path is None:
    repo_path = "/content/a-lm"
    !git clone https://github.com/ammaar-alam/a-lm.git {repo_path}

%cd {repo_path}

Cloning into '/content/a-lm'...
remote: Enumerating objects: 635, done.[K
remote: Counting objects: 100% (635/635), done.[K
remote: Compressing objects: 100% (381/381), done.[K
remote: Total 635 (delta 383), reused 473 (delta 222), pack-reused 0 (from 0)[K
Receiving objects: 100% (635/635), 148.38 KiB | 18.55 MiB/s, done.
Resolving deltas: 100% (383/383), done.
/content/a-lm


In [3]:
print(
    "Next: run the 'Install pinned dependencies' cell below."
    " Restart only if Colab warns about imports."
    " Then continue to 'Hugging Face login' and 'Start pretraining'."
)

Next: run the 'Install pinned dependencies' cell below. Restart only if Colab warns about imports. Then continue to 'Hugging Face login' and 'Start pretraining'.


## Install pinned dependencies
These versions avoid Colab crashes and keep `transformers` compatibility.
This cell intentionally does **not** downgrade `numpy` (downgrades force restarts and conflict with Colab preinstalls).


In [4]:
%pip install -U "huggingface_hub<1.0" "datasets>=2.19,<3" "pyarrow>=15.0.2,<19" \
  "gcsfs" "tokenizers>=0.22.0,<=0.23.0"
%pip install -e . --no-deps

Collecting datasets<3,>=2.19
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting gcsfs
  Downloading gcsfs-2026.1.0-py3-none-any.whl.metadata (2.1 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub<1.0)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
INFO: pip is looking at multiple versions of gcsfs to determine which version is compatible with other requirements. This could take a while.
Collecting gcsfs
  Downloading gcsfs-2025.12.0-py3-none-any.whl.metadata (2.1 kB)
  Downloading gcsfs-2025.10.0-py2.py3-none-any.whl.metadata (2.1 kB)
  Downloading gcsfs-2025.9.0-py2.py3-none-any.whl.metadata (2.1 kB)
  Downloading gcsfs-2025.7.0-py2.py3-none-any.whl.metadata (2.1 kB)
  Downloading gcsfs-2025.5.1-py2.py3-none-any.whl.metadata (1.9 kB)
  Downloading gcsfs-2025.5.0.post1-py2.py3-none-any.whl.metadata (1.9 kB)
  Downloading gcsfs-2025.5.0-py2.py3-none-any.whl.metadata (1.9 kB)
INFO: pip is still looking at multiple versions of gcsfs to determin

## Optional: enable verbose training logs
By default the progress bar updates live. If you want per-step log lines, run this cell.


In [5]:
from pathlib import Path

Path("configs/train_colab_verbose.yaml").write_text("""
optim:
  name: adamw
  lr: 3e-4
  betas: [0.9, 0.95]
  weight_decay: 0.1
  eps: 1e-8

scheduler:
  name: cosine
  warmup_steps: 1000
  max_steps: 20000

training:
  micro_batch_size: 4
  gradient_accumulation: 8
  max_steps: 20000
  checkpoint_interval: 500
  gradient_clip_norm: 0.5
  mixed_precision: fp16
  grad_checkpointing: false
  seed: 1337
  dataloader_workers: 2

logging:
  log_interval: 1
  rich_progress: false
""")
print("Wrote configs/train_colab_verbose.yaml")

Wrote configs/train_colab_verbose.yaml


## Hugging Face login
Paste your token when prompted.


In [6]:
from huggingface_hub import login
from google.colab import userdata

login(userdata.get('HF_TOKEN'))

## GPU check
Make sure CUDA is available before training.


In [7]:
import torch

!nvidia-smi
print(torch.__version__, torch.cuda.is_available(), torch.version.cuda)

Thu Jan 29 21:37:00 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:04:00.0 Off |                    0 |
| N/A   31C    P0             69W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## Start pretraining
Stores the run id in `LAST_RUN.txt` so later cells can resume or chat.


In [8]:
import subprocess
import time
from pathlib import Path

run_id = time.strftime("%Y%m%d-%H%M%S")
Path("LAST_RUN.txt").write_text(run_id)
print("run:", run_id)

train_cfg = "configs/train_colab.yaml"
if Path("configs/train_colab_verbose.yaml").exists():
    train_cfg = "configs/train_colab_verbose.yaml"
    print("using verbose logging config")

cmd = ["make", "colab-pretrain", f"RUN={run_id}", f"TRAIN_CFG={train_cfg}"]
print("command:", " ".join(cmd))
result = subprocess.run(cmd)
if result.returncode != 0:
    raise RuntimeError(f"make failed with exit code {result.returncode}")

run: 20260129-213701
using verbose logging config
command: make colab-pretrain RUN=20260129-213701 TRAIN_CFG=configs/train_colab_verbose.yaml


RuntimeError: make failed with exit code 2

## Chat with the latest checkpoint


In [None]:
from pathlib import Path

run_id = Path("LAST_RUN.txt").read_text().strip()
print("using run", run_id)
!make chat RUN={run_id}

## RLVR post-training


In [None]:
from pathlib import Path

run_id = Path("LAST_RUN.txt").read_text().strip()
print("using run", run_id)

!make rlvr-data
!make rlvr-train RUN={run_id}
!make chat RUN={run_id} CHECKPOINT=runs/{run_id}/rlvr/ckpt-last.pt