## Mini Transformer - BPE Tokenization + Cosine LR Warmup/Decay (Hands-on Notebook)
    - **Preparing a corpus** (Your own text/code/chats/docs)
    - **Training a BPE Tokenizer** (Hugging Face tokenizers) - with a SentencePiece alternative.
    - **Wiring the tokenizer into a Tiny GPT-style Transformer**
    - **Training with cosine LR + warmup, AMP, and gradient accumulation.
    - **Sample text**


0) Setup:
    - Install deps
        - !pip -q install torch tqdm tokenizers sentencepiece

In [1]:
# 0) Setup
!pip -q install torch tqdm tokenizers
import os, math, glob, json
from typing import List, Optional
import torch
from torch import nn
from torch.nn import functional as F
from torch.cuda.amp import autocast, GradScaler
from tqdm import trange
print('Torch:', torch.__version__, '| CUDA available:', torch.cuda.is_available())


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Torch: 2.9.0 | CUDA available: False


1) Bring/Build/Load your Corpus
You can: (a) download tiny Shakespeare, (b) point to a folder of .txt, .md, .code files, or (c) paste text.
### Tip:
    - For code include extensions like .py, .js, .ts, .java, .go.
    - For chats, export them to .txt and drop into a folder.

In [2]:
# # Option A: tiny Shakespear(quick start)
# import urllib.request
# url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
# os.makedirs('data', exist_ok=True)
# urllib.request.urlretrieve(url, 'data/input.txt')
# print ('Saved data/input.txt')
#
# # Option B: Point to a folder of your own files (set This to your path)
# USER_CORPUS_DIR = None      # e.g., '/path/to/my/corpus'
#
# def load_corpus(user_dir: str | None, min_len = 100):
#     texts: List[str] = []
#     if user_dir and os.path.isdir(user_dir):
#         exts = ['*.txt', '*.md', '*.py', '*.js', '*.ts', '*.java', '*.go', '*.rs', '*.c', '*.cpp', '*.h']
#         for ext in exts:
#             for p in glob.glob(os.path.join(user_dir, '**', ext), recursive=True):
#                 try:
#                     s = open(p, 'r', encoding = 'utf-8', errors='ignore').read()
#                     if len(s) >= min_len:
#                         texts.append(s)
#                 except Exception as e:
#                     print('skip', p, e)
#     else:
#         # fallback to tiny shakespear
#         texts.append(open('data/input.txt', 'r', encoding = 'utf-8').read())
#     return texts
#
# texts = load_corpus(USER_CORPUS_DIR)
# print('Loaded files: ', len(texts), 'Total chars:', sum(len(t) for t in texts))

DATA_FILE = None  # e.g., 'data/input.txt'
DATA_DIR  = None  # e.g., '/path/to/my/corpus'
MIN_LEN_PER_FILE = 100  # skip very tiny files

def load_texts(data_dir: Optional[str], data_file: Optional[str], min_len: int = 100) -> List[str]:
    texts: List[str] = []
    if data_file:
        s = open(data_file, 'r', encoding='utf-8', errors='ignore').read()
        if len(s) >= min_len:
            texts.append(s)
    elif data_dir:
        exts = ['*.txt','*.md','*.py','*.js','*.ts','*.java','*.go','*.rs','*.c','*.cpp']
        for ext in exts:
            for p in glob.glob(os.path.join(data_dir, '**', ext), recursive=True):
                try:
                    s = open(p, 'r', encoding='utf-8', errors='ignore').read()
                    if len(s) >= min_len:
                        texts.append(s)
                except Exception as e:
                    print('skip', p, e)
    else:
        os.makedirs('data', exist_ok=True)
        url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
        import urllib.request
        urllib.request.urlretrieve(url, 'data/input.txt')
        texts.append(open('data/input.txt','r',encoding='utf-8').read())
    if not texts:
        raise RuntimeError('No usable text found. Set DATA_FILE or DATA_DIR.')
    print('Loaded files:', len(texts), '| Total chars:', sum(len(t) for t in texts))
    return texts

texts = load_texts(DATA_DIR, DATA_FILE, MIN_LEN_PER_FILE)



Loaded files: 1 | Total chars: 1115394


2) Train a BPE Tokenizer (Hugging Face **Tokenizers**)
    - we build a BPE vocabulary with special tokens. For small experiments, **vocab_size=2000-8000** is good.
    - Alternative: SentencePiece is below if you prefer that library
    - If bpe/tokenizer.json exists, we'll reuse it; otherwise we train a new one.
        -  Guidance:
                - Use `vocab_size=4k-8k` for small pure-text corpora
                - Use `8k-16k` for mixed code + prose

In [5]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

TOKENIZER_PATH = 'bpe/tokenizer.json'
VOCAB_SIZE = 8000

def maybe_train_tokenizer(texts: List[str], tok_path: str, vocab_size: int = 8000):
    if os.path.exists(tok_path):
        print(f'[tokenizer] using existing {tok_path}')
        return
    print(f'[tokenizer] training new BPE tokenizer → {tok_path} (vocab_size={vocab_size})')
    os.makedirs(os.path.dirname(tok_path) or '.', exist_ok=True)
    tok = Tokenizer(BPE(unk_token='<unk>'))
    tok.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=['<pad>','<unk>','<bos>','<eos>'])
    tok.train_from_iterator(texts, trainer)
    tok.save(tok_path)

maybe_train_tokenizer(texts, TOKENIZER_PATH, VOCAB_SIZE)
tok = Tokenizer.from_file(TOKENIZER_PATH)
print('Vocab size:', tok.get_vocab_size())
print('Test decode:', tok.decode(tok.encode('Hello, mini Transformer!').ids))

[tokenizer] training new BPE tokenizer → bpe/tokenizer.json (vocab_size=8000)



Vocab size: 8000
Test decode: He llo , min i Tr ans former !


- (Optional) SentencePiece Alternative
    - This does the same idea with the SentencePiece library.

In [None]:
import sentencepiece as spm
corpus_path = 'bpe/corpus.txt'
open(corpus_path, 'w', encoding='utf-8').write('\n'.join(texts))
spm.SentencePieceTrainer.Train(
    input=corpus_path,
    model_prefix='bpe/spm',
    vocab_size=4000,
    model_type='bpe',
    character_coverage=1.0,
    pad_id =0, unk_id=1, bos_id=2, eos_id=3,
)

sp = spm.SentencePieceProcessor(model_file='bpe/spm.model')
print('SPM Test:', sp.decode(sp.encode('Hello Transformer!', out_type=int)))


3) Dataset using BPE
    - we encode all texts to one long stream, add `<eos>` separators, split 90/10 train/val, and create batches by slicing fixed window
    - This class will auto-clamp `block_size` if the train split is smaller than the requested context


In [6]:
class BPEDataset:
    def __init__(self, texts: List[str], tokenizer_path: str, block_size: int = 256, device: Optional[str] = None):
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        self.tok = Tokenizer.from_file(tokenizer_path)
        ids = []
        eos_id = self.tok.token_to_id('<eos>')
        for t in texts:
            enc = self.tok.encode(t).ids
            if enc:
                ids.extend(enc)
                if eos_id is not None:
                    ids.append(eos_id)
        data = torch.tensor(ids, dtype=torch.long)
        total_len = len(data)
        if total_len < 4:
            raise RuntimeError('Encoded corpus too small. Add more text or reduce block_size a lot.')
        n = int(0.9 * total_len)
        self.train_data, self.val_data = data[:n], data[n:]
        if block_size >= len(self.train_data) - 1:
            new_bs = max(2, len(self.train_data) - 2)
            print(f'[BPEDataset] block_size {block_size} > train len; clamping to {new_bs}.')
            block_size = new_bs
        self.block_size = block_size
        self.vocab_size = self.tok.get_vocab_size()
        if len(self.val_data) <= self.block_size + 1:
            print(f'[BPEDataset] Warning: tiny val split for block_size={self.block_size} (len={len(self.val_data)}).')

    def get_batch(self, split: str, batch_size: int):
        src = self.train_data if split == 'train' else self.val_data
        hi = len(src) - self.block_size - 1
        if hi <= 0:
            raise RuntimeError(
                f'Not enough {split} tokens for block_size={self.block_size}. len(src)={len(src)}. '
                'Add more data or reduce block_size.'
            )
        idx = torch.randint(0, hi, (batch_size,))
        x = torch.stack([src[i:i+self.block_size] for i in idx])
        y = torch.stack([src[i+1:i+1+self.block_size] for i in idx])
        return x.to(self.device), y.to(self.device)

BLOCK_SIZE = 256  # lower to 64/128 if your corpus is tiny
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
bpe_ds = BPEDataset(texts, tokenizer_path=TOKENIZER_PATH, block_size=BLOCK_SIZE, device=DEVICE)
print('vocab_size=', bpe_ds.vocab_size, '| block_size=', bpe_ds.block_size,
      '| train_len=', len(bpe_ds.train_data), '| val_len=', len(bpe_ds.val_data))

xb, yb = bpe_ds.get_batch('train', batch_size=4)
print('batch shapes:', xb.shape, yb.shape)

vocab_size= 8000 | block_size= 256 | train_len= 248230 | val_len= 27582
batch shapes: torch.Size([4, 256]) torch.Size([4, 256])


4) Mini GPT Model (Same as before)
    - Decoder-only blocks with masked self-attention and LM head.

In [7]:
class CausalSelfAttention(nn.Module):
    def __init__(self, n_embed, n_head, block_size, dropout=0.1):
        super().__init__()
        assert n_embed % n_head == 0
        self.n_head = n_head
        self.qkv = nn.Linear(n_embed, 3 * n_embed, bias=False)
        self.proj = nn.Linear(n_embed, n_embed, bias=False)
        self.attn_drop = nn.Dropout(dropout)
        self.resid_drop = nn.Dropout(dropout)
        mask = torch.tril(torch.ones(block_size, block_size))
        self.register_buffer('mask', mask.view(1,1,block_size,block_size))
    def forward(self, x):
        B,T,C = x.shape
        qkv = self.qkv(x)
        q,k,v = qkv.chunk(3, dim=-1)
        nh = self.n_head
        q = q.view(B,T,nh,-1).transpose(1,2)
        k = k.view(B,T,nh,-1).transpose(1,2)
        v = v.view(B,T,nh,-1).transpose(1,2)
        att = (q @ k.transpose(-2,-1)) / math.sqrt(k.size(-1))
        att = self.mask[:,:,:T,:T].to(dtype=att.dtype, device=att.device) * att + \
              (1 - self.mask[:,:,:T,:T].to(dtype=att.dtype, device=att.device)) * float('-inf')
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v
        y = y.transpose(1,2).contiguous().view(B,T,-1)
        y = self.resid_drop(self.proj(y))
        return y

class Block(nn.Module):
    def __init__(self, n_embed, n_head, block_size, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embed)
        self.attn = CausalSelfAttention(n_embed, n_head, block_size, dropout)
        self.ln2 = nn.LayerNorm(n_embed)
        self.mlp = nn.Sequential(
            nn.Linear(n_embed, 4*n_embed), nn.GELU(), nn.Linear(4*n_embed, n_embed), nn.Dropout(dropout)
        )
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, n_embed=384, n_head=6, n_layer=6, block_size=256, dropout=0.1):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, n_embed)
        self.pos_emb = nn.Embedding(block_size, n_embed)
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([Block(n_embed, n_head, block_size, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.head = nn.Linear(n_embed, vocab_size, bias=False)
        self.apply(self._init)
    def _init(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, mean=0.0, std=0.02)
            if m.bias is not None: nn.init.zeros_(m.bias)
        elif isinstance(m, nn.Embedding):
            nn.init.normal_(m.weight, mean=0.0, std=0.02)
    def forward(self, idx, targets=None):
        B,T = idx.shape
        pos = torch.arange(0, T, device=idx.device)
        x = self.tok_emb(idx) + self.pos_emb(pos)[None,:,:]
        x = self.drop(x)
        for blk in self.blocks: x = blk(x)
        x = self.ln_f(x)
        logits = self.head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(B*T, -1), targets.view(B*T))
        return logits, loss
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / max(temperature, 1e-8)
            if top_k and top_k > 0:
                k = min(int(top_k), logits.size(-1))
                v, _ = torch.topk(logits, k)
                logits[logits < v[:, [-1]]] = -float('inf')
            probs = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_id], dim=1)
        return idx

5) Training Loop - Cosine LR + Warmup, AMP, Grad Accum
    - This mirros the improved train.py (L4_training_loop.py) now using BPE dataset

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MiniGPT(
    vocab_size=bpe_ds.vocab_size,
    n_embed=384, n_head=6, n_layer=6,
    block_size=bpe_ds.block_size, dropout=0.1
).to(device)

LR = 3e-4
MAX_STEPS = 3000
WARMUP_STEPS = 200
USE_COSINE = True
BATCH_SIZE = 128 if device.type == 'cuda' else 64
GRAD_ACCUM_STEPS = 2 if device.type == 'cuda' else 1

opt = torch.optim.AdamW(model.parameters(), lr=LR)
scaler = GradScaler(enabled=(device.type == 'cuda'))

def lr_factor(step: int) -> float:
    if step < WARMUP_STEPS:
        return max(1e-8, (step+1)/max(1, WARMUP_STEPS))
    if not USE_COSINE:
        return 1.0
    progress = (step - WARMUP_STEPS) / max(1, MAX_STEPS - WARMUP_STEPS)
    min_factor = 0.1
    return min_factor + 0.5*(1-min_factor)*(1 + math.cos(math.pi * progress))

@torch.no_grad()
def eval_loss(iters=20) -> float:
    model.eval()
    s = 0.0
    for _ in range(iters):
        x, y = bpe_ds.get_batch('val', BATCH_SIZE)
        _, loss = model(x, y)
        s += float(loss.item())
    model.train()
    return s / max(1, iters)

best = float('inf')
for step in trange(MAX_STEPS, desc='training'):
    fac = lr_factor(step)
    for pg in opt.param_groups: pg['lr'] = LR * fac

    opt.zero_grad(set_to_none=True)
    micro_bs = max(1, BATCH_SIZE // max(1, GRAD_ACCUM_STEPS))
    for _ in range(GRAD_ACCUM_STEPS):
        x, y = bpe_ds.get_batch('train', micro_bs)
        with autocast(enabled=(device.type == 'cuda')):
            _, loss = model(x, y)
            loss = loss / max(1, GRAD_ACCUM_STEPS)
        scaler.scale(loss).backward()

    scaler.unscale_(opt)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(opt); scaler.update()

    if (step + 1) % 200 == 0 or (step + 1) == MAX_STEPS:
        vl = eval_loss(20)
        ppl = math.exp(vl) if vl < 20 else float('inf')
        print(f"\nstep {step+1} | val_loss {vl:.4f} | ppl ~ {ppl:.2f}")
        if vl < best:
            best = vl
            torch.save({
                'model': model.state_dict(),
                'config': {
                    'vocab_size': bpe_ds.vocab_size,
                    'n_embed': 384, 'n_head': 6, 'n_layer': 6,
                    'block_size': bpe_ds.block_size, 'dropout': 0.1
                },
                'tokenizer_path': TOKENIZER_PATH
            }, 'bpe_best.pt')
            print('new best saved: bpe_best.pt')

  scaler = GradScaler(enabled=(device.type == 'cuda'))
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  with autocast(enabled=(device.type == 'cuda')):
training:   7%|▋         | 200/3000 [12:38<7:06:33,  9.14s/it]


step 200 | val_loss nan | ppl ~ inf


training:  13%|█▎        | 400/3000 [25:41<6:19:12,  8.75s/it] 


step 400 | val_loss nan | ppl ~ inf


training:  20%|██        | 600/3000 [44:50<5:46:31,  8.66s/it]  


step 600 | val_loss nan | ppl ~ inf


training:  27%|██▋       | 800/3000 [1:30:55<5:17:14,  8.65s/it]   


step 800 | val_loss nan | ppl ~ inf


training:  33%|███▎      | 1000/3000 [1:45:54<4:47:41,  8.63s/it]


step 1000 | val_loss nan | ppl ~ inf


training:  40%|████      | 1200/3000 [1:58:00<4:20:12,  8.67s/it]


step 1200 | val_loss nan | ppl ~ inf


training:  47%|████▋     | 1400/3000 [2:10:06<3:51:53,  8.70s/it]


step 1400 | val_loss nan | ppl ~ inf


training:  53%|█████▎    | 1600/3000 [2:22:12<3:21:45,  8.65s/it]


step 1600 | val_loss nan | ppl ~ inf


training:  60%|██████    | 1800/3000 [2:34:25<2:53:36,  8.68s/it]


step 1800 | val_loss nan | ppl ~ inf


training:  67%|██████▋   | 2000/3000 [3:10:14<2:24:26,  8.67s/it]  


step 2000 | val_loss nan | ppl ~ inf


training:  73%|███████▎  | 2200/3000 [3:22:58<1:58:35,  8.89s/it]


step 2200 | val_loss nan | ppl ~ inf


training:  80%|████████  | 2400/3000 [3:35:08<1:27:52,  8.79s/it]


step 2400 | val_loss nan | ppl ~ inf


training:  87%|████████▋ | 2600/3000 [3:47:21<58:23,  8.76s/it]  


step 2600 | val_loss nan | ppl ~ inf


training:  93%|█████████▎| 2800/3000 [4:00:35<29:59,  9.00s/it]  


step 2800 | val_loss nan | ppl ~ inf


training: 100%|██████████| 3000/3000 [4:19:29<00:00,  5.19s/it] 


step 3000 | val_loss nan | ppl ~ inf





6) Sample with a prompt
    - Use the same tokenizer and the best checkpoint to generate text.

In [10]:
ck = torch.load('bpe_best.pt', map_location=device)
conf = ck['config']
model = MiniGPT(
    vocab_size=conf['vocab_size'],
    n_embed=conf['n_embed'], n_head=conf['n_head'], n_layer=conf['n_layer'],
    block_size=conf['block_size'], dropout=conf['dropout']
).to(device)
model.load_state_dict(ck['model']); model.eval()
tok = Tokenizer.from_file(ck.get('tokenizer_path', TOKENIZER_PATH))

def generate_text(prompt: str, max_new_tokens=300, temperature=0.9, top_k=80):
    ctx = torch.tensor([tok.encode(prompt).ids], dtype=torch.long, device=device)
    out = model.generate(ctx, max_new_tokens=max_new_tokens, temperature=temperature, top_k=top_k)
    return tok.decode(out[0].tolist())

print(generate_text('ROMEO:', max_new_tokens=300, temperature=0.8, top_k=80))

FileNotFoundError: [Errno 2] No such file or directory: 'bpe_best.pt'

### Tips
    - Increase VOCAB_SIZE to 8k-16k for mixed code + prose
    - Increase BLOCK_SIZE (if you have enough data + memory) for longer context.
    - Scale n_layer / n_embed if you have more compute
    - Try different sampling settings: temperature 0.7-1.0, top_k 50-(vocab_size-1)