
# Bigram Language Model (PyTorch)

This notebook demonstrates a character-level bigram language model implemented in PyTorch. It trains on a corpus of text and generates new character sequences based on learned probabilities.



## Dataset Preparation

We begin by loading the Shakespeare dataset (`miniSpeare.txt`) and preparing the vocabulary. This includes creating mappings from characters to integers and vice versa.


In [1]:

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

torch.manual_seed(1337)


<torch._C.Generator at 0x2114609d250>


## Batch Generation

This function creates mini-batches of training data by randomly sampling sequences from the text. These are used for training the model in chunks of fixed `block_size` length.


In [2]:

with open('miniSpeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]



## Loss Estimation Function

This utility function evaluates model performance on both training and validation datasets by averaging loss over several mini-batches.


In [3]:

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)



## Model Architecture: Bigram Language Model

The model consists of a single embedding layer that acts like a bigram table. Each token directly maps to logits predicting the next character.


In [4]:

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



## Model Training

We now train the model using `AdamW` optimizer and periodically evaluate the loss. Loss values for both training and validation sets are printed every `eval_interval` iterations.


In [5]:

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx



## Text Generation

Finally, we use the trained model to generate new text, starting from an empty context. The model sequentially samples the next character based on learned probabilities.


In [7]:

model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.6506, val loss 4.6615
step 300: train loss 2.8190, val loss 2.8228
step 600: train loss 2.5420, val loss 2.5593
step 900: train loss 2.4982, val loss 2.5179
step 1200: train loss 2.4814, val loss 2.5149
step 1500: train loss 2.4672, val loss 2.4981
step 1800: train loss 2.4646, val loss 2.4879
step 2100: train loss 2.4648, val loss 2.4919
step 2400: train loss 2.4724, val loss 2.4907
step 2700: train loss 2.4629, val loss 2.4862

Buer me,
RUSllle i&elat toUNuee theas Plichee.
LERio me &ZEnks bloutwlue henarthian th fie ber!
LABuppulien?
Berod ther imy d.
HAlip
Thow ngathe I t ondy y:
Hitis al he IN:
I, ces3Kal ue; t tsos in sellen, omoon
HANORure, thes ar fos o s bu in, veamerris m e.
Myeran'tl'd Yomeeisuprmiuspinougninks n at: m:
Tore anorer f harbro spurshat! is bung julesth k atroove se westy thishme tothe IUSS:

IO:
O:
LAnas bsut s me he har?
Hodia y'dghar urs
Go lillobre nt fay Non s
Ans, sy gstwovesan thimecve, oul
