# Simple Language Model (LLM) Implementation

A simple transformer-based language model that learns to predict the next word in a sequence. This notebook demonstrates the core concepts of modern LLMs like ChatGPT.

## Overview
- **Tokenization**: Converting text to numbers
- **Embeddings**: Representing words as vectors
- **Self-Attention**: Learning relationships between words
- **Transformer Blocks**: Stacking attention layers for deeper understanding
- **Text Generation**: Creating new text based on learned patterns

In [2]:
!pip install torch numpy

Collecting torch
  Using cached torch-2.9.1-cp314-cp314-win_amd64.whl.metadata (30 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting setuptools (from torch)
  Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.9.1-cp314-cp314-win_amd64.whl (110.9 MB)
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   ---------------------------------------- 0.3/110.9 MB ? eta -:--:--
   - -------------------------------------- 3.1/110.9 MB 11.6 MB/s eta 0:00:10
   --- ------------------------------------ 8.9/110.9 MB 18.8 MB/s eta 0:00:06
   ----- ---------------------------------- 14.9/110.9 MB 21.8 MB/s eta 0:00:05
   ------ ---------------------------------

## Step 1: Install Dependencies

This cell installs required libraries:
- **PyTorch (torch)**: Deep learning framework for building neural networks and performing tensor operations
- **NumPy**: Library for numerical computing with arrays and mathematical operations

In [3]:
import torch
import torch.nn as nn
from collections import defaultdict

# Tokenizer
class SimpleTokenizer:
    def __init__(self):
        self.vocab = defaultdict(int)
        self.vocab["<PAD>"] = 0  # Padding token
        self.vocab["<UNK>"] = 1  # Unknown token

    def fit(self, texts):
        for text in texts:
            for word in text.split():
                if word not in self.vocab:
                    self.vocab[word] = len(self.vocab)

    def encode(self, text):
        return [self.vocab.get(word, self.vocab["<UNK>"]) for word in text.split()]

# Embedding Layer
class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, x):
        return self.embedding(x)

# Self-Attention Mechanism
class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (x.size(-1) ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(weights, V)
        return output

# Transformer Block
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attention = SelfAttention(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.ReLU(),
            nn.Linear(embed_dim * 4, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        x = x + self.attention(self.norm1(x))  # Residual connection
        x = x + self.feed_forward(self.norm2(x))  # Residual connection
        return x

# Simple LLM
class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_blocks):
        super().__init__()
        self.embedding = EmbeddingLayer(vocab_size, embed_dim)
        self.blocks = nn.ModuleList([TransformerBlock(embed_dim) for _ in range(num_blocks)])
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        for block in self.blocks:
            x = block(x)
        logits = self.fc(x)
        return logits

## Step 2: Define Model Architecture

This cell defines all the core neural network components that make up the LLM:

### **SimpleTokenizer** - Converts text to numbers
- `vocab`: Dictionary mapping words to unique integer IDs
- `fit(texts)`: Learns vocabulary from training texts
- `encode(text)`: Converts sentence strings to lists of token IDs
- `<PAD>` token (ID=0): Used for padding sequences to same length
- `<UNK>` token (ID=1): Represents unknown/out-of-vocabulary words

### **EmbeddingLayer** - Creates word vectors
- Converts each token ID into a dense vector (embedding)
- Example: word "cat" becomes [0.1, -0.3, 0.5, ...] (8 numbers)
- Helps model understand semantic relationships between words

### **SelfAttention** - Learns word relationships
- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What information do I have?"
- **Value (V)**: "What information should I share?"
- **Attention scores**: Similarity between Q and K (higher = more related)
- **Softmax weights**: Normalized scores summing to 1
- Allows model to focus on relevant words when processing each word

### **TransformerBlock** - Core processing unit
- Combines attention mechanism with feed-forward neural network
- **Residual connections** (x + output): Helps information flow and training
- **LayerNorm**: Normalizes values to stabilize learning
- **Feed-forward**: Two linear layers with ReLU activation for non-linearity

### **SimpleLLM** - Complete model
- Stacks multiple transformer blocks for deep learning
- Final linear layer outputs probability distribution over vocabulary
- This is the main model we'll train and use for text generation

In [4]:
# Example dataset
texts = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "A bird flew over the house.",
    "The sun rises in the east.",
    "Apples and bananas are fruits.",
    "She drinks coffee every morning.",
    "He reads a book before bed.",
    "They are playing soccer in the park.",
    "We watched a movie last night.",
    "I am learning to play the guitar.",
    "The baby is sleeping in the crib.",
    "She walks her dog every evening.",
    "He is writing a letter to his friend.",
    "They visited the museum on Sunday.",
    "We are planning a trip to the mountains.",
    "I enjoy cooking Italian food.",
    "The flowers are blooming in the garden.",
    "She bought a new dress for the party.",
    "He is fixing the broken chair.",
    "They are studying for the final exam.",
    "We had dinner at a new restaurant.",
    "I like listening to classical music.",
    "The children are playing with their toys.",
    "She is painting a beautiful landscape.",
    "He goes jogging every morning.",
    "They adopted a kitten from the shelter.",
    "We are redecorating our living room.",
    "I am practicing yoga to relax.",
    "The chef is preparing a delicious meal.",
    "She is learning to speak French.",
    "He fixed the leaky faucet.",
    "They are building a treehouse.",
    "We visited the zoo last weekend.",
    "I am reading a fascinating novel.",
    "The dog chased the ball.",
    "She is knitting a scarf for winter.",
    "He plays the piano beautifully.",
    "They are organizing a charity event.",
    "We enjoyed the concert last night.",
    "I am taking a photography class.",
    "The kids are drawing pictures.",
    "She baked cookies for her neighbors.",
    "He is learning to swim.",
    "They went hiking in the forest.",
    "We are attending a wedding tomorrow.",
    "I love watching the sunset.",
    "The teacher explained the lesson.",
    "She is planting flowers in the yard.",
    "He repaired his bicycle.",
    "They are hosting a dinner party.",
    "We went shopping for groceries.",
    "I am writing in my journal.",
    "The cat is napping on the sofa.",
    "She is sewing a dress.",
    "He enjoys fishing on weekends.",
    "They are playing board games.",
    "We visited a historic site.",
    "I am learning to dance.",
    "The baby is crawling.",
    "She is organizing her closet.",
    "He is practicing the violin.",
    "They are watching a documentary.",
    "We had a picnic in the park.",
    "I am trying a new recipe.",
    "The dog is digging in the yard.",
    "She is attending a yoga class.",
    "He is building a model airplane.",
    "They went to the beach.",
    "We are learning about astronomy.",
    "I enjoy cycling.",
    "The kids are playing hide and seek.",
    "She is writing a poem.",
    "He is cooking dinner.",
    "They are exploring the city.",
    "We watched a play.",
    "I am studying history.",
    "The cat climbed the tree.",
    "She is practicing meditation.",
    "He is painting the fence.",
    "They are volunteering at the shelter.",
    "We went to a music festival.",
    "I am learning to code.",
    "The dog is wagging its tail.",
    "She is decorating the cake.",
    "He is reading the newspaper.",
    "They are planting a vegetable garden.",
    "We are planning a surprise party.",
    "I enjoy hiking.",
    "The children are building a sandcastle.",
    "She is taking a pottery class.",
    "He is mowing the lawn.",
    "They are attending a workshop.",
    "We visited an art gallery.",
    "I am practicing calligraphy.",
    "The cat is chasing a butterfly.",
    "She is learning to play chess.",
    "He is assembling furniture.",
    "They are going on a road trip.",
    "We are baking a cake.",
    "I am writing a short story.",
    "The dog is fetching the stick.",
    "She is practicing the flute.",
    "He is organizing his workspace.",
    "They are watching a comedy show.",
    "We went to a science museum.",
    "I am learning about photography.",
    "The kids are flying kites.",
    "She is knitting a sweater.",
    "He is playing basketball.",
    "They are cooking together.",
    "We are attending a lecture.",
    "I enjoy birdwatching.",
    "The cat is playing with yarn.",
    "She is gardening.",
    "He is learning martial arts.",
    "They are visiting relatives.",
    "We went to a book fair.",
    "I am studying geography.",
    "The dog is running in the field.",
    "She is practicing singing.",
    "He is repairing the car.",
    "They are going camping.",
    "We are making homemade pizza.",
    "I am writing a blog post.",
    "The children are drawing with chalk.",
    "She is learning to knit.",
    "He is practicing archery.",
    "They are attending a dance class.",
    "We visited a botanical garden.",
    "I am learning sign language.",
    "The cat is lounging in the sun.",
    "She is sewing a quilt.",
    "He is playing the drums.",
    "They are exploring a cave.",
    "We are watching a magic show.",
    "I enjoy stargazing.",
    "The dog is rolling over.",
    "She is practicing ballet.",
    "He is building a birdhouse.",
    "They are going to a fair.",
    "We are learning to surf.",
    "I am studying chemistry.",
    "The kids are making crafts.",
    "She is writing a song.",
    "He is cooking breakfast.",
    "They are visiting a farm.",
    "We went to a sports game.",
    "I am learning to skateboard.",
    "The cat is scratching the post.",
    "She is practicing yoga poses.",
    "He is fixing the roof.",
    "They are organizing a fundraiser.",
    "We are attending a poetry reading.",
    "I enjoy playing chess.",
    "The dog is sniffing around.",
    "She is painting a portrait.",
    "He is learning to juggle.",
    "They are going to a concert.",
    "We are baking cookies.",
    "I am writing a letter.",
    "The children are playing tag.",
    "She is knitting a hat.",
    "He is practicing the saxophone.",
    "They are exploring a new neighborhood.",
    "We visited a historical monument.",
    "I am learning to play the ukulele.",
    "The cat is chasing its tail.",
    "She is sewing curtains.",
    "He is playing soccer.",
    "They are cooking a family recipe.",
    "We are attending a cooking class.",
    "I enjoy painting landscapes.",
    "The dog is chewing a bone.",
]

# Initialize and fit tokenizer
tokenizer = SimpleTokenizer()
tokenizer.fit(texts)

# Encode a sample sentence
encoded_sentence = tokenizer.encode("The cat sat")
print("Encoded sentence:", encoded_sentence)

Encoded sentence: [2, 3, 4]


## Step 3: Load and Tokenize Training Data

This cell creates the dataset and converts text to numbers:
- **texts list**: 200+ example sentences covering various daily activities
- **tokenizer.fit(texts)**: Learns vocabulary (all unique words) from dataset
- **encoded_sentence**: Demonstrates text-to-numbers conversion

### Key Terms:
- **Tokenization**: Breaking text into tokens (words) and converting to IDs
- **Vocabulary (vocab_size)**: Total number of unique words the model knows
- **Encoding**: Mapping words → their ID numbers (deterministic and reversible)

In [5]:
tokenizer = SimpleTokenizer()
tokenizer.fit(texts)


## Step 4: Reinitialize and Fit Tokenizer

This cell creates a fresh tokenizer instance and learns the vocabulary:
- Creates new tokenizer object
- Calls `fit()` to learn all words from the training texts
- Prepares tokenizer for encoding sentences into token IDs

In [6]:
training_pairs = []

for text in texts:
    tokens = tokenizer.encode(text)
    if len(tokens) > 1:
        input_seq = tokens[:-1]  # All tokens except last
        target_seq = tokens[1:]  # All tokens except first (i.e., next word)
        training_pairs.append((input_seq, target_seq))


## Step 5: Create Training Pairs

This cell prepares data in input-output pairs for supervised learning:

**How it works:**
- Encodes each sentence into token IDs
- Creates pairs where input = all tokens except last, target = all tokens except first
- This teaches the model: "Given these words, predict the next word"

**Example:** Sentence "The cat sat"
- Tokens: [2, 3, 4] (assuming these are the IDs)
- Input sequence: [2, 3] (The, cat)
- Target sequence: [3, 4] (cat, sat)
- Model learns: position 0→predict token 3, position 1→predict token 4

### Key Terms:
- **Training pair**: Input-target pair for supervised learning
- **Sequence shifting**: Creating labels by shifting sequence one position
- **Next word prediction**: The fundamental task that teaches the LLM

In [7]:
# Hyperparameters
vocab_size = len(tokenizer.vocab)
embed_dim = 8
num_blocks = 2
learning_rate = 0.001
epochs = 100

# Model
model = SimpleLLM(vocab_size, embed_dim, num_blocks)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    total_loss = 0
    for input_seq, target_seq in training_pairs:
        input_tensor = torch.tensor(input_seq).unsqueeze(0)  # [1, seq_len]
        target_tensor = torch.tensor(target_seq).unsqueeze(0)  # [1, seq_len]

        optimizer.zero_grad()
        logits = model(input_tensor)
        loss = criterion(logits.view(-1, vocab_size), target_tensor.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


Epoch 1, Loss: 1009.1206
Epoch 2, Loss: 878.2931
Epoch 3, Loss: 794.0573
Epoch 4, Loss: 746.1861
Epoch 5, Loss: 706.4180
Collecting torch
  Downloading torch-2.9.1-cp314-cp314-win_amd64.whl.metadata (30 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Downloading networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting setuptools (from torch)
  Downloading setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.9.1-cp314-cp314-win_amd64.whl (110.9 MB)
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   ---------------------------------------- 0.3/110.9 MB ? eta -:--:--
   ---------------------------------------- 0.5/110.9 MB 1.1 MB/s eta 0:01:38
   -------------

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\aarru\\AppData\\Local\\Programs\\Python\\Python314\\Lib\\site-packages\\torch\\_C.cp314-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



Epoch 6, Loss: 670.6301
Epoch 7, Loss: 636.2277
Epoch 8, Loss: 602.6550
Epoch 9, Loss: 570.8095
Epoch 10, Loss: 540.5299
Epoch 11, Loss: 512.0668
Epoch 12, Loss: 485.1356
Epoch 13, Loss: 459.7001
Epoch 14, Loss: 434.9153
Epoch 15, Loss: 410.9473
Epoch 16, Loss: 387.7252
Epoch 17, Loss: 365.0768
Epoch 18, Loss: 343.1710
Epoch 19, Loss: 321.7801
Epoch 20, Loss: 301.2607
Epoch 21, Loss: 281.7299
Epoch 22, Loss: 262.8966
Epoch 23, Loss: 245.5423
Epoch 24, Loss: 228.8443
Epoch 25, Loss: 213.9525
Epoch 26, Loss: 201.1660
Epoch 27, Loss: 187.1572
Epoch 28, Loss: 173.3091
Epoch 29, Loss: 163.2567
Epoch 30, Loss: 150.2955
Epoch 31, Loss: 141.4850
Epoch 32, Loss: 131.4057
Epoch 33, Loss: 123.1869
Epoch 34, Loss: 115.4446
Epoch 35, Loss: 108.2559
Epoch 36, Loss: 101.0252
Epoch 37, Loss: 100.4540
Epoch 38, Loss: 89.1398
Epoch 39, Loss: 86.3826
Epoch 40, Loss: 83.6505
Epoch 41, Loss: 76.7268
Epoch 42, Loss: 68.6956
Epoch 43, Loss: 64.0549
Epoch 44, Loss: 59.4529
Epoch 45, Loss: 55.3419
Epoch 46, Lo

## Step 6: Train the Model

This cell configures hyperparameters and trains the neural network:

### Hyperparameters (learning configuration):
- **vocab_size**: Total unique words the model knows
- **embed_dim = 8**: Each word represented as 8-number vector
- **num_blocks = 2**: Model has 2 transformer layers for depth
- **learning_rate = 0.001**: How fast weights update (smaller = more cautious)
- **epochs = 100**: Train through entire dataset 100 times

### Model Components:
- **SimpleLLM**: Main neural network model
- **CrossEntropyLoss**: Measures prediction error (lower is better)
- **Adam Optimizer**: Algorithm that adjusts model weights to minimize loss

### Training Loop (what happens each epoch):
1. For each training pair (input, target):
   - `input_tensor`, `target_tensor`: Convert to PyTorch tensors (special arrays)
   - `optimizer.zero_grad()`: Reset gradients from previous step
   - `logits`: Model predictions (raw scores before probabilities)
   - `loss`: Measure how wrong predictions are
   - `loss.backward()`: Calculate gradients (how to improve)
   - `optimizer.step()`: Update weights using gradients
2. Print total loss for the epoch (shows progress)

### Key Terms:
- **Gradient descent**: Optimization by following negative gradient
- **Backpropagation**: Computing gradients using chain rule
- **Epoch**: One complete pass through all training data
- **Logits**: Unnormalized prediction scores (before softmax)
- **Loss**: Error metric (CrossEntropyLoss combines softmax + cross-entropy)

In [9]:
def generate_text(model, tokenizer, prompt, max_length=10):
    model.eval()
    tokens = tokenizer.encode(prompt)
    generated = tokens.copy()
    
    with torch.no_grad():
        for _ in range(max_length):
            input_tensor = torch.tensor(generated).unsqueeze(0)
            logits = model(input_tensor)
            next_token_logits = logits[0, -1, :]
            next_token = torch.argmax(next_token_logits).item()
            generated.append(next_token)
    
    # Decode back to text
    inv_vocab = {v: k for k, v in tokenizer.vocab.items()}
    generated_words = [inv_vocab.get(token, "<UNK>") for token in generated]
    return " ".join(generated_words)

generated_text = generate_text(model, tokenizer, "The sun", max_length=10)
print("Generated text:", generated_text)


Generated text: The sun over exam. kites. neighborhood. an the groceries. playing going the


## Step 7: Generate Text Using Trained Model

This cell defines text generation function and creates new text:

### generate_text() Function:
- **model.eval()**: Switches to evaluation mode (disables training-specific features like dropout)
- **torch.no_grad()**: Disables gradient calculation to save memory (we don't train here)
- **Initialization**: Start with prompt tokens
- **Generation loop** (max_length times):
  - Pass generated sequence through model
  - Get `logits` (prediction scores) for all positions
  - Extract last position's logits: `logits[0, -1, :]`
  - **torch.argmax()**: Find word with highest probability
  - Append to generated sequence
- **Decoding**: Create inverse vocabulary to map IDs back to words
- **Return**: Join words with spaces to create readable text

### Key Terms:
- **Inference**: Using trained model to make predictions (no training)
- **Greedy decoding**: Always selecting highest probability token (simple but effective)
- **Logits**: Raw prediction scores (before softmax normalization)
- **Model evaluation mode**: Disables dropout and batch normalization effects