# **5: From Pixels to Prose - Your First Language Model**

In the previous sections, we explored the fascinating world of `computer vision`, where machines learn to see and interpret images. Now, we are about to embark on a new journey into the realm of `natural language processing (NLP)`, where machines learn to understand and generate human language. In this section, we will build our first language model, a powerful tool that can generate coherent and meaningful text based on the patterns it has learned from a large corpus of data.

Deep learning isn't limited to just images and vision; it has revolutionized the way we process and understand language. Language models are at the heart of many applications, from chatbots and virtual assistants to machine translation and content generation, where models learn to understand and generate human language. 

In this section, we will explore the architecture of language models, starting with the simplest possible language model: the `Bigram Model`. This model predicts the next word in a sequence based on the previous word, allowing us to generate text one word at a time. It has no memory of previous words, making it a simple yet powerful tool for understanding the basics of language modeling. Despite its simplicity, the `Biagram Model` will teach us the fundamental concepts of language modeling, such as tokenization, probability estimation, batch creation, and autoregressive text generation.



## **The Data and the Tokenizer**

To build our language model, we need a dataset to train on. For this example, we will use a simple text dataset, such as a collection of sentences or a small corpus of text. The first step in processing this data is to tokenize it, which means breaking down the text into smaller units called tokens. Tokens can be words, subwords, or even characters, depending on the level of granularity we want to achieve.

Before we can train our model, we need to convert our text data into a format that the model can understand. This involves creating a vocabulary of unique tokens and mapping each token to a unique integer index. This process is known as `tokenization`.

Since our model will be a `Bigram Model`, we will focus on tokenizing the text at the `character-level tokenizer` which breaks down the text into individual characters. This allows us to capture the structure of the language at a very fine-grained level, which can be useful for generating text that is more coherent and natural.

For example, if we have the sentence "Hello world", our character-level tokenizer would break it down into the following tokens: `H`, `e`, `l`, `o`, ` ` (space), `w`, `r`, `d`. Each of these tokens would then be assigned a unique integer index in our vocabulary.

i.e;

- `a` -> 0
- `b` -> 1
- `c` -> 2
- `d` -> 3
- `e` -> 4
- `f` -> 5
- And so on...


## **The Components of our Tokenizer**

1. `Vocabulary Creation`: We will create a vocabulary of unique tokens from our dataset. This involves iterating through the text and collecting all the unique characters (or words) that appear in the dataset.

2. `stoi (string to index)`: We will create a mapping from each token in our vocabulary to a unique integer index. This allows us to convert our text data into numerical format that can be fed into the model.

3. `itos (index to string)`: We will also create a mapping from integer indices back to their corresponding tokens. This is useful for converting the model's output back into human-readable text.

4. `Encoding and Decoding Functions`: We will implement functions to encode text into sequences of integer indices and to decode sequences of indices back into text. This will allow us to easily convert between the raw text and the numerical format required for training our model.

With these components in place, we will be able to preprocess our text data and prepare it for training our `Bigram Model`. The tokenizer will play a crucial role in ensuring that our model can effectively learn the patterns and structure of the language from the dataset.

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
import urllib.request as request

# Download the tiny shakespeare dataset
# This is a small dataset of Shakespeare's works, which is often used for character-level language modeling.
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

print("Downloading dataset...")
response = request.urlopen(url)
data = response.read().decode('utf-8')
print("Dataset downloaded.")

Downloading dataset...
Dataset downloaded.


In [7]:
print(f"Dataset length: {len(data)} characters")
print(f"First 250 characters:\n{data[:250]}")

Dataset length: 1115394 characters
First 250 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [9]:
# Create the vocabulary of unique characters in the dataset
chars = sorted(list(set(data)))
vocab_size = len(chars)
print(f"Characters: {''.join(chars)}")
print(f"Vocabulary size: {vocab_size} unique characters")

Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65 unique characters


In [20]:
# Create the mappings from characters to integers and vice versa
stoi = { ch:i for i,ch in enumerate(chars) } # string to integer
itos = { i:ch for i,ch in enumerate(chars) } # integer to string

print(f"stoi mapping: {list(stoi.items())[:10]}")  # Print first 10 mappings
print(f"itos mapping: {list(itos.items())[:10]}\n")  # Print first 10 mappings
print(f"Example itos mapping: 1 -> '{itos[1]}', 2 -> '{itos[2]}', 3 -> '{itos[3]}'")
print(f"Example stoi mapping: ' ' -> {stoi[' ']}, 'a' -> {stoi['a']}, 'b' -> {stoi['b']}\n")
print(f"Total unique characters: {len(stoi)}")
print(f"Total unique characters (itos): {len(itos)}")

stoi mapping: [('\n', 0), (' ', 1), ('!', 2), ('$', 3), ('&', 4), ("'", 5), (',', 6), ('-', 7), ('.', 8), ('3', 9)]
itos mapping: [(0, '\n'), (1, ' '), (2, '!'), (3, '$'), (4, '&'), (5, "'"), (6, ','), (7, '-'), (8, '.'), (9, '3')]

Example itos mapping: 1 -> ' ', 2 -> '!', 3 -> '$'
Example stoi mapping: ' ' -> 1, 'a' -> 39, 'b' -> 40

Total unique characters: 65
Total unique characters (itos): 65


In [21]:
print(f"Example stoi mapping: 'a' -> {stoi.get('a', 'N/A')}")
print(f"Example stoi mapping: 'H' -> {stoi.get('H', 'N/A')}")

Example stoi mapping: 'a' -> 39
Example stoi mapping: 'H' -> 20


In [24]:
# Define encode and decode functions
def encode(s):
    "encoder takes a string and outputs a list of integers"
    return [stoi[c] for c in s]

def decode(l):
    "decoder takes a list of integers and outputs a string"
    return ''.join([itos[i] for i in l])

# Test the encode and decode functions
test_string = "Hello, World!"
encoded = encode(test_string)
decoded = decode(encoded)

print(f"Test string: {test_string}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
assert test_string == decoded, "Decoded string does not match original"
print(f"Round-trip successful: '{test_string}' -> {encoded} -> '{decoded}'")

Test string: Hello, World!
Encoded: [20, 43, 50, 50, 53, 6, 1, 35, 53, 56, 50, 42, 2]
Decoded: Hello, World!
Round-trip successful: 'Hello, World!' -> [20, 43, 50, 50, 53, 6, 1, 35, 53, 56, 50, 42, 2] -> 'Hello, World!'


## **Creating Batches of Data for Language Models**

Unlike in image classification, where each data point is an independent image (one image corresponds to one label), language modeling involves sequences of tokens that are interdependent. The model needs to learn the relationships and patterns in the text, which means that we need to create batches of data that capture these dependencies.


### **The Input-Output Relationship in Language Models**

For a language model, the input is a sequence of tokens (e.g., characters or words), and the output is the next token in the sequence. For example, if we have the input sequence "H", "e", "l", "l", "o", the model should learn to predict the next token, which is " " (space). This means that our training data will consist of pairs of input sequences and their corresponding target tokens.

- For example, if we have the text "Hello world", we can create training pairs like this:
- Input: "H" -> Target: "e"
- Input: "He" -> Target: "l"
- Input: "Hel" -> Target: "l"
- Input: "Hell" -> Target: "o"
- Input: "Hello" -> Target: " " (space)
- Input: "Hello " -> Target: "w"
- And so on... The same chunk shifted by one character to the right, creating a new input-target pair each time.

For every character in the input sequence, we will have a corresponding target character that the model needs to learn to predict. This creates a rich dataset of input-target pairs that the model can use to learn the patterns and structure of the language.

### **Block Size and Context Window**

The `block size` (or `context window`) is a crucial hyperparameter in language modeling. It determines how many previous tokens the model can see when making a prediction. For example, if we set a block size of 5, the model will only be able to see the last 5 tokens when predicting the next token. This means that the model will learn to capture dependencies and patterns within that context window.

For a `Bigram Model`, the block size is effectively 1, since the model only looks at the previous token to predict the next one. However, as we move to more complex models like `Trigram Models` or `Transformer-based Models`, we can increase the block size to capture longer-range dependencies in the text.


### **Creating Batches of Data**

To create batches of data for training our language model, we will implement a function that takes in the raw text data and generates batches of input-target pairs based on the specified block size. This function will randomly sample sequences from the dataset and create corresponding target tokens for each input sequence. The batch shape will be `(batch_size, block_size)`, where

- `batch_size` is the number of sequences in each batch and 
- `block_size` is the length of each input sequence.

In [25]:
# Set device to GPU if available or MPS if on Apple Silicon, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

Using device: mps


In [26]:
# Convert the entire text dataset into a tensor of integers
data_tensor = torch.tensor(encode(data), dtype=torch.long)
print(f"Data tensor shape: {data_tensor.shape}")
print(f"First 10 integers in data tensor: {data_tensor[:10].tolist()}")
print(f"Data type of data tensor: {data_tensor.dtype}")
print(f"Total tokens in dataset: {len(data_tensor)}")

Data tensor shape: torch.Size([1115394])
First 10 integers in data tensor: [18, 47, 56, 57, 58, 1, 15, 47, 58, 47]
Data type of data tensor: torch.int64
Total tokens in dataset: 1115394


In [27]:
# Split into train and validation sets (90% train, 10% validation)
train_size = int(0.9 * len(data_tensor))
train_data = data_tensor[:train_size]
val_data = data_tensor[train_size:] 
print(f"Train data shape: {train_data.shape}")
print(f"Validation data shape: {val_data.shape}")

Train data shape: torch.Size([1003854])
Validation data shape: torch.Size([111540])


In [41]:
# Define batch creation function
def get_batch(split, batch_size=4, block_size=8):
    """
    Generate a batch of data for training or validation.
    
    Args:
        :param split: 'train' or 'val' to specify which dataset to use
        :param batch_size: Number of sequences in the batch
        :param block_size: Length of each sequence (context length)
    
    Returns:
        x: Tensor of shape (batch_size, block_size) containing input sequences
        y: Tensor of shape (batch_size, block_size) containing target sequences
    """
    data = train_data if split == 'train' else val_data
    
    # Randomly select starting indices for the batch
    # We subtract block_size to ensure we have enough characters for the target sequence
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # Create input (x) and target (y) tensors
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    return x, y


# Test the get_batch function
x_batch, y_batch = get_batch('train', batch_size=4, block_size=8)
print(f"\nBatch example:")
print(f"x_batch shape: {x_batch.shape}")
print(f"y_batch shape: {y_batch.shape}")

print(f"\nFirst example:")
print(f"Input (x[0]): {x_batch[0].tolist()}")
print(f"Target (y[0]): {y_batch[0].tolist()}")

print(f"\nDecoded first example:")
print(f"Input (x[0]): '{decode(x_batch[0].tolist())}'")
print(f"Target (y[0]): '{decode(y_batch[0].tolist())}'")

print(f"\nBatch details:")
print(f"x_batch:\n{x_batch}")
print(f"y_batch:\n{y_batch}")


Batch example:
x_batch shape: torch.Size([4, 8])
y_batch shape: torch.Size([4, 8])

First example:
Input (x[0]): [47, 52, 58, 43, 56, 54, 56, 43]
Target (y[0]): [52, 58, 43, 56, 54, 56, 43, 58]

Decoded first example:
Input (x[0]): 'interpre'
Target (y[0]): 'nterpret'

Batch details:
x_batch:
tensor([[47, 52, 58, 43, 56, 54, 56, 43],
        [52,  1, 58, 46, 47, 57,  1, 49],
        [43, 39, 60, 43, 52,  1, 44, 53],
        [13, 63,  6,  1, 46, 39, 52, 42]])
y_batch:
tensor([[52, 58, 43, 56, 54, 56, 43, 58],
        [ 1, 58, 46, 47, 57,  1, 49, 47],
        [39, 60, 43, 52,  1, 44, 53, 56],
        [63,  6,  1, 46, 39, 52, 42,  1]])


## **The Biagram Model Architecture**

The `Bigram Model` is a simple language model that predicts the next token in a sequence based on the previous token. It is called a "bigram" model because it considers pairs of tokens (bigrams) when making predictions.

You can think of `nn.Embedding` as a lookup table that maps each token index to a dense vector representation. The input is the index of a character (e.g., the index of `H` in the vocabulary), and the output is a dense vector that represents that character in a continuous space. The model learns to adjust these embeddings during training so that similar characters (or tokens) have similar embeddings, allowing the model to capture semantic relationships between tokens.

The `input` to the embedding layer is a batch of token indices, and the output is a batch of corresponding embeddings. For example, if we have a batch of input token indices of shape `(batch_size, block_size)`, the output from the embedding layer will be of shape `(batch_size, block_size, embedding_dim)`, where `embedding_dim` is the size of the dense vector representation for each token.


### **How it works**

1. The `input` to the model is a batch of token indices, which are passed through the embedding layer to get their corresponding dense vector representations.

2. `Embedding Lookup`: The embedding layer takes a vector of size `vocab_size` (the number of unique tokens) containing logits (raw scores) for each possible next token and produces a vector of size `embedding_dim` for each token in the input batch. This is done by looking up the embedding for each token index in the input.

3. `Output`: These logits represent the model's prediction for which character (or token) is most likely to come next in the sequence. The model learns to adjust the embeddings and the linear layer's weights during training to improve its predictions over time.


### **Autoregressive Generation**

Once the model is trained, we can use it to generate text by feeding in an initial token and repeatedly predicting the next token until we reach a desired length of generated text. This process is known as `autoregressive generation`, where the model generates one token at a time based on the previously generated tokens.

To generate a new text, we use an `autoregressive` loop`:

- Start with a seed character (or prompt)
- Feed the seed character into the model and get the predicted next character
- Sample from the predicted probabilities to get the next character (using the logits output from the model)
- Append the sampled character to the input sequence
- Use that new sequence as the input for the next prediction
- Repeat this process until we have generated the desired length of text

This is called `autoregressive generation` because the model generates each token based on the previously generated tokens, allowing it to create coherent and contextually relevant text over time.

In [43]:
class BigramLanguageModel(nn.Module):
    """
    A simple Bigram Language Model that predicts the next token based on the current token.
    
    This model predicts the next character based solely on the previous character.
    Despite its simplicity, it can learn to generate coherent text by capturing the statistical relationships between characters in the training data.
    """
    
    def __init__(self, vocab_size):
        super().__init__()
        # Each token (character) directly reads off the logits for the 
        # next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        """
        Forward pass of the model.
        
        Args:
            idx: Input tensor of shape (batch_size, block_size) containing token indices
            targets: Optional target tensor of shape (batch_size, block_size)
                    If provided, loss is computed
        
        Returns:
            logits: Tensor of shape (batch_size, block_size, vocab_size)
                    Contains logits for each position and each possible next token
            loss: Scalar loss value (only if targets provided)
        """
        # idx and targets are both (batch_size, block_size) tensors of integers
        
        # Get logits for each position in the input
        # Shape: (batch_size, block_size, vocab_size)
        logits = self.token_embedding_table(idx)
        
        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy: (batch_size * block_size, vocab_size) and (batch_size * block_size,)
            B, T, C = logits.shape  # Batch size, Time steps, Vocabulary size
            logits = logits.view(B*T, C)  # Reshape to (B*T, C)
            targets = targets.view(B*T)    # Reshape to (B*T)
            
            # compute cross-entropy loss
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """
        Generate new tokens given a starting sequence.
        
        Args:
            idx: Input tensor of shape (batch_size, block_size) containing token indices
            max_new_tokens: Number of new tokens to generate
        
        Returns:
            idx: Tensor of shape (batch_size, block_size + max_new_tokens) containing the original and generated token indices
        """
        for _ in range(max_new_tokens):
            # Get the logits for the current input
            logits, _ = self.forward(idx) # Get logits for last position only
            
            # Focus only on the last time step's logits
            logits = logits[:, -1, :]  # Shape: (batch_size, vocab_size)
            
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # Shape: (batch_size, vocab_size)
            
            # Sample from the distribution to get the next token index
            next_token = torch.multinomial(probs, num_samples=1)  # Shape: (batch_size, 1)
            
            # Append the sampled token to the input sequence
            idx = torch.cat((idx, next_token), dim=1)  # Shape: (batch_size, current_length + 1)
        
        return idx
    

# Instantiate the model
model = BigramLanguageModel(vocab_size)
print(f"Model:\n{model}\n")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters())}")

# test forward pass
x, y = get_batch('train', batch_size=4, block_size=8)
logits, loss = model(x, y)
print(f"\nForward pass test:")
print(f"Input shape: {x.shape}")
print(f"Output shape: {logits.shape}")
print(f"Loss: {loss.item():.4f}")

Model:
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)

Number of parameters: 4225

Forward pass test:
Input shape: torch.Size([4, 8])
Output shape: torch.Size([32, 65])
Loss: 4.4530


### **Training and Generation**

To train the `Bigram Model`, we will use the input-target pairs we created earlier. We will feed the input sequences into the model and compute the loss based on the predicted next token and the actual target token. We will then backpropagate the loss and update the model's parameters using an optimizer.

- `Forward pass`: We will pass the input sequences through the model to get the predicted logits for the next token.

- `Loss computation`: We will compute the loss using a suitable loss function (e.g., cross-entropy loss) that compares the predicted logits with the actual target tokens.

- `Backpropagation`: We will backpropagate the loss to compute the gradients of the model's parameters.

- `Parameter update`: We will use an optimizer (e.g., Adam) to update the model's parameters based on the computed gradients.

- `Repeat` this process for multiple epochs until the model converges and can generate coherent text based on the training data.

After training, we can use the model to generate new text by providing a seed character and using the autoregressive generation process described above. This allows us to create new sentences or paragraphs that are similar in style and content to the training data, demonstrating the model's ability to learn and generate human-like language.

In [46]:
# Create optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
print(f"Optimizer:\n{optimizer}")

# Training loop
batch_size = 32
block_size = 8
max_iters = 5000
eval_interval = 500
eval_iters = 200

print("\nStarting training...\n")

for iter in range(max_iters):
    # Every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = {}
        model.eval() # Set model to evaluation mode
        for split in ['train', 'val']:
            losses[split] = []
            for _ in range(eval_iters):
                xb, yb = get_batch(split, batch_size, block_size)
                _, loss = model(xb, yb)
                losses[split].append(loss.item())
            losses[split] = torch.mean(torch.tensor(losses[split]))
            # print(f"Iter {iter}: {split} loss = {losses[split].item():.4f}")
        print(f"Step {iter}: Train loss = {losses['train'].item():.4f}, Val loss = {losses['val'].item():.4f}\n")
        model.train() # Set model back to training mode
        
    # Sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    
    # Evaluate the loss
    logits, loss = model(xb, yb)
    
    # Clear gradients before backward pass
    optimizer.zero_grad(set_to_none=True)
    
    # backpropagate to compute gradients
    loss.backward()
    
    # update the parameters
    optimizer.step()

print("Training completed.")

# Generate some text using the trained model
print("\n" + "="*40 + "\n")
print("Generating text...\n")
print("="*40 + "\n")

# Start with a newline character (or any character you like)
context = torch.zeros(1, 1, dtype=torch.long) # Starting with a single token (e.g., newline)
generated = model.generate(context, max_new_tokens=500)[0].tolist() # Generate 500 new tokens
print(decode(generated))
print("\n" + "="*40 + "\n")
print("Note: The output may seem random or nonsensical.")
print("This is expected for such a simple model!")
print("The model is learning basic character-level statistics, but it does not have the capacity to learn complex language patterns or long-range dependencies.")
print("To generate more coherent text, we would need to use a more sophisticated model architecture (e.g., RNNs, Transformers) and train on a larger dataset for more iterations.")
print("This example serves as a starting point to understand how language models work at a fundamental level.")
print("Feel free to experiment with the model, increase the training iterations, or try different starting contexts to see how it affects the generated text!")
print("\n" + "="*40 + "\n")

Optimizer:
AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: True
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0.01
)

Starting training...

Step 0: Train loss = 2.4677, Val loss = 2.4931

Step 500: Train loss = 2.4532, Val loss = 2.4900

Step 1000: Train loss = 2.4518, Val loss = 2.4844

Step 1500: Train loss = 2.4589, Val loss = 2.4792

Step 2000: Train loss = 2.4618, Val loss = 2.4850

Step 2500: Train loss = 2.4528, Val loss = 2.4828

Step 3000: Train loss = 2.4531, Val loss = 2.4974

Step 3500: Train loss = 2.4585, Val loss = 2.4805

Step 4000: Train loss = 2.4503, Val loss = 2.4827

Step 4500: Train loss = 2.4645, Val loss = 2.4822

Step 4999: Train loss = 2.4642, Val loss = 2.4917

Training completed.


Generating text...



ABoryon'sish ice broe n,
T:
Asthest AMathe hes s.
ILENENDWNomowifre WA:
IZANCEO hery bleld h ho ir'sthatrs 