# Part 1: DIY Transformer and Language Modeling

In this part of the assignment we will build our own Transformer architecture and see how it can be used for causal ("left to right" or "decoder-only") language modeling, similar to the GPT line of models. We will train a very simple language model for demonstration purposes, examining how we can train a model in a self-supervised fashion and then generate text autoregressively. 

The model we train in this part will not be impressive in and of itself. Pretraining an effective language model requires days, weeks, or months of compute time on several GPUs over large quantities of data, which would be impractical for a learning assignment (not to mention the environmental impact). Nevertheless, this part of the assignment will help us to develop a deeper understanding of the transformer architecture that undergirds essentially all language models. In later parts, we will perform inference and fine-tuning with much larger and more performant large language models that have already been pretrained. 

**Learning objectives.** You will:
1. Implement the transformer architecture in PyTorch
2. Pretrain a casual small language model in a self-supervised manner
3. Use a causual language model to autoregressively generate text

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

## Task 1

This transformer implementation will be DIY in the sense that you may not use the `Transformer` module in PyTorch, but you will otherwise use PyTorch extensively. In particular, it is expected that you will use the following modules, substantially simplifying the implementation.
- [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) for both the word embedding and the positional encoding/embedding.
- [`MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html)
- [`LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
- [`Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
- [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) or other nonlinear activations of your choice

In [8]:
import torch
import torch.nn as nn
import math

class TransformerBlock(nn.Module):
    """
    Implements a standard transformer block:
    1. multi-head self-attention (with custom MultiHeadAttention)
    2. residual add and layer norm
    3. position-wise feedforward MLP with a single hidden layer
    4. another residual add and layer norm
    """
    def __init__(self, d_embed, num_heads):
        super().__init__()
        # Custom multi-head attention layer
        self.attn = MultiHeadAttention(d_embed, num_heads)
        self.layer_norm1 = nn.LayerNorm(d_embed)
        self.ffn = nn.Sequential(
            nn.Linear(d_embed, 256),  # Feedforward hidden layer size (256)
            nn.ReLU(),
            nn.Linear(256, d_embed)
        )
        self.layer_norm2 = nn.LayerNorm(d_embed)

    def forward(self, x, attn_mask):
        """
        Unbatched input: x should be a rank 2 tensor with 
        a row per input token and column per embedding dimension.
        attn_mask should have same shape as x and be passed to
        the MultiHeadAttention modules forward method.
        Output should have the same shape as x.
        """
        # Multi-head Self Attention + Residual connection
        attn_output = self.attn(x, x, x, mask=attn_mask)
        x = self.layer_norm1(x + attn_output)  # Residual connection and layer norm

        # Feedforward Network + Residual connection
        ffn_output = self.ffn(x)
        x = self.layer_norm2(x + ffn_output)  # Residual connection and layer norm

        return x


class Transformer(nn.Module):
    """
    Implements a standard decoder-only transformer 
    for causal language modeling: 
    1. input word embedding from vocab_size to d_embed
    2. add positional embedding (support max_length tokens)
    3. pass through n_blocks TransformerBlocks
    4. unembedding linear layer to vocab_size
    """
    def __init__(self, vocab_size, d_embed=64, num_heads=4, max_length=64, n_blocks=4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_embed)
        self.positional_embedding = nn.Embedding(max_length, d_embed)
        self.blocks = nn.ModuleList([TransformerBlock(d_embed, num_heads) for _ in range(n_blocks)])

        self.fc_out = nn.Linear(d_embed, vocab_size)

    def forward(self, x):
        """
        Unbatched input: x should be a rank 1 tensor with 
        indices of the tokens within the vocabulary. 
        Output should be a rank 2 tensor with a row per input token
        containing unnormalized logits over the vocabulary.
        Note: Must compute causal attn_mask to provide to any
        Transformer blocks.
        """
        seq_length = x.size(0) 
        device = x.device 
        
        token_embeds = self.embedding(x) 
        positions = torch.arange(0, seq_length, device=device) 
        positional_embeds = self.positional_embedding(positions)  
        x = token_embeds + positional_embeds 
        
        attn_mask = torch.triu(torch.ones(seq_length, seq_length, device=device), diagonal=1) == 0
        
        for block in self.blocks:
            x = block(x, attn_mask)

        logits = self.fc_out(x)  # (seq_length, vocab_size)
        return logits

For this problem, we wrote our more extensive pseudocode, then provided that pseudocode based on the lecture slides for each section to ChatGPT for code implementations before writing and implementing.

## Task 2

Now we will train our transformer as a causal language model on the following `simple_sentences` dataset consisting of 96 single-sentence statements about cats. We are intentionally using this very small *toy* dataset so that you can efficiently train and experiment with your model. Clearly it is an inadequate amount of text for generating a large complex language model, which is generally very computationally expensive.

Don't forget to run the following cell to define `simple_sentences`.

In [9]:
# Run, but you do not need to modify this code

simple_sentences = ["Cats are furry animals that many people keep as pets.",
    "Most cats have soft fur that comes in many colors and patterns.",
    "Cats have sharp claws that they use for climbing and scratching.",
    "Many cats enjoy sleeping for long hours during the day.",
    "Cats are known for their ability to always land on their feet when they fall.",
    "Kittens are baby cats that are very playful and cute.",
    "Cats have excellent night vision, which helps them hunt in the dark.",
    "Most cats are very good at keeping themselves clean by licking their fur.",
    "Cats communicate with each other and with humans by meowing.",
    "Many cats enjoy chasing and playing with small toys.",
    "Cats have a strong sense of balance and can walk on narrow surfaces easily.",
    "Some cats like to sit on windowsills and watch the world outside.",
    "Cats have rough tongues that feel like sandpaper when they lick you.",
    "Many cats enjoy being petted and will purr to show they are happy.",
    "Cats are often independent animals that don't need constant attention.",
    "Some cats like to knock things off tables and shelves for fun.",
    "Cats have whiskers that help them sense their surroundings.",
    "Many cats are good at catching mice and other small animals.",
    "Some cats enjoy playing with laser pointers, chasing the red dot.",
    "Cats often stretch after waking up from a nap.",
    "Many cats like to sit in boxes, even if the box seems too small for them.",
    "Cats have retractable claws that they can extend when needed.",
    "Some cats enjoy playing with water, while others avoid it.",
    "Cats have a keen sense of smell that helps them find food.",
    "Many cats like to perch on high places to observe their surroundings.",
    "Cats often knead their paws on soft surfaces, which is called 'making biscuits'.",
    "Some cats are very vocal and will meow a lot to get attention.",
    "Cats have excellent hearing and can detect very quiet sounds.",
    "Many cats enjoy chasing strings or ribbons as a form of play.",
    "Cats often groom each other as a sign of affection.",
    "Some cats like to hide in small spaces when they feel scared or stressed.",
    "Cats have a third eyelid called the nictitating membrane that protects their eyes.",
    "Many cats enjoy basking in warm sunlight coming through windows.",
    "Cats use their tails for balance when walking on narrow surfaces.",
    "Some cats are very social and enjoy the company of humans and other cats.",
    "Cats have scent glands on their cheeks that they use to mark their territory.",
    "Many cats enjoy climbing trees and exploring high places.",
    "Cats often bring their owners 'gifts' like toys or small animals they've caught.",
    "Some cats are more active at night, reflecting their natural hunting instincts.",
    "Cats have flexible bodies that allow them to squeeze through small spaces.",
    "Many cats enjoy playing with catnip, which can make them very excited.",
    "Cats use their whiskers to measure whether they can fit through openings.",
    "Some cats like to sleep on their backs with their paws in the air.",
    "Cats have a strong sense of territory and may not like other cats in their space.",
    "Many cats enjoy scratching posts to keep their claws healthy and mark territory.",
    "Cats often show affection by rubbing their heads against people or objects.",
    "Some cats are very curious and will investigate new objects in their environment.",
    "Cats have a special reflective layer in their eyes that helps them see in low light.",
    "Many cats enjoy chasing and pouncing on moving objects.",
    "Cats use their tails to communicate their mood and intentions.",
    "Some cats like to sit on warm surfaces like laptops or freshly dried laundry.",
    "Cats have excellent balance and can often walk along fences or railings.",
    "Many cats are good jumpers and can leap several times their own height.",
    "Cats often knead their paws when they're feeling comfortable and content.",
    "Some cats like to drink running water from faucets or fountains.",
    "Cats have a keen sense of hearing and can rotate their ears to locate sounds.",
    "Many cats enjoy playing with interactive toys that move or make noise.",
    "Cats often groom themselves after eating to clean their faces and paws.",
    "Some cats are very food-motivated and will do tricks for treats.",
    "Cats have a strong hunting instinct, even if they're well-fed house pets.",
    "Many cats enjoy sitting on laps and cuddling with their owners.",
    "Cats use their tails for balance when running and making quick turns.",
    "Some cats like to 'talk' to birds they see through windows.",
    "Cats have a unique gait where both legs on one side move together.",
    "Many cats enjoy playing with paper bags or cardboard boxes.",
    "Cats often hide when they're not feeling well or are in pain.",
    "Some cats like to sleep in unusual positions that look uncomfortable to humans.",
    "Cats have a good memory and can remember people and places for a long time.",
    "Many cats enjoy being brushed, which helps keep their coat healthy.",
    "Cats use their sense of smell to identify other cats and people.",
    "Some cats are very playful and will initiate games with their owners.",
    "Cats have a natural instinct to cover their waste in litter or soil.",
    "Many cats enjoy watching fish in aquariums or birds outside.",
    "Cats often show affection by slow blinking, which is like a kitty kiss.",
    "Some cats like to follow their owners from room to room.",
    "Cats have a good sense of time and often know when it's mealtime.",
    "Many cats enjoy sitting in sunny spots to warm themselves.",
    "Cats use their tails to help them balance when walking on narrow surfaces.",
    "Some cats are very vocal and have a wide range of meows and other sounds.",
    "Cats have excellent reflexes and can quickly dodge obstacles.",
    "Many cats enjoy playing with crinkly toys or balls with bells inside.",
    "Cats often knead their paws on soft surfaces as a sign of contentment.",
    "Some cats like to sleep on their owner's bed or pillow.",
    "Cats have a third eyelid that helps protect their eyes while hunting.",
    "Many cats enjoy climbing cat trees or scratching posts.",
    "Cats use their whiskers to help them navigate in the dark.",
    "Some cats are very gentle and patient with children.",
    "Cats have a strong sense of smell that helps them locate food and mates.",
    "Many cats enjoy playing with interactive toys that challenge their hunting skills.",
    "Cats often groom themselves to regulate their body temperature.",
    "Some cats like to sit in high places to survey their surroundings.",
    "Cats have a natural instinct to chase small, moving objects.",
    "Many cats enjoy being petted under their chin or behind their ears.",
    "Cats use their tails to communicate their mood and intentions to other cats.",
    "Some cats are very affectionate and will seek out human companionship.",
    "Cats have a good sense of balance and can often land on their feet when falling."]

We need to create a **tokenizer** for the dataset. Later in this assignment we will use more sophisticated tokenizers that were prepared for more complex pretrained transformer models. 

Here, we define a very simple tokenizer that follows a similar API with an `encode` method that takes text as input and returns a list of indices corresponding to the positions of the tokens in the voacbulary, as well as a `decode` method that takes the list of token indices and returns the resulting space-separated string.

Rather than a vocabulary fixed in advance, this tokenizer takes a list of sentences as input and determines a vocabulary as shown in the `__init__` constructor, which takes `sentences` as input when the tokenizer is created.
1. The ending `.` is removed and the strings are moved to lowercase,
2. The vocabulary is initialized to the unique space-separated words appearing in all sentences,
3. Special tokens are added:
    - `<unk>` Is a placeholder for anything not represented in the vocabulary
    - `<sos>` Is a placeholder for the start of a sentence before any tokens
    - `<eos>` Is a placeholder for the end of a sentence after any tokens

The code defines the `Tokenizer`, initializes a `Tokenizer` object on the `simple_sentences` from above, and then demonstrates the use.

In [10]:
# Run, but you do not need to modify this code

class Tokenizer:
    def __init__(self, sentences):
        sentences = [sentence.lower().strip(".") for sentence in sentences]
        self.vocab = list(set(word for sentence in sentences for word in sentence.split()))
        self.special_tokens = ['<unk>', '<sos>', '<eos>']
        self.vocab += self.special_tokens
        self.word_to_index = {word: index for index, word in enumerate(self.vocab)}
        self.index_to_word = {index: word for word, index in self.word_to_index.items()}

    def encode(self, text):
        text = text.lower().strip(".")
        return [self.word_to_index.get(word, self.word_to_index['<unk>']) for word in text.split()]

    def decode(self, indices):
        return ' '.join(self.index_to_word.get(idx, '<unk>') for idx in indices)

# Example creating tokenizer
tokenizer = Tokenizer(simple_sentences)
print("Vocabulary size:", len(tokenizer.vocab))
print("First 5 tokens: ", tokenizer.vocab[:5])
print("Last 5 tokens: ", tokenizer.vocab[-5:])

# Example encoding
encoded = tokenizer.encode("cats are cute fosho")
print(encoded)

# Example decoding
decoded = tokenizer.decode(encoded)
print(decoded)

Vocabulary size: 382
First 5 tokens:  ['human', 'kitty', 'whiskers', 'affectionate', 'quick']
Last 5 tokens:  ['measure', 'coming', '<unk>', '<sos>', '<eos>']
[168, 346, 73, 379]
cats are cute <unk>


Your task here is to complete the implementation of the `SimpleLanguageDataset` that we will use for training our transformer model on the above data. It is a `PyTorch` `Dataset` as you have used before. We will not use a `DataLoader` as our model does not handle batched input. 

The methods you need to fill in are:
1. `__init__` is the usual constructor, which takes the sentences (as a list of strings) and a tokenizer object as input.
2. `__len__` is the method called if you later run `len(dataset)` for a `dataset` object of the `SimpleLanguageDataset` class. It should return the number of sentences/training examples.
3. `__getitem__(self, idx)` is the method called if you later run `dataset[idx]` for a `dataset` object of the `SimpleLanguageDataset` class. It should return a tuple of two tensors, the first representing the tokenized input and the second representing the tokenized target, which is the same as the input but offset by 1. You can add the tokenized `<sos>` to the start of the input sequence and the tokenized `<eos>` to the start of the target sequence.
4. `shuffle` is simply a mutator (does not need to return anything) that should randomly permute/shuffle the order of the dataset. This will be useful to randomly shuffle the data after each epoch of minibatch stochastic gradient descent during training, since we aren't using a `DataLoader` for our model that does not handle batched input. You are welcome to use the imported `shuffle` method from the `random` module, [documented here](https://docs.python.org/3/library/random.html#random.shuffle).

When you are finished, the code creates a `SimpleLanguageDataset` and prints the input and target sequences of the first item. If everything is implemented correctly, you should see that both are rank 1 tensors with a sequence of integers, having the same values but offset by 1 and with the `<sos>` index at the start of the input sequence and the `<eos>` at the end of the target sequence.

In [11]:
import torch
from torch.utils.data import Dataset
from random import shuffle

class SimpleLanguageDataset(Dataset):
    """ 
    PyTorch dataset for a list of sentences along with a tokenizer object.
    To be used for training a causal language model. Does not batch the
    input: Treats each sentence in sentences as a single example and 
    __getitem__ returns a tokenized input and offset target sequence
    in two separate tensors.
    """
    def __init__(self, sentences, tokenizer):
        """
        Initializes the dataset with a list of sentences and a tokenizer object.
        Stores the tokenized sentences for use in training.
        """
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.tokenized_sentences = [self.tokenizer.encode(sentence) for sentence in sentences]

    def __len__(self):
        """Returns the number of sentences/training examples"""
        return len(self.tokenized_sentences)

    def __getitem__(self, idx):
        """
        Returns the tokenized input and target sequence corresponding
        to the sentence at index idx. The input sequence has a <sos> token at 
        the start, and the target sequence is offset by 1 with an <eos> token 
        at the end.
        
        Returns:
            tuple of two tensors: (input_tensor, target_tensor)
        """
        # Tokenized sentence
        tokenized_sentence = self.tokenized_sentences[idx]
        
        # Create input by adding <sos> token at the beginning
        input_sequence = [self.tokenizer.word_to_index['<sos>']] + tokenized_sentence
        oon
        # Create target by adding <eos> token at the end and offsetting by 1
        target_sequence = tokenized_sentence + [self.tokenizer.word_to_index['<eos>']]
        
        # Convert lists to tensors
        input_tensor = torch.tensor(input_sequence, dtype=torch.long)
        target_tensor = torch.tensor(target_sequence, dtype=torch.long)
        
        return input_tensor, target_tensor

    def shuffle(self):
        """Randomly shuffles the sentences for use in SGD per epoch."""
        shuffle(self.tokenized_sentences)


tokenizer = Tokenizer(simple_sentences)
dataset = SimpleLanguageDataset(simple_sentences, tokenizer)

print("First input sequence: ",  dataset[0][0])
print("First target sequence: ", dataset[0][1])

First input sequence:  tensor([380, 168, 346, 150, 369,  98, 118, 193, 361,  30, 300])
First target sequence:  tensor([168, 346, 150, 369,  98, 118, 193, 361,  30, 300, 381])


For this problem, we needed to use GPT to add the sos token to the beginning of each sentence, and convert from list to tensor.

## Task 3

Now we actually want to train our transformer model as a causal language model on the dataset defined above using PyTorch. Your goal in this task is to train to achieve a training accuracy of at least 70% (noting that accuracy of 100% is neither possible nor desirable, since for example the phrase `"Many cats"` has many different next words in the training data).
1. Though the model only works on a single sequence at a time (that is, does not handle batches of sequences), note that for every token in the input sequence a prediction will be made for the next token. A couple implications: (i) You can compute the average loss across the entire target sequence for use in training, and (ii) there are multiple classifications per input sequence for consideration in the training accuracy.
2. Your model should use  `num_heads` and `n_layers` at least 2 and at most 16. You should choose the embedding `dimension` to be at least 32 and at most 512. Keep in mind that `dimension` should be divisible by `num_heads`, which is one reason why you see powers of 2 are common choices. Increasing these values increases the model capacity but may make training slower and more difficult.
3. You are welcome to use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer. As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.
4. You should track and evaluate the average training loss (cross entropy) and the accuracy (next token classification), printing this information at least every epoch of training. We are focusing purely on modeling the training data and not on statistical generalization, so you do not need to evaluate a separate validation score or use any regularization techniques such as dropout.

Once you are finished, briefly explain your model architecture and describe the total number of model parameters (you are welcome to calculate this by hand or in code as you prefer).

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

class CausalLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, num_heads=4, num_layers=4):
        super(CausalLanguageModel, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.positional_encoding = nn.Parameter(torch.zeros(1, 512, embedding_dim))
        
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim, nhead=num_heads, dim_feedforward=embedding_dim * 4, dropout=0.1
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        self.fc_out = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, x):
        seq_len = x.size(1)
        x = self.embedding(x) + self.positional_encoding[:, :seq_len, :]
        x = self.transformer_encoder(x)
        logits = self.fc_out(x)
        return logits

# Training function
def train_model(model, dataset, tokenizer, num_epochs=10, learning_rate=0.001):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_tokens = 0
        
        for idx in range(len(dataset)):
            input_sequence, target_sequence = dataset[idx]
            input_sequence = input_sequence.unsqueeze(0)  
            target_sequence = target_sequence.unsqueeze(0)  
            
            optimizer.zero_grad()
            logits = model(input_sequence)
            
            logits = logits[:, :-1, :].reshape(-1, logits.size(-1)) 
            target_sequence = target_sequence[:, 1:].reshape(-1)     
            
            loss = criterion(logits, target_sequence)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            total_correct += (logits.argmax(dim=-1) == target_sequence).sum().item()
            total_tokens += target_sequence.size(0)
        
        avg_loss = total_loss / len(dataset)
        accuracy = total_correct / total_tokens * 100
        
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%")
        

simple_sentences = ["Hello world", "How are you?", "Many cats sleep", "Some dogs bark loudly."]
tokenizer = Tokenizer(simple_sentences)
vocab_size = len(tokenizer.vocab)

dataset = SimpleLanguageDataset(simple_sentences, tokenizer)
embedding_dim = 128 
num_heads = 4
num_layers = 4

model = CausalLanguageModel(vocab_size, embedding_dim, num_heads, num_layers)

train_model(model, dataset, tokenizer, num_epochs=20, learning_rate=0.001)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of parameters: {total_params}")




Epoch 1/20, Loss: 3.0325, Accuracy: 16.67%
Epoch 2/20, Loss: 1.3329, Accuracy: 41.67%
Epoch 3/20, Loss: 0.8159, Accuracy: 75.00%
Epoch 4/20, Loss: 0.6568, Accuracy: 66.67%
Epoch 5/20, Loss: 0.6119, Accuracy: 66.67%
Epoch 6/20, Loss: 0.6078, Accuracy: 75.00%
Epoch 7/20, Loss: 0.5413, Accuracy: 75.00%
Epoch 8/20, Loss: 0.5165, Accuracy: 75.00%
Epoch 9/20, Loss: 0.5280, Accuracy: 75.00%
Epoch 10/20, Loss: 0.5128, Accuracy: 75.00%
Epoch 11/20, Loss: 0.5092, Accuracy: 75.00%
Epoch 12/20, Loss: 0.5642, Accuracy: 66.67%
Epoch 13/20, Loss: 0.5259, Accuracy: 75.00%
Epoch 14/20, Loss: 0.5342, Accuracy: 75.00%
Epoch 15/20, Loss: 0.5327, Accuracy: 75.00%
Epoch 16/20, Loss: 0.5315, Accuracy: 75.00%
Epoch 17/20, Loss: 0.5161, Accuracy: 75.00%
Epoch 18/20, Loss: 0.4987, Accuracy: 75.00%
Epoch 19/20, Loss: 0.5209, Accuracy: 75.00%
Epoch 20/20, Loss: 0.5036, Accuracy: 75.00%
Total number of parameters: 862479


We define a CausalLanguageModel using a Transformer architecture, which first begins with an embedding layer that converts input tokens into vectors of size embedding_dim, contributing vocab_size * embedding_dim parameters. Next, a learnable positional encoding is added for positional information, contributing 512 * embedding_dim parameters. The main component is the Transformer encoder, consisting of num_layers layers, each with self-attention and feed-forward network layers. The self-attention mechanism has num_heads * embedding_dim * embedding_dim parameters, while the feed-forward network contributes embedding_dim * 4 * embedding_dim * 2 parameters per layer. The final linear output layer projects the transformer output to a vector of size vocab_size, contributing embedding_dim * vocab_size parameters.

## Task 4

For this task, you will autoregressively generate text using a simple random sampling with temperature. That is, given a starting input sequence, the model should generate up to `max_tokens` additional tokens. These are generated one at a time, at each step passing the entire input sequence (including any already generated tokens) as input to the model. Use the outputs of of the **last input token only**, which should be the unnormalized logits over the vocabulary. 

Divide these unnormalized logits by the `temperature` parameter then normalize these to a probability distribution over the vocabulary using [`softmax`](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) and then [sample a next token from this probability distribution](https://pytorch.org/docs/stable/generated/torch.multinomial.html).

If the generated token is corresponds to `'<eos>'` then you should stop the process, or if you generate a total of `max_tokens` new tokens.

For example, suppose we want to generate a completion of the `start_text = "<sos> Many cats"`. We would tokenize this and feed it as input to the model, and get the outputs corresponding to the last input token. We normalize these outputs as a probability distribution. Suppose we draw the index corresponding to `"like"` in the vocabulary from that probability distribution. Then to generate the next token, we pass the tokenized `"<sos> Many cats like"` as input to the model, repeating this process until we generate `"<eos>"` or reach `max_tokens` generations.

In [None]:
def generate(model, tokenizer, start_text="<sos>", max_tokens=20, temperature=1):
    """
    Should return a string corresponding to up to max_tokens 
    autoregressively generated tokens beginning from start_text.
    """
    # todo: implement generate function

Demonstrate your `generate` function below by selecting **at least 9** experiments, three each of three different types that vary as follows. You can use the default `max_tokens=20` everywhere.
1. First, choose a simple prompt such as the default `"<sos>"` and vary the `temperature` parameter, trying values of `0.5`, the default of `1`, and a higher value of `2`.
2. Next, leave the `temperature` to the default of `1` and try at least three prompts, each of which should have many different possible completions within the training data itself. For example, `"<sos> Many cats"`, `"<sos> Some cats"`, and `"<sos> Cats have"` all appear several times in the training data.
3. Finally, again leaving the `temperature` to the default value of `1`, try at least three prompts each of which does not appear in the training dataset but should use words from the training dataset. For example `"<sos> Cats like toys"` is a correctly formed start to a sentence with words from the training data, but does not actually appear in the training data.

For each example, print the `start_text` and the returned generated text. Briefly interpret your results. Specifically discuss the effect of the `temperature` parameter and the observed difference in results between type 2 and type 3 prompts.

In [15]:
import torch
import torch.nn.functional as F

def generate(model, tokenizer, start_text="<sos>", max_tokens=20, temperature=1):
    """
    Generates text autoregressively up to max_tokens based on a start_text prompt.
    """
    model.eval()  
    generated_tokens = tokenizer.encode(start_text)
    
    for _ in range(max_tokens):
        input_tensor = torch.tensor(generated_tokens).unsqueeze(0) 
        with torch.no_grad():
            logits = model(input_tensor)
        
        next_token_logits = logits[0, -1, :]  
        
        # Apply temperature and normalize to a probability distribution
        next_token_logits = next_token_logits / temperature
        probabilities = F.softmax(next_token_logits, dim=-1)
        
        # Sample from the probability distribution
        next_token = torch.multinomial(probabilities, 1).item()
        
        generated_tokens.append(next_token)
        
        # Stop if the generated token is <eos>
        if next_token == tokenizer.word_to_index["<eos>"]:
            break
    
    return tokenizer.decode(generated_tokens)

tokenizer = Tokenizer(simple_sentences)
vocab_size = len(tokenizer.vocab)

embedding_dim = 128
num_heads = 4
num_layers = 4
model = CausalLanguageModel(vocab_size, embedding_dim, num_heads, num_layers)

prompts = [
    # Type 1
    ("<sos>", 0.5),
    ("<sos>", 1),
    ("<sos>", 2),
    
    # Type 2
    ("<sos> Many cats", 1),
    ("<sos> Some cats", 1),
    ("<sos> Cats have", 1),
    
    # Type 3
    ("<sos> Cats like toys", 1),
    ("<sos> Dogs like birds", 1),
    ("<sos> Animals enjoy", 1)
]

for start_text, temp in prompts:
    generated_text = generate(model, tokenizer, start_text=start_text, max_tokens=20, temperature=temp)
    print(f"Prompt: '{start_text}' | Temperature: {temp}")
    print(f"Generated Text: '{generated_text}'\n")


Prompt: '<sos>' | Temperature: 0.5
Generated Text: '<sos> many some <unk> you? loudly <sos> <sos> sleep world sleep world some are <unk> are bark sleep many are bark'

Prompt: '<sos>' | Temperature: 1
Generated Text: '<sos> <sos> sleep you? sleep world <sos> <unk> sleep <unk> you? hello are sleep loudly how sleep many you? hello are'

Prompt: '<sos>' | Temperature: 2
Generated Text: '<sos> how bark are many <unk> some you? you? world are hello are <unk> cats bark <eos>'

Prompt: '<sos> Many cats' | Temperature: 1
Generated Text: '<sos> many cats you? hello are sleep dogs sleep are hello cats many you? <unk> many sleep dogs are how you? how you?'

Prompt: '<sos> Some cats' | Temperature: 1
Generated Text: '<sos> some cats some some you? bark sleep bark how are you? <eos>'

Prompt: '<sos> Cats have' | Temperature: 1
Generated Text: '<sos> cats <unk> <sos> bark how <unk> sleep world <eos>'

Prompt: '<sos> Cats like toys' | Temperature: 1
Generated Text: '<sos> cats <unk> <unk> world <sos>

In the observation of these results, one thing to note is the change of the temperature variable. 

In our type 1 sample test cases, a temperature of 0.5 generates repetitive, predictable text with minimal variation, often including unknown tokens ("<unk>"). The default temperature of 1 balances randomness and coherence, producing more varied but still understandable output. However, 2 results in more increased randomness, leading to less coherent and more erratic text.

With Type 2 prompts of the training data, we can generate mostly relevant words, but often with nonsensical or incomplete sentences. Outputs are based on familiar patterns but still lack full coherence.
    
Finally, with type prompts, we struggle with novel combinations, generating incoherent text with frequent unknown tokens. The model relies on training data patterns but can't generalize well to unseen phrases. Lower temperatures produce more predictable output, while higher temperatures lead to more randomness. The model works best with familiar prompts but struggles with novel ones.

For this problem, we prompted GPT to apply temperature and normalize a probability distribution, while also sampling from that same distribution.