# Part 2: DIY Transformer and Language Modeling

In this part of the assignment we will build our own Transformer architecture and see how it can be used for causal ("left to right" or "decoder-only") language modeling, similar to the GPT line of models. We will train a very simple language model for demonstration purposes, examining how we can train a model in a self-supervised fashion and then generate text autoregressively. 

The model we train in this part will not be impressive in and of itself. Pretraining an effective language model requires days, weeks, or months of compute time on several GPUs over large quantities of data, which would be impractical for a learning assignment (not to mention the environmental impact). Nevertheless, this part of the assignment will help us to develop a deeper understanding of the transformer architecture that undergirds essentially all language models. In later parts, we will perform inference and fine-tuning with much larger and more performant large language models that have already been pretrained. 

**Learning objectives.** You will:
1. Implement the transformer architecture in PyTorch
2. Pretrain a causal small language model in a self-supervised manner
3. Use a causal language model to autoregressively generate text

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

## Preparing Training Data and Tokenizer

We will use the following `simple_sentences` dataset consisting of 96 single-sentence statements about cats. We are intentionally using this very small *toy* dataset so that you can efficiently train and experiment with your model. Clearly it is an inadequate amount of text for generating a large complex language model, which is generally very computationally expensive.

Run the following cell to define `simple_sentences`.

In [3]:
# Run, but you do not need to modify this code

simple_sentences = ["Cats are furry animals that many people keep as pets.",
    "Most cats have soft fur that comes in many colors and patterns.",
    "Cats have sharp claws that they use for climbing and scratching.",
    "Many cats enjoy sleeping for long hours during the day.",
    "Cats are known for their ability to always land on their feet when they fall.",
    "Kittens are baby cats that are very playful and cute.",
    "Cats have excellent night vision, which helps them hunt in the dark.",
    "Most cats are very good at keeping themselves clean by licking their fur.",
    "Cats communicate with each other and with humans by meowing.",
    "Many cats enjoy chasing and playing with small toys.",
    "Cats have a strong sense of balance and can walk on narrow surfaces easily.",
    "Some cats like to sit on windowsills and watch the world outside.",
    "Cats have rough tongues that feel like sandpaper when they lick you.",
    "Many cats enjoy being petted and will purr to show they are happy.",
    "Cats are often independent animals that don't need constant attention.",
    "Some cats like to knock things off tables and shelves for fun.",
    "Cats have whiskers that help them sense their surroundings.",
    "Many cats are good at catching mice and other small animals.",
    "Some cats enjoy playing with laser pointers, chasing the red dot.",
    "Cats often stretch after waking up from a nap.",
    "Many cats like to sit in boxes, even if the box seems too small for them.",
    "Cats have retractable claws that they can extend when needed.",
    "Some cats enjoy playing with water, while others avoid it.",
    "Cats have a keen sense of smell that helps them find food.",
    "Many cats like to perch on high places to observe their surroundings.",
    "Cats often knead their paws on soft surfaces, which is called 'making biscuits'.",
    "Some cats are very vocal and will meow a lot to get attention.",
    "Cats have excellent hearing and can detect very quiet sounds.",
    "Many cats enjoy chasing strings or ribbons as a form of play.",
    "Cats often groom each other as a sign of affection.",
    "Some cats like to hide in small spaces when they feel scared or stressed.",
    "Cats have a third eyelid called the nictitating membrane that protects their eyes.",
    "Many cats enjoy basking in warm sunlight coming through windows.",
    "Cats use their tails for balance when walking on narrow surfaces.",
    "Some cats are very social and enjoy the company of humans and other cats.",
    "Cats have scent glands on their cheeks that they use to mark their territory.",
    "Many cats enjoy climbing trees and exploring high places.",
    "Cats often bring their owners 'gifts' like toys or small animals they've caught.",
    "Some cats are more active at night, reflecting their natural hunting instincts.",
    "Cats have flexible bodies that allow them to squeeze through small spaces.",
    "Many cats enjoy playing with catnip, which can make them very excited.",
    "Cats use their whiskers to measure whether they can fit through openings.",
    "Some cats like to sleep on their backs with their paws in the air.",
    "Cats have a strong sense of territory and may not like other cats in their space.",
    "Many cats enjoy scratching posts to keep their claws healthy and mark territory.",
    "Cats often show affection by rubbing their heads against people or objects.",
    "Some cats are very curious and will investigate new objects in their environment.",
    "Cats have a special reflective layer in their eyes that helps them see in low light.",
    "Many cats enjoy chasing and pouncing on moving objects.",
    "Cats use their tails to communicate their mood and intentions.",
    "Some cats like to sit on warm surfaces like laptops or freshly dried laundry.",
    "Cats have excellent balance and can often walk along fences or railings.",
    "Many cats are good jumpers and can leap several times their own height.",
    "Cats often knead their paws when they're feeling comfortable and content.",
    "Some cats like to drink running water from faucets or fountains.",
    "Cats have a keen sense of hearing and can rotate their ears to locate sounds.",
    "Many cats enjoy playing with interactive toys that move or make noise.",
    "Cats often groom themselves after eating to clean their faces and paws.",
    "Some cats are very food-motivated and will do tricks for treats.",
    "Cats have a strong hunting instinct, even if they're well-fed house pets.",
    "Many cats enjoy sitting on laps and cuddling with their owners.",
    "Cats use their tails for balance when running and making quick turns.",
    "Some cats like to 'talk' to birds they see through windows.",
    "Cats have a unique gait where both legs on one side move together.",
    "Many cats enjoy playing with paper bags or cardboard boxes.",
    "Cats often hide when they're not feeling well or are in pain.",
    "Some cats like to sleep in unusual positions that look uncomfortable to humans.",
    "Cats have a good memory and can remember people and places for a long time.",
    "Many cats enjoy being brushed, which helps keep their coat healthy.",
    "Cats use their sense of smell to identify other cats and people.",
    "Some cats are very playful and will initiate games with their owners.",
    "Cats have a natural instinct to cover their waste in litter or soil.",
    "Many cats enjoy watching fish in aquariums or birds outside.",
    "Cats often show affection by slow blinking, which is like a kitty kiss.",
    "Some cats like to follow their owners from room to room.",
    "Cats have a good sense of time and often know when it's mealtime.",
    "Many cats enjoy sitting in sunny spots to warm themselves.",
    "Cats use their tails to help them balance when walking on narrow surfaces.",
    "Some cats are very vocal and have a wide range of meows and other sounds.",
    "Cats have excellent reflexes and can quickly dodge obstacles.",
    "Many cats enjoy playing with crinkly toys or balls with bells inside.",
    "Cats often knead their paws on soft surfaces as a sign of contentment.",
    "Some cats like to sleep on their owner's bed or pillow.",
    "Cats have a third eyelid that helps protect their eyes while hunting.",
    "Many cats enjoy climbing cat trees or scratching posts.",
    "Cats use their whiskers to help them navigate in the dark.",
    "Some cats are very gentle and patient with children.",
    "Cats have a strong sense of smell that helps them locate food and mates.",
    "Many cats enjoy playing with interactive toys that challenge their hunting skills.",
    "Cats often groom themselves to regulate their body temperature.",
    "Some cats like to sit in high places to survey their surroundings.",
    "Cats have a natural instinct to chase small, moving objects.",
    "Many cats enjoy being petted under their chin or behind their ears.",
    "Cats use their tails to communicate their mood and intentions to other cats.",
    "Some cats are very affectionate and will seek out human companionship.",
    "Cats have a good sense of balance and can often land on their feet when falling."]

We need to create a **tokenizer** for the dataset. Here, we define a very simple tokenizer that follows a similar API with an `encode` method that takes text as input and returns a list of indices corresponding to the positions of the tokens in the vocabulary, as well as a `decode` method that takes the list of token indices and returns the resulting space-separated string.

Rather than a vocabulary fixed in advance, this tokenizer takes a list of sentences as input and determines a vocabulary as shown in the `__init__` constructor, which takes `sentences` as input when the tokenizer is created.
1. The ending `.` is removed and the strings are moved to lowercase,
2. The vocabulary is initialized to the unique space-separated words appearing in all sentences,
3. Special tokens are added:
    - `<unk>` Is a placeholder for anything not represented in the vocabulary
    - `<sos>` Is a placeholder for the start of a sentence before any tokens
    - `<eos>` Is a placeholder for the end of a sentence after any tokens

The code defines the `Tokenizer`, initializes a `Tokenizer` object on the `simple_sentences` from above, and then demonstrates the use.

In [5]:
# Run, but you do not need to modify this code

class Tokenizer:
    def __init__(self, sentences):
        sentences = [sentence.lower().strip(".") for sentence in sentences]
        self.vocab = list(set(word for sentence in sentences for word in sentence.split()))
        self.special_tokens = ['<unk>', '<sos>', '<eos>']
        self.vocab += self.special_tokens
        self.word_to_index = {word: index for index, word in enumerate(self.vocab)}
        self.index_to_word = {index: word for word, index in self.word_to_index.items()}

    def encode(self, text):
        text = text.lower().strip(".")
        return [self.word_to_index.get(word, self.word_to_index['<unk>']) for word in text.split()]

    def decode(self, indices):
        return ' '.join(self.index_to_word.get(idx, '<unk>') for idx in indices)

# Example creating tokenizer
tokenizer = Tokenizer(simple_sentences)
print("Vocabulary size:", len(tokenizer.vocab))
print("First 5 tokens: ", tokenizer.vocab[:5])
print("Last 5 tokens: ", tokenizer.vocab[-5:])

# Example encoding
encoded = tokenizer.encode("cats are cute fosho")
print(encoded)

# Example decoding
decoded = tokenizer.decode(encoded)
print(decoded)

Vocabulary size: 382
First 5 tokens:  ['well', 'their', 'content', "biscuits'", 'like']
Last 5 tokens:  ['whether', 'gentle', '<unk>', '<sos>', '<eos>']
[85, 265, 90, 379]
cats are cute <unk>


Now we use the `Tokenizer` to define a PyTorch `Dataset` called `SimpleLanguageDataset` that we will use for training our transformer model on the above data. We will not use a `DataLoader` as our model will not handle batched input. 

Some notes for your reference about the methods. You do not need to implement anything but you will use these later.

1. `__init__` is the usual constructor, which takes the sentences (as a list of strings) and a tokenizer object as input.

2. `__len__` is the method called if you later run `len(dataset)` for a `dataset` object of the `SimpleLanguageDataset` class. It should return the number of sentences/training examples.

3. `__getitem__(self, idx)` is the method called if you later run `dataset[idx]` for a `dataset` object of the `SimpleLanguageDataset` class. It returns a tuple of two tensors, the first representing the tokenized input and the second representing the tokenized target, which is the same as the input but offset by 1. The tokenized `<sos>` is added to the start of the input sequence and the tokenized `<eos>` to the end of the target sequence.

4. `shuffle` is simply a mutator (does not need to return anything) that should randomly permute/shuffle the order of the dataset. This will be useful to randomly shuffle the data after each epoch of minibatch stochastic gradient descent during training, since our model does not handle batched input and we aren't using a `DataLoader`.

The code creates a `SimpleLanguageDataset` and prints the input and target sequences of the first item. You should see that both are rank 1 tensors with a sequence of integers, having the same values but offset by 1 and with the `<sos>` index at the start of the input sequence and the `<eos>` at the end of the target sequence.

In [7]:
import torch
from torch.utils.data import Dataset
from random import shuffle

class SimpleLanguageDataset(Dataset):
    def __init__(self, sentences, tokenizer):
        self.sentences = sentences
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        tokens = self.tokenizer.encode(sentence)
        
        full_sequence = [self.tokenizer.word_to_index['<sos>']] + tokens + [self.tokenizer.word_to_index['<eos>']]
        
        input_seq = torch.tensor(full_sequence[:-1])   # [<sos>, word1, ..., wordN]
        target_seq = torch.tensor(full_sequence[1:])   # [word1, word2, ..., <eos>]
        
        return input_seq, target_seq

    def shuffle(self):
        shuffle(self.sentences)


# Create tokenizer and dataset
tokenizer = Tokenizer(simple_sentences)
dataset = SimpleLanguageDataset(simple_sentences, tokenizer)

print("First input sequence: ",  dataset[0][0])
print("First target sequence: ", dataset[0][1])

First input sequence:  tensor([380,  85, 265, 358, 234, 221,  79, 297, 372,  91, 139])
First target sequence:  tensor([ 85, 265, 358, 234, 221,  79, 297, 372,  91, 139, 381])


## Task 1

This transformer implementation will be DIY in the sense that you may not use the `Transformer` module in PyTorch, but you will otherwise use PyTorch extensively. In particular, it is expected that you will use the following modules, substantially simplifying the implementation.
- [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) for both the word embedding and the positional encoding/embedding.
- [`MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html)
- [`LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
- [`Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
- [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) or other nonlinear activations of your choice

In [9]:
import torch
from torch import nn

class TransformerBlock(nn.Module):
    """
    Implements a standard transformer block: 
        1. multi-head self-attention
        2. residual add and layer norm
        3. position-wise feedforward MLP with a single hidden layer
        4. another residual add and layer norm
    """
    def __init__(self, d_embed, num_heads):
        super().__init__()
        # TODO: complete constructor implementation
        self.MHAttention = nn.MultiheadAttention(embed_dim = d_embed, num_heads = num_heads)
        self.LNorm1 = nn.LayerNorm(d_embed)
        self.LNorm2 = nn.LayerNorm(d_embed)
        self.feedforward = nn.Sequential(nn.Linear(d_embed, 4 * d_embed), nn.ReLU(), nn.Linear(4 * d_embed, d_embed))

    def forward(self, x, attn_mask):
        """    
        Args:
            x: Tensor of shape (seq_len, d_embed)
            attn_mask: Tensor of shape (seq_len, seq_len) 
        Returns:
            Tensor of shape (seq_len, d_embed)

        Unbatched input: x should be a rank 2 tensor with 
        a row per input token and column per embedding dimension.
        attn_mask should have a shape of (seq_len, seq_len) and be passed to
        to the MultiheadAttention modules forward method.
        """
        # TODO: complete forward implementation
        query = x.unsqueeze(1)
        key = query
        value = query
        attn_output, attn_output_weights = self.MHAttention(query, key, value, attn_mask = attn_mask)
        attn_out = attn_output.squeeze(1)
        x = self.LNorm1(x + attn_out)
        feedforward_output = self.feedforward(x)
        x = self.LNorm2(x + feedforward_output)
        return x

class Transformer(nn.Module):
    """
    Implements a standard decoder-only transformer 
    for causal language modeling: 
        1. input word embedding from vocab_size to d_embed
        2. add positional embedding (support max_length tokens)
        3. pass through n_blocks TransformerBlocks
        4. unembedding linear layer to vocab_size
    """
    def __init__(self, vocab_size, d_embed=64, num_heads=4, max_length=64, n_blocks=4):
        super().__init__()
        # TODO: complete constructor implementation
        self.vocab_size = vocab_size
        self.d_embed = d_embed
        self.num_heads = num_heads
        self.max_length = max_length
        self.n_blocks = n_blocks
        self.word_embedding = nn.Embedding(vocab_size, d_embed)
        self.positional_embedding = nn.Embedding(max_length, d_embed)
        self.blocks = nn.ModuleList([TransformerBlock(d_embed, num_heads) for b in range(n_blocks)])
        self.final_layer_norm = nn.LayerNorm(d_embed)
        self.linear_projection = nn.Linear(d_embed, vocab_size)

    def forward(self, x):
        """
        Unbatched input: x should be a rank 1 tensor with 
        indices of the tokens within the vocabulary. 
        
        Output should be a rank 2 tensor with a row per input token
        containing unnormalized logits over the vocabulary.
        
        Note: Must compute causal attn_mask to provide to any
        Transformer blocks.
        Hint: For causal attention mask, use torch.triu() to 
        create upper triangular matrix
        """
        # TODO: complete forward implementation
        pos = torch.arange(x.size(0), device = x.device)
        adds = self.word_embedding(x) + self.positional_embedding(pos)
        causal_mask = torch.triu(torch.ones(size = (x.size(0), x.size(0)), device = x.device) * float('-inf'), diagonal = 1)
        b_input = adds
        b_output = None
        for block in self.blocks:
            b_output = block(b_input, causal_mask)
            b_input = b_output
        logits = self.linear_projection(self.final_layer_norm(b_output))
        return logits

## Task 2

Now we actually want to train our transformer model as a causal language model on the dataset defined above using PyTorch. Your goal in this task is to train to achieve a training accuracy of at least 70% (noting that accuracy of 100% is neither possible nor desirable, since for example the phrase `"Many cats"` has many different next words in the training data).

1. Though the model only works on a single sequence at a time (that is, does not handle batches of sequences), note that for every token in the input sequence a prediction will be made for the next token. A couple implications: (i) You can compute the average loss across the entire target sequence for use in training, and (ii) there are multiple classifications per input sequence for consideration in the training accuracy.

2. Your model should use  `num_heads` and `n_blocks` at least 2 and at most 16. You should choose the embedding `dimension` to be at least 32 and at most 512. Keep in mind that `dimension` should be divisible by `num_heads`, which is one reason why you see powers of 2 are common choices. Increasing these values increases the model capacity but may make training slower and more difficult.

3. You are welcome to use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer. As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.

4. You should track and evaluate the average training loss (cross entropy) and the accuracy (next token classification), both evaluated simply on the training data (we will not worry about statistical generalization for this part). **Print both every epoch of training.** 

5. You can stop training once you achieve the required 70% training accuracy (which may take many epochs given the small size of the dataset). We are focusing purely on modeling the training data and not on statistical generalization, so you do not need to evaluate a separate validation or test score or use any regularization techniques such as dropout.

6. Once you are finished, report the **total number of model parameters**. And show your work: Either the calculations you do by hand or the code you used to count the number of model parameters.

In [11]:
# TODO: write code for task 2 here
torch.manual_seed(2025)
device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu'))
model = Transformer(vocab_size = len(tokenizer.vocab), d_embed = 128, num_heads = 4, max_length = 64, n_blocks = 4).to(device)
CEL = nn.CrossEntropyLoss(reduction = "sum")
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)
t_tokens = 0
t_correct = 0
t_loss = 0.0
max_epochs = 30
for epoch in range(1, max_epochs + 1):
    model.train()
    dataset.shuffle()
    for index in range(len(dataset)):
        i_input, i_target = dataset[index]
        i_input = i_input.to(device)
        i_target = i_target.to(device)
        optimizer.zero_grad()
        logits = model(i_input)
        loss = CEL(logits, i_target)
        loss.backward()
        optimizer.step()
        t_loss += loss.item()
        t_tokens += i_target.numel()
        with torch.no_grad():
            t_correct += (logits.argmax(1) == i_target).sum().item()
    average_loss = t_loss / t_tokens
    accuracy = t_correct / t_tokens
    print(device, "Epoch", epoch, ": Average Loss", average_loss, "| Accuracy", accuracy)
    if accuracy >= 0.7:
        break
total_parameters = sum(param.numel() for param in model.parameters())
print("The total number of model parameters is", total_parameters)

mps Epoch 1 : Average Loss 4.711638938261853 | Accuracy 0.22572815533980584
mps Epoch 2 : Average Loss 4.144784561638693 | Accuracy 0.27224919093851135
mps Epoch 3 : Average Loss 3.6535255952688996 | Accuracy 0.3238942826321467
mps Epoch 4 : Average Loss 3.2164638351082417 | Accuracy 0.38612459546925565
mps Epoch 5 : Average Loss 2.842477780561231 | Accuracy 0.4451456310679612
mps Epoch 6 : Average Loss 2.5326272822120814 | Accuracy 0.49730312837108953
mps Epoch 7 : Average Loss 2.28292670224587 | Accuracy 0.5371012482662968
mps Epoch 8 : Average Loss 2.078500267226719 | Accuracy 0.5708940129449838
mps Epoch 9 : Average Loss 1.9144436161709277 | Accuracy 0.5961884214311399
mps Epoch 10 : Average Loss 1.7792399933037248 | Accuracy 0.6169093851132686
mps Epoch 11 : Average Loss 1.6671711970652225 | Accuracy 0.6344513092085907
mps Epoch 12 : Average Loss 1.5719861447361039 | Accuracy 0.6492044228694714
mps Epoch 13 : Average Loss 1.4901883981997674 | Accuracy 0.661999004232014
mps Epoch 1

We need to take into consideration the the vacabulary size is 382, the embedding size is 128, the max positions is 64, and the transformer blocks is 4 in this case. 

The first thing to calculate is the number of parameters in the embedding layers. The number of paramteters for the word embedding is 48,896 paramters (382 X 128 = 48,896). The number of paramters for the positional embedding is 8,192 parameters (64 X 128 = 8,192). So together, the number of parameters in the embedding layers is 57,088 parameters (48,896 + 8,192 = 57,088).

The second thing to calculate is the number of parameters in the transformer blocks. With multi-head self-attention, there are 66,048 paramters ((3 X 128 X 128) + (3 X 128) + (128 X 128) + 128 = 49,152 + 384 + 16,384 + 128 = 66,048). With the two layer normalizations, there are 512 paramters (2 X (2 X 128) = 2 X 256 = 512). With the feedforward network, there are there are 66,048 paramters at the first linear layer ((128 X 4 X 128) + (4 X 128) = 65,536 + 512 = 66,048), and there are 65,664 paramters at the second linear layer ((4 X 128 X 128) + 128 = 65,536 + 128 = 65,664), so the total number of parameters for feedforward network is 131,712 paramters (66,048 + 65,664 = 131,712). So for one transformer block, there are 198,272 paramters (66,048 + 512 + 131,712 = 198,272). Since there are 4 transformer blocks, the total number of parameters for the transformer bloack is 793,088 paramters (4 X 198,272 = 793,088).

The third thing to calculate is the number of paramters in the final layers. With the final layer normalization, there are 256 parameters (2 X 128 = 256). With the output projection, there are 49,278 parameters ((128 X 382) + 382 = 48,896 + 382 = 49,278). So the total number of parameters in the final layers is 49,534 parameters (256 + 49,278 = 49,534).

Therefore, the total number of model parameters is 899,710 paramters (57,088 + 793,088 + 49,534 = 899,710).

## Task 3

For this task, you will autoregressively generate text using a simple random sampling with temperature. That is, given a starting input sequence, the model should generate up to `max_tokens` additional tokens. These are generated one at a time, at each step passing the entire input sequence (including any already generated tokens) as input to the model. 

1. Use the outputs of the **last input token only**, which should be the unnormalized logits over the vocabulary. 

2. Divide these unnormalized logits by the `temperature` parameter then normalize these to a probability distribution over the vocabulary using [`softmax`](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) and then [sample a next token from this probability distribution](https://pytorch.org/docs/stable/generated/torch.multinomial.html). Recall that a higher temperature implies more uniform randomness, and a lower temperature concentrates probability mass around the top scoring tokens.

3. If the generated token corresponds to `'<eos>'` then you should stop the process, or if you generate a total of `max_tokens` new tokens.

For example, suppose we want to generate a completion of the `start_text = "<sos> Many cats"`. 

1. We would tokenize this and feed it as input to the model, and get the outputs corresponding to the last input token. 

2. We normalize these outputs as a probability distribution. Suppose we draw the index corresponding to `"like"` in the vocabulary from that probability distribution. 

3. Then to generate the next token, we pass the tokenized `"<sos> Many cats like"` as input to the model, repeating this process until we generate `"<eos>"` or reach `max_tokens` generations.

In [15]:
def generate(model, tokenizer, start_text="<sos>", max_tokens=20, temperature=1, device=None):
    """
    Should return a string corresponding to up to max_tokens 
    autoregressively generated tokens beginning from start_text.
    """
    # TODO: implement generate function
    model.eval()
    sequence = tokenizer.encode(start_text)
    sequence_tensor = torch.tensor(sequence, device = device)
    eos_flag = tokenizer.word_to_index['<eos>']
    with torch.no_grad():
        for t in range(max_tokens):
            logits = model(sequence_tensor)
            temp_scaling = logits[-1] / temperature
            probabilities = torch.softmax(temp_scaling, dim = -1)
            next_token = torch.multinomial(probabilities, num_samples = 1).item()
            if next_token == eos_flag:
                break
            sequence.append(next_token)
            sequence_tensor = torch.tensor(sequence, device = device)
    return tokenizer.decode(sequence)

Demonstrate your `generate` function below by selecting **at least 9** experiments, three each of three different types that vary as follows. You can use the default `max_tokens=20` everywhere.
1. First, choose a simple prompt such as the default `"<sos>"` and vary the `temperature` parameter, trying values of `0.5`, the default of `1`, and a higher value of `2`.

2. Next, leave the `temperature` to the default of `1` and try at least three prompts, each of which should have many different possible completions within the training data itself. For example, `"<sos> Many cats"`, `"<sos> Some cats"`, and `"<sos> Cats have"` all appear several times in the training data.

3. Finally, again leaving the `temperature` to the default value of `1`, try at least three prompts each of which does not appear in the training dataset but should use words from the training dataset. For example `"<sos> Cats like toys"` is a correctly formed start to a sentence with words from the training data, but does not actually appear in the training data.

For each example, print the `start_text` and the returned generated text. Briefly interpret your results. Specifically discuss the effect of the `temperature` parameter and the observed difference in results between type 2 and type 3 prompts.

In [17]:
# TODO: write code for task 3 here
torch.manual_seed(2025)
print("Conducting 3 temperature experiments with same prompt")
for temp in [0.5, 1, 2]:
    output_generate = generate(model, tokenizer, start_text = "<sos>", max_tokens = 20, temperature = temp, device = device)
    print("With temperature", temp, ": start = <sos> ->", output_generate)
print()
print("Conducting 3 training-data prompts experiment")
training_prompts = ["<sos> Many cats", "<sos> Some cats", "<sos> Cats have"]
for prompt in training_prompts:
    output_generate = generate(model, tokenizer, start_text = prompt, max_tokens = 20, temperature = 1, device = device)
    print("start =", prompt, "->", output_generate)
print()
print("Conducting 3 novel prompts experiment")
novel_prompts = ["<sos> Cats like toys", "<sos> Cats can play", "<sos> Many kittens"]
for p in novel_prompts:
    output_generate = generate(model, tokenizer, start_text = p, max_tokens = 20, temperature = 1, device = device)
    print("start =", p, "->", output_generate)

Conducting 3 temperature experiments with same prompt
With temperature 0.5 : start = <sos> -> <sos> cats have a natural instinct to cover their waste in litter or soil
With temperature 1 : start = <sos> -> <sos> cats have a good sense of time and often know when it's mealtime
With temperature 2 : start = <sos> -> <sos> boxes chasing knead their paws on challenge chin cardboard feel

Conducting 3 training-data prompts experiment
start = <sos> Many cats -> <sos> many cats enjoy playing with interactive toys that challenge their hunting skills
start = <sos> Some cats -> <sos> some cats are very playful and will initiate games with their owners
start = <sos> Cats have -> <sos> cats have a third eyelid that helps protect their eyes while hunting

Conducting 3 novel prompts experiment
start = <sos> Cats like toys -> <sos> cats like toys or remember measure spots to warm themselves
start = <sos> Cats can play -> <sos> cats can play smell to help them navigate in the dark
start = <sos> Many ki

Looking at the effects of temperature, I notice that the model with a temperature of 0.5 tends to pick high probability words since I find the text to be fluent and to be on topic, but I find that it is a bit formulaic as the sentence seems to be too formal and has a writing of if an LLM model wrote the sentence. Onto the model with the temperature of 1, it is coherent but not as formulaic as in the model with the temperature of 0.5, it is almost like a sweetspot. Now the model with the temperature of 2.0, I can see a trend where the is some randomness of the words to form sentences, probably since that the model have selected low probability words quite more often, it makes the sentence more nonsensical and is not really coherent. So it appears that higher temperatures try to achieve more diversity but it does make more mistakes, while lower temperatures focuses more on quality and trying to make the sentences more coherent.

Now as for the training style prompts, I notice that the completion of these sentences flows well and are factual, and it closely matches patterns in the current cat dataset. On the other hand with novel prompts, the model for the most part is able to stay on topic, but I find the grammar and the phrasing to be downgraded compared to the sentences generated from training stype prompts, as these sentences selected some odd word choices that do not flow well together. But the quality difference is understandable since the model does memorize common sentence frames and can handle similar starts really well, but it does generalize less cleanly to newer phrasings that it has not been seen yet which would make the fluency drop a bit but it can still do a decent job staying on topic about cats for this instance.