<a href="https://colab.research.google.com/github/KarthikGowdaRamakrishna/Adv-Techniques-With-LLM-INFO7374-Spring-2025/blob/main/Assignment_2_Karthik.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [None]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-02-28 05:08:38--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-02-28 05:08:38 (29.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

## 1a) Collect all unique characters

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("Unique characters found:", chars)
print("Vocab size:", vocab_size)

Unique characters found: ['\n', ' ', '!', "'", ',', '.', ':', ';', '?', 'A', 'B', 'C', 'F', 'I', 'L', 'M', 'N', 'O', 'R', 'S', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']
Vocab size: 46


## 1b) Implement encode(s: str) -> list[int]

In [None]:

stoi = { ch: i for i, ch in enumerate(chars) }  # string to integer
itos = { i: ch for i, ch in enumerate(chars) }  # integer to string

def encode(s: str) -> list[int]:
    """
    Convert a string into a list of integer indices.
    """
    return [stoi[ch] for ch in s]


## 1c) Implement decode(ids: list[int]) -> str

In [None]:
def decode(ids: list[int]) -> str:
    """
    Convert a list of integer indices back to the original string.
    """
    return ''.join(itos[i] for i in ids)


## 1d) Create two tensors, inputs_one_hot and outputs_one_hot

In [None]:
import torch

# Encode the text
encoded_text = encode(text)

x_list = []
y_list = []

# consecutive pair (for 'hello', we get 'h' -> 'e', 'e' -> 'l', ...)
for i in range(len(encoded_text) - 1):
    x_list.append(encoded_text[i])
    y_list.append(encoded_text[i + 1])

# Convert to torch tensors
x = torch.tensor(x_list, dtype=torch.long)  # shape [N]
y = torch.tensor(y_list, dtype=torch.long)  # shape [N]

# Create one-hot encodings
N = x.shape[0]
inputs_one_hot = torch.zeros(N, vocab_size)
inputs_one_hot[torch.arange(N), x] = 1.0

outputs_one_hot = torch.zeros(N, vocab_size)
outputs_one_hot[torch.arange(N), y] = 1.0

print("inputs_one_hot shape:", inputs_one_hot.shape)
print("outputs_one_hot shape:", outputs_one_hot.shape)


inputs_one_hot shape: torch.Size([999, 46])
outputs_one_hot shape: torch.Size([999, 46])


## 1e) Implement BigramOneHotMLP

In [None]:
import torch
from torch import nn

class BigramOneHotMLP(nn.Module):
    def __init__(self, vocab_size, hidden_dim=8):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(vocab_size, hidden_dim),
            nn.LeakyReLU(negative_slope=0.01),
            nn.Linear(hidden_dim, vocab_size)
        )

    def forward(self, x_one_hot):

        logits = self.net(x_one_hot)
        return logits

    @torch.no_grad()
    def generate(self, start_idx, max_new_tokens=50):

        generated = [start_idx]
        current_one_hot = torch.zeros(vocab_size)
        current_one_hot[start_idx] = 1.0
        current_one_hot = current_one_hot.unsqueeze(0)  # shape [1, vocab_size]

        for _ in range(max_new_tokens):
            logits = self.forward(current_one_hot)  # [1, vocab_size]
            # Convert logits to probabilities
            probs = nn.functional.softmax(logits, dim=-1)
            # probability distribution
            next_idx = torch.multinomial(probs, num_samples=1).item()
            generated.append(next_idx)

            current_one_hot = torch.zeros_like(current_one_hot)
            current_one_hot[0, next_idx] = 1.0

        return generated


## 1f) Train the BigramOneHotMLP for 1000 steps

In [None]:
import torch.optim as optim

model = BigramOneHotMLP(vocab_size=vocab_size, hidden_dim=8)
optimizer = optim.Adam(model.parameters(), lr=1e-2)
criterion = nn.CrossEntropyLoss()

inputs_one_hot_float = inputs_one_hot.float()

num_steps = 1000
for step in range(num_steps):
    # Forward pass
    logits = model(inputs_one_hot_float)  # shape: [N, vocab_size]

    # raw logits and integer targets.
    loss = criterion(logits, y)


    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 100 == 0:
        print(f"Step {step} / {num_steps}, Loss: {loss.item():.4f}")


Step 0 / 1000, Loss: 3.8279
Step 100 / 1000, Loss: 2.3991
Step 200 / 1000, Loss: 2.2108
Step 300 / 1000, Loss: 2.1592
Step 400 / 1000, Loss: 2.1394
Step 500 / 1000, Loss: 2.1265
Step 600 / 1000, Loss: 2.1174
Step 700 / 1000, Loss: 2.1106
Step 800 / 1000, Loss: 2.1055
Step 900 / 1000, Loss: 2.1015


## 1g) Create input_ids and outputs_one_hot

In [None]:


input_ids = x  # integer IDs for the input
outputs_one_hot = torch.zeros_like(inputs_one_hot)  # or reconstruct similarly
outputs_one_hot[torch.arange(len(y)), y] = 1.0

print("input_ids shape:", input_ids.shape)        # Should be [N]
print("outputs_one_hot shape:", outputs_one_hot.shape)  # Should be [N, vocab_size]


input_ids shape: torch.Size([999])
outputs_one_hot shape: torch.Size([999, 46])


## 1h) Implement and train BigramEmbeddingMLP

In [None]:
class BigramEmbeddingMLP(nn.Module):
    def __init__(self, vocab_size, embed_dim=8, hidden_dim=8):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)


        self.net = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.LeakyReLU(negative_slope=0.01),
            nn.Linear(hidden_dim, vocab_size)
        )

    def forward(self, x_ids):
        """
        x_ids: [batch_size] of integer token IDs
        Returns logits: [batch_size, vocab_size]
        """
        x_embed = self.embedding(x_ids)        # shape: [batch_size, embed_dim]
        logits = self.net(x_embed)             # shape: [batch_size, vocab_size]
        return logits

    @torch.no_grad()
    def generate(self, start_idx, max_new_tokens=50):
        """
        start_idx: single integer index representing the initial character
        max_new_tokens: how many characters to generate
        """
        generated_indices = [start_idx]
        current_idx = torch.tensor([start_idx], dtype=torch.long)

        for _ in range(max_new_tokens):
            logits = self.forward(current_idx)     # [1, vocab_size]
            probs = nn.functional.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1).item()
            generated_indices.append(next_idx)

            current_idx = torch.tensor([next_idx], dtype=torch.long)

        return generated_indices

# Now let's train the model
model_embed = BigramEmbeddingMLP(vocab_size=vocab_size, embed_dim=8, hidden_dim=8)
optimizer_embed = torch.optim.Adam(model_embed.parameters(), lr=1e-2)
criterion = nn.CrossEntropyLoss()

num_steps = 1000
for _ in range(num_steps):
    logits = model_embed(input_ids)  # shape: [N, vocab_size]

    loss = criterion(logits, y)

    # Backprop
    optimizer_embed.zero_grad()
    loss.backward()
    optimizer_embed.step()

    if _ % 100 == 0:
        print(f"Step {_}/{num_steps}, loss: {loss.item():.4f}")


Step 0/1000, loss: 3.8591
Step 100/1000, loss: 2.4054
Step 200/1000, loss: 2.2370
Step 300/1000, loss: 2.1792
Step 400/1000, loss: 2.1550
Step 500/1000, loss: 2.1396
Step 600/1000, loss: 2.1291
Step 700/1000, loss: 2.1228
Step 800/1000, loss: 2.1181
Step 900/1000, loss: 2.1142


In [None]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    # implement
    pass

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()

class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        # implement
        pass
    def forward(self, x):
        # implement
        pass

    def generate(self, start='a', max_new_tokens=100) -> str:
        # implement
        pass

bigram_embedding_mlp = BigramEmbeddingMLP()

# training loop
for _ in range(1000):
    # implement
    pass


print(bigram_embedding_mlp.generate())

## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [None]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Fri Feb 28 05:08:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   56C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [None]:

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Number of unique characters:", vocab_size)
print("Unique characters:", chars)

# mappings for encode/decode
stoi = { ch: i for i, ch in enumerate(chars)}  # string to integer
itos = { i: ch for i, ch in enumerate(chars)}  # integer to string

# 4) Implement encode(s: str) -> list[int]
def encode(s: str) -> list[int]:
    return [stoi[ch] for ch in s]

# 5) Implement decode(ids: list[int]) -> str
def decode(ids: list[int]) -> str:
    return ''.join(itos[i] for i in ids)

# Quick test
sample = "Hello"
encoded_sample = encode(sample)
decoded_sample = decode(encoded_sample)
print(f"Sample: '{sample}' -> {encoded_sample} -> '{decoded_sample}'")


Number of unique characters: 65
Unique characters: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Sample: 'Hello' -> [20, 43, 50, 50, 53] -> 'Hello'


In [None]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [None]:
block_size = 16
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [None]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [None]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SingleHeadSelfAttention(nn.Module):

    def __init__(self, embed_dim, block_size):
        super().__init__()
        self.key = nn.Linear(embed_dim, embed_dim, bias=False)
        self.query = nn.Linear(embed_dim, embed_dim, bias=False)
        self.value = nn.Linear(embed_dim, embed_dim, bias=False)

        # Causal mask: we only attend to positions <= current index
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        """
        x: Tensor of shape (B, T, C)
           B = batch size
           T = block_size (sequence length)
           C = embed_dim (embedding dimension)
        """
        B, T, C = x.shape

        # Compute key, query, value
        k = self.key(x)   # (B, T, C)
        q = self.query(x) # (B, T, C)
        v = self.value(x) # (B, T, C)

        # Compute attention scores: (B, T, C) x (B, T, C)^T -> (B, T, T)
        # We'll transpose the last two dimensions of k for the matrix multiply
        att = q @ k.transpose(-2, -1) / (C**0.5)

        # Apply the causal mask to prevent attending to future tokens
        att = att.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        att = self.dropout(att)

        # Weighted sum of the values
        out = att @ v  # (B, T, T) x (B, T, C) -> (B, T, C)
        return out


### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [None]:
import torch
import torch.nn as nn

class MinimalAttentionBlock(nn.Module):

    def __init__(self, embed_dim, block_size):
        super().__init__()
        self.ln = nn.LayerNorm(embed_dim)
        self.attn = SingleHeadSelfAttention(embed_dim, block_size)

    def forward(self, x):

        # Pre-norm approach: layer norm first, then attention
        x_normed = self.ln(x)
        attn_out = self.attn(x_normed)

        # Residual connection: add the original x
        out = x + attn_out
        return out


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [None]:
import torch
import torch.nn as nn

class MLP(nn.Module):

    def __init__(self, embed_dim=64, hidden_dim=256, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),  # (B, T, 64) → (B, T, 256)
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim),  # (B, T, 256) → (B, T, 64)
            nn.Dropout(dropout)  # Final output remains (B, T, 64)
        )

    def forward(self, x):
        return self.net(x)


In [None]:
batch_size, seq_len, embed_dim = 8, 32, 64
mlp = MLP(embed_dim=embed_dim)
x = torch.randn(batch_size, seq_len, embed_dim)  # Random input
output = mlp(x)
print("Output shape:", output.shape)  # Should be (8, 32, 64)


Output shape: torch.Size([8, 32, 64])


## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [None]:
class TransformerBlock(nn.Module):

    def __init__(self, embed_dim, block_size):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)  # Norm before attention
        self.attn = SingleHeadSelfAttention(embed_dim, block_size)
        self.ln2 = nn.LayerNorm(embed_dim)  # Norm before MLP
        self.mlp = MLP(embed_dim)  # Uses the MLP we just implemented

    def forward(self, x):
        """
        x: (batch_size, seq_len, embed_dim)
        """
        # Self-attention with residual connection
        x = x + self.attn(self.ln1(x))

        # Feed-forward MLP with residual connection
        x = x + self.mlp(self.ln2(x))

        return x  # Output shape remains (B, T, C)


## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GPT(nn.Module):
    def __init__(self, vocab_size, n_embd=64, n_head=4, n_layers=4, block_size=32):

        super().__init__()
        self.block_size = block_size

        # Token and Position Embeddings
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)

        # Stacking 4 Transformer Blocks
        self.blocks = nn.Sequential(
            *[TransformerBlock(n_embd, block_size) for _ in range(n_layers)]
        )

        # Layer Norm & Linear Head
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        B, T = idx.shape  # Batch size, Sequence length

        # token embeddings
        tok_emb = self.token_embedding(idx)  # (B, T, n_embd)

        # position embeddings
        pos_ids = torch.arange(T, device=idx.device)
        pos_emb = self.position_embedding(pos_ids)  # (T, n_embd)
        pos_emb = pos_emb.unsqueeze(0)  # (1, T, n_embd)

        x = tok_emb + pos_emb  # (B, T, n_embd)

        x = self.blocks(x)

        # LayerNorm and linear projection
        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)

        loss = None
        if targets is not None:
            logits = logits.view(B * T, -1)
            targets = targets.view(B * T)
            loss = nn.CrossEntropyLoss()(logits, targets)

        return logits, loss

    @torch.no_grad()
    def generate(self, start_char, max_new_tokens=50, top_p=0.9, top_k=10, temperature=1.0):

        self.eval()  # Set model to evaluation mode

        idx = torch.tensor([[stoi[start_char]]], dtype=torch.long)

        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]

            # Get logits for next token
            logits, _ = self.forward(idx_cond)

            logits = logits[:, -1, :]  # (1, vocab_size)

            logits = logits / temperature

            if top_k > 0:
                values, _ = torch.topk(logits, top_k)
                logits[logits < values[:, -1]] = -float("Inf")

            if top_p < 1.0:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
                sorted_indices_to_remove[:, 0] = False

                logits[sorted_indices[sorted_indices_to_remove]] = -float("Inf")

            # Convert logits to probabilities
            probs = F.softmax(logits, dim=-1)

            # Sample from probability distribution
            next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)

            # Append new token to sequence
            idx = torch.cat((idx, next_token), dim=1)

        # Convert generated token indices back to text
        return decode(idx[0].tolist())


### Training loop (15 points)

implement training loop

In [None]:
def build_dataset(data, block_size):

    X, Y = [], []
    for i in range(len(data) - block_size):
        X.append(data[i : i + block_size])
        Y.append(data[i + 1 : i + block_size + 1])  # Shifted by 1 token

    return torch.stack(X), torch.stack(Y)

block_size = 32
data = torch.tensor(encode(text), dtype=torch.long)  # Convert text to tokenized tensor
X, Y = build_dataset(data, block_size)

print("Dataset shapes:", X.shape, Y.shape)  # Should be (N, 32) each


Dataset shapes: torch.Size([1115362, 32]) torch.Size([1115362, 32])


In [None]:
import torch
import torch.optim as optim

# Ensure the model runs on GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters
batch_size = 32         # How many sequences per batch
max_iters = 5000        # Number of training iterations
learning_rate = 3e-4    # Standard for transformer training

# Initialize the GPT model and move it to GPU
model = GPT(vocab_size=vocab_size, n_embd=64, n_head=4, block_size=32).to(device)

# Optimizer (AdamW is commonly used for transformers)
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

# Function to sample a batch of training data
def get_batch(X, Y, batch_size):
    ix = torch.randint(0, len(X), (batch_size,))
    xb = X[ix].to(device)  # Move batch to GPU
    yb = Y[ix].to(device)
    return xb, yb

# Training loop
for iter in range(max_iters):
    # Get a random batch
    xb, yb = get_batch(X, Y, batch_size)

    # Forward pass: compute logits and loss
    logits, loss = model(xb, targets=yb)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss every 100 iterations
    if iter % 100 == 0:
        print(f"Iteration {iter}/{max_iters}, Loss: {loss.item():.4f}")

print("Training complete!")


Iteration 0/5000, Loss: 4.3477
Iteration 100/5000, Loss: 3.1248
Iteration 200/5000, Loss: 2.7105
Iteration 300/5000, Loss: 2.6458
Iteration 400/5000, Loss: 2.4971
Iteration 500/5000, Loss: 2.4554
Iteration 600/5000, Loss: 2.3747
Iteration 700/5000, Loss: 2.3771
Iteration 800/5000, Loss: 2.3790
Iteration 900/5000, Loss: 2.3144
Iteration 1000/5000, Loss: 2.3413
Iteration 1100/5000, Loss: 2.2722
Iteration 1200/5000, Loss: 2.1911
Iteration 1300/5000, Loss: 2.1950
Iteration 1400/5000, Loss: 2.2010
Iteration 1500/5000, Loss: 2.1022
Iteration 1600/5000, Loss: 2.1532
Iteration 1700/5000, Loss: 2.1230
Iteration 1800/5000, Loss: 2.0947
Iteration 1900/5000, Loss: 2.0825
Iteration 2000/5000, Loss: 2.0701
Iteration 2100/5000, Loss: 2.0781
Iteration 2200/5000, Loss: 2.0484
Iteration 2300/5000, Loss: 2.0762
Iteration 2400/5000, Loss: 2.0352
Iteration 2500/5000, Loss: 2.0111
Iteration 2600/5000, Loss: 2.0277
Iteration 2700/5000, Loss: 2.0335
Iteration 2800/5000, Loss: 1.9911
Iteration 2900/5000, Loss:

### Generate text


print some text that your model generates