<a href="https://colab.research.google.com/github/Sweta716/LLM/blob/main/Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [48]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [49]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-06-14 22:29:10--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-06-14 22:29:10 (113 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [50]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [51]:
chars = sorted(list(set(text)))
print(f"Unique characters: {chars}")


Unique characters: ['\n', ' ', '!', "'", ',', '.', ':', ';', '?', 'A', 'B', 'C', 'F', 'I', 'L', 'M', 'N', 'O', 'R', 'S', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']


In [52]:
stoi = { ch:i for i,ch in enumerate(chars) }

def encode(s: str) -> list[int]:
    return [stoi[c] for c in s]

# Example usage
encoded = encode("hello")
print(f"Encoded 'hello': {encoded}")


Encoded 'hello': [29, 26, 33, 33, 36]


In [53]:
itos = { i:ch for i,ch in enumerate(chars) }

def decode(ids: list[int]) -> str:
    return ''.join([itos[i] for i in ids])

# Example usage
decoded = decode(encoded)
print(f"Decoded back to string: {decoded}")


Decoded back to string: hello


In [54]:
import torch

# Example values for the required variables
text = "hello"
chars = sorted(set(text))
stoi = {ch: i for i, ch in enumerate(chars)}

def create_one_hot_inputs_and_outputs() -> tuple[torch.tensor, torch.tensor]:
    inputs = []
    outputs = []
    for i in range(len(text) - 1):
        input_char = text[i]
        output_char = text[i + 1]
        input_id = stoi[input_char]
        output_id = stoi[output_char]

        input_one_hot = torch.zeros(len(chars))
        output_one_hot = torch.zeros(len(chars))

        input_one_hot[input_id] = 1.0
        output_one_hot[output_id] = 1.0

        inputs.append(input_one_hot)
        outputs.append(output_one_hot)

    return torch.stack(inputs), torch.stack(outputs)

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()
print(f"Inputs one-hot shape: {inputs_one_hot.shape}")
print(f"Outputs one-hot shape: {outputs_one_hot.shape}")


Inputs one-hot shape: torch.Size([4, 4])
Outputs one-hot shape: torch.Size([4, 4])


In [55]:
import torch
import torch.nn as nn

# Example values for the required variables
text = "hello"
chars = sorted(set(text))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

class BigramOneHotMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(len(chars), 8),
            nn.LeakyReLU(),
            nn.Linear(8, len(chars))
        )

    def forward(self, x):
        return self.net(x)

    def generate(self, start='a', max_new_tokens=100) -> str:
        generated_text = start
        input_one_hot = torch.zeros(1, len(chars))
        input_one_hot[0, stoi[start]] = 1.0

        for _ in range(max_new_tokens):
            logits = self.forward(input_one_hot)
            probs = torch.softmax(logits, dim=1)
            next_id = torch.multinomial(probs, num_samples=1).item()
            next_char = itos[next_id]
            generated_text += next_char

            input_one_hot = torch.zeros(1, len(chars))
            input_one_hot[0, next_id] = 1.0

        return generated_text

bigram_one_hot_mlp = BigramOneHotMLP()
print("BigramOneHotMLP model created")


BigramOneHotMLP model created


In [56]:

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(bigram_one_hot_mlp.parameters(), lr=0.001)

# Create one-hot inputs and outputs
def create_one_hot_inputs_and_outputs() -> tuple[torch.tensor, torch.tensor]:
    inputs = []
    outputs = []
    for i in range(len(text) - 1):
        input_char = text[i]
        output_char = text[i + 1]
        input_id = stoi[input_char]
        output_id = stoi[output_char]

        input_one_hot = torch.zeros(len(chars))
        output_one_hot = torch.zeros(len(chars))

        input_one_hot[input_id] = 1.0
        output_one_hot[output_id] = 1.0

        inputs.append(input_one_hot)
        outputs.append(output_one_hot)

    return torch.stack(inputs), torch.stack(outputs)

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()

# Training loop
for epoch in range(1000):
    optimizer.zero_grad()
    logits = bigram_one_hot_mlp(inputs_one_hot)
    loss = criterion(logits, torch.argmax(outputs_one_hot, dim=1))
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Generate text after training
print("Generated text after training BigramOneHotMLP:")
print(bigram_one_hot_mlp.generate(start='h'))

Epoch 0, Loss: 1.3691418170928955
Epoch 100, Loss: 1.1970409154891968
Epoch 200, Loss: 1.002708911895752
Epoch 300, Loss: 0.7542867064476013
Epoch 400, Loss: 0.5484490990638733
Epoch 500, Loss: 0.4524416923522949
Epoch 600, Loss: 0.4006079435348511
Epoch 700, Loss: 0.3783514201641083
Epoch 800, Loss: 0.3673589527606964
Epoch 900, Loss: 0.3612023890018463
Generated text after training BigramOneHotMLP:
heloloelololllololoolllohelooellloooelllolloloellollloooooloellololoeoooloelloellolloeloollllolollloo


In [57]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    inputs = []
    outputs = []
    for i in range(len(text) - 1):
        input_char = text[i]
        output_char = text[i + 1]
        input_id = stoi[input_char]
        output_id = stoi[output_char]

        inputs.append(input_id)
        outputs.append(output_id)

    return torch.tensor(inputs), torch.stack([torch.nn.functional.one_hot(torch.tensor(output), num_classes=len(chars)).float() for output in outputs])

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()
print(f"Input IDs shape: {input_ids.shape}")
print(f"Outputs one-hot shape: {outputs_one_hot.shape}")


Input IDs shape: torch.Size([4])
Outputs one-hot shape: torch.Size([4, 4])


In [58]:
class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(len(chars), 8)
        self.net = nn.Sequential(
            nn.Linear(8, 8),
            nn.LeakyReLU(),
            nn.Linear(8, len(chars))
        )

    def forward(self, x):
        embedded = self.embedding(x)
        return self.net(embedded)

    def generate(self, start='a', max_new_tokens=100) -> str:
        if start not in stoi:
            start = chars[0]  # Default to the first character if start is not in the dataset

        generated_text = start
        input_id = torch.tensor([stoi[start]])

        for _ in range(max_new_tokens):
            logits = self.forward(input_id)
            probs = torch.softmax(logits, dim=1)
            next_id = torch.multinomial(probs, num_samples=1).item()
            next_char = itos[next_id]
            generated_text += next_char

            input_id = torch.tensor([next_id])

        return generated_text

bigram_embedding_mlp = BigramEmbeddingMLP()
print("BigramEmbeddingMLP model created")

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(bigram_embedding_mlp.parameters(), lr=0.001)

# Training loop
for epoch in range(1000):
    optimizer.zero_grad()
    logits = bigram_embedding_mlp(input_ids)
    loss = criterion(logits, torch.argmax(outputs_one_hot, dim=1))
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Generate text after training
print("Generated text after training BigramEmbeddingMLP:")
print(bigram_embedding_mlp.generate())


BigramEmbeddingMLP model created
Epoch 0, Loss: 1.2666984796524048
Epoch 100, Loss: 0.7982078194618225
Epoch 200, Loss: 0.47917354106903076
Epoch 300, Loss: 0.3833381235599518
Epoch 400, Loss: 0.36133113503456116
Epoch 500, Loss: 0.3545904755592346
Epoch 600, Loss: 0.3516312539577484
Epoch 700, Loss: 0.35006436705589294
Epoch 800, Loss: 0.3491272032260895
Epoch 900, Loss: 0.3485206067562103
Generated text after training BigramEmbeddingMLP:
elooollooooolllloololoooollllollooooooololllooheloloooooooooollloolollloollllollllllooooollooooollooo


## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [59]:
# run nvidia-smi to check gpu usage
# Check GPU usage
!nvidia-smi



Fri Jun 14 22:29:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P0              32W /  70W |    445MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [60]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [61]:
chars = []

def encode(s: str) -> list[int]:
    pass

def decode(ids: list[int]) -> str:
    pass

In [62]:
chars = sorted(list(set(text)))
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

def encode(s: str) -> list[int]:
    return [stoi[c] for c in s]

def decode(ids: list[int]) -> str:
    return ''.join([itos[i] for i in ids])

# Encode the entire dataset
data = torch.tensor(encode(text), dtype=torch.long).cuda()



In [63]:
block_size = 16

# Create batches
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [64]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [65]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

In [66]:
class SelfAttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(head_size, head_size, bias=False)
        self.query = nn.Linear(head_size, head_size, bias=False)
        self.value = nn.Linear(head_size, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(1, 1, head_size, head_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:,:,:T,:T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        v = self.value(x)
        out = wei @ v
        return out


### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [91]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, num_heads * head_size)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [92]:
class MLP(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd)
        )

    def forward(self, x: torch.tensor) -> torch.tensor:
        return self.net(x)


## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [93]:
class Block(nn.Module):
    def __init__(self, n_embd: int, n_head: int):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = MLP(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
      print("Block:", x.shape)
      x = x + self.dropout(self.sa(self.ln1(x)))
      x = x + self.dropout(self.ffwd(self.ln2(x)))
      return x


## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [94]:
# class GPT(nn.Module):
#     def __init__(self, n_embd, n_head, vocab_size, num_layers):
#         super().__init__()
#         self.token_embedding = nn.Embedding(vocab_size, n_embd)
#         self.position_embedding = nn.Embedding(512, n_embd)
#         self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(num_layers)])
#         self.ln_f = nn.LayerNorm(n_embd)
#         self.head = nn.Linear(n_embd, vocab_size)

#     def forward(self, idx, targets=None):
#         B, T = idx.shape
#         tok_emb = self.token_embedding(idx)  # (B, T, C)
#         pos_emb = self.position_embedding(torch.arange(T, device=idx.device)).unsqueeze(0)  # (1, T, C)
#         x = tok_emb + pos_emb  # (B, T, C)

#         print(f"Token Embedding Shape: {tok_emb.shape}")
#         print(f"Position Embedding Shape: {pos_emb.shape}")
#         print(f"Combined Embedding Shape: {x.shape}")

#         x = self.blocks(x)  # (B, T, C)
#         x = self.ln_f(x)  # (B, T, C)
#         logits = self.head(x)  # (B, T, vocab_size)

#         print(f"Logits Shape: {logits.shape}")

#         if targets is None:
#             loss = None
#         else:
#             # Compute the loss with correct dimensions
#             loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
#             print(f"Loss: {loss.item()}")

#         return logits, loss

#     def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
#         idx = torch.tensor(encode(start_char), dtype=torch.long).unsqueeze(0).to(device)
#         generated = idx.tolist()[0]
#         for _ in range(max_new_tokens):
#             idx_cond = idx if idx.size(1) <= 512 else idx[:, -512:]
#             logits, _ = self.forward(idx_cond)
#             logits = logits[:, -1, :] / temperature
#             probs = F.softmax(logits, dim=-1)
#             sorted_probs, sorted_indices = torch.sort(probs, descending=True)
#             cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
#             sorted_indices_to_remove = cumulative_probs > top_p
#             sorted_probs[sorted_indices_to_remove] = 0
#             sorted_probs /= sorted_probs.sum()
#             next_token = torch.multinomial(sorted_probs, 1).item()
#             idx = torch.cat([idx, torch.tensor([[next_token]], device=device)], dim=1)
#             generated.append(next_token)
#         return decode(generated)
class GPT(nn.Module):
    def __init__(self, n_embd, n_head, vocab_size, num_layers):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(512, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(num_layers)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding(idx)  # (B, T, n_embd)
        pos_emb = self.position_embedding(torch.arange(T, device=idx.device)).unsqueeze(0)  # (1, T, n_embd)
        x = tok_emb + pos_emb  # (B, T, n_embd)

        # Debugging print statements
        print(f"Token Embedding Shape: {tok_emb.shape}")
        print(f"Position Embedding Shape: {pos_emb.shape}")
        print(f"Combined Embedding Shape: {x.shape}")

        x = self.blocks(x)  # (B, T, n_embd)
        x = self.ln_f(x)  # (B, T, n_embd)

        # Reshape x to (B*T, n_embd) for the linear layer
        x = x.view(B * T, -1)  # (B*T, n_embd)
        print(x.shape)
        logits = self.head(x)  # (B*T, vocab_size)

        # Reshape logits back to (B, T, vocab_size)
        logits = logits.view(B, T, -1)  # (B, T, vocab_size)

        # Debugging print statements
        print(f"Logits Shape: {logits.shape}")

        if targets is None:
            loss = None
        else:
            # Compute the loss with correct dimensions
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
            print(f"Loss: {loss.item()}")

        return logits, loss

    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        idx = torch.tensor(encode(start_char), dtype=torch.long).unsqueeze(0).to(device)
        generated = idx.tolist()[0]
        for _ in range(max_new_tokens):
            idx_cond = idx if idx.size(1) <= 512 else idx[:, -512:]
            logits, _ = self.forward(idx_cond)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            sorted_probs, sorted_indices = torch.sort(probs, descending=True)
            cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_probs[sorted_indices_to_remove] = 0
            sorted_probs /= sorted_probs.sum()
            next_token = torch.multinomial(sorted_probs, 1).item()
            idx = torch.cat([idx, torch.tensor([[next_token]], device=device)], dim=1)
            generated.append(next_token)
        return decode(generated)


### Training loop (15 points)

implement training loop

In [95]:
# # Hyperparameters
# vocab_size = len(chars)
# n_embd = 512
# n_head = 8
# num_layers = 6
# learning_rate = 3e-4
# num_epochs = 10
# batch_size = 64

# # Initialize model, optimizer, and loss function
# model = GPT(n_embd, n_head, vocab_size, num_layers).to(device)
# optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# for epoch in range(num_epochs):
#     model.train()
#     for _ in range(len(data) // batch_size):
#         x, y = get_batch()
#         optimizer.zero_grad()
#         logits, loss = model(x, y)

#         # Print shapes for debugging
#         print(f"x shape: {x.shape}")
#         print(f"logits shape: {logits.shape}")
#         print(f"y shape: {y.shape}")

#         loss.backward()
#         optimizer.step()
#     print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# # Generate text after training
# start_char = "A"
# generated_text = model.generate(start_char, max_new_tokens=100, top_p=0.95, top_k=10, temperature=1.0)
# print("Generated text:", generated_text)
# Hyperparameters
vocab_size = len(chars)
n_embd = 512
n_head = 8
num_layers = 6
learning_rate = 3e-4
num_epochs = 10
batch_size = 64

# Initialize model, optimizer, and loss function
model = GPT(n_embd, n_head, vocab_size, num_layers).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    model.train()
    for _ in range(len(data) // batch_size):
        x, y = get_batch()
        optimizer.zero_grad()

        # Debugging print statements
        print(f"Input x shape: {x.shape}")
        print(f"Input y shape: {y.shape}")

        logits, loss = model(x, y)

        # Debugging print statements
        print(f"Logits shape: {logits.shape}")
        print(f"Loss: {loss.item()}")

        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Generate text after training
start_char = "A"
generated_text = model.generate(start_char, max_new_tokens=100, top_p=0.95, top_k=10, temperature=1.0)
print("Generated text:", generated_text)


Input x shape: torch.Size([64, 16])
Input y shape: torch.Size([64, 16])
Token Embedding Shape: torch.Size([64, 16, 512])
Position Embedding Shape: torch.Size([1, 16, 512])
Combined Embedding Shape: torch.Size([64, 16, 512])
Block: torch.Size([64, 16, 512])


RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x512 and 64x64)

### Generate text


print some text that your model generates