# **Miniproject 2**
## **~Large~ Small Language Model**

### **Objective**
Implement a transformer-based, character-level language model (GPT-like) and train it on the Shakespeare dataset. By the end of this project, you should be able to generate Shakespearean-like text given a seed string.

You will probably want to train the model on a GPU. You can use free GPUs on [Google Colab](https://colab.research.google.com/?utm_source=scs-index).

### **Dataset**:

The Shakespeare dataset contains the complete works of William Shakespeare, including his plays, poems, and sonnets.

[**Download link**](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)

In a character-level language model, each character in the input data is mapped to its respective index from a dictionary. The input to the model is in the form (B, N), where B is the batch size and N is the number of tokens for each sequence. The model was tested with B=N=128, but feel free to explore different values.

An interface for the dataset class that takes care of tokenization is provided below.



```python
from torch.utils.data import Dataset

class CharDataset(Dataset):
    """
    Emits batches of characters.

    Adapted from "https://github.com/karpathy/minGPT".
    """

    def __init__(self, config, data):

        chars = ... # get characters from the input data
        self.stoi = { ch:i for i,ch in enumerate(chars) } # map characters to integer indices

        ...

    def get_vocab_size(self):
        raise NotImplementedError()

    def __len__(self):
        raise NotImplementedError()

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        # encode every character to an integer
        # return the chunk and the shifted version as tensors
        pass
```




### **Requirements**

#### **Architecture**

Implement the Transformer's decoder-only structure.
This includes

* input token embeddings
* the causal multi-head self-attention mechanism
* feed-forward neural networks
* positional encodings, residual connections, layer normalizations.

The project was tested with $12$ layers, $8$ attention heads, and $768$ embedding dimensions, on a single GPU.

The `forward` method for the entire model has the following form:

```
tok_emb = WTE(idx) # token embeddings
pos_emb = WPE(pos) # position embeddings
x = Dropout(tok_emb + pos_emb)
for Block in Blocks:
    x = Block(x)
x = Final_LayerNorm(x)
logits = LM_Head(x)
```

The `forward` method for the transformer block has the following form:



```
x = x + self.CausalSelfAttn(self.LayerNorm_1(x))
out = x + self.MLP(self.LayerNorm_2(x))
```

---

#### **Training**

In a character-level transformer language model, the goal is to predict the next character in a sequence given the previous characters. To train such a model effectively, we use two versions of our data: the input sequence and a shifted version of this sequence, which serves as the target for our predictions.

Preprocess the dataset to a character-level representation.
Use a sliding window approach for sequence chunks (e.g., window size of $128$ characters).
Implement causal masking for the self-attention mechanism.
Use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer and the cross-entropy loss.

**Optional**:

* Implement a learning rate decay strategy
* Implement gradient clipping

---


#### **Evaluation and Inference**

* Monitor the cross-entropy loss. Use a seed string to initialize the model and generate Shakespearean-like text.

* In order to generate the characters, at each generation step you can either select the character with the highest probability, or you can sample according to the output distribution.

The high-level pseudocode for generation is:

```python
model.eval()
with torch.no_grad():
    context = "O God, O God!"
    tokenized_context = tokenize(context)
    # the model should implement a method to generate tokens given a prompt
    y = model.generate(tokenized, ...)
    completion = tokens_to_string(y)
```

**Optional**:
* Compute the [perplexity](https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72#:~:text=Intuitively%2C%20perplexity%20means%20to%20be,loss%20obtained%20from%20the%20model.) metric for quantitative evaluation.

### **Example Outputs**

The following are my outputs after $6000$ steps of training, with the seed string "O God, O God!"



```
O God, O God! neither? unto the base very ears,
As damned with it.

DUKE OF YORK:
Away! Once more, one word.

RICHARD:
Clove, dear so; and therein my son will be
false of woe: if ye seems to be the mother
Of gracious order this time when R going kinsperse eyes,
What dost bewreck her fairer drying tears.

NORTHUMBERLAND:
Have you forgot the Duke of Norfolk, get him to
again; and and agilic: there is my spirit
So maly did must such a marble perfection.

ELBOW:
Come, bring them with oaths, and so deliver
```


### Resources:

* Vaswani et al., "Attention is All You Need": [link](https://arxiv.org/abs/1706.03762)

* Illustrated Transformer by Jay Alammar: [link](https://jalammar.github.io/illustrated-transformer/)

* OpenAI GPT-2 Paper: [link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

* Deep Learning Course slides on transformers: [link](https://fleuret.org/dlc/materials/dlc-handout-13-3-transformers.pdf)

## Proposed Solution

# Step 1: Dataset

We implement the CharDataset dataset, which handles tokenization and provides batches for training.

In [None]:
import torch
from torch.utils.data import Dataset

class CharDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}
        self.data = [self.stoi[ch] for ch in data]
        self.block_size = block_size
        self.vocab_size = len(chars)

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        chunk = self.data[idx:idx+self.block_size+1]
        x = torch.tensor(chunk[:-1], dtype=torch.long)
        y = torch.tensor(chunk[1:], dtype=torch.long)
        return x, y

    def get_vocab_size(self):
        return self.vocab_size


## Step 2: Transformer Model

For this to work right i need the self-attention-mechanism

In [None]:
import torch.nn as nn
import torch

class CausalSelfAttention(nn.Module):
    def __init__(self, embed_size, num_heads, block_size):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim=embed_size, num_heads=num_heads)
        self.proj = nn.Linear(embed_size, embed_size)
        self.dropout = nn.Dropout(0.1)
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        attn_mask = self.mask[:T, :T].to(x.device)
        x = x.transpose(0, 1)  # (T, B, C)
        x, _ = self.attn(x, x, x, attn_mask=attn_mask)
        x = x.transpose(0, 1)  # (B, T, C)
        return self.dropout(self.proj(x))


class TransformerBlock(nn.Module):
    def __init__(self, embed_size, num_heads, ff_hidden_size, block_size):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_size)
        self.ln2 = nn.LayerNorm(embed_size)
        self.attn = CausalSelfAttention(embed_size, num_heads, block_size)
        self.ff = nn.Sequential(
            nn.Linear(embed_size, ff_hidden_size),
            nn.ReLU(),
            nn.Linear(ff_hidden_size, embed_size),
            nn.Dropout(0.1),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

Let's create a Transformer model with a GPT structure.

In [None]:
class GPT(nn.Module):
    def __init__(self, vocab_size, block_size, embed_size, num_heads, num_layers, ff_hidden_size):
        super().__init__()
        self.embed_size = embed_size
        self.tok_emb = nn.Embedding(vocab_size, embed_size)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, embed_size))
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_size, num_heads, ff_hidden_size, block_size)
            for _ in range(num_layers)
        ])
        self.ln_f = nn.LayerNorm(embed_size)
        self.head = nn.Linear(embed_size, vocab_size)

    def forward(self, idx):
        B, T = idx.shape
        token_embeddings = self.tok_emb(idx)
        position_embeddings = self.pos_emb[:, :T, :]
        x = token_embeddings + position_embeddings
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.pos_emb.size(1):]
            logits = self(idx_cond)
            logits = logits[:, -1, :]
            probs = torch.nn.functional.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_token), dim=1)
        return idx

## Step 3: Training

Let's define the training loop.

In [None]:
from torch.utils.data import DataLoader
import torch.optim as optim

# Hyperparameters
block_size = 64
embed_size = 256
num_heads = 8
num_layers = 6
ff_hidden_size = 1024
batch_size = 64
learning_rate = 3e-4
epochs = 50

# Data Loading
with open("sonnets.txt", "r") as f:
    data = f.read()

dataset = CharDataset(data, block_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Modello
model = GPT(
    vocab_size=dataset.get_vocab_size(),
    block_size=block_size,
    embed_size=embed_size,
    num_heads=num_heads,
    num_layers=num_layers,
    ff_hidden_size=ff_hidden_size,
).to("cuda")

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    model.train()
    for x, y in dataloader:
        x, y = x.to("cuda"), y.to("cuda")
        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
    # Save a checkpoint every few epochs
    if (epoch + 1) % 10 == 0:
        torch.save(model.state_dict(), f"shakespeare_gpt_epoch{epoch+1}.pth")


# Salva il modello
torch.save(model.state_dict(), "shakespeare_gpt.pth")


Epoch 1, Loss: 0.0483
Epoch 2, Loss: 0.0264
Epoch 3, Loss: 0.0216
Epoch 4, Loss: 0.0274
Epoch 5, Loss: 0.0243
Epoch 6, Loss: 0.0142
Epoch 7, Loss: 0.0088
Epoch 8, Loss: 0.0418
Epoch 9, Loss: 0.0659
Epoch 10, Loss: 0.0112
Epoch 11, Loss: 0.0228
Epoch 12, Loss: 0.0061
Epoch 13, Loss: 0.0022
Epoch 14, Loss: 0.0538
Epoch 15, Loss: 0.0027
Epoch 16, Loss: 0.0571
Epoch 17, Loss: 0.0010
Epoch 18, Loss: 0.0012
Epoch 19, Loss: 0.0466
Epoch 20, Loss: 0.0052
Epoch 21, Loss: 0.0002
Epoch 22, Loss: 0.0676
Epoch 23, Loss: 0.0009
Epoch 24, Loss: 0.0552
Epoch 25, Loss: 0.0151
Epoch 26, Loss: 0.0307
Epoch 27, Loss: 0.0508
Epoch 28, Loss: 0.0260
Epoch 29, Loss: 0.0242
Epoch 30, Loss: 0.0093
Epoch 31, Loss: 0.0456
Epoch 32, Loss: 0.0003
Epoch 33, Loss: 0.0433
Epoch 34, Loss: 0.0072
Epoch 35, Loss: 0.0040
Epoch 36, Loss: 0.0310
Epoch 37, Loss: 0.0144
Epoch 38, Loss: 0.0012
Epoch 39, Loss: 0.0319
Epoch 40, Loss: 0.0182
Epoch 41, Loss: 0.0123
Epoch 42, Loss: 0.0024
Epoch 43, Loss: 0.0441
Epoch 44, Loss: 0.0585
Epoch 45, Loss: 0.0552
Epoch 46, Loss: 0.0502
Epoch 47, Loss: 0.0652
Epoch 48, Loss: 0.0548
Epoch 49, Loss: 0.0444
Epoch 50, Loss: 0.0402

## Step 4: Generate Text

We use the model to generate text.

In [None]:
model.eval()
context = "O God, O God!"
idx = torch.tensor([dataset.stoi[ch] for ch in context], dtype=torch.long).unsqueeze(0).to("cuda")
generated = model.generate(idx, max_new_tokens=600)
print("".join([dataset.itos[i] for i in generated[0].tolist()]))


O God, O God! OOm O O
Om OOm OOm OOm OOm OOm OOm OOm OOm Om Om Onloud that his rarfl rank,
From find looks my appeachered shakes still.
Therefore my love, so precious his lie,
And almost tashion the face as shall showers,
Who alter than put'st doth sible doth stay?
And how I whose grief in goodlen of Praise,
Even pain on happy at all thy smight would bar,
Whose hadow my prise, and all wirest spect,
That then famiend beauty, and watery the pach
Our height him habiet his more will be rhymn,
Of me of his leases is just-be blest,
Shilives mauding his misuse antique char,
And he barth your war night beside hide

In [None]:
generated_text = "".join([dataset.itos[i] for i in generated[0].tolist()])

In [None]:
# Save the generated text to a file
output_file = "shakespeare_generated.txt"
with open(output_file, "w") as f:
    f.write(generated_text)

print(f"Generated text saved to {output_file}")

Low (1–10):
Indicates the model is highly accurate in predicting the next character.
A perplexity close to 1 means the model assigns very high probabilities to the correct sequence, showing excellent generalization.

Intermediate (10–100):
This is the typical range for a well-trained model on a complex dataset like Shakespeare's works.
Perplexity values around 20–50 are considered good for natural text generation tasks, but this can vary depending on the level of granularity (character-level, word-level) and sequence length.

High (>100):
Indicates the model struggles to capture dependencies in the dataset.
This could be due to undertraining, insufficient architecture (e.g., too few layers or small embedding dimensions), or an especially complex dataset.

Very High (>>1000):
Extremely high perplexity generally suggests the model is failing to learn the structure of the text, assigning nearly uniform probabilities to different options.
This is common in the early stages of training or when there's an issue with the data or model implementation.

In [None]:
import math

def compute_perplexity(model, dataloader, device="cuda"):
    model.eval()
    total_loss = 0
    total_count = 0
    criterion = nn.CrossEntropyLoss(reduction="sum")  # Sum to compute total log-likelihood

    with torch.no_grad():
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            logits = model(x)  # (B, T, vocab_size)
            loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
            total_loss += loss.item()
            total_count += y.numel()

    avg_loss = total_loss / total_count
    perplexity = math.exp(avg_loss)
    return perplexity

# Compute perplexity on validation set
val_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)  # Use a validation split
perplexity = compute_perplexity(model, val_dataloader, device="cuda")
print(f"Validation Perplexity: {perplexity:.2f}")


Validation Perplexity: 10.01