# NLP Lab: Language Models

In this lab, we will build the main components of the GPT-2 model and train a small model on poems by Victor Hugo.

The questions are included in this notebook. To run the training, you will need to modify the `gpt_single_head.py` file, which is also available in the Git repository.

## Data

The training data consists of a collection of poems by Victor Hugo, sourced from [gutenberg.org](https://www.gutenberg.org/). The dataset is available in the `data` directory.

To reduce model complexity, we will model the text at the character level. Typically, language models process sequences of subwords using [tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) such as BPE, SentencePiece, or WordPiece.

#### Questions:
- Using [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter), display the number of unique characters in the text and the frequency of each character.

In [2]:
import collections

with open('hugo_contemplations.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f'Number of characters in the file: {len(text)}')
##  YOUR CODE HERE
counter = collections.Counter(text)

chars = list(counter.keys())
###

print (f'Number of character in counter: {sum(counter.values())}')
print (f'{len(chars)} different characters')
print (counter)


Number of characters in the file: 285222
Number of character in counter: 285222
101 different characters
Counter({' ': 49127, 'e': 30253, 's': 17987, 'u': 14254, 'r': 14223, 't': 14071, 'a': 14048, 'n': 13725, 'i': 12828, 'o': 12653, 'l': 11638, '\n': 8102, 'm': 6495, 'd': 6375, ',': 6077, 'c': 5074, 'p': 4206, "'": 3820, 'v': 3492, 'é': 2943, 'b': 2783, 'f': 2772, 'h': 2221, 'q': 1956, 'g': 1790, '.': 1420, 'x': 1154, 'L': 1147, '!': 1121, 'E': 1074, ';': 1043, '-': 1020, 'j': 890, 'D': 764, 'è': 725, 'à': 706, 'y': 660, 'I': 627, 'ê': 605, 'C': 593, 'S': 545, 'A': 530, 'Q': 503, 'z': 482, 'J': 471, 'O': 450, 'T': 441, 'P': 435, '?': 388, 'V': 383, 'â': 381, 'N': 362, 'M': 344, 'ù': 298, ':': 294, 'R': 240, 'î': 214, 'U': 208, 'ô': 159, 'X': 150, '1': 146, 'H': 116, 'F': 114, '5': 111, '8': 93, 'B': 78, '«': 74, 'É': 70, '»': 69, 'G': 67, '4': 64, 'û': 62, '3': 47, 'ç': 34, 'À': 33, 'ë': 32, 'ï': 31, '2': 30, '·': 26, 'Ê': 24, '6': 23, '7': 23, 'Ô': 19, '9': 19, 'È': 11, 'k': 10, '0':

### Encoding / Decoding  

To transform the text into a vector for the neural network, each character must be encoded as an integer.  

The following functions perform the encoding and decoding of characters:

In [7]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: transform a string into a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: transform a list of integers into a string


# test that your encoder/decoder is coherent
testString = "\nDemain, dès l'aube"
assert decode(encode (testString)) ==  testString

### Train/Validation Split  

Since the goal is to predict poems, the lines should not be shuffled randomly. Instead, we must preserve the order of the lines in the text and take only the first 90% for training, while using the remaining 10% to monitor learning.  

#### Questions:  
- Split the data into `train_data` (90%) and `val_data` (10%) using slicing on the dataset.

In [4]:
!pip install torch

Collecting torch
  Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl.metadata (28 kB)
Collecting filelock (from torch)
  Downloading filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from to

In [8]:
import torch
# Train and validation splits
data = torch.tensor(encode(text), dtype=torch.long)
## YOUR CODE HERE
split_idx = int(0.9 * len(data))

train_data = data[:split_idx]  
val_data = data[split_idx:]    

print(f'Training data size: {len(train_data)}')
print(f'Validation data size: {len(val_data)}')

###

Training data size: 256699
Validation data size: 28523


### Context  

The language model has a parameter that defines the maximum context size to consider when predicting the next character. This context is called `block_size`. The training data consists of sequences of consecutive characters, randomly sampled from the training set, with a length of `block_size`.  

If the starting character of the sequence is `i`, then the context sequence is:  
```python
x = data[i:i+block_size]
```
And the target value to predict at each position in the context is the next character:  
```python
y = data[i+1:i+block_size+1]
```



In [9]:
block_size = 8

i  = torch.randint(len(data) - block_size, (1,))
print (i)
x = train_data[i:i+block_size]
y = train_data[i+1:i+1+block_size]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print (f'context is >{decode(context.tolist())}< target is >{decode([target.tolist()])}<')

tensor([255286])
context is >s< target is > <
context is >s < target is >p<
context is >s p< target is >e<
context is >s pe< target is >n<
context is >s pen< target is >d<
context is >s pend< target is >r<
context is >s pendr< target is >e<
context is >s pendre< target is >,<


### Defining Batches  

The training batches consist of multiple character sequences randomly sampled from `train_data`. To randomly select a sequence for the batch, we need to randomly pick a starting point in `train_data` and extract the following `block_size` characters. When selecting the starting point, ensure that there are enough characters remaining after it to form a full sequence of `block_size` characters.  

#### Questions:  
- Create the batches `x` by selecting `batch_size` sequences of length `block_size` starting from a randomly chosen index `i`. Stack the examples using `torch.stack`.  
- Create the batches `y` by adding the next character following each sequence in `x`. Stack the examples using `torch.stack`.


In [131]:
batch_size = 4
torch.manual_seed(2023)
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ## YOUR CODE HERE
    # select batch_size starting points in the data, store them in a list called starting_points
    starting_points = torch.randint(len(data) - block_size, (batch_size,))
    # x is the sequence of integer starting at each straing point and of length block_size
    x = []
    y = []
    # Generate sequences and targets
    for i in starting_points:
        x.append(data[i:i+block_size])
        y.append(data[i+1:i+block_size+1])

    # Stack the sequences and targets into tensors
    x = torch.stack(x)
    y = torch.stack(y)
    # send data and target to device
    x, y = x.to(device), y.to(device)
    return x, y

### First Model: A Bigram Model  

The first model we will implement is a bigram model. It predicts the next character based only on the current character. This model can be stored in a simple matrix: for each character (row), we store the probability distribution over all possible next characters (columns). This can be implemented using a simple [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer in PyTorch.  

#### Questions:  
- In the constructor, define an Embedding layer of size `vocab_size × vocab_size`.  
- In the `forward` method, apply the embedding layer to the batch of indices (`x`).  
- In the `forward` method, define the loss as `cross_entropy` between the predictions and the target (`y`).


In [132]:
import torch.nn as nn

# use a gpu if we have one
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # we use a simple vocab_size times vocab_size tensor to store the probabilities 
        # of each token given a single token as context in nn.Embedding
        # YOUR CODE HERE
        self.embeddings = nn.Embedding(vocab_size, vocab_size)
        ## 
        
    def forward(self, idx, targets=None):

        # idx and targets are both (Batch,Time) tensor of integers
        # YOUR CODE HERE
        logits = self.embeddings(idx)  # Shape: (Batch, Time, Vocab_size)
        ## 
   
        # don't compute loss if we don't have targets
        if targets is None:
            loss = None
        else:
            # change the shape of the logits and target to match what is needed for CrossEntropyLoss
            # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
            Batch, Time, Channels = logits.shape
            logits = logits.view(Batch*Time, Channels)
            targets = targets.view(Batch*Time)
            
            # negative log likelihood between prediction and target
            # YOUR CODE HERE
            loss = nn.functional.cross_entropy(logits, targets)
            ## 

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = nn.functional.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

vocab_size=101
model = BigramLanguageModel(vocab_size)
# send the model to device
m = model.to(device)

### Model Before Training  

At this stage, the model has not yet been trained—it has only been initialized. However, we can already compute the loss on a random batch. Since the weights are initialized with a normal distribution \( N(0,1) \) for each dimension, the expected loss after initialization should be close to `-ln(1/vocab_size)`, as the entropy is maximal.

In [148]:
import math
xb, yb = get_batch('train')
logits, loss = m(xb, yb)
print (logits.shape)
print (f'Expected loss {-math.log(1.0/vocab_size)}')
print (f'Computed loss {loss}')

torch.Size([32, 101])
Expected loss 4.61512051684126
Computed loss 4.9704203605651855


### Using the Model for Prediction  

To use the model for prediction, we need to provide an initial character to start the sequence—this is called the prompt. In our case, we can initialize the generation with the newline character (`\n`) to start a new sentence.  

#### Questions:  
- Create a prompt as a tensor of size `(1,1)` containing the integer corresponding to the character `\n`.  
- Generate a sequence of 100 characters from this prompt using the functions `m.generate` and `decode`.  
- How does the generated sentence look?

In [149]:
print (encode(['\n']))
## YOUR CODE HERE
generated_text = []
for char in list(m.generate(torch.reshape(torch.tensor(encode(['\n'])), (1, 1)), 100)):
    generated_text.append(decode(list([int(i) for i in char])))
print(generated_text)
###

[3]
['\n?(»»TZUàyÉ,nFÎHàâçPAË8B3d-IïfP:3daèP)·4,CWOpÈS1ûFwÀaô.7Kkè;axlÆêëmr8JQla-QnÈF,mé4çM T8ISu!gj(ê-èTrSï']


### Training  

For training, we use the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer with a learning rate of `1e-3`. Each training iteration consists of the following steps:  

- Generate a batch  
- Apply the neural network (forward pass) and compute the loss: `model(xb, yb)`  
- Compute the gradient (after resetting accumulated gradients): `loss.backward()`  
- Update the parameters: `optimizer.step()`  

In [156]:
max_iters = 100
batch_size = 4
eval_interval = 10
learning_rate = 1e-3
eval_iters = 20

@torch.no_grad() # no gradient is computed here
def estimate_loss():
    """ Estimate the loss on eval_iters batch of train and val sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            X = X.view(-1, 1)  # Reshape to (Batch, Time)
            Y = Y.view(-1, 1)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# re-create the model
model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


step 0: train loss 4.8549, val loss 4.7953
step 10: train loss 4.7633, val loss 4.8356
step 20: train loss 4.8049, val loss 4.9163
step 30: train loss 4.7362, val loss 4.8654
step 40: train loss 4.8355, val loss 4.8563
step 50: train loss 4.7630, val loss 4.7572
step 60: train loss 4.7536, val loss 4.8830
step 70: train loss 4.7597, val loss 4.8103
step 80: train loss 4.7719, val loss 4.8172
step 90: train loss 4.7716, val loss 4.7750


Once the network has been trained for 100 iterations, we can generate a sequence of characters.  

#### Questions:  
- What is the effect of training?  
- Increase the number of iterations to 1,000 and then to 10,000. Note the obtained loss and the generated sentence. What do you observe?

100 iterations : train loss 4.7716, val loss 4.7750 \
generated sequence : A5_3éSXGtfrùBÂPA[ëâQ2MRdÎéaùXGseïeu9?A(6,??xhMWw eVc6s0ï0ScUeÂf»8Ê38gCEDGÔ LD7·bG»9]TÎBÊfk«_QXxzÎ1T] \
1000 iterations : train loss 4.3266, val loss 4.1426 \
generated sequence : W8ëgrDZE4f
U»dvÈcD0pÎI!G-T4»cAbOxâ!bE ï95E4HuYÈ·eÎkap9Fsy'(WûG2onQôk
fï[·eëxzYI3J8jam-TÆV3J«52[AêvPî \
10000 iterations : train loss 2.4387, val loss 2.5172 \
generated sequence : St-'hez joeu  mTetre deur ber, têt die  ns s toucancerileujN5
ESÆwVRx.»Urieuncet l'itte ce s d,
4zo

Training for 10 000 iterations improves the performance, we get more word looking sequences with alternating vowels and consonants and less number and letters with unusual accents. However the generated sequence remains meaningless.

In [157]:
idx = torch.ones((1,1), dtype=torch.long)*3
print (decode(m.generate(idx, max_new_tokens=100)[0].tolist()))


A5_3éSXGtfrùBÂPA[ëâQ2MRdÎéaùXGseïeu9?A(6,??xhMWw eVc6s0ï0ScUeÂf»8Ê38gCEDGÔ LD7·bG»9]TÎBÊfk«_QXxzÎ1T]


## Single Head Attention  

We will now implement the basic attention mechanism. For each pair of words in the sequence, this mechanism combines:  
- **Q** (*query*): the information being searched for,  
- **K** (*key*): the information retrieved,  
- **V** (*value*): a result vector calculated from the attention mechanism.  

![single head attention](images/single_head_attention.png)  

### Masking  

However, since we are using the model to generate sequences, we must not use characters that come after the current character—these are precisely the characters we aim to predict during training. *The future should not be used to predict the future.*  

To enforce this constraint, we integrate a **masking matrix** into the process. This matrix ensures that:  
- For the first character in the sequence, only that character is available for prediction (no context).  
- For the second character, only the first and second characters can be used.  
- For the third character, only the first three characters are accessible, and so on.  

This results in a **lower triangular matrix**, where each row is normalized (rows sum to 1).

In [158]:
T = 8

# first version of the contraints with matrix multiplication
# create a lower triangular matrix
weights0 = torch.tril(torch.ones(T,T))
# normalize each row
weights0 = weights0 / weights0.sum(1, keepdim=True) 
print (weights0)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


The [`softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html) function provides another way to achieve normalization.  

#### Question:  
- Verify that applying `softmax` results in the same lower triangular matrix.

In [159]:
tril = torch.tril(torch.ones(T,T))
weights = torch.tril(torch.ones(T,T))
weights = weights.masked_fill(tril== 0, float('-inf'))
weights = nn.functional.softmax(weights, dim=-1)
print (weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


We get the same matrix

### Implementation  

We can now implement the attention layer based on the following formula:  

![attention_formula](images/attention_formula.png)  

This involves computing the **queries (Q)**, **keys (K)**, and **values (V)**, applying the **masking mechanism**, and using the **softmax function** to normalize the attention scores before computing the weighted sum of values.

#### Questions:  

- Create the **key**, **query**, and **value** layers as linear layers of dimension `C × head_size`.  
- Apply these layers to `x`.  
- Compute the attention weights:  
  ```python
  weights = query @ key.transpose(-2, -1)
  ```
  (Transpose the second and third dimensions of `key` to enable matrix multiplication).  
- Apply the **normalization factor** (typically, divide by `sqrt(head_size)`).  
- Apply the **triangular mask** and the **softmax** function to `weights`.  
- Apply the **value** layer to `x`.  
- Compute the final output:  
  ```python
  out = weights @ value(x)
  ```

In [170]:
head_size = 16
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)
## YOUR CODE HERE
# define the Key layer  
key = nn.Linear(in_features=C, out_features=head_size)
# define the Query layer
query = nn.Linear(in_features=C, out_features=head_size)
# define the Value layer
value = nn.Linear(in_features=C, out_features=head_size)
# apply each layer to the input
k =  key(x) # (B, T, head_size)
q =  query(x) # (B, T, head_size)
v =  value(x) # (B, T, head_size)
# compute the normalize product between Q and K 
weights = q @ k.transpose(-2, -1) # (B, T, head_size) @ (B, 16, head_size) -> (B, T, T)
# apply the mask (lower triangular matrix)
weights = weights.masked_fill(tril== 0, float('-inf'))
# apply the softmax
weights = nn.functional.softmax(weights, dim=1)
###
out  = weights @ value(x) # (B, T, head_size)

# print the result
print(weights[0])
print(out[0])

tensor([[0.2048, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1922, 0.4723, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0792, 0.0293, 0.5324, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0755, 0.0966, 0.0616, 0.2005, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0175, 0.0027, 0.1762, 0.3046, 0.0833, 0.0000, 0.0000, 0.0000],
        [0.0044, 0.2432, 0.0212, 0.0435, 0.0936, 0.4072, 0.0000, 0.0000],
        [0.0201, 0.0536, 0.1209, 0.0543, 0.5081, 0.2761, 0.0555, 0.0000],
        [0.4062, 0.1022, 0.0877, 0.3972, 0.3150, 0.3168, 0.9445, 1.0000]],
       grad_fn=<SelectBackward0>)
tensor([[-1.5037e-01,  1.2243e-01, -2.5650e-02, -5.4654e-04, -6.9596e-03,
          9.5414e-02,  3.5182e-02,  1.8299e-02,  4.8209e-02,  1.6416e-01,
         -1.4683e-01, -1.1650e-01,  1.2587e-01,  9.2077e-02, -1.5294e-01,
          6.9406e-02],
        [-7.9962e-02,  2.2217e-01,  9.5810e-02, -1.7323e-01, -2.4115e-02,
          2.0551e-02, -3.8572e-01, -1.4494e-01, -1.041

### Questions:  

- Copy your code into `gpt_single_head.py`:  
  - Define the **key**, **query**, and **value** layers in the **constructor** of the `Head` class.  
  - Implement the **computations** in the `forward` function.  
- Train the model.  
- What are the **training** and **validation** losses?  
- Does the generated text appear **better** compared to the previous model?

In [7]:
%run gpt_single_head.py

0.009893 M parameters
step 0: train loss 4.6812, val loss 4.6833
step 500: train loss 2.7358, val loss 2.8455
step 1000: train loss 2.4925, val loss 2.5880
step 1500: train loss 2.4397, val loss 2.5376
step 2000: train loss 2.3942, val loss 2.5404
step 2500: train loss 2.3766, val loss 2.5541
step 3000: train loss 2.3624, val loss 2.5050
step 3500: train loss 2.3424, val loss 2.4783
step 4000: train loss 2.3457, val loss 2.3882
step 4500: train loss 2.3352, val loss 2.4472
step 4999: train loss 2.3324, val loss 2.4422

L'ouesages honan n l laiver me! he jene, ces as à-êle ves les cèrat-t s me  l'anntomgiesumes mye is che.
.
 hens me cha cer dountre, cavis, sachaintarcorile reumait les, et decoru eques mes daient chaîvou ouoisey: ffoudeut,  le lai ge Uu darse te, dansomdaint en;
IENe  s'heuèr d'ont, strêteursoun jauitosonernonnsedite veuxtaivit fan cesceux;

Ohuveaît mans hants st mbre Pa lais st ouvre e fource ce Rêterivand'oruL'ame maurspel qu'rpoutét, brsonus apauries oriembetteursi 

The loss is lower than with the previous model and the text generatiion is better (no unusual characters or numbers) even if it is still meaningless.

## Multi-Head Attention  

Multi-head attention is simply the parallel computation of multiple **single-head attention** mechanisms. Each **single-head attention** output is concatenated to form the output of the **multi-head attention** module. In the original paper's illustration, the number of heads in the **multi-head attention** is denoted as `h`.  

To allow for **weighted combinations** of each single-head attention output, a **linear transformation layer** is added after concatenation.  

![multi head attention](images/multi_head_attention.png)  

#### Questions:  

- In the **constructor**, create a list containing `num_heads` instances of the `Head` module using PyTorch’s [`ModuleList`](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html).  
- In the `forward` function:  
  - Apply each **single-head attention** to the input.  
  - Concatenate the results using PyTorch’s [`cat`](https://pytorch.org/docs/stable/generated/torch.cat.html) function.

In [9]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        ## YOUR CODE HERE
        ## list of num_heads modules of type Head
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(in_features=num_heads * head_size, out_features=n_embd)
        ###
        
    def forward(self, x):
        ## YOUR CODE HERE
        ## apply each head in self.heads to x and concat the results 
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)  # has shape: (B, T, n_embd) 
        return out


#### Questions:  

1. **Copy** the file `gpt_single_head.py` and rename it as `gpt_multi_head.py`.  
2. **Add** the `MultiHeadAttention` module in `gpt_multi_head.py`.  
3. At the **beginning of the file**, add a parameter:  
   ```python
   n_head = 4
   ```
4. In the `BigramLanguageModel` module, **replace** the `Head` module with `MultiHeadAttention`, using the parameters:  
   ```python
   num_heads = n_head
   head_size = n_embd // n_head
   ```
   This ensures the total number of parameters remains **the same**.  
5. **Retrain the model** and note:  
   - The total number of **parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.009893 M parameters  
step 4999: train loss 2.1570, val loss 2.1802  
```

In [10]:
%run gpt_multi_head.py

0.010949 M parameters
step 0: train loss 4.5778, val loss 4.5397
step 500: train loss 2.5540, val loss 2.5864
step 1000: train loss 2.3802, val loss 2.3900
step 1500: train loss 2.3149, val loss 2.3378
step 2000: train loss 2.2636, val loss 2.2716
step 2500: train loss 2.2354, val loss 2.2548
step 3000: train loss 2.2079, val loss 2.2419
step 3500: train loss 2.1835, val loss 2.2246
step 4000: train loss 2.1798, val loss 2.1832
step 4500: train loss 2.1558, val loss 2.1662
step 4999: train loss 2.1561, val loss 2.1703

   
     Le lèrans; empandu den sans lEt l'ombes aflle nes mas,
Lansirite dantest jeux, sazongez joux,
       D'orge aile vouxt lus ue  pasr combre enins es'our nui des cromme un le qu'uné,
L'enue pror, ma voits de tages,
Le
Sont l'irangires de suriendss-t ces fêmes ala desclome l'anbrat l'ende rutommen tes se gre, su plis allau:-M-e chotrê void, e  gre de denouce fla pourièrie lessaproil;
Lu leiss,
Caêtre?
Jaisontore,
Jeux part;
Ent rait farêbraîle;
Es l'une;
Su, me  Na

## Adding a FeedForward Computation Layer  

After the **attention layers**, which collect information from the sequence, a **computation layer** is added to combine all the gathered information.  

This layer is a simple **Multi-Layer Perceptron (MLP)** with:  
- One **hidden layer**,  
- A **ReLU non-linearity** using [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html).  

### Architecture:  

<img src="images/multi_ff.png" alt="multi feedforward" width="200">


In [11]:
class FeedForward(nn.Module):
    """ a simple MLP with RELU """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

#### Questions:  

1. **Add** the `FeedForward` module to your `gpt_multi_head.py` file.  
2. **Integrate** this `FeedForward` layer **after** the **multi-head attention** module.  
3. **Retrain the model** and note:  
   - The total **number of parameters**  
   - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.010949 M parameters  
step 4999: train loss 2.1290, val loss 2.1216  
```

In [13]:
%run gpt_multi_head.py

0.012005 M parameters
step 0: train loss 4.6450, val loss 4.6300
step 500: train loss 2.5354, val loss 2.5274
step 1000: train loss 2.3648, val loss 2.3618
step 1500: train loss 2.2954, val loss 2.2953
step 2000: train loss 2.2709, val loss 2.2565
step 2500: train loss 2.2336, val loss 2.2737
step 3000: train loss 2.2109, val loss 2.2342
step 3500: train loss 2.1778, val loss 2.1923
step 4000: train loss 2.1647, val loss 2.1576
step 4500: train loss 2.1413, val loss 2.1632
step 4999: train loss 2.1229, val loss 2.1394

         Vous as dortière autié deux deven-je el boncore quau du sarces sontres, dait la rit l'armens yhétô là jachère;
S'hous l'ute,
Nuie bait la porgeux, learte, sra quut vongez leulempe sour te etténonte,
         Qua bri,



En sarve
Sen querdeine
Pe pachouffIilges, juréla bens fun torme où fommaclit cola pil flaien gaveveille ditanfre,
Et ne saitre!
XVQuans la eule mibla de sombandis du ruit sora vi, et me, mavosseurqu'echangella fle et la cre
       Sa ginouyon le 

## Stacking Blocks  

The network we have built so far represents just **one block** of the final model. Now, we can **stack multiple blocks** of **multi-head attention** to create a **deeper** network.  

### Architecture:  
![multi feedforward](images/multi_bloc.png)  

The following code defines a **block**:  


In [14]:
class Block(nn.Module):
    """ A single bloc of multi-head attention """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

#### Questions:  

- Add the `Block` module to `gpt_multi_head.py`.  
- Modify the `BigramLanguageModel` code to include **three** instances of `Block(n_embd, n_head=4)`, using a [`Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) container **instead of** `MultiHeadAttention` and `FeedForward`.  
- Retrain the model and note:  
  - The **number of parameters**  
  - The **training** and **validation** losses obtained  

**Expected Output Example:**  
```
0.019205 M parameters  
step 4999: train loss 2.2080, val loss 2.2213  
```

In [15]:
%run gpt_multi_head.py

0.025541 M parameters
step 0: train loss 4.6601, val loss 4.6732
step 500: train loss 3.1763, val loss 3.1895
step 1000: train loss 2.9343, val loss 2.8924
step 1500: train loss 2.8885, val loss 2.8657
step 2000: train loss 2.7241, val loss 2.7018
step 2500: train loss 2.6437, val loss 2.6216
step 3000: train loss 2.5328, val loss 2.5382
step 3500: train loss 2.4693, val loss 2.4731
step 4000: train loss 2.4189, val loss 2.4316
step 4500: train loss 2.3884, val loss 2.4140
step 4999: train loss 2.3552, val loss 2.3640



»Lur:s
Je noun le pourt lu sun ettuaur bErhen carbos lecabConserers v'houmréindens i froncoir ver dedt vorres costte! sre pans sens gagme: panits quu de darme elboure domppre êrecrrhh'el cette-utd, jerets bopt nourse, carmabeu, Mi! soirrixe fonsres élaur embhèpome onule cee soiromamen.
E taiad en l''aucêtteu nous ses quu taui le manse
êrr t'er s canuunsé en daix pen floute;
L''ensmeuve cogômbense non quu virme m'lâllan nlus vaiucée silré fis quu réous son damte prignét

## Improving Training  

If we want to continue increasing the **network size**, we need to incorporate layers that **enhance training stability** and **improve generalization** (reducing overfitting). These layers include:  

- **Skip connections** (or **residual connections**)  
- **Normalization layers**  
- **Dropout**  

### Updated Architecture:  

<img src="images/multi_skip_norm.png" alt="multi feedforward" width="200">

---

#### Questions:  

1. In the `Block` module, **add a skip connection** by summing the input at each step:  
   ```python
   x = x + self.sa(self.ln1(x))
   x = x + self.ffwd(self.ln2(x))
   ```  
   
2. In the `Block` module, **add two** [`LayerNorm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) layers of size `n_embd`:  
   - **Before** the `Multi-Head Attention` layer.  
   - **Before** the `FeedForward` layer.  

3. **After the sequence of 3 blocks**, add a **LayerNorm** layer of size `n_embd`.  

4. Define a variable at the **beginning of the file**:  
   ```python
   dropout = 0.2
   ```
   Then add a [`Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) layer:  
   - **After** the `ReLU` activation in `FeedForward`.  
   - **After** the `Multi-Head Attention` layer in `MultiHeadAttention`.  
   - **After** the `softmax` layer in the single-head attention `Head`.  

5. **Retrain the model** and note:  
   - The **number of parameters**  
   - The **training** and **validation** losses  

---

**Expected Output Example:**  
```
0.019653 M parameters  
```

In [16]:
%run gpt_multi_head.py

0.025989 M parameters
step 0: train loss 4.7928, val loss 4.8084
step 500: train loss 2.4509, val loss 2.4351
step 1000: train loss 2.3046, val loss 2.3366
step 1500: train loss 2.2310, val loss 2.2343
step 2000: train loss 2.1857, val loss 2.2111
step 2500: train loss 2.1638, val loss 2.1709
step 3000: train loss 2.1453, val loss 2.1205
step 3500: train loss 2.1199, val loss 2.1238
step 4000: train loss 2.1086, val loss 2.1078
step 4500: train loss 2.0827, val loss 2.0781
step 4999: train loss 2.0777, val loss 2.0804

          ge rabe chanceraien ec ensur dèce à au dous luis cetéjousée   quiest la coutet nles con en de re cur léons sessouffout maierempl'échazant manstre ill, je te l'oalte 14 rant, cour hes pourabrit roil! metéss, latéthirlee loufret;
          Nous respâmer sandechapraituer pourflent, le li annend âme, usénoue dastre et-mosse cénrouté,
«De,
Le ces poustondu dinchurs d'oufiet que dièseone aileitesvoy,
Oordu mes ve,
Aurdurre Sanne Qua ne  immeu  vouffrance,
Tousroi, d'

## Conclusion  

The key components of **GPT-2** are now in place. The next step is to **scale up** the model and train it on a **much larger** dataset. For comparison, the parameters of [GPT-2](https://huggingface.co/transformers/v2.11.0/model_doc/gpt2.html) are:  

- **`vocab_size = 50257`** → GPT-2 models **subword tokens**, while we model **characters**. For us, `vocab_size = 100`.  
- **`n_positions = 1024`** → The maximum **context size**. For us, it's `block_size = 8`.  
- **`n_embd = 768`** → The **embedding dimension**. For us, it's `n_embd = 32`.  
- **`n_layer = 12`** → The number of **blocks**. For us, it's `3`.  
- **`n_head = 12`** → The number of **multi-head attention layers**. For us, it's `4`.  

Overall, **GPT-2** consists of **1.5 billion parameters** and was trained on **8 million web pages**, totaling **40 GB of text**.  

---

### **Training Results**  
```text
10.816613 M parameters  
step 0: train loss 4.7847, val loss 4.7701  
step 4999: train loss 0.2683, val loss 2.1161  
time: 31m47.910s   
```

### **Generated Text Sample:**  

```text
Le pêcheur où l'homme en peu de Carevante  
Sa conter des chosses qu'en ses yoitn!  

Ils sont là-hauts parler à leurs ténèbres  
A ceux qu'on rêve aux oiseaux des cheveux,  
Et celus qu'on tourna jamais sous le front;  
Ils se disent tu mêle aux univers.  
J'ai vu Jean vu France, potte; petits contempler,  
Et petié calme au milibre et versait,  
M'éblouissant, emportant, écoute, ingorancessible,  
On meurt s'efferayait.....--Pas cont âme parle en Apparia!  
```