<h1 align="center" style="color:green;font-size: 3em;">Homework 1:
Implementing Transformers From Scratch Using PyTorch</h1>


# Part 1: Introduction

In this homework, you will implement the transformer architecture as described in the "Attention is All You Need" paper from scratch using PyTorch.

**Instructions:**
- Follow the notebook sections to implement various components of the transformer.
- Code cells marked with `TODO` are parts that you need to complete.
- Ensure your code runs correctly by the end of the notebook.


# Part 2: Import Libraries

In [1]:
# importing required libraries
import torch.nn as nn
import torch
import torch.nn.functional as F
import math,copy,re
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random
warnings.simplefilter("ignore")
print(torch.__version__)


# Set the seed value
seed_value = 0

# For CPU
torch.manual_seed(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

# For GPU (if using CUDA)
print("I cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    torch.backends.cudnn.deterministic = True

2.5.1+cu124
I cuda available: True


Before diving into the decoder, let us discuss some of its components.


# Part 3: Basic Transformer Class

Neural networks operate over numerical weights and biases but natural language does not naturally take this form. Thus first, we need to convert an input sequence into an embedding vector. Embedding vectors create a more semantic representation of each token.


## 3.1: Embeddings Matrix


**3.1: Understanding Embeddings**

Transformers require numerical input, but natural language consists of words. To convert words into a numerical format, we use embeddings, which map each word to a high-dimensional vector. These vectors capture semantic information about the words.

**Task:**

- Instantiate an embedding layer using the `nn.Embedding` class in PyTorch.
- Investigate the properties of the embedding matrix.
- Embed multiple tokens and analyze the results.

**Step-by-Step Instructions:**

1. **Create an Embedding Layer:**
    - Assume a vocabulary size of 100 and an embedding dimension of 512.
    - Create an embedding layer using `nn.Embedding`.

2. **Analyze the Embedding Matrix:**
    - Print the shape of the embedding matrix.
    - Extract and print the first 3 rows of the embedding matrix (corresponding to tokens 0, 1, and 2).
  
3. **Embed Multiple Tokens:**
    - Create an input tensor representing a sequence of tokens. For example, use the tokens `[0, 1, 2]`.
    - Pass this input through the embedding layer.
    - Print and compare the embedding vectors for tokens 0, 1, and 2 with the first 3 rows of the embedding matrix.


In [2]:
# TODO: Create an Embedding Layer
vocabulary_size = 100
embedding_dimension = 512
embedding_layer = nn.Embedding(vocabulary_size, embedding_dimension)

print("Embedding Layer:", embedding_layer)

# TODO: Print the shape of the embedding matrix
# TODO: Analyze the Embedding Matrix
embedding_matrix = embedding_layer.weight
print(f"Shape of Embedding Matrix:{embedding_matrix.shape}")

# TODO: Print first 3 rows of embedding matrix
first_three_rows = embedding_matrix[:3]
print("First 3 Rows of Embedding Matrix:")
print(first_three_rows)

# TODO: Embed Multiple Tokens
input_tokens = torch.tensor([0, 1, 2])  # Example input tokens
embedded_tokens = embedding_layer(input_tokens)

print("\n Print embedding Vectors for Tokens 0, 1, 2:")
print(embedded_tokens)

# Compare with the first 3 rows of the embedding matrix
print("\nCompare with the first 3 rows of the embedding matrix:")
print(torch.allclose(embedded_tokens, first_three_rows, atol=1e-6))

Embedding Layer: Embedding(100, 512)
Shape of Embedding Matrix:torch.Size([100, 512])
First 3 Rows of Embedding Matrix:
tensor([[-1.1258, -1.1524, -0.2506,  ..., -1.6989,  1.3094, -1.6613],
        [-0.5461, -0.6302, -0.6347,  ...,  0.5374,  1.0826, -1.7105],
        [-1.0841, -0.1287, -0.6811,  ..., -0.0363,  0.0981,  0.9636]],
       grad_fn=<SliceBackward0>)

 Print embedding Vectors for Tokens 0, 1, 2:
tensor([[-1.1258, -1.1524, -0.2506,  ..., -1.6989,  1.3094, -1.6613],
        [-0.5461, -0.6302, -0.6347,  ...,  0.5374,  1.0826, -1.7105],
        [-1.0841, -0.1287, -0.6811,  ..., -0.0363,  0.0981,  0.9636]],
       grad_fn=<EmbeddingBackward0>)

Compare with the first 3 rows of the embedding matrix:
True


## 3.2: Positional Encoding

The next step is to generate positional encoding. For the model to understand a sentence, it helps to know two things about each token:
- What does the token mean semantically?
- What is the position of the token in the sentence?

In the "Attention is All You Need" paper, the authors used the following functions to create positional encoding. A cosine function is used for odd time steps, and a sine function is used for even time steps.

<img src="https://miro.medium.com/max/524/1*yWGV9ck-0ltfV2wscUeo7Q.png">

<img src="https://miro.medium.com/max/564/1*SgNlyFaHH8ljBbpCupDhSQ.png">

```
pos -> refers to order in the sentence
i -> refers to position along embedding vector dimension`
```

Positional encoding will generate a matrix similar to the embedding matrix. It will create a matrix of dimension sequence length x embedding dimension. For each token (word) in the sequence, we will find the embedding vector, which is of dimension (1, 512), and add it with the corresponding positional vector, which is also of dimension (1, 512), to get a (1, 512) dimension output for each word/token.

For example, if we have a batch size of 32 and a sequence length of 10 with an embedding dimension of 512, we will have an embedding vector of dimension (32, 10, 512). Similarly, we will have a positional encoding vector of dimension (32, 10, 512). Then we add both.

<img src="https://miro.medium.com/max/906/1*B-VR6R5vJl3Y7jbMNf5Fpw.png">

<hr>
<h3>Task:</h3>
Implement the `PositionalEmbedding` class. Complete the `__init__` method to initialize the positional encoding matrix and the `forward` method to add positional encoding to the input embeddings.

```
Code Hint:
Use math.sin and math.cos functions to create the positional encoding matrix.
```
```
Code Hint:
Use the nn.Parameter to store the positional encoding matrix.
```
**Note:** Ensure that the positional encoding matrix is not trained by the optimizer:
```
Code Hint:
This means that the positional encoding matrix do not require gradients. Look at pytorch nn.Parameter for more information.
```

In [18]:
import torch
import torch.nn as nn
import math

class PositionalEmbedding(nn.Module):
    def __init__(self, max_seq_len, embed_model_dim):
        """
        Args:
            max_seq_len: maximum length of input sequence
            embed_model_dim: dimension of embeddings
        """
        super(PositionalEmbedding, self).__init__()
        self.embed_dim = embed_model_dim
        print("Max_seq_len:", max_seq_len)

        # TODO: Initialize the positional encoding matrix using the above
        # Shape: (max_seq_len, embed_model_dim)
        # Create a positional encoding matrix
        position_encoding = torch.zeros(max_seq_len, embed_model_dim)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_model_dim, 2).float() * (-math.log(10000.0) / embed_model_dim))

        # Compute sine and cosine for each position
        position_encoding[:, 0::2] = torch.sin(position * div_term)
        position_encoding[:, 1::2] = torch.cos(position * div_term)
        # Store the positional encoding matrix as an nn.Parameter but set requires_grad=False
        self.positional_encoding = nn.Parameter(position_encoding, requires_grad=False)

    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            x: output vector with positional encoding added
        """

        # TODO: Add positional encoding to the input embeddings
        x = x + self.positional_encoding[: x.size(1), :].unsqueeze(0)
        return x

In [4]:
# Initialize the PositionalEmbedding
max_seq_len = 10  # Maximum sequence length
embed_model_dim = 16  # Dimension of embeddings
positional_embedding = PositionalEmbedding(max_seq_len, embed_model_dim)
positional_embedding.eval()
with torch.no_grad():
  batch_size = 2
  sequence_length = 10
  sample_data = torch.zeros(batch_size, sequence_length, embed_model_dim)

  # Pass the sample data through the PositionalEmbedding layer
  output = positional_embedding(sample_data)

  print("Sample Input Data:")
  print(sample_data)
  print("Output Data with Positional Encoding:")
  print(output)

Max_seq_len: 10
Sample Input Data:
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0

## 3.3 What is Self-Attention?
In this section, we will explore Self-Attention and Multi-Head Attention mechanisms.

Consider the sentence, "The cat slept because it was tired." Here, "it" refers to "the cat." While this is intuitive for humans, this may not be clear at all to a machine.

Self-Attention allows the model to consider other positions in the input sequence while processing each word, enabling it to generate a vector that captures dependencies between words.

Let's break down the self-attention mechanism step by step:

1. **Input Projections:** For each word in the input, create three vectors: Query (Q), Key (K), and Value (V). Each vector has a dimension of 1x512.

   In multi-head attention, we have multiple self-attention heads (e.g., 8 heads). Each head corresponds contains smaller vectors which are reshaped from the 1x512 vector.

   **How to create Key, Query, and Value vectors?**

   Use matrices (key, query, and value matrices) to generate these vectors. These matrices are learned during training.
```
Code Hint:
If batch_size=32, sequence_length=10, and embedding_dimension=512, the output after embedding and positional encoding will be 32x10x512.
First, project it to get K,Q,V matrices of shape 32x10x512. Next, create heads of shape 32x10x8x64. (8 is the number of heads in multi-head attention).
```


* **Step 2:** **Calculate Attention Scores:** Multiply the query matrix with the key matrix transpose: [Q x K.t]
```
Code Hint:
If the dimensions of key, query, and value are 32x10x8x64, transpose them to 32x8x10x64.
Then, multiply the query matrix with the key matrix transpose: (32x8x10x64) x (32x8x64x10) -> (32x8x10x10).
```


* **Step 3:**  **Scale Scores:** Divide the output matrix by the square root of the key matrix dimension and apply Softmax.
```
Code Hint:
Divide the 32x8x10x10 vector by 8 (the square root of 64, the key matrix dimension).
```


* **Step 4:** **Weighted Sum:** Multiply the scores with the value matrix.
```
Code Hint:
After step 3, the output will be 32x8x10x10. Multiply it with the value matrix (32x8x10x64) to get the output (32x8x10x64).
```

* **Step 5:** **Output Transformation:** Pass the result through a linear layer to form the final output of the multi-head attention.
```
Code Hint:
Transpose the (32x8x10x64) vector to (32x10x8x64) and reshape it to (32x10x512).
Then, pass it through a linear layer to get the output of (32x10x512).
```


Now that you have an overview of how multi-head attention works, let's implement it. You will gain a deeper understanding through the following code exercise.

<hr>
<h3>Task:</h3>
Implement the `MultiHeadAttention` class. Complete the `__init__` and `forward` methods to perform the multi-head self-attention operation.

```
Code Hint:
Ensure you properly reshape and transpose the tensors to match the required dimensions for matrix multiplication.
```

**Note:** Masking can be used in the decoder to prevent attending to future tokens, but more on this later


In [17]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim=512, n_heads=8):
        """
        Args:
            embed_dim: dimension of embedding vector output
            n_heads: number of self-attention heads
        """
        super(MultiHeadAttention, self).__init__()

        self.embed_dim = embed_dim    # 512 dim
        self.n_heads = n_heads   # 8 heads
        self.single_head_dim = embed_dim // n_heads   # 512 / 8 = 64, each key, query, and value head will be 64d

        # TODO: Initialize key, query, value, and output projection matrices/layers.
        # -- Note: Use biases only for the output projection layer. Not for the key, query, and value layers.
        self.query_matrix = nn.Linear(embed_dim, embed_dim, bias=False)
        self.key_matrix = nn.Linear(embed_dim, embed_dim, bias=False)
        self.value_matrix = nn.Linear(embed_dim, embed_dim, bias=False)
        self.output_projection = nn.Linear(embed_dim, embed_dim)

    def forward(self, key, query, value, mask=None):
        """
        Args:
            key: key vector
            query: query vector
            value: value vector
            mask: mask for decoder

        Returns:
            output: vector from multi-head attention
        """
        batch_size = key.size(0)

        # TODO: Apply linear transformations for computing the key, query and value elements
        query = self.query_matrix(query)
        key = self.key_matrix(key)
        value = self.value_matrix(value)

        # TODO: Reshape key, query, and value
        query = query.view(batch_size, -1, self.n_heads, self.single_head_dim).transpose(1, 2)
        key = key.view(batch_size, -1, self.n_heads, self.single_head_dim).transpose(1, 2)
        value = value.view(batch_size, -1, self.n_heads, self.single_head_dim).transpose(1, 2)


        # TODO: Compute attention scores
        attention_scores = torch.matmul(query, key.transpose(-2, -1))

        # TODO: scale the dot products
        scores = attention_scores / math.sqrt(self.single_head_dim)

        # Apply masking, if mask in not None.
        # Assume that product is the tensor with the scaled dot products
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-1e20'))

        # TODO: Apply softmax
        attention_weights = torch.softmax(scores, dim=-1)

        # TODO: Compute weighted sum of value vectors and run it through the last layer
        weighted_sum = torch.matmul(attention_weights, value)

        weighted_sum = weighted_sum.transpose(1,2).contiguous().view(batch_size, -1, self.embed_dim)

        output = self.output_projection(weighted_sum)

        return output

In [6]:
torch.manual_seed(0)
mha = MultiHeadAttention(512, 8)
mha.eval()
with torch.no_grad():
    key = torch.zeros(2, 10, 512)
    query = torch.ones(2, 10, 512)
    value = torch.zeros(2, 10, 512)
    # Perform forward pass twice with the same inputs
    output1 = mha(key, query, value)
    print(mha.query_matrix.weight[0][:3])
    print(output1[0][0][:3])

tensor([-0.0003,  0.0237, -0.0364], requires_grad=True)
tensor([ 0.0141, -0.0128,  0.0270])


# Part 4: Decoder-only architecture

In this section, we will fully implement the decoder-only architecture


## 4.1 Decoder Class

<h2>Steps for the Decoder:</h2>

**Step 1:**

   The input (padded tokens of the sentence) is passed through the embedding layer and positional encoding layer.
```
Code Hint:
If the input is of size 64 (batch size=32 and sequence length=64), after passing through the embedding layer, it becomes 32x10x512.
This output is added to the corresponding positional encoding vector, producing a 32x10x512 output that is passed to the multi-head attention.
```

**Step 2:**
  At each decoder block the processed input is passed through the multi-head attention layer to create a useful representational matrix.
```
Code Hint:
The input to the multi-head attention is 32x10x512. Key, query, and value vectors are generated, ultimately producing a 32x10x512 output.
```

In the decoder the self-attention is masked. In other words, a causal mask is used with multi-head attention.

**Why mask?**

A mask is used to prevent a word from attending to future words in the sequence. For example, in the sentence "I am a student," we do not want the word "a" to attend to the word "student."
```
Code Hint:
To create the attention mask, we use a triangular matrix with 1s and 0s. For example, a triangular matrix for a sequence length of 5 looks like this:

1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1

After the key is multiplied by the query, you should fill all zero positions with a very small number (e.g., -1e20) to avoid division errors.
```

**Step 3:**
  The output from the multi-head attention is added to its input and then normalized.
```
Code Hint:
The output from the multi-head attention (32x10x512) is added to the input (32x10x512) and then normalized.
```
  Before the residual connection and norm layers, the multi-head attention output should be forwared through a dropout layer.

**Step 4:**
  
  The normalized output passes through a feed-forward layer and another normalization layer with a residual connection from the input of the feed-forward layer.
```
Code Hint:
The normalized output (32x10x512) is passed through two linear layers: 32x10x512 -> 32x10x2048 -> 32x10x512.
Finally, a residual connection is added, and the layer is normalized.
This produces a 32x10x512 dimensional vector as the encoder's output.
```
Again, before the residual connection and norm layers, the feed-forward layer output should be forwared through a dropout layer.

**Step 5:**

Finally, we create a linear layer with a size equal to the vocabulary size of the target corpus. Do not use softmax after the linear layer. Before the final linear layer, there is another layer norm layer.

<hr>
<h3>Task:</h3>

1. Implement the `__init__` and `forward` methods for `TransformerBlock`, making sure to apply masked attention with the given mask.
2. Implement make_mask function in `TransformerDecoderOnly` to generate masks for the self-attention layers
3. Implement the forward pass for `TransformerDecoderOnly`, including embedding, positional encoding, and the final linear layer with softmax. The `generate` function is provided as a helper function for inference / sampling


In [7]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(TransformerBlock, self).__init__()
        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: factor determining output dimension of the linear layer
           n_heads: number of attention heads
        """
        self.attention = MultiHeadAttention(embed_dim, n_heads)

        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, expansion_factor * embed_dim),
            nn.ReLU(),
            nn.Linear(expansion_factor * embed_dim, embed_dim)
        )

        self.dropout1 = nn.Dropout(0.2)
        self.dropout2 = nn.Dropout(0.2)

    def forward(self, x, mask=None):
        """
        Args:
           x: embeddings

        Returns:
           x_out: output of transformer block
        """

        # TODO: Calculate attention output using masked self.
        attention_output = self.attention(x, x, x, mask)

        # TODO: Apply droppout on the attention outputs.
        attention_output = self.dropout1(attention_output)

        # TODO: Add residual connection and normalize
        x = self.norm1(x + attention_output)

        # TODO: Pass through feed forward layer
        feed_forward_output = self.feed_forward(x)

        # TODO: Apply dropout on the outputs of the feed forward layer
        feed_forward_output = self.dropout2(feed_forward_output)

        # TODO: Add residual connection and normalize
        x_out = self.norm2(x + feed_forward_output)

        return x_out


class TransformerDecoderOnly(nn.Module):
    def __init__(self, vocab_size, embed_dim, seq_len, num_layers=6, expansion_factor=4, n_heads=8):
        super(TransformerDecoderOnly, self).__init__()
        """
        Args:
           vocab_size: vocabulary size of the target
           embed_dim: dimension of embedding
           seq_len: length of input sequence
           num_layers: number of decoder layers
           expansion_factor: factor determining the number of linear layers in the feed-forward layer
           n_heads: number of heads in multi-head attention
        """
        self.word_embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_embedding = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList(
            [TransformerBlock(embed_dim, expansion_factor=expansion_factor, n_heads=n_heads) for _ in range(num_layers)]
        )
        self.norm_out = nn.LayerNorm(embed_dim)
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def make_mask(self, seq):
        """
        Args:
            seq: sequence of indices. The shape should be [batch_size, seq_len]

        Returns:
            mask: causal mask. The shape should be: [batch_size, 1, seq_len, seq_len]
        """
        # TODO: Implement the mask for the sequence
        batch_size, seq_len = seq.size()
        mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(1)
        return mask.to(seq.device)

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

    def forward(self, x):
        """
        Args:
            x: input vector with token

        Returns:
            out: output vector
        """
        mask = self.make_mask(x)
        mask = mask.to(x.device)

        # TODO: Apply the embedding layer
        x = self.word_embedding(x).to(x.device)

        # TODO: Apply positional encoding
        x = self.position_embedding(x)

        # TODO: Pass through each decoder transformer block with mask
        for layer in self.layers:
            x = layer(x, mask)
        # TODO: Apply the final layer norm layer (self.norm_out)
        x = self.norm_out(x)

        # TODO: Apply final linear layer
        out = self.fc_out(x)

        return out


<h3>Test the coded transformer blocks:</h3>

In [8]:
def test_transformer_block():
    embed_dim = 512
    n_heads = 8
    expansion_factor = 4

    # Define input shapes: batch_size x seq_length x embed_dim
    batch_size = 32
    seq_length = 4

    # Create random input tensor
    x = torch.rand(batch_size, seq_length, embed_dim)

    # Create the TransformerBlock
    transformer_block = TransformerBlock(embed_dim, expansion_factor, n_heads)

    # Pass the inputs through the transformer block
    output = transformer_block(x)

    # Check the output shape: should be [batch_size, seq_length, embed_dim]
    assert output.shape == (batch_size, seq_length, embed_dim), \
        f"Expected output shape {(batch_size, seq_length, embed_dim)}, but got {output.shape}"

    print("TransformerBlock test passed!")

test_transformer_block()

TransformerBlock test passed!


In [9]:
from torch.testing import assert_close

batch_size = 32
seq_len = 10
embed_dim = 512
vocab_size = 10000
n_heads = 8
num_layers = 2
expansion_factor = 4

# Test case 1: Full Transformer Decoder functionality
def test_transformer_decoder_full_pass():
    """
    Tests the forward pass of TransformerDecoderOnly and check if output has expected shape and behavior.
    """
    # Initialize a TransformerDecoder
    decoder = TransformerDecoderOnly(
        vocab_size=vocab_size,
        embed_dim=embed_dim,
        seq_len=seq_len,
        num_layers=num_layers,
        expansion_factor=expansion_factor,
        n_heads=n_heads
    )

    x = torch.randint(0, vocab_size, (batch_size, seq_len))

    # Forward pass through the transformer decoder
    output = decoder(x)

    # Assert the output has the correct shape (batch_size, seq_len, vocab_size)
    assert output.shape == (batch_size, seq_len, vocab_size), \
        f"Expected {(batch_size, seq_len, vocab_size)}, but got {output.shape}"
    output = F.softmax(output, dim=-1)
    # Check that the output is a valid probability distribution (i.e., each row sums to 1 after softmax)
    assert_close(output.sum(dim=-1), torch.ones(batch_size, seq_len), rtol=1e-2, atol=1e-2)

    print("TransformerDecoderOnly passed the full forward pass test!")


test_transformer_decoder_full_pass()

Max_seq_len: 10
TransformerDecoderOnly passed the full forward pass test!


# Part 5: Train and test our Decoder-only architecture

Now, we're going to the train the modle on the TinyStories dataset with the GPT-2 tokenizer. The goal is to try to learn a small language model that can generate creative and coherent text in English.

Let's install some dependencies

In [10]:
!pip install datasets tiktoken

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)


## 5.1 Model hyperparameters
Now we are going to, define the model hyperparameters

No code is needed here.

In [11]:
import tiktoken

# Model  hyperparameters
batch_size = 128  # how many independent sequences will we process in parallel?
model_max_seq_len = block_size = 64  # what is the maximum context length for predictions.
max_iters = 5000  # 5000
eval_interval = 500 # 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 100
n_embd = 512
n_head = 8
n_layer = 6 # 12
dropout = 0.1

# Loading the tiktoken tokenizer used in GPT2
enc = tiktoken.get_encoding("gpt2")  # tokenizer - GPT2
vocab_size = enc.n_vocab
# ------------

## 5.2 Building dataset and dataloader

Now we are going to download the dataset (TinyStories dataset), tokenize it using the tiktoken tokenizer from GPT-2, divide it to train and test splits.

No code is needed here

In [12]:
from datasets import load_dataset

dataset = load_dataset('roneneldan/TinyStories')
# loading first 12000 stories from the dataset
text = '\n'.join(dataset['train']['text'][:12000])

print("Print the first 1000 characters from the TinyStories:")
print(text[:1000])

# Tokenize the data:
data = torch.tensor(enc.encode(text), dtype=torch.long)
print("\n\n\n")
print("Data after the the tokenization process (fist 100 indices):")
print(data[:100])

# Make the train and test splits
n = int(0.9*len(data))  # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

print("\n\n\n")
print(f"Train data shape {train_data.shape}, validation data shape {val_data.shape}")
# data loading

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

(…)-00000-of-00004-2d5a1467fff1081b.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

(…)-00001-of-00004-5852b56a2bd28fd9.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00002-of-00004-a26307300439e943.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00003-of-00004-d243063613e5a057.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

(…)-00000-of-00001-869c898b519ad725.parquet:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

Print the first 1000 characters from the TinyStories:
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.

One day, Beep was driving in the park when he saw a b

## 5.3 Create GPT Model And Its Optimizer

Your tasks are:
1. Define / create the decoder only model using the hyperparameters set above
2. Define an AdamW optimizer using the learning rate (hyperparameter set above)
3. Print from model the following stuff:
  - The number of trainable parameters in the input embedding layer
  - The number of trainable parameters in all the transformer blocks
  - The number of trainable parameters in the final linear layer of the model
  - The total number of trainable parameters.

In [13]:
# TODO Define / create the model (Accut)
model = TransformerDecoderOnly(vocab_size, n_embd, model_max_seq_len, num_layers=n_layer, expansion_factor=4, n_heads=n_head)

# print(model)
print(model)

# Set the model to the device using, e.g., m = model.to(device)
model = model.to(device)

# TODO create a AdamW optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# TODO print the asked number of trainable parameters
# Counting function
def count_parameters(module):
    return sum(p.numel() for p in module.parameters() if p.requires_grad)

# The number of trainable parameters in the input embedding layer
input_embedding_params = count_parameters(model.word_embedding)
# The number of trainable parameters in all the transformer blocks
transformer_blocks_params = count_parameters(model.layers)
# The number of trainable parameters in the final linear layer of the model
final_linear_layer_params = count_parameters(model.fc_out)
# The total number of trainable parameters
total_params = count_parameters(model)

# Print the trainable parameters
print(f"The number of trainable parameters in the input embedding layer: {input_embedding_params}")
print(f"The number of trainable parameters in all the transformer blocks: {transformer_blocks_params}")
print(f"The number of trainable parameters in the final linear layer of the model: {final_linear_layer_params}")
print(f"The total number of trainable parameters: {total_params}")

Max_seq_len: 64
TransformerDecoderOnly(
  (word_embedding): Embedding(50257, 512)
  (position_embedding): PositionalEmbedding()
  (layers): ModuleList(
    (0-5): 6 x TransformerBlock(
      (attention): MultiHeadAttention(
        (query_matrix): Linear(in_features=512, out_features=512, bias=False)
        (key_matrix): Linear(in_features=512, out_features=512, bias=False)
        (value_matrix): Linear(in_features=512, out_features=512, bias=False)
        (output_projection): Linear(in_features=512, out_features=512, bias=True)
      )
      (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (feed_forward): Sequential(
        (0): Linear(in_features=512, out_features=2048, bias=True)
        (1): ReLU()
        (2): Linear(in_features=2048, out_features=512, bias=True)
      )
      (dropout1): Dropout(p=0.2, inplace=False)
      (dropout2): Dropout(p=0.2, inplace=False)
    )
  )
  (norm_out): 

## 5.4 Define the Loss Function

In [14]:
def loss_function(logits, targets):
    """
      Args:
          logits: predicted logits (pre-softmax activations) from the model. Shape: [batch_size, seq_len, vocab_size]
          targets: target vocabulary indices. Shape [batch_size, seq_len]

      Returns:
          loss: scalar value with the average cross-entropy loss
    """
    # TODO implement the cross_entropy loss
    loss_f = nn.CrossEntropyLoss()
    loss = loss_f(logits.view(-1, logits.size(-1)), targets.view(-1))
    return loss

## 5.5 Train the GPT Model And Its Optimizer

Your tasks are to write the training and evaluation code that does the following things:
1. Train the model for `max_iters` using the training set.
2. Every `eval_interval` eveluates the model (computes the loss) on both the test set and the train set, using `eval_iters` iterations for each split. Report the average loss on the train and test set from this.

In [15]:
from tqdm import tqdm
from torch.optim.lr_scheduler import CosineAnnealingLR


# Cosine Annealing LR Scheduler (for tuning the learning rate)
scheduler = CosineAnnealingLR(optimizer, T_max=max_iters, eta_min=1e-5)

early_stopping_threshold = 5  # Stop training if no improvement in val loss for 5 evaluations
best_val_loss = float('inf')
no_improvement = 0

model.train()
for iter in tqdm(range(max_iters)):
    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        model.eval()
        losses = {'train': 0, 'val': 0,}
        for split in ['train', 'val']:
            total_loss = 0
            for k in range(eval_iters):
              X, Y = get_batch(split)
              # TODO: fill the code. It must compute the average loss on the 'train' or 'val' split using eval_iters
              X, Y = X.to(device), Y.to(device)

              logits = model(X)
              loss = loss_function(logits, Y)
              total_loss += loss.item()
            # Compute the average loss over eval_iters
            losses[split] = total_loss / eval_iters
        # for each.
        scheduler.step(losses['val'])

        # Check for early stopping
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            no_improvement = 0
        else:
            no_improvement += 1
        # Stop training if no improvement in val loss for early_stopping_threshold evaluations
        if no_improvement >= early_stopping_threshold:
            print(f"Stopping early at iteration {iter}, best validation loss: {best_val_loss:.4f}")
            break

        model.train()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')
    xb, yb = xb.to(device), yb.to(device)

    # TODO: implement a training iteration (forward, loss computation, backward, optimizer step, etc).
    # Forward pass: compute logits
    logits = model(xb)

    # Compute loss
    loss = loss_function(logits, yb)

    # Backward pass: compute gradients
    optimizer.zero_grad()
    loss.backward()

    # Gradient update
    optimizer.step()

    # evaluate the loss

  0%|          | 0/5000 [00:00<?, ?it/s]

step 0: train loss 11.0375, val loss 11.0335


 10%|█         | 502/5000 [01:38<2:58:27,  2.38s/it]

step 500: train loss 2.9695, val loss 3.0569


 20%|██        | 1002/5000 [03:04<2:38:57,  2.39s/it]

step 1000: train loss 2.5653, val loss 2.7616


 30%|███       | 1502/5000 [04:31<2:18:04,  2.37s/it]

step 1500: train loss 2.3334, val loss 2.6233


 40%|████      | 2002/5000 [05:57<1:58:41,  2.38s/it]

step 2000: train loss 2.1637, val loss 2.5527


 50%|█████     | 2502/5000 [07:23<1:38:52,  2.37s/it]

step 2500: train loss 2.0143, val loss 2.5310


 60%|██████    | 3002/5000 [08:50<1:19:19,  2.38s/it]

step 3000: train loss 1.8986, val loss 2.5216


 70%|███████   | 3502/5000 [10:17<59:30,  2.38s/it]  

step 3500: train loss 1.7835, val loss 2.5251


 80%|████████  | 4002/5000 [11:43<39:25,  2.37s/it]

step 4000: train loss 1.6933, val loss 2.5465


 90%|█████████ | 4502/5000 [13:10<19:45,  2.38s/it]

step 4500: train loss 1.6029, val loss 2.5688


100%|██████████| 5000/5000 [14:36<00:00,  5.70it/s]

step 4999: train loss 1.5247, val loss 2.6006





To achieve good results (low validation loss), you will probably need to play (tune) the `learning_rate` and max training duration (`max_iters`).

## 5.6 Generate stories using the trained model.

No code is needed here

In [16]:
# generate from the model
# context = torch.zeros((1, 1), dtype=torch.long, device=device)
context = torch.tensor(enc.encode('\n'), dtype=torch.long, device=device).unsqueeze(0)
print(enc.decode(model.generate(context, max_new_tokens=200)[0].tolist()))


And so, one day, the fairy waved her wand and waved her wand. The fairy felt very proud of herself for becoming a hero!

The fairy smiled and twirled around with pride, stars jumped out together. With the chirp, the fairy would remind her of all the wonderful gift of his magical life!
Once upon a time, there was a swimming in the water. It was so beautiful that it almost fell from the ocean. The family took it home and together they carried it outside. Soon, they were so happy that it was back to shore. 

The happy sea knew it was time for a dive to sail. The seagull was happy that he got to please and skillfully. He was a loyal friend because he could say they got care.

Later that day, the seahorse played catch their own water in the bathtub. The fish had a big stick that touched the wipe to snuck and bubbles with all the colors
