# Lab1 - Transformers, ViT and VLM

The first two sections are an introduction to Git/GitHub and Pytorch. The last part focuses on Transformers, ViT and VLM. 

## I: Introduction to Git and GitHub

This notebook will guide you through the essential commands and operations needed to use Git and GitHub effectively. By the end of this notebook, you will be able to:

1. Install and configure Git.
2. Create a local repository.
3. Make commits to your repository.
4. Push your repository to GitHub.
5. Collaborate with others using GitHub.

### Step 1: Install and Configure Git

First, you need to install Git on your computer. You can download it from [git-scm.com](https://git-scm.com/).

Once installed, open your terminal (Command Prompt, PowerShell, or Git Bash) and configure Git with your username and email:

In [None]:
git config --global user.name "Your Name"
git config --global user.email "your.email@email.com"

### Step 2: Create a Local Repository

1. Create a new directory for your project and navigate into it:

In [None]:
mkdir my_project
cd my_project

2. Initialize a new Git repository:

In [None]:
git init

### Step 3: Make Commits to Your Repository

1. Create a new file (e.g., README.md) and add some content to it

In [None]:
echo "# My Project" > README.md

2. Check the status of your repository:

In [None]:
git status


3. Add the file to the staging area:

In [None]:
git add README.md

4. Commit the changes:

In [None]:
git commit -m "Initial commit"

### Step 4: Push Your Repository to GitHub

1. Create a new repository on GitHub. Do not initialize the repository with a README, .gitignore, or license.

2. Add the remote repository URL to your local repository:

In [None]:
git remote add origin https://github.com/yourusername/my_project.git

3. Push your changes to the remote repository:

In [None]:
git push -u origin master

### Step 5: Collaborate with Others Using GitHub

To collaborate with others, you can add collaborators to your GitHub repository. Go to the repository settings and add collaborators.

Collaborators can clone the repository to their local machine:

In [None]:
git clone https://github.com/yourusername/my_project.git

Collaborators can create a new branch to work on a feature (A simple rule is: "one new feature, one branch, one pull request")


In [None]:
git checkout -b feature-branch

They can make changes and commit them:

In [None]:
git add .
git commit -m "Add new feature"

Push the branch to the remote repository:

In [None]:
git push origin feature-branch

Create a pull request on GitHub to merge the branch into the main branch.

1. Click on "Contribute" and "Open pull request".

2. Select the Base (the branch you want to merge your changes into, typically the `main` or `master` branch.) and Compare (the branch that contains your changes) ranches:

3. Review the Changes: GitHub will show you the changes between the two branches. Review these changes to ensure everything looks correct.

4. Fill out the pull request form and create the pull request:

5. Review and Merge:

The pull request will now be visible to collaborators, who can review your changes, leave comments, and request modifications. If reviewers request changes, make the necessary modifications in your branch, commit them, and push the updates to the same branch. The pull request will automatically update with the new changes. Once the pull request is approved, you or a collaborator with merge permissions can merge the pull request into the base branch.



### Additional Git Commands

- Check the status of your repository:

In [None]:
git status

- View the commit history:

In [None]:
git log

- Switch to a different branch:

In [None]:
git checkout branch-name

Merge a branch into the current branch:

In [None]:
git merge branch-name

- Fetch updates from the remote repository:

In [None]:
git fetch

- Pull updates from the remote repository and merge them into the current branch:

In [None]:
git pull

## II: Introduction to PyTorch

PyTorch is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily as a research platform that provides dynamic computation graphs and rich ecosystems of tools and libraries.

In this section, we will:
1. Introduce PyTorch and its basic concepts.
2. Learn how to manipulate tensors.
3. Define a simple Multi-Layer Perceptron (MLP).
4. Train the MLP on a simple dataset.

First, let's download and import PyTorch and check its version.

In [None]:
!pip install torch torchvision torchaudio numpy

In [None]:
import torch
print(f"PyTorch version: {torch.__version__}")

### 1. Tensors

Tensors are the core data structure in PyTorch. They are similar to NumPy arrays but with added functionality for GPU acceleration and automatic differentiation. Tensors are optimized for automatic differentiation (we will see more about that later in the Gradient section).

A. Creating Tensors

Tensor can be created 
- directly from data. The data type is automatically inferred. 
- from NumPy arrays
- from another tensor
- with random or constant values

In [None]:
# Creating a tensor from a list
x_from_list = torch.tensor([1, 2, 3, 4, 5])
print(f"Tensor from list: \n {x_from_list}")

# Creating a tensor from a numpy array
import numpy as np
numpy_array = np.array([1, 2, 3, 4, 5])
x_from_numpy = torch.tensor(numpy_array)


# Creating a tensor from another tensor
x_ones = torch.ones_like(x_from_list)
print(f"Ones Tensor: \n {x_from_list} \n")

x_rand = torch.rand_like(x_from_list, dtype=torch.float)
print(f"Random Tensor: \n {x_rand} \n")


# Creating a tensor with random values
x_random = torch.rand((3, 3))
print(f"Tensor with random values: \n {x_random} \n")


# Creating a tensor with constant values
x_zeros = torch.zeros((3, 3))
print(f"Tensor with zeros:\n {x_zeros} \n")

x_ones = torch.ones((3, 3))
print(f"Tensor with ones: \n {x_ones} \n")

Tensor attributes describe their shape, datatype, and the device on which they are stored.

In [None]:
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

We notice that the variable `tensor` has 'cpu' as device attribute. PyTorch allows you to store tensors and perform computations on different devices, such as the CPU and GPU. Using a GPU can significantly speed up training and inference for large models and datasets.

By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using .to method. First, let's check if a GPU is available.

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

If you’re using Colab, you can allocate a GPU by going to Runtime > Change runtime type > GPU. 
You can move tensors to a specific device using the .to(device) method.

In [None]:
# Create a tensor (by default on the cpu)
tensor_cpu = torch.tensor([1.0, 2.0, 3.0])
print(f"Tensor on device: {tensor_cpu.device}")

# Move the tensor to the selected device
if torch.cuda.is_available():
    device = torch.device("cuda")
    tensor_device = tensor_cpu.to(device)
    print(f"Tensor on device: {tensor_device.device}")

### 2. Tensor operations

Lots of tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling and more are already implemented in PyTorch. The next cells show some basic operations. You can find comprehensively list of available operations [here](https://pytorch.org/docs/stable/torch.html).

Arithmetic operation

In [None]:
# Adding two tensors
tensor_a = torch.tensor([1, 2, 3])
tensor_b = torch.tensor([4, 5, 6])
tensor_sum = tensor_a + tensor_b
print(f"Sum of tensors: \n {tensor_sum}\n")


# Multiplying two tensors element-wise. All three result tensors will have the same value
tensor_product_1 = tensor_a * tensor_b
tensor_product_2 = tensor_a.mul(tensor_b)
tensor_product_3 = torch.mul(tensor_a, tensor_b)
print(f"Product of tensors: \n {tensor_product_1} \n")

y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)


# Matrix multiplication. All three result tensors will have the same value
tensor_c = torch.tensor([[1, 2], [3, 4]])
tensor_d = torch.tensor([[5, 6], [7, 8]])
tensor_matmul_1 = tensor_c.matmul(tensor_d)
tensor_matmul_2 = tensor_c @ tensor_d
tensor_matmul_3 = torch.matmul(tensor_c, tensor_d)
print(f"Matrix multiplication: \n {tensor_matmul_1} \n")

Indexing and Slicing

In [None]:
# Creating a 2D tensor
tensor_2d = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"2D Tensor: \n {tensor_2d} \n")

# Indexing
print(f"Element at (0, 0): \n {tensor_2d[0, 0]} \n")

# Slicing
print(f"First row: \n {tensor_2d[0, :]} \n")
print(f"First column: \n {tensor_2d[:, 0]} \n")

Single tensor operations

In [None]:
tensor = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])

## Statistical operations: 
mean_value = tensor.mean()
print(f"Mean of tensor: \n {mean_value} - {type(mean_value)} \n")
mean_value_item = mean_value.item()
print(f"Mean of tensor (item): \n {mean_value_item} - {type(mean_value_item)}\n")

# Sum of all elements
sum_value = tensor.sum()
print(f"Sum of tensor: \n {sum_value} \n")

# Maximum value
max_value = tensor.max()
print(f"Maximum value in tensor: \n {max_value} \n")

# Minimum value
min_value = tensor.min()
print(f"Minimum value in tensor: \n {min_value} \n")

# Standard deviation
std_value = tensor.std()
print(f"Standard deviation of tensor: \n {std_value} \n")


Concatenation and operation on the dimension 

In [None]:
tensor = torch.ones(4, 4)
print(f"Initial tensor: \n {tensor} \n")

# Concatenation of tensors
t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(f"Concatenated tensor (dim=1): \n {t1} \n")

t2 = torch.cat([tensor, tensor, tensor], dim=0)
print(f"Concatenated tensor (dim=0): \n {t2} \n")

# Squeeze and Unsqueeze operations
tensor_with_ones = torch.tensor([[1.0], [2.0], [3.0]])
print(f"Tensor with ones shape: \n {tensor_with_ones.shape} \n")
squeezed_tensor = tensor_with_ones.squeeze()
print(f"Squeezed tensor shape: \n {squeezed_tensor.shape} \n")

unsqueezed_tensor = tensor.unsqueeze(0)
print(f"Unsqueezed tensor shape at dimension 0 (shape): \n {unsqueezed_tensor.shape} \n")

# Add a dimension of size 1 at position 1
unsqueezed_tensor_1 = tensor.unsqueeze(1)
print(f"Unsqueezed tensor at dimension 1 (shape): \n {unsqueezed_tensor_1.shape} \n")

### 2. Neural networks

Neural networks comprise of layers/modules that perform operations on data. Neural networks in PyTorch subclasses the nn.Module. A neural network is a module itself that consists of other modules (layers). 

To create a neural network in PyTorch, follow these steps:
1. Define a class that inherits from `nn.Module`.
2. Implement the `__init__` method to initialize the layers of the network.
3. Implement the `forward` method to define the forward pass of the network.


The `__init__` function is used to initialize the layers of the neural network. In this example, we will define an MLP with one hidden layer.

The `forward` takes the input data x and passes it through the first fully connected layer (fc1). This layer applies a linear transformation to the input data. The first layer is then passed through the ReLU activation function. he output of the ReLU activation is finally passed through the second fully connected layer (fc2).  

In [11]:
import torch.nn as nn
import torch.nn.functional as F

class SimpleMLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # First fully connected layer
        self.fc2 = nn.Linear(hidden_size, output_size)  # Second fully connected layer

    def forward(self, x):
        x = F.relu(self.fc1(x))  # Apply ReLU activation to the first layer
        x = self.fc2(x)  # Apply the second layer
        return x

Our MLP class is now defined. We can create an instance this class and move it to the `device`: 

In [None]:
# Define the MLP
input_size = 784
hidden_size = 256
output_size = 10
model = SimpleMLP(input_size, hidden_size, output_size)
model.to(device)

# Print the model architecture
print(model)

To use the model, we pass it the input data. This executes the model’s forward, along with some background operations. Do not call `model.forward()` directly!

In [None]:
X = torch.rand(1, 784, device=device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred.item()}")

Let’s break down the layers in the model.
We consider an input tensor `x` of shape `[1, 784]` (1 for the batch size, 784 for the size).

- nn.Linear

The linear layer is a module that applies a linear transformation on the input using its stored weights and biases.


In [None]:
x = torch.rand(1, 784, device=device)

layer1 = nn.Linear(in_features=784, out_features=20, device=device)
hidden1 = layer1(x)
print(hidden1.size())

- nn.ReLU

Non-linear activations are what create the complex mappings between the model’s inputs and outputs. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.

In this model, we use nn.ReLU between our linear layers. 

In [None]:
print(f"Before ReLU: {hidden1}\n\n")
hidden1 = nn.ReLU()(hidden1)
print(f"After ReLU: {hidden1}")

- nn.Sequential

nn.Sequential is an ordered container of modules. The data is passed through all the modules in the same order as defined. You can use sequential containers to put together a quick network like seq_modules.

In [None]:
seq_modules = nn.Sequential(
    layer1,
    nn.ReLU(),
    nn.Linear(20, 10, device=device)
)
logits = seq_modules(x)

- nn.Softmax

The last linear layer of the neural network returns logits - raw values in [-infty, infty] - which are passed to the nn.Softmax module. The logits are scaled to values [0, 1] representing the model’s predicted probabilities for each class. dim parameter indicates the dimension along which the values must sum to 1.

In [None]:
softmax = nn.Softmax(dim=1)
pred_probab = softmax(logits)
print(f"Predicted probabilities: \n {pred_probab} \n")

Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. You can access to the parameters of a model using: `parameters()` or `named_parameters()` methods.

In [None]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} \n")

### 3. Differentiation and gradients in PyTorch

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

To compute the derivative (of a variable, a loss, etc. with respect to a parameters), we use the `.backward()` method:  

In [None]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

y = x ** 2
z = y.sum()

# Compute the gradient
z.backward()

print("Gradient of z with respect to x:", x.grad)

Consider the simplest one-layer neural network `z = w x + b`, with input x, parameters w and b, and some loss function.

In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables.

In [None]:
x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

loss.backward()
print(f"Gradient of the loss with respect to w: \n {w.grad} \n")
print(f"Gradient of the loss with respect to b: \n {b.grad} \n")

A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor.

In [None]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

### 4. Training a neural network

Training a neural network in PyTorch involves several key steps: defining the model, preparing the data, specifying the loss function and optimizer, and iterating through the training loop. Let's go through each step in detail.

1. Define the model 

We will use the Multi-Layer Perceptron defined in the previous section.

2. Prepare the data

Next, we need to prepare our training and evaluation data. PyTorch provides `Dataset` and `DataLoader` classes to handle data loading and preprocessing efficiently. The `Dataset` class is an abstract class representing a dataset, and the `DataLoader` class provides an iterable over the given dataset. The `torchvision.datasets` module contains Dataset objects for many real-world vision data like CIFAR or COCO. 

In this example, we will use the [MNIST](https://yann.lecun.com/exdb/mnist/) dataset. 




PyTorch provides `Dataset` and `DataLoader` classes to handle data loading and preprocessing efficiently. The `Dataset` class is an abstract class representing a dataset, and the `DataLoader` class provides an iterable over the given dataset.

In this section, we will work with the MNIST Dataset. The `torchvision.datasets` module contains Dataset objects for many real-world vision data like CIFAR or COCO. 

In [None]:
from torchvision import datasets
import torchvision.transforms as T

training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=T.Compose([T.ToTensor(), T.Lambda(lambda x: torch.flatten(x))]),
)

test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=T.Compose([T.ToTensor(), T.Lambda(lambda x: torch.flatten(x))]),
)

We then pass the Dataset as an argument to DataLoader. The DataLoader class supports batching, shuffling, and loading data in parallel using multiprocessing workers. Here we define a batch size of 64, i.e. each element in the dataloader iterable will return a batch of 64 features and labels.

In [None]:
from torch.utils.data import DataLoader

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

3. Specify the Loss Function and Optimizer

The loss function measures the difference between the predicted outputs and the true labels. The optimizer updates the model parameters to minimize the loss.

In [26]:
import torch.optim as optim

# Define the loss function
criterion = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

4. Training Loop

The training loop involves iterating over the dataset, performing forward and backward passes, and updating the model parameters. The main steps are: 
- Zero the Parameter Gradients: `optimizer.zero_grad()` clears the old gradients from the last step. This is important because PyTorch accumulates gradients by default.
- Forward Pass: The input data is passed through the model to get the outputs. The loss is then computed using the criterion.
- Backward Pass: `loss.backward()` computes the gradient of the loss with respect to the model parameters. 
- Optimization: `optimizer.step()` updates the model parameters using the computed gradients.

In [None]:
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    running_loss = 0.0
    for i, data in enumerate(train_dataloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Print statistics
        running_loss += loss.item()
        if i % 100 == 99:  # Print every 100 mini-batches
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_dataloader)}], Loss: {running_loss/100:.4f}')
            running_loss = 0.0

print("Training complete.")


5. Evaluation

After training, you can evaluate the model on the test dataset to see how well it performs. We need to:
- Set the Model to Evaluation Mode: `model.eval()` sets the model to evaluation model
- Disable Gradient Calculation: `torch.no_grad()` disables gradient calculation. It also reduce memory consumption and speeds up computation.
- Compute Accuracy: The model's predictions are compared to the true labels to compute the accuracy.


In [None]:
# Evaluation loop
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0
with torch.no_grad():  # Disable gradient calculation
    for data in test_dataloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct / total:.2f}%')


## III: Attention and Transformers

In this part, we will implement the Transformers architecture. Transformers has been a key architecture in deep learning for the past 5 years. 

It has first began with NLP, then came audio and finally, since 2020, computer vision.
We will implement every block that makes a transformer from scratch and we will try to create a deep understanding of what is happening.
Here is a figure for the transformer architecture:


<img src="https://www.researchgate.net/profile/Miruna-Gheata/publication/355339249/figure/fig1/AS:1079476452622337@1634378650979/Encoder-decoder-architecture-of-the-Transformer-developed-by-Vaswani-et-al-28.ppm" width=768>

In [None]:
!pip install einops
!pip install timm

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
import math
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torchvision import transforms
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
from einops import rearrange, repeat
from PIL import Image
from torchvision import transforms
import requests

torch.manual_seed(3407)


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### III-1. The attention mechanism

### Implementing the scaled dot-product attention mechanism

The transformer architecture is built around one key block: The attention.
The idea behind attention is the following. Imagine you want to retrieve information from a dictionary. The dictionnary is indexed by keys which maps to a particular value. Now, you have a query which will be matched against the keys of the dict and if you have a match, you will retrieve the associated value.
Attention is very similar to this simple retrieval example. Now, with real data, we don't have this structure, we however are going to learn to create it. 

We have 2 sets of vectors (also named tokens). One is $X_{to}$ which is the destination set. We want to be able to map this set of tokens to queries. We achieve this by doing a linear projection of $X_{to}$ to obatain:  $Q = W_QX_{to}$

The other set is $X_{from}$ the set from which we want to retrieve information. We will need to extract both keys and values from this set. We therefore do 2 linear projections of $X_{from}$ to obtain:  $K = W_KX_{from}$ and $V = W_VX_{from}$.

Now, contrary to the dictionnary where queries and values are exact matchs, we don't have this here. Therefore, we will perform a softer match by computing the similarity matrix between $Q$ and $K$. Then for each $Q$, we want to output the values that have the higher similarity. We therefore output the weighted sum of the values, weighted by the softmax of the similarity (also called the attention matrix).

Finally, the attention operation is given by the cross attention:

$$
A(Q,K,V) = \text{SoftMax}(\frac{QK^T}{\sqrt{d_k}})V
$$

We divide the similarity by $\sqrt{d_k}$ for stability reason to avoid the similarity to explode with big vectors which would lead to very sharp attention coeficients.

#### Question 1.
Implement the attention operation, use  `torch.einsum` to easily compute the similarity matrix.

In [None]:
class Attention(nn.Module):
    def __init__(self, x_to_dim, x_from_dim, hidden_dim,):  
        # To complete
        pass
        
    def forward(self, x_to, x_from):
        # x_to = [batch size, x_to_len, x_to_dim]
        # x_from = [batch size, x_from_len, x_from_dim]

        # To complete
        return x_to

### Multi-head attention

We improve the above attention implementation by introducing multi-head attention. The idea here is that we compute the attention on subspaces of the $Q,K,V$ triplets. 
We split each vector in $n$ subsets and compute the attention for each subset. At the end, we concatenate every attention output and project it with an output projection.

#### Question 2. 
Implement Multihead attention.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, x_to_dim, x_from_dim, hidden_dim, n_heads):
        # To complete
        pass

    def forward(self, x_to, x_from):
        # x_to = [batch size, x_to_len, x_to_dim]
        # x_from = [batch size, x_from_len, x_from_dim]

        # To complete
        return x_to

MultiheadAttention is the attention that is used in transformers in pratice. It is used in 2 flavors:
- Self Attention: When $X_{to}$ attends itself ($X_{to}=X_{from}$)
- Cross Attention. $X_{to}\neq X_{from}$


#### Question 3. 
Implement MultiHead Self Attention and MultiHeadCrossAttention

In [None]:
class MultiHeadSelfAttention(MultiHeadAttention):
    def __init__(self, x_dim, hidden_dim, n_heads):
        # To complete
        pass

    def forward(self, x):
        # x = [batch size, x_len, x_dim]
        # To complete
        pass
    
class MultiHeadCrossAttention(MultiHeadAttention):
    def __init__(self, x_to_dim, x_from_dim, hidden_dim, n_heads):
        # To complete
        pass
    
    def forward(self, x_to, x_from):
        # x_to = [batch size, x_to_len, x_to_dim]
        # x_from = [batch size, x_from_len, x_from_dim]
        
        # To complete
        pass

### LayerNorm
Another key component of the transformer is the LayerNorm. As we have previously seen, normalizing the output of a deep learning layer helps a lot with convergence and stability. 
Until Transformers, the most used normalization is BatchNorm. We normalize the data among the batch dimension. However, this has a few problems.
- The normalization depend on the other samples in the batch
- When using multiple GPUs, BatchNorm needs to synchronize the batch statistic across GPUs, which locks the forward process and slow down training.

The last element is the most important one. Transformers, aims to be a easy to parralilize architecture and can't afford to use batchnorm.

Instead, Transformers uses Layer Norm. LayerNorm is sample dependent, which removes the synchronization issue. We normalize over the channel dimension instead of the batch dimension.

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*gat8a-TUnopoYN_veGEi0w.png">

To account for the loss of capacity, we map the output by a linear transformation with a learned bias and scale.

#### Question 4.
Implement the LayerNorm

In [None]:
class LayerNorm(nn.Module):
    # To complete
    pass

### Feed Feedward Network

Finally, the last block is a feed-forward network with one hidden layer. This layer has usually a size of $2 * input\_dim$. This is followed by a dropout layer and an activation function. Here, we will use leaky relu, with a leak parameter of 0.1.

#### Question 5.
Implement the FFN layer

In [None]:
class FFN(nn.Sequential):
    def __init__(self, hidden_dim, dropout_rate=0.1, expansion_factor=2):
        # To complete
        pass

### The Transformer block

The last thing that we are missing are the skip connection. Like in ResNet, the transformer architecture implements the skip-connection. This allow for a better gradient flow avoiding vanishing gradient.
There is a skip connection after the attention and the feed forward network

#### Question 6.
Given at the transformer figure at the top, implement the Transformer Encoder Block

In [None]:
class TransformerEncoderBlock(nn.Module):
    def __init__(self ,data_dim, hidden_dim, n_heads, dropout_rate=0.1):
        # To complete
        pass

    def forward(self, x):
        # x = [batch size, x_len, hidden dim]
        # To complete
        pass

### Positional embedding
The transformers architecture is permutation independent. That means that for every token, we can swap 2 tokens and have the exact same result. However, the position of the token can be a very important information to consider. Imagine in an image. If a pixel is nearby another pixel, we want the transformer to be able to capture such information. Which is not the case for now.
That's why we introduce positional encodings. For each token, add the positional encoding to the original token:

$$
X_i = X_i + PE(i)
$$

with X_i the token at the i dimension.

The most used positional encodings are sinusoidal encodings. They are defined as follow:

$$
PE(i, 2j) = sin(i / 10000^{\frac{2j}{d}}) \\
PE(i, 2j + 1) = cos(i / 10000^{\frac{2j}{d}})
$$


Where $d$ the dimension of the tokens, $i$, the i-th token in the sequence and $2j$ (resp $2j + 1$), the index of the dimension of the vector.
The idea here is that we add a sinusoidal that encode the position in a multidimensional array.

Another common positional encodings is the learned positional encoding. Simply, we let the network learn a set of tensor $PE$ that match the sequence length and dimension of the tokens.

#### Question 7. 

Implement both Sinusoidal and Learned positional embeddings

In [None]:
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, hidden_dim):
        # To complete
        pass


    def forward(self, x):
        # x = [batch size, seq len, hidden dim]
        # To complete
        pass

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, hidden_dim, max_len):
        # To complete
        pass
    
    def forward(self, x):
        # x = [batch size, seq len, hidden dim]
        # To complete
        pass

### The transformer encoder
Now you have everything you need to implement the transformer . You add positional encoding to the tokens and then stack N transformer encoder layers

#### Question 8. 
Implement the transformer encoder with n_layers and the ability to choose both positional embeddings.

Tip: Look into `ModuleList`

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, data_dim,  hidden_dim, n_heads, n_layers, dropout_rate=0.1, positional_encoding="sinusoidal", max_len=1000):
        # To complete
        pass
    
    def forward(self, x):
        # x = [batch size, seq len, hidden dim]
        # To complete
        pass

# IV: Vision Transformers (ViT)

The above architecture was introduced in 2017 to process sequences of text tokens. However, it could be useful to be able to leverage this architecture for computer vision.

This could be interesting to leverage to improve vision systems. If we learn the biases from the data, we can hope to have better performances. We however need compute and a lot of data to do this.

To apply the transformer to images, one key question remains to be answered: How do we transform an image to tokens? The approach introduce in Vision Transformers is to cut the image into patches that are then transformed into a token trhought a linear projection.

We also add an extra token, known as the classification token, that will be the token which will be use to predict upon. After going through the N transformer layers, this is the token that goes throught a multi layer perceptron.


<img src= "https://1.bp.blogspot.com/-_mnVfmzvJWc/X8gMzhZ7SkI/AAAAAAAAG24/8gW2AHEoqUQrBwOqjhYB37A7OOjNyKuNgCLcBGAsYHQ/s1600/image1.gif" width="512">


#### Question 9

Implement the PatchEmbedding class that allows to transform an image into a sequence of patches. 

Hint: Use a Conv2D with the right kernel size and stride to do the linear projection of non-overlapping patches

In [None]:
class PatchEmbeddings(nn.Module):
    def __init__(self, img_size=96, patch_size=16, hidden_dim=512):
        super().__init__()
        # To complete
        pass

    def forward(self, x):
        # To complete
        pass

#### Question 10

Implement the vision transformer

Hint: Use Conv2D with the right kernel size and stride to do the linear projection of non-overlapping patches.

In [None]:
class ViT(nn.Module):
    def __init__(self, img_size, patch_size, hidden_dim, n_heads, n_layers, n_classes, dropout_rate):
        super(ViT, self).__init__()
        # To complete
        pass

    def forward(self, x, apply_mlp_head=True):
        # x: (batch size, 3, image height, image width)
        # To complete
        pass

#### Question 11 
Train a ViT on CIFAR10 for 100 epochs (for compute reason you can use only 20 epochs) and log both train and test loss and accuracy. 
We provide a data augmentation strategy called auto augment to avoid overfitting on the training data.
Hparameters are to be choosen to your discretion.


Tips for Hparams:
- Don't use a transformer hidden dim too big (<256)
- Use a small patch size
- Use AdamW with some weight decay to avoid overfitting
- Use between 2 and 6 transformer layers.
- Use between 2 and 4 transformer heads

In [None]:
batch_size = 128
train_set = CIFAR10(root='./data', train=True, download=True, transform=transforms.Compose([
    transforms.autoaugment.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
]))

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)

test_set = CIFAR10(root='./data', train=False, download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
]))

test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=4)

In [None]:
# To complete: train the model, (don't forget to test it)

In [None]:
# Plot the training and test loss

# V: Vision Language Models (VLM)

In this section, we will implement a Vision Language Model and finetune one for a Visual Question Answering task.

Here is a diagram for the VLM architecture:

<img src="https://raw.githubusercontent.com/AviSoori1x/seemore/refs/heads/main/images/vlm.png" width=512>

We have already implemented the vision encoder. We will now implement the text decoder (also based on the transformer architecture), and a multimodal projector. We will try to create a deep understanding of what is happening.

## V-1. The Multimodal Projector 

The ViT class (implemented in the last section) takes an input image and returns the embedding corresponding to the class token (CLS), which is then used to condition the text generation in the language decoder. 

However, we can not directly concatenate this to the text embeddings. We need to project this from the dimensionality of image embeddings from the vision transformer to the dimensionality of text embeddings. This is done by the multimodal projector.

This projector is usually a single learnable layer followed by a non-linearity or an MLP. Here we will implement a MLP wit one hidden layer, an expansion factor of $4$ and a GELU activation function. 

#### Question 12
Implement the Multimodal Projector

In [None]:
class MultiModalProjector(nn.Module):
    # To complete
    pass

## V-2. The Language Transformer Decoder

The final component we need to look at is the decoder language model. 
The (text) Transformer Decoder is similar to the (Vision) Transformer Encoder defined before. The main differences are: 
- A text token embedding replaces the patch embedding. 
- Causal Self Attention replaces Self Attention in the Transformer Block. 
- A language modeling head is added on top of the last Transformer Block. 


### Causal Attention

In Causal Attention, masking is applied in each attention head to obscure any information following the current token's position, thereby directing the model's attention to only the preceding parts of the sequence. A token can not attend to the "future" (following) tokens in the sentence. 

In practice, a lower triangular mask $M$ is added to the similarity matrix between $Q$ and $K$. 

The causal attention operation is then given by: 
$$
A_{causal}(Q,K,V) = \text{SoftMax}(\frac{QK^T + M}{\sqrt{d_k}})V
$$

<img src="https://blog.sailor.plus/deep-learning/images/1613723693323.png" width=512> 


#### Question 13
Implement the Causal Multi Head Self Attention 

Hint: You can start from the class implemented before.

In [None]:
class CausalMultiHeadSelfAttention(nn.Module):
    def __init__(self, x_dim, hidden_dim, n_heads):
        # To complete
        pass

    def forward(self, x):
        # x = [batch size, x_len, x_dim]
        # To complete
        return x

### Transformer Decoder Block

Now we implement a Transformer Decoder Block using the Causal Attention. 

#### Question 14

Implement the Transformer Decoder Block

In [None]:
class TransformerDecoderBlock(nn.Module):
    def __init__(self, data_dim, hidden_dim, n_heads, dropout_rate=0.1):
        # To complete
        pass
    
    def forward(self, x):
        # To complete
        pass

## Building the Language Transformer Decoder

#### Question 15

Implement the Language Transformer Decoder

In [None]:
class LanguageTransformerDecoder(nn.Module):
    def __init__(self, n_embd, image_embed_dim, vocab_size, n_heads, n_layers):
        # To complete
        pass

    def forward(self, idx, image_embeds):
        # To complete
        pass
    
    def generate(self, idx, image_embeds, max_new_tokens):
        ### Do not complete this method ### 
        # With the logits outputted by the forward method 
        # and using sampling methods (e.g., greedy sampling, top-k sampling, nucleus sampling)
        # we can generate some text tokens!! 
        pass

With the logits outputted by the forward method and using sampling methods (e.g., greedy sampling, top-k sampling, nucleus sampling), we can generate some text tokens!

## V-4. Bringing everything together to implement the Vision Language Model

Now, we have all the element to build our Vision Languauge model. 

#### Question 16

Implement the Vision Language Model class

In [None]:
class VisionLanguageModel(nn.Module):
    def __init__(
            self,
            n_embd,
            image_embed_dim,
            vocab_size,
            n_enc_layers,
            img_size, patch_size,
            n_heads,
            n_dec_layers,
            dropout_rate
        ):
        # To complete
        pass

    def forward(self, img_array, idx):
        # To complete
        pass

    def generate(self, img_array, idx, max_new_tokens):
        # To complete
        pass

## VI. Finetuning a VLM for Visual Question Answering task

Training a Vision Language Model from scratch requires a lot of training data and GPU ressources. Then, we are going to finetune an already pretrained VLM on a specific task: Visual Question Answering.  

We will fine-tune the recent VLM `Florence-2` from Microsoft on the dataset DocVQA.

Lets start by donwloading the model and the dataset.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

data = load_dataset("HuggingFaceM4/DocumentVQA", split=["train[:10%]", "validation[:10%]", "test[:10%]"])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True, revision='refs/pr/6').to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True, revision='refs/pr/6')

torch.cuda.empty_cache()

Let's do inference with our dataset first to see how the model performs before fine-tuning.

In [None]:
# Function to run the model on an example
def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

In [None]:
for idx in range(3):
    print(run_example("DocVQA", 'What is written on top of the document?', data[0][idx]['image']))
    display(data[0][idx]['image'].resize([350, 350]))

We need to construct our dataset. Note how we are adding a new task prefix `<DocVQA>` before the question when constructing the prompt.

In [None]:
from torch.utils.data import Dataset

class DocVQADataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        question = "<DocVQA>" + example['question']
        first_answer = example['answers'][0]
        image = example['image']
        if image.mode != "RGB":
            image = image.convert("RGB")
        return question, first_answer, image

Let's get to fine-tuning. We will instntiate our dataset, the data collator, and start training.

In [None]:
import os
from torch.utils.data import DataLoader, Subset
from tqdm import tqdm
from transformers import AutoProcessor, get_scheduler
from bitsandbytes.optim import AdamW

def collate_fn(batch):
    questions, answers, images = zip(*batch)
    inputs = processor(text=list(questions), images=list(images), return_tensors="pt", padding=True).to(device)
    return inputs, answers

# Create datasets
train_dataset = DocVQADataset(data[0])
val_dataset = DocVQADataset(data[1])

Finetuning the model on the entire dataset will be too long for us (around 2,5 hours per epoch) and could be too heavy for the vRAM of the GPU we are using. We reduce the size of the dataset using `Subset`.

You can adapt the size of the subsets depending on your GPU. 

If you have time, run the finetuning on the entire dataset, the results will be even better!

In [None]:
# Use a subset of the dataset for training
train_dataset = Subset(train_dataset, list(range(0, 1000)))
val_dataset = Subset(val_dataset, list(range(0, 200)))

# Create DataLoader
batch_size = 2
num_workers = 0

train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers)

#### Question 16 (Bonus)

Complete the training loop in the `train model` function

In [None]:
def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
    optimizer = AdamW(model.parameters(), lr=lr)
    num_training_steps = epochs * len(train_loader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        i = -1
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
            i += 1
            inputs, answers = batch

            input_ids = inputs["input_ids"]
            pixel_values = inputs["pixel_values"]
            labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

            ### Complete here:  ###
            # Get the logits from the model
            # Calculate the loss
            # Backpropagate


            ### End completion ###

            train_loss += loss.item()

        avg_train_loss = train_loss / len(train_loader)
        print(f"Average Training Loss: {avg_train_loss}")

        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
                inputs, answers = batch

                input_ids = inputs["input_ids"]
                pixel_values = inputs["pixel_values"]
                labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)

                outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
                loss = outputs.loss

                val_loss += loss.item()

        avg_val_loss = val_loss / len(val_loader)
        print(f"Average Validation Loss: {avg_val_loss}")

        # Save model checkpoint
        output_dir = f"./model_checkpoints/epoch_{epoch+1}"
        os.makedirs(output_dir, exist_ok=True)
        model.save_pretrained(output_dir)
        processor.save_pretrained(output_dir)

We will freeze image encoder for this TP. The authors have reported improvement in unfreezing image encoder, but note that this will result in more resource usage.

In [None]:
for param in model.vision_tower.parameters():
    param.is_trainable = False

Note: if the following cell crash with the error `OutOfMemoryError: CUDA out of memory.`, try to reduce the batch size and/or the number of example in the train/validation set.

In [None]:
train_model(train_loader, val_loader, model, processor, epochs=2)

Let's do inference with our finetuned model:

In [None]:
for idx in range(3):
    print(run_example("DocVQA", 'What is written on top of the document?', data['train'][idx]['image']))
    display(data['train'][idx]['image'].resize([350, 350]))