# Assignment 11.1 - Transformer

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf.

#### Please state both names of your group members here:
Farah Ahmed Atef Abdelhameed Hafez

## Task 11.1.1: Self-Attention

Implement the attention mechanism by yourself. You are free to use torch and numpy to speed up the matrix multiplications, but please don't just use their transformer implementation.

In the image below, you see the design of one Encoder Block. We want you to set up this Block. Please use your implementation of the Self-Attention (doesn't have to be multi-head) and build the Add & Norm and Feed Forward layers on top of it. Add & Norm and the Feed Forward should be implementations by PyTorch or else. You only need to use your own Self-Attention function.

* Show that your model block works, by forwarding a randomly initialized tensor through it once. Print the values of the Random input tensor, the output tensor and the Q,K and V matrices. **(RESULT)**

In [2]:
from IPython.display import Image
Image(url="https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure.ppm", height=300)

In [1]:
import torch.nn as nn
import torch
import torchvision
import torchvision.transforms as T

# Build self-attention

In [3]:
class self_attention(nn.Module):
  def __init__(self, embed_size):
    super(self_attention, self).__init__()
    self.embed_size = embed_size
    self.q = nn.Linear(embed_size, embed_size)
    self.k = nn.Linear(embed_size, embed_size)
    self.v = nn.Linear(embed_size, embed_size)
  def forward(self, x):
    query = self.q(x)
    key = self.k(x)
    value = self.v(x)

    S= query.matmul(key.transpose(-2,-1))
    S = S/(self.embed_size**0.5)
    S = torch.softmax(S, dim=-1)
    return S.matmul(value), query, key, value


#Build Encoder

In [4]:
class encoder(nn.Module):
  def __init__(self, embed_size, lndim):
    super(encoder, self).__init__()
    self.self_attention = self_attention(embed_size)
    self.norm1 = nn.LayerNorm(embed_size)
    self.norm2 = nn.LayerNorm(embed_size)
    self.ln1= nn.Linear(embed_size, lndim)
    self.ln2= nn.Linear(lndim, embed_size)
    self.act= nn.ReLU()
  def forward(self, x):
      output, Q, K, V = self.self_attention(x)
      x= self.norm1(x+output)
      x= self.norm2(x+self.ln2(self.act(self.ln1(x))))
      return x, Q, K, V


#Test

In [5]:
x = torch.randn(3, 28, 28)

block = encoder(28, 56)

out, Q, K, V = block(x)

print("Input:\n", x)
print("\nQ:\n", Q)
print("\nK:\n", K)
print("\nV:\n", V)
print("\nOutput:\n", out)

Input:
 tensor([[[ 0.3407, -0.7186,  0.2485,  ...,  0.6806, -0.4684,  0.0272],
         [-0.3070,  0.9536,  0.5065,  ..., -0.0046, -1.4808, -0.4607],
         [ 0.2990,  0.8637, -0.2279,  ..., -1.8814, -0.1344,  0.1014],
         ...,
         [-2.2575,  1.3466, -0.7011,  ...,  1.2885,  0.8074, -0.7119],
         [ 0.2386,  0.5361,  0.1436,  ...,  0.6470, -0.5483,  1.6475],
         [ 0.5114, -0.8427,  0.8347,  ...,  0.4807,  0.4986, -0.6574]],

        [[ 1.3771,  0.4581,  1.1626,  ...,  0.6905, -1.7329,  0.2714],
         [-0.9778, -0.5344, -1.7767,  ..., -0.7071, -0.0103,  0.5246],
         [-0.2135, -0.9023,  0.3432,  ...,  0.9843, -0.7998,  0.8360],
         ...,
         [-0.6046, -1.9588, -0.6948,  ..., -0.0039,  0.5153, -1.1242],
         [ 0.5512, -0.6040,  0.6258,  ...,  0.1885,  0.1005,  0.9686],
         [-0.5089,  2.0722, -0.9525,  ...,  0.7120,  2.1778, -1.0791]],

        [[ 1.1564,  0.7803, -0.5001,  ...,  0.3236, -1.5500, -2.1761],
         [-0.7173, -0.2090,  0.9867, 

### Task 11.2 Use your own Transformer Block

* Chain 3 of your transformer blocks to set up a model. Put 1 fully connected layer head on top. **(RESULT)**
* Train your model on the MNIST dataset for image classification. **(RESULT)**
* Report the test accuracy after training. **(RESULT)**

Can you make your own attention work? :)

#Build simple transformer

In [6]:
class transformer(nn.Module):
  def __init__(self, embed_size, lndim):
    super(transformer, self).__init__()
    self.encoder1= encoder(embed_size, lndim)
    self.encoder2= encoder(embed_size, lndim)
    self.encoder3= encoder(embed_size, lndim)
    self.fc= nn.Linear(embed_size, 10)

  def forward(self, x):
    x=x.squeeze(1)
    x,Q, K, V= self.encoder1(x)
    x,Q, K, V= self.encoder2(x)
    x,Q, K, V= self.encoder3(x)
    m= torch.mean(x, dim=1)
    return self.fc(m)

#Load Data

In [7]:
transform = T.Compose([
  T.Resize(28),
  T.ToTensor()
])

train_set = torchvision.datasets.MNIST(
  root="./../datasets", train=True, download=True, transform=transform
)
test_set = torchvision.datasets.MNIST(
  root="./../datasets", train=False, download=True, transform=transform
)

train_loader =  torch.utils.data.DataLoader(train_set, shuffle=True, batch_size=32)
test_loader =  torch.utils.data.DataLoader(test_set, shuffle=False, batch_size=32)


100%|██████████| 9.91M/9.91M [00:00<00:00, 42.2MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 1.18MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 10.6MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 6.56MB/s]


#Build Training logic

In [8]:
def train(model, train_loader, optimizer, criterion, epoch, device):
    model.train()
    for i in range(epoch):
      totalloss=0
      for batch_x, batch_y in train_loader:
          batch_x, batch_y = batch_x.to(device), batch_y.to(device)
          optimizer.zero_grad()
          output = model(batch_x)
          loss = criterion(output, batch_y)
          totalloss+=loss.item()
          loss.backward()
          optimizer.step()
      print("Epoch: ", i, "Average Training Loss: ", totalloss/len(train_loader))




#Build Test logic

In [9]:
def test(model, test_loader, device):
    model.eval()
    correct = 0
    total=0
    with torch.no_grad():
        for batch_x, batch_y in test_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            output = model(batch_x)
            _, predicted = torch.max(output.data, 1)
            total += batch_y.size(0)
            correct += (predicted == batch_y).sum().item()

        print("Accuracy of the network:",correct / total)




#Train and Test on MNIST

In [10]:
model= transformer(28, 56)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
train(model, train_loader, optimizer, criterion, 10, device)
test(model, test_loader, device)

Epoch:  0 Average Training Loss:  0.887586820936203
Epoch:  1 Average Training Loss:  0.5223461686929067
Epoch:  2 Average Training Loss:  0.42644229337771733
Epoch:  3 Average Training Loss:  0.37761253734032313
Epoch:  4 Average Training Loss:  0.3448488276839256
Epoch:  5 Average Training Loss:  0.32116650668382646
Epoch:  6 Average Training Loss:  0.30450626935958863
Epoch:  7 Average Training Loss:  0.2878247905880213
Epoch:  8 Average Training Loss:  0.27475180576841035
Epoch:  9 Average Training Loss:  0.2656788719167312
Accuracy of the network: 0.9076


## Congratz, you made it! :)