## Week 6 : Large Language Models
```
- Generative Artificial Intelligence (Fall semester 2023)
- Professor: Muhammad Fahim
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>

## Contents
```
1. Transformers (Implementing a transformer)
2. Self-Attention
3. Multi-headed attention
4. Positional Encoding

```

<hr>


# Transformers

* [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) -- Original paper on attention

![](http://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png)


In [None]:
import torch
from torch import nn
import torch.optim as optim
import pandas as pd
import numpy as np

from torch.utils.data import DataLoader, TensorDataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Transformer Encoder with Pytorch

In [None]:
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=32)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=3)



In [None]:
encoder_layer

TransformerEncoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (linear1): Linear(in_features=512, out_features=2048, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=2048, out_features=512, bias=True)
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
)

## Encoder

The encoder contains a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. <br>
**The main goal is to efficiently encode the data**

![](http://jalammar.github.io/images/t/encoder_with_tensors.png)

## Self-Attention

**Keep in mind : The main goal is to encode the data in a much more efficient way** In other words is to create meaningful embeddings<br>
- As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.


**How does Self-Attention work?**

Steps:
1. For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.
  - What are the **`Query`** vector, a **`Key`** vector, and a **`Value`** vector? : They’re abstractions that are useful for calculating attention... They are a breakdown of the word embeddings
2. Calculating self-attention score from **`Query`** **`Key`** vector.
3. Divide the scores by 8 (This leads to having more stable gradients)
4. Pass the result through a softmax operation (softmax score determines how much each word will be expressed at this position)
5. Multiply each value vector by the softmax score
6. Sum up the weighted value vectors

### Step 1

For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.

![](http://jalammar.github.io/images/t/transformer_self_attention_vectors.png)

In [None]:
# simple sequence = I am here today
simple_sequence_embedding = torch.rand(( 4, 512))

# Create weight matrices
W_k, W_v, W_q = torch.normal(0, 0.1, (3, 512, 7))

# Create key, query and value for each word in the senetence
keys = simple_sequence_embedding @ W_k
values = simple_sequence_embedding @ W_v
queries = simple_sequence_embedding @ W_q

In [None]:
queries

tensor([[ 0.2276, -1.4653,  0.7111,  1.1006, -0.6690,  3.5245,  0.6591],
        [-1.8327, -0.3299, -0.0965,  1.1251,  0.3615,  1.6877,  0.2623],
        [-1.5825, -0.2237,  0.8040,  1.7383, -1.0828,  1.7392,  0.3550],
        [-0.1077, -0.4872,  0.7023,  0.8926,  0.2044,  0.8951,  0.1703]])

In [None]:
simple_sequence_embedding

tensor([[0.3371, 0.4011, 0.6598,  ..., 0.8202, 0.7766, 0.7869],
        [0.4986, 0.4135, 0.9754,  ..., 0.9567, 0.1487, 0.7978],
        [0.3638, 0.9830, 0.6659,  ..., 0.7042, 0.2075, 0.9717],
        [0.6925, 0.3755, 0.2811,  ..., 0.9970, 0.9085, 0.6525]])

## Step 2

Calculating self-attention score from **`Query`** and **`Key`** vector

In [None]:
scores = queries @ keys.T
scores

tensor([[-5.0016, -2.4316,  0.3968, -0.2446],
        [ 3.0838,  3.2170,  6.6662,  4.4395],
        [-4.4791, -1.1210,  0.4245,  0.2488],
        [ 0.5970,  2.1744,  2.3622,  1.9029]])

## Step 3
Divide the scores by 8 (This leads to having more stable gradients)

In [None]:
scores = scores / 8
scores

tensor([[-0.6252, -0.3040,  0.0496, -0.0306],
        [ 0.3855,  0.4021,  0.8333,  0.5549],
        [-0.5599, -0.1401,  0.0531,  0.0311],
        [ 0.0746,  0.2718,  0.2953,  0.2379]])

## Step 4

Pass the result through a softmax operation

In [None]:
scores = torch.softmax(scores, dim = 1)
scores

tensor([[0.1625, 0.2240, 0.3190, 0.2945],
        [0.2098, 0.2133, 0.3283, 0.2486],
        [0.1620, 0.2465, 0.2990, 0.2925],
        [0.2154, 0.2624, 0.2686, 0.2536]])

## Step 5 & 6

* Multiply each value vector by the softmax score
* Sum up the weighted value vectors



In [None]:
scores.shape, values.shape

(torch.Size([4, 4]), torch.Size([4, 7]))

In [None]:
z = scores @ values
z

tensor([[-1.0859,  2.4785, -0.8251,  0.2181, -0.3534, -1.0101,  0.9283],
        [-1.1282,  2.5314, -0.7973,  0.2624, -0.2797, -1.0180,  0.9657],
        [-1.0696,  2.4955, -0.8386,  0.2124, -0.3556, -1.0174,  0.9108],
        [-1.0750,  2.5436, -0.8420,  0.2492, -0.2894, -1.0248,  0.9192]])

# Multi-headed attention

**GOAL**:
1. Expand the model’s ability to focus on different positions
2. Provide the attention layer multiple “representation subspaces”

**Attention with $N$ just means repeating self attention algorithm $N$ times and joining the results**


![](https://data-science-blog.com/wp-content/uploads/2022/01/mha_img_original.png)

**Multi-headed attention steps:**
1. Same as self-attention calculation, just n different times with different weight matrices
2. Condense the $N$ z metrices down into a single matrix by concatinating the matrices then multiply them by an additional weights matrix `WO`

Now the output z metrix is fed to the FFNN

In [None]:
from torch import Tensor
import torch.nn.functional as f


def scaled_dot_product_attention(query, key, value):
  temp = query.bmm(key.transpose(1, 2))
  scale = query.size(-1) ** 0.5
  softmax = f.softmax(temp / scale, dim=-1)
  return softmax.bmm(value)

## Now lets make attention head

In [None]:
class AttentionHead(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k):
    super().__init__()
    self.q = nn.Linear(dim_in, dim_q)
    self.k = nn.Linear(dim_in, dim_k)
    self.v = nn.Linear(dim_in, dim_k)

  def forward(self, query, key, value):
    return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

## Multi Head Attention

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, number_of_heads, dim_in, dim_q, dim_k):
    super().__init__()
    self.heads = nn.ModuleList([AttentionHead(dim_in, dim_q, dim_k) for _ in range(number_of_heads)])
    self.linear = nn.Linear(number_of_heads * dim_k, dim_in)

  def forward(self, query: Tensor, key: Tensor, value: Tensor):
    z = self.linear(torch.cat([h(query, key, value) for h in self.heads], dim=-1))
    return z

## Positional Encoding

A way to account for the order of the words in the input sequence. A transformer adds a vector to each input embedding which helps it determine the position of each word. <br>
**Goal** : preserving information about the order of tokens  

Positional Encoding they can either be learned or fixed a priori.

Proposed approach from original paper : describe a simple scheme for fixed positional encodings based on sine and cosine functions

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*C3a9RL6-SFC6fW8NGpJg5A.png)

In [None]:
def position_encoding(seq_len, dim_model, device):
  pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
  dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
  phase = pos / (1e4 ** (dim / dim_model))

  return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

## Encoder Feed Forward

In [None]:
def feed_forward(dim_input = 512, dim_feedforward = 2048):
  return nn.Sequential(nn.Linear(dim_input, dim_feedforward),
                       nn.ReLU(),
                       nn.Linear(dim_feedforward, dim_input)
                       )

## Encoder Residual

From the original paper the author implementation

In [None]:
class Residual(nn.Module):
  def __init__(self, sublayer, dimension, dropout = 0.1):
    super().__init__()
    self.sublayer = sublayer
    self.norm = nn.LayerNorm(dimension)
    self.dropout = nn.Dropout(dropout)

  def forward(self, *tensors):
    # Assumption : query tensor is given first
    return self.norm(tensors[0] + self.dropout(self.sublayer(*tensors)))

## Putting all together on decoder side

![](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)

## Putting the Encoder layer together

In [None]:
class TransformerEncoderLayer(nn.Module):
  def __init__(self, dim_model = 512, num_heads = 6, dim_feedforward = 2048, dropout = 0.1):
    super().__init__()
    dim_q = dim_k = max(dim_model // num_heads, 1)
    self.attention = Residual(MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
                              imension=dim_model, dropout=dropout)
    self.feed_forward = Residual(
        feed_forward(dim_model, dim_feedforward),
        dimension=dim_model, dropout=dropout)

  def forward(self, src):
    src = self.attention(src, src, src)
    return self.feed_forward(src)

## Putting together transfomer Encoder part

In [None]:
class TransformerEncoder(nn.Module):
  def __init__(self, num_layers = 12, dim_model = 512, num_heads = 4, dim_feedforward = 2048,
               dropout: float = 0.1):
    super().__init__()
    self.layers = nn.ModuleList([TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ])

  def forward(self, src):
    seq_len, dimension = src.size(1), src.size(2)
    src += position_encoding(seq_len, dimension)
    for layer in self.layers:
      src = layer(src)

    return src

# The Decoder Side

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder.


![](https://media.arxiv-vanity.com/render-output/6494154/Figures/ModalNet-21.png)

## Decoder layer

**Task**: implement the decoder layer

In [None]:
class TransformerDecoderLayer(nn.Module):
  def __init__(self, ... ):
    super().__init__()
    pass

  def forward(self):
    pass

SyntaxError: ignored

## Full Transfomer Decoder

**Task**: implement the transfomer decoder part class

In [None]:
class TransformerDecoder(nn.Module):
  def __init__(self, num_layers = 12, dim_model = 512, num_heads = 4, dim_feedforward = 2048,
               dropout: float = 0.1):
    super().__init__()
    self.layers = nn.ModuleList([TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ])

  def forward(self, src):
    seq_len, dimension = src.size(1), src.size(2)
    src += position_encoding(seq_len, dimension)
    for layer in self.layers:
      src = layer(src)

    return src

## Full Transfomer model

**Task**: Assembly a full transfomer (Encoder + Decoder)

In [None]:
class Transfomer(nn.Module):
  def __init__(self, ... ):
    super().__init__()
    pass

  def forward(self):
    pass