## Attention Mechanism

It is a technique that allows a model to dynamically assign weights to words based on their importance. This mechanism computes attention scores by comparing:-
<hr />

`Query`: What am I looking for <br />
`Key`: These are like book titles let's say <br />
`Value`: The actual content that those title contains <br />
<hr />
Here we will look at multiheaded attention mechanism.

In [2]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))

The Key, Query, and Value are different things so we use nn.Linear() to make them learn. In order for a model to learn we need their weights and biases to be adjusted so we are adding that one.

In [3]:
import torch
import torch.nn as nn

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)
        print(self.key)


## Dimension in each head
Here we have `num_heads` set equal to to `2` in our case so for each head dimension will be `8`. 8X8=64.    

In [4]:
import torch
import torch.nn as nn

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 
        print(self.head_dim)


multi_head = MultiHeadedAttention(16)

8


## Understanding of B,T AND C

`B`: Batch size `T`: Number of tokens in a input sequence `C`: embedding size <hr />

Things that I should have talked about in the `embedding.iynb`
is here we add the `positional` and `token` embeddings to get our final embeddings now that we have embeddings that hold position and symmentaic meaning incase of `word` but in our case it's character so things are little different.

In [5]:
import torch
import torch.nn as nn
from src.tokenizer import Tokenizer
import string

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 

        self.output_proj = nn.Linear(dim, dim) # we will talk about this later
    
    def forward(self, x):
        B, T, C = x.shape
        print(B, T, C)
        '''
            B = Batch
            T = Number of token in a input sequence
            C = emebdding size
        '''
        # we will talk about things that come up here later..




class Transformer(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.block_size = 25
        self.d_model = 16
        self.vocab_size = len(string.printable) # in this case that's how we are tokenizing

        self.p_e = nn.Embedding(self.block_size, self.d_model)
        self.t_e = nn.Embedding(self.vocab_size, self.d_model)

        self.multi_head = MultiHeadedAttention(16)

    def embedding(self, x):
        B, T = x.shape
        # like in the original paper we add those embeddings.
        te = self.t_e(x)
        position = torch.arange(T)
        pe = self.p_e(position).unsqueeze(0)
        return te + pe

    def forward(self, x):
        B, T = x.shape
        emb = self.embedding(x)
        self.multi_head(emb)


transformer = Transformer()

x = Tokenizer(string.printable)
transformer.forward(
    torch.tensor(x.encode("Hello"), dtype=torch.long).unsqueeze(0)
)


1 5 16


### Distrubuting the dimension to each head

After nn.Linear() we convert them to matrix to calculate the attention score. nn.Linear() is a neural net with it's own wieght and biases with no hidden layer.

In [6]:
import torch
import torch.nn as nn
from src.tokenizer import Tokenizer
import string

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 

        self.output_proj = nn.Linear(dim, dim) # we will talk about this later
    
    def forward(self, x):
        B, T, C = x.shape

        query = self.query(x)
        key = self.key(x)
        value = self.key(x)

        query = query.view(B, T, self.num_heads, self.head_dim)
        key = key.view(B, T, self.num_heads, self.head_dim)
        value = value.view(B, T, self.num_heads, self.head_dim)
        
        s = torch.matmul(query, key.transpose(-2, -1))

        print(f"Shape of query {query.shape}")

        print(f"Shape of original key {key.shape}")
        print(f"Shape of transposed key {key.transpose(-2, -1).shape}")


class Transformer(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.block_size = 25
        self.d_model = 16
        self.vocab_size = len(string.printable) # in this case that's how we are tokenizing

        self.p_e = nn.Embedding(self.block_size, self.d_model)
        self.t_e = nn.Embedding(self.vocab_size, self.d_model)

        self.multi_head = MultiHeadedAttention(16)

    def embedding(self, x):
        B, T = x.shape
        # like in the original paper we add those embeddings.
        te = self.t_e(x)
        position = torch.arange(T)
        pe = self.p_e(position).unsqueeze(0)
        return te + pe

    def forward(self, x):
        B, T = x.shape
        emb = self.embedding(x)
        self.multi_head(emb)


transformer = Transformer()

x = Tokenizer(string.printable)
transformer.forward(
    torch.tensor(x.encode("Hello"), dtype=torch.long).unsqueeze(0)
)

Shape of query torch.Size([1, 5, 2, 8])
Shape of original key torch.Size([1, 5, 2, 8])
Shape of transposed key torch.Size([1, 5, 8, 2])


torch.Size([1, 5, 2, 8]) here each dimension are batch size, sequence length, number of heads, and head dimension. if we try to multiply without the transpose then it would be logically incorrect.

### Calculating the `Attention score` and the concept of masking

In [12]:
import torch
import torch.nn as nn
from src.tokenizer import Tokenizer
import string

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 

        self.output_proj = nn.Linear(dim, dim) # we will talk about this later
    
    def forward(self, x):
        B, T, C = x.shape

        query = self.query(x)
        key = self.key(x)
        value = self.key(x)

        query = query.view(B, T, self.num_heads, self.head_dim)
        key = key.view(B, T, self.num_heads, self.head_dim)
        value = value.view(B, T, self.num_heads, self.head_dim)
        print("Dimension before transpose!")
        print(query.shape)
        print(key.shape)
        print(value.shape)

        print("Dimension after transpose")
        query = query.transpose(1, 2)
        key = key.transpose(1, 2)
        value = value.transpose(1, 2)

        print(query.shape)
        print(key.shape)
        print(value.shape)



        '''
        torch.Size([1, 5, 2, 8])
        torch.Size([1, 5, 2, 8])
        torch.Size([1, 5, 2, 8])
        1 BATH 5 TOKENS 2 HEAD 8 = DIM
        '''
        
        s = torch.matmul(query, key.transpose(-2, -1))
        s = s / torch.sqrt(torch.tensor(self.head_dim, dtype=torch.long)) # normalizing by root of dimension

        # # masking and then we apply the softmax.
        mask = torch.tril(torch.ones(T, T)).view(1, 1, T, T)
        # print(s.shape)
        s = s.masked_fill(mask==0, -1e9) # set it to -Inf

        attn_wts = torch.softmax(s, dim=-1) # make sure everythings sums up to a disto of 1
        out = torch.matmul(attn_wts, value)


class Transformer(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.block_size = 25
        self.d_model = 16
        self.vocab_size = len(string.printable) 

        self.p_e = nn.Embedding(self.block_size, self.d_model)
        self.t_e = nn.Embedding(self.vocab_size, self.d_model)

        self.multi_head = MultiHeadedAttention(16)

    def embedding(self, x):
        B, T = x.shape
        # like in the original paper we add those embeddings.
        te = self.t_e(x)
        position = torch.arange(T)
        pe = self.p_e(position).unsqueeze(0)
        return te + pe

    def forward(self, x):
        B, T = x.shape
        emb = self.embedding(x)
        self.multi_head(emb)


transformer = Transformer()

x = Tokenizer(string.printable)
transformer.forward(
    torch.tensor(x.encode("Hello"), dtype=torch.long).unsqueeze(0)
)


Dimension before transpose!
torch.Size([1, 5, 2, 8])
torch.Size([1, 5, 2, 8])
torch.Size([1, 5, 2, 8])
Dimension after transpose
torch.Size([1, 2, 5, 8])
torch.Size([1, 2, 5, 8])
torch.Size([1, 2, 5, 8])


In the above code before transpose we have <br />
Before the transpose:
(B, T, num_heads, head_dim)
After the transpose:
(B, num_heads, T, head_dim)

But each head is independent? Yes it is, but in this way it is effective for parallel computation.



### Understanding the masking 

In matrix way:-
Maybe like one-hot encode but we set them to -Inf before the softmax



| Token | Can Attend To | Cannot Attend To |
|-------|--------------|------------------|
| 0 | [0] | [1, 2, 3, 4] |
| 1 | [0, 1] | [2, 3, 4] |
| 2 | [0, 1, 2] | [3, 4] |
| 3 | [0, 1, 2, 3] | [4] |
| 4 | [0, 1, 2, 3, 4] | [] |
