## Attention Mechanism

It is a technique that allows a model to dynamically assign weights to words based on their importance. This mechanism computes attention scores by comparing:-
<hr />

`Query`: What am I looking for <br />
`Key`: These are like book titles let's say <br />
`Value`: The actual content that those title contains <br />
<hr />
Here we will look at multiheaded attention mechanism.

In [7]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))

The Key, Query, and Value are different things so we use nn.Linear() to make them learn. In order for a model to learn we need their weights and biases to be adjusted so we are adding that one.

In [8]:
import torch
import torch.nn as nn

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)
        print(self.key)


## Dimension in each head
Here we have `num_heads` set equal to to `2` in our case so for each head dimension will be `8`. 8X8=64.    

In [9]:
import torch
import torch.nn as nn

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 
        print(self.head_dim)


multi_head = MultiHeadedAttention(16)

8


## Understanding of B,T AND C

`B`: Batch size `T`: Number of tokens in a input sequence `C`: embedding size <br />

In [None]:
import torch
import torch.nn as nn
from src.tokenizer import Tokenizer
import string

class MultiHeadedAttention(nn.Module):

    def __init__(self, dim, num_heads=2):
        super().__init__()
        self.key = nn.Linear(dim, dim)
        self.query = nn.Linear(dim, dim)
        self.value = nn.Linear(dim, dim)

        self.num_heads = num_heads

        self.head_dim = dim // num_heads 

        self.output_proj = nn.Linear(dim, dim) # we will talk about this later
    
    def forward(self, x):
        B, T, C = x.shape
        print(B, T, C)
        '''
            B = Batch
            T = Number of token in a input sequence
            C = emebdding size
        '''
        # we will talk about things that come up here later..




class Transformer(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.block_size = 25
        self.d_model = 16
        self.vocab_size = len(string.printable) # in this case that's how we are tokenizing

        self.p_e = nn.Embedding(self.block_size, self.d_model)
        self.t_e = nn.Embedding(self.vocab_size, self.d_model)

        self.multi_head = MultiHeadedAttention(16)

    def embedding(self, x):
        B, T = x.shape
        # like in the original paper we add those embeddings.
        te = self.t_e(x)
        position = torch.arange(T)
        pe = self.p_e(position).unsqueeze(0)
        return te + pe

    def forward(self, x):
        B, T = x.shape
        emb = self.embedding(x)
        forward_pass = self.multi_head(emb)


transformer = Transformer()

x = Tokenizer(string.printable)
transformer.forward(
    torch.tensor(x.encode("Hello"), dtype=torch.long).unsqueeze(0)
)


torch.Size([1, 5, 16]) Size of embeddings
1 5 16
