## **ATTENTION LEARNING:**

Here in this part we are going to cover attention learning. A classical example of such is the **Transformer**.

In attention learning here we are putting the input and then a corresponding input embedding is passed which is then pass to the encoder and then to the decoder and as a result an output is generated.


There are two main blocks:

-    <font color='green'>**Encoder :** </font> 

Now what goes inside the Encoding layer. Here we have Multihead Attention where we pass 3 inputs, namely key, query and value. Passing it to normalizationa and scaling. Then it is send through Feed Forward and again normalized. This whole block as Nx also called as transformer block.



-    <font color='green'>**Decoder :** </font> 

Now defining the decoder block what we can see is that we have the values and keys passed from the encoder but the qurie is passed from **another multiheaded attention** as its input. Which are then passed to the multiheaded attention. As we can see the further part of this is very similar to encode block with another layer of Multihead Attention and Normalization. The output is passed to the output encoder then to the block of decoder on the right hand side of the diagram. Lastly, Linear and Softmax are used for output probabilitites.

This `decoder` and `encoder` block is repeated a couple of times. In encode block it will iterate multiple times before sending it to the decode.

Position encoding encodes the position in each word(as we know changing the position changes the sentence contxt and meaning).


**Link:** https://drive.google.com/file/d/1Tqlud4Iw375egpDSIdElvbQCf4Qk4v4g/view?usp=share_link


All operations are done parallely. We are actually providing mask so that each elements go for it target. The first element target only the first whereas the second can target both first and second.

**Link:** https://drive.google.com/file/d/1D5Nw00kTNrOgDmqi5LIc3u6iYCtKitZT/view?usp=share_link

Now digging deeper we can map the embedding input of dim=256 splitting it into several parts(8 part then 32 dims each) and they all are passed through linear layers. Here we are sending the splitted input and then to the **Scaled Dot Product Attention**. Then to Concatination and Linear function made up our multihead attention(dim 256)

**Link:** https://drive.google.com/file/d/1J1ZVixqTttVoRJHls_1aweK7SyCPGw8J/view?usp=share_link


In [1]:
import torch
import torch.nn as nn

In [2]:
class Self_attention(nn.Module): #base class for all neural networks helps to define, initialize and manipulate the parameters
  def __init__(self,emb_size,heads): #we have embedding and we are going to split in different part mostly 8 and how many part we split it we call it heads
    super(Self_attention,self).__init__()
    self.emb_size=emb_size
    self.heads=heads
    self.head_dim=emb_size//heads  #doing the integer division

    #as it should be in interger we cannot split 256 into 7 parts so for that case
    assert (self.head_dim*heads==emb_size), "Embedded size needs to be divided by heads"

    self.keys=nn.Linear(self.head_dim,self.head_dim,bias=False)#for linear transformation for sending our values keys and queries through
    self.values=nn.Linear(self.head_dim,self.head_dim,bias=False)
    self.queries=nn.Linear(self.head_dim,self.head_dim,bias=False)
    self.fc_out=nn.Linear(heads*self.head_dim,emb_size) #fully connected out heads*self.head_dim needs to be equal to emb_size

  def forward(self,values,keys,queries,mask):
    N=queries.shape[0]  #for getting training examples and telling how many examples we are setting at the same time. Batch size
    values_len,keys_len,queries_len= values.shape[1],keys.shape[1],queries.shape[1]# they always going to correspond to source len and taget len. Here we are making it abstract as we dont know where we are going to use it(can be in encoder/decoder).

    #now splitting the embedding into self.heads no. of pieces
    #this is done for capturing more complex relation to the input at different level and then passing each head throught attention mechanism to get the score which is 
    #then concatenated and use to compute the weighted sum of input embedding.
    values=values.reshape(N,values_len,self.heads,self.head_dim) #we are splitting using self.heads,self.head_dim. if not done then we end up with no trainable parameters in the attention block
    queries=queries.reshape(N,queries_len,self.heads,self.head_dim) 
    keys=keys.reshape(N,keys_len,self.heads,self.head_dim) 

    #now defining the score function
    energy=torch.einsum("nqhd,nkhd-->nhqk",[queries,keys]) #einsum performs einstein summation conventions that help in performing various complex operztion over the tensor
    #querie shape: N,queries_len,self.heads,self.head_dim
    #key shape: N,keys_len,self.heads,self.head_dim
    #we finally get score shape= N, head, queries_len, keys_len
    #queries_len is the target source sentence and key_len is the source sentence. For each word in our target how much we can attention on input.

    if mask is not None:
      energy=energy.mask_filled(mask==0,float("-1e20")) #if the elements of mask is zero then we are going to shut this down so that it cannot impact.
      #mask for the target is traingular matrix
    #now with the help of softmax we are calculating the attention
    attention=torch.softmax(energy/(self.emb_size**(1/2)),dim=3) #doing this for nemerical stability and also normalizing around key length
    out=torch.einsum("nqhl,nlhd-->nqhd",[attention,values]).reshape(  #for concatenation
        N,queries_len, self.heads*self.head_dim
    )
    # attention shape: N, heads, queries_len, keys_len
    # value shape: N,values_len,self.heads,self.head_dim
    #out shape/ after einsum: N,queries_len,heads,head_dim   ; key _len=value_len  then faltten last two dimensions
    out=self.fc_out(out) #the fully connected layer of the NN 
    return out

For the attention formula follow the link:
[Attention](https://drive.google.com/file/d/1SIG7SfJDG9HnmNp_yZmeP8x6_MnA9wlE/view?usp=share_link)

In [3]:
from torch.nn.modules import dropout
# Now creating the transformer part

class TransformerBlock(nn.Module): #base class for all neural networks helps to define, initialize and manipulate the parameters
  def __init__(self,emb_size,heads,droupout,forward_expansion): #we have embedding and we are going to split in different part mostly 8 and how many part we split it we call it heads
    super(TransformerBlock,self).__init__()
    self.attention=Self_attention(emb_size,heads)

    #now passing it through the normalization. 2 types: 1st through the attention block then to normalization and after that 2nd through the feedforward and then normalization
    self.norm1=nn.LayerNorm(emb_size) 
    self.norm2=nn.LayerNorm(emb_size)
    self.feed_forward=nn.sequential(
        nn.Linear(emb_size,forward_expansion*emb_size),           #here we are mapping it
        nn.ReLU(),
        nn.Linear(forward_expansion*emb_size,emb_size)            #mapping back to embed size
    )
    self.dropout=nn.Dropout(dropout)

  def forward(self,values,keys,queries,mask):
    attention=self.attention(values,keys,queries,mask)
    x=self.dropout(self.norm1(attention+queries))                                         #sending a skip connection 
    forward=self.feed_forward(x)
    out=self.dropout(self.norm2(attention+x))
    return out

In [4]:
#For the encoder

class Encoder(nn.Module):
  def __init__(self,
               emb_size,
               heads,
               droupout,
               forward_expansion,
               scr_vocab_size,
               device,
               num_layers,
               max_length):    #positional embedding
                     #declaration of hyper parameters
    super(Encoder,self).__init__()
    self.emb_size=emb_size
    self.device=device
    self.word_emb=nn.Embedding(scr_vocab_size,emb_size)
    self.position_emb=nn.Embedding(max_length,emb_size)
    self.layer=nn.ModuleList(                           #map different modules together
        [
            TransformerBlock(emb_size,heads,dropout=dropout,forward_expansion=forward_expansion) #for number of layers
        ]
    )
    self.dropout=nn.Dropout(dropout)
    def forward(self,x,mask):    #awnding one input
      N,seq_len=x.shape
      positions=torch.arrange(0,seq_len).expand(N,seq_len).to(self.device)   #0---->seq_length for every example
      #sending x through embedding
      out=self.dropout(self.word_emb(x)+self.position_emb(positions))
      for layers in self.layers:
        out= layers(out,out,out,mask) #as the value,key and queries are equal.
        return out

In [5]:
#For the decoder

class DecoderBlock(nn.Module):
  def __init__(self,
               emb_size,
               heads,
               droupout,
               forward_expansion,
               device): 
    super(Encoder,self).__init__()
    self.attention=Self_attention(emb_size,heads)
    self.norm=nn.LayerNorm(emb_size)
    self.TransformerBlock=TransformerBlock(emb_size,heads,droupout,forward_expansion)
    self.dropout=nn.Dropout(dropout)
  
  def forward(self,x,value,key,scr_mask,trg_mask):
    #taking i/p from target then value and key that we got from encoder
    attention=self.attention(x,x,x,trg_mask)
    query=self.dropout(self.norm(attention+x))
    out=self.TransformerBlock(value,key,query,scr_mask)
    return out


In [6]:
class Decoder(nn.Module):
  def __init__(self,
               tar_vocab_size,
               emb_size,
               heads,
               droupout,
               forward_expansion,
               device,
               max_length,
               num_layer): 
    super(Decoder,self).__init__()
    self.emb_size=emb_size
    self.device=device
    self.word_emb=nn.Embedding(tar_vocab_size,emb_size)
    self.position_emb=nn.Embedding(max_length,emb_size)
    self.layer=nn.ModuleList(                           #map different modules together
        [
            DecoderBlock(emb_size,heads,dropout,forward_expansion,device) #for number of layers
            for _ in range(num_layer)
        ]
    )
    self.fc_out=nn.Linear(emb_size, tar_vocab_size) #this is the last linear layer that you will find in the diagram.
    self.dropout=nn.Dropout(dropout)
  def forward(self,x,scr_mask,trg_mask,enc_out):
    N,seq_len=x.shape
    positions=torch.arrange(0,seq_len).expand(N,seq_len).to(self.device)
    x=self.dropout(self.word_emb(x)+self.position_emb(positions))
    for layers in self.layers:
      x= layers(x,enc_out,enc_out,scr_mask,trg_mask) #as the value that is input to the decoder block and enc_out is for the value of the values and keys
    out=self.fc_out(x) #prediction of which word is next

Making a mask of traingular matrix: [Triangular Matrix](https://drive.google.com/file/d/1GBjvR9KAdc13UwUiAJa2dBfJhJVypPAD/view?usp=share_link)

In [7]:
#Putting them together

class Transformer(nn.Module):
  def __init__(self,
               tar_vocab_size,
               scr_vocab_size,
               tar_pad_index,     #for the mask
               scr_pad_index,
               emb_size=256,
               num_layer=6,
               forward_expansion=4,
               heads=8,
               dropout=0,
               device="cuda",
               max_length=100
               ):
    super(Transformer,self).__init__()
    self.encoder=Encoder(
        scr_vocab_size,
        emb_size,
        num_layer,
        heads,
        dropout,
        max_length,
        forward_expansion,
        device
    )
    self.decoder=Decoder(
        tar_vocab_size,
        emb_size,
        num_layer,
        heads,
        dropout,
        max_length,
        forward_expansion,
        device
    )
    self.tar_pad_index=tar_pad_index
    self.scr_pad_index=scr_pad_index
    self.device=device

  def make_scr_mask(self,scr):
    scr_mask=(scr!=self.scr_pad_index).unsqueeze(1).unsqueeze(2) #if it is a source pad index then it will be 0 or else 1
    #to shape it ---> (N,1,1,scr_len)
    return scr_mask.to(self.device)

  def make_tar_mask(self,tar):
    N,tar_len=tar.shape
    tar_mask=torch.tril(torch.ones((tar_len,tar_len))).expand(
        N,1,tar_len,tar_len
    )   #making a lower triangular matrix and also expanding to train for each
    return tar_mask.to(self.device)

  def forward(self,scr,tar):
    scr_mask=self.make_scr_mask(scr)
    tar_mask=self.make_tar_mask(tar)
    enc_scr=self.encoder(scr,scr_mask)
    out=self.encoder(tar,enc_scr,scr_mask,tar_mask)
    return out

In [8]:
if __name__=="__main__":
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  x = torch.tensor([[1, 5, 6, 4, 3, 9, 5, 2, 0], [1, 8, 7, 3, 4, 5, 6, 7, 2]]).to(device)
  tar= torch.tensor([[1, 7, 4, 3, 5, 9, 2, 0], [1, 5, 6, 2, 4, 7, 6, 2]]).to(device)
  scr_pad_index = 0
  tar_pad_index = 0
  scr_vocab_size = 10
  tar_vocab_size = 10
  model = Transformer(scr_vocab_size, tar_vocab_size, scr_pad_index, tar_pad_index).to(device)
  out = model(x, tar[:, :-1]) 
  print(out.shape)

TypeError: ignored