## Self Attention

Self attention is a mechanism used in neural networks to help the model to focus on different parts of the input data when generating the output. 

Transformer achitecture mainly used in following jobs:

* Machine Translation
* Sentiment Analysis
* Text summarization

The unique library that we use in this structure

*`Levenshtein:`* Use for calculating the `Levenshtein distance`, which can be useful for evaluating model performance in tasks like text generation, or translation

## Installing Required Libraries

In [3]:
! pip install Levenshtein

! pip install torch==2.3.0 torchtext==0.18.0

Collecting torch==2.3.0
  Downloading torch-2.3.0-cp312-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting torchtext==0.18.0
  Using cached torchtext-0.18.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.9 kB)
Downloading torch-2.3.0-cp312-none-macosx_11_0_arm64.whl (61.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached torchtext-0.18.0-cp312-cp312-macosx_11_0_arm64.whl (2.1 MB)
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.2.2
    Uninstalling torch-2.2.2:
      Successfully uninstalled torch-2.2.2
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.17.2
    Uninstalling torchtext-0.17.2:
      Successfully uninstalled torchtext-0.17.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source o

## Importing Required Libraries`

In [3]:
import os
import sys
import time
import warnings
from pathlib import Path
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import requests

from Levenshtein import distance
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [2]:
def warn(*args,**kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## Training Parameters

`Learning_rate:` step size at each iteration while moving toward a minimum of the loss function

`batch_size:` The number of samples that will be propagated throught ht network in one forward/ backward pass. 

`max iters:` the total number of training iterations we plan to run. Set to 5000 to allow the model ample opportunity to learn from the data.

`eval_interval and eval_iters:` parameters defining how frequently we evaluate the models performance on a set number of batches to approximate loss


## Architecture Parameters

* Max_vocab_size: This represents the maximum number of tokens in our vocabulary. It's set to 256. meaning that we will only consider the most frequent 256 tokens

* Vocab_size: The actual number of tokens in the vocabulary, which may be less than the max due to the variable length of tokens in subword tokenization like BPE (Byte Pair Encoding)

* Block_size: The length of the input sequence that the model is designed to handle. Here it's 16

* n_embed: The size of each embedding vector, set to 32

* Num_heads: the number of heads in the multi-headed self-attention mechanism,2 in this case, which allows the the model to jointly attend to information from different representation


In [10]:
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
device

'mps'

In [11]:
## Training Parameters
learning_rate = 3e-4
batch_size = 64
max_iters = 5000 ## Maximum training iterations
eval_interval = 200 ## Evaluate model every 'eval_interval' iterations in the training loop
eval_iters = 100 ## When evaluating, approximate loss using 'eval_iters' batch


## Architecture Parameters
max_vocab_size = 256  # Maximum vocabulary size
vocab_size = max_vocab_size # Real vocabulary size (e.g. BPE has a variable length, so it can be less than the max vocab size)
block_size = 16  # Context length for predictions
n_embed = 32 # Embedding Size
num_heads = 2 # Number of head in multi headed attention
n_layer = 2 # Number of blocks
ff_scale_factor = 4
dropout = 0.0


head_size = n_embed//num_heads
assert (num_heads*head_size) ==n_embed

Following the parameter setup, you will create a function defined as `plot_embeddings` which is designed to visualize the learned embeddings in a 3D space using matplotlib.

In [None]:
def plot_embeddings(my_embeddings,name,vocab):

    fig = plt.figure()
    ax = fig.add_subplot(111,projection = '3d')

    # plot the data points
    ax.scatter(my_embeddings[:,0], my_embeddings[:,1], my_embeddings[:,2])


    # label the points
    for j,label in en

## Program for Literal transformation


In [30]:
dictionary = {
    'le': 'the',
    'chat': 'cat',
    'est': "is",
    'sous': "under",
    'la': 'the',
    'table': 'table'
}

In [13]:
def tokenize(text):
    return text.split()

def translate(sentence):

    out = '' # Initialize the output string
    for token in tokenize(sentence):
        out += dictionary[token]+ " "

    return out.strip() # Return the translated sentence, stripping any extra whitespace

In [15]:
translate("le chat est sous la table")

'the cat is under the table'

In [20]:
def find_closest_key(query):

    """This function computes the Levenstein distance between the query and each key in the dictionary
    The Levenshtein distance is a measure of the number of single-character edits required to change one word into the other """

    closest_key,min_dist = None,float('inf') # Initialize the closest key and minimum distance to inifinity

    for key in dictionary.keys():

        dist = distance(query,key)

        if dist<min_dist:
            min_dist,closest_key = dist,key

    return closest_key

def translet(sentence):

    """This function tokenizes the input sentence into words and finds the closest translation for each word.
    """

    out = "" # Initialize the output string
    for query in tokenize(sentence):
        key =  find_closest_key(query)
        out +=dictionary[key] +' '
    return out.strip()

                        

In [21]:
translet('tables')

'table'

## Define Vocabularies

In [23]:
# Create and sort the input vocabulary from the dictionary 's keys
vocabulary_in = sorted(list((set(dictionary.keys()))))

# Display the size and the sorted vocabulary for the input language
print(f"Vocabulary input {len(vocabulary_in)}: {vocabulary_in}")

# convert and sort the input vocabulary from the dictionary's values 
vocabulary_out  = sorted(list(set(dictionary.values())))

# Display the size and the sorted vocabulary for the output language
print(f"Vocabulary output {len(vocabulary_out)}: {vocabulary_out}")


Vocabulary input 6: ['chat', 'est', 'la', 'le', 'sous', 'table']
Vocabulary output 5: ['cat', 'is', 'table', 'the', 'under']


In [24]:
len(vocabulary_in)

6

## Encode Tokens using 'one hot' encoding

In [28]:
def encode_one_hot(vocabulary):

    vocabulary_size = len(vocabulary)

    one_hot = dict()

    LEN = len(vocabulary)

    for i, key in enumerate(vocabulary):

        one_hot_encod = torch.zeros(len(vocabulary))

        one_hot_encod[i] =1
        one_hot[key] = one_hot_encod

        print(f"{key}\t: {one_hot[key]}")

    return one_hot

In [31]:
one_hot = encode_one_hot(vocabulary_in)
one_hot

chat	: tensor([1., 0., 0., 0., 0., 0.])
est	: tensor([0., 1., 0., 0., 0., 0.])
la	: tensor([0., 0., 1., 0., 0., 0.])
le	: tensor([0., 0., 0., 1., 0., 0.])
sous	: tensor([0., 0., 0., 0., 1., 0.])
table	: tensor([0., 0., 0., 0., 0., 1.])


{'chat': tensor([1., 0., 0., 0., 0., 0.]),
 'est': tensor([0., 1., 0., 0., 0., 0.]),
 'la': tensor([0., 0., 1., 0., 0., 0.]),
 'le': tensor([0., 0., 0., 1., 0., 0.]),
 'sous': tensor([0., 0., 0., 0., 1., 0.]),
 'table': tensor([0., 0., 0., 0., 0., 1.])}

In [32]:
for k,v in one_hot.items():
    print(f"E_{{ {k} }} = {v}")

E_{ chat } = tensor([1., 0., 0., 0., 0., 0.])
E_{ est } = tensor([0., 1., 0., 0., 0., 0.])
E_{ la } = tensor([0., 0., 1., 0., 0., 0.])
E_{ le } = tensor([0., 0., 0., 1., 0., 0.])
E_{ sous } = tensor([0., 0., 0., 0., 1., 0.])
E_{ table } = tensor([0., 0., 0., 0., 0., 1.])


In [36]:
## Stacking the one-hot encoded vector for input vocabulary to form a tensor

k = torch.stack([one_hot[k] for k in one_hot.keys()])
print(k)

tensor([[1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1.]])


In [33]:
type(one_hot)

dict

In [35]:
one_hot.keys()

dict_keys(['chat', 'est', 'la', 'le', 'sous', 'table'])

In [25]:
one_hot_vector = torch.zeros()

In [27]:
one_hot_vector[1] =1
one_hot_vector

tensor([0., 1., 0., 0., 0., 0.])

In [5]:
class Head(nn.Module):

    def __init__(self):

        super().__init__()

        # Embedding layer to convert input token indices to vectors of fixed size

        self.embedding = nn.Embedding(vocab_size,n_embed)

        # Linear layers to compute the queries, keys, and values from the embeddings
        self.key = nn.Linear(n_embed,n_embed,bias = False)
        self.query = nn.Linear(n_embed,n_embed,bias = False)
        self.value = nn.Linear(n_embed,n_embed,bias = False)


    def attention(self,x):

        embedded_x = self.embedding(x)

        k = self.key(embedded_x)
        q = self.query(embedded_x)
        v = self.value(embedded_x)

        # Attention Score
        # key shape: [batch_size, seq_len, embed_dim]
        # query shape: [batch_size, seq_len, embed_dim]

        # the attention score we got from the dot product of key and query is w
        # w shape: [batch_size, seq_len, seq_len]
        # attention score of each token against every other token, including itself.
        
        w = q @ k.transpose(-2,-1) * k.shape[-1] ** -0.5 # transpose(-2,-1)--> swaps the last two dimension between them
        w = torch.nn.functional.softmax(w,dim = 1) # do a softmax across the last dimenstion
        return embedded_x,k,q,v,w

    def forward(self,x):

        embedded_x = self.embedding(x)

        k = self.key(embedded_x)
        q = self.query(embedded_x)
        v = self.value(embedded_x)

        # Attention score
        w = q @ k.transpose(-2,-1) * k.shape[-1] ** -0.5
        w = nn.functional.softmax(w,dim = 1)

        # add weighted values
        out = w@v

        return out

## Dataset Definition

In [77]:
dataset = [
    (1,"Introduction to NLP"),
    (2,"Basics of PyTorch"),
    (1,"NLP Techniques for Text Classification"),
    (3,"Named Entity Recognition with PyTorch"),
    (3,"Sentiment Analysis using PyTorch"),
    (3,"Machine Translation with PyTorch"),
    (1," NLP Named Entity,Sentiment Analysis,Machine Translation "),
    (1," Machine Translation with NLP "),
    (1," Named Entity vs Sentiment Analysis  NLP "),
    (3,"he painted the car red"),
    (1,"he painted the red car")
    ]

## TOKENIZATION SETUP

In [78]:
tokenizer = get_tokenizer('basic_english')

tokens = [tokenizer(lines) for _,lines in dataset]

In [79]:
tokens

[['introduction', 'to', 'nlp'],
 ['basics', 'of', 'pytorch'],
 ['nlp', 'techniques', 'for', 'text', 'classification'],
 ['named', 'entity', 'recognition', 'with', 'pytorch'],
 ['sentiment', 'analysis', 'using', 'pytorch'],
 ['machine', 'translation', 'with', 'pytorch'],
 ['nlp',
  'named',
  'entity',
  ',',
  'sentiment',
  'analysis',
  ',',
  'machine',
  'translation'],
 ['machine', 'translation', 'with', 'nlp'],
 ['named', 'entity', 'vs', 'sentiment', 'analysis', 'nlp'],
 ['he', 'painted', 'the', 'car', 'red'],
 ['he', 'painted', 'the', 'red', 'car']]

In [80]:
def yield_tokens(vocabulary):
    for _,lines in vocabulary:
        yield tokenizer(lines)

vocabulary = iter(dataset)

tokens_data = yield_tokens(vocabulary)

In [81]:
tokens_func = [token for token in tokens_data]
tokens_func

[['introduction', 'to', 'nlp'],
 ['basics', 'of', 'pytorch'],
 ['nlp', 'techniques', 'for', 'text', 'classification'],
 ['named', 'entity', 'recognition', 'with', 'pytorch'],
 ['sentiment', 'analysis', 'using', 'pytorch'],
 ['machine', 'translation', 'with', 'pytorch'],
 ['nlp',
  'named',
  'entity',
  ',',
  'sentiment',
  'analysis',
  ',',
  'machine',
  'translation'],
 ['machine', 'translation', 'with', 'nlp'],
 ['named', 'entity', 'vs', 'sentiment', 'analysis', 'nlp'],
 ['he', 'painted', 'the', 'car', 'red'],
 ['he', 'painted', 'the', 'red', 'car']]

In [82]:
## Build the Vocabulary

In [83]:
vocab = build_vocab_from_iterator(tokens_func,specials = ["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [84]:
vocab.get_itos()

['<unk>',
 'nlp',
 'pytorch',
 'analysis',
 'entity',
 'machine',
 'named',
 'sentiment',
 'translation',
 'with',
 ',',
 'car',
 'he',
 'painted',
 'red',
 'the',
 'basics',
 'classification',
 'for',
 'introduction',
 'of',
 'recognition',
 'techniques',
 'text',
 'to',
 'using',
 'vs']

## Text Processing Pipeline

In [86]:
def text_pipeline(x):
    """Converts a text string to a list of token indices"""

    return vocab(tokenizer(x))

## Hyperparameter definition

In [85]:
vocab_size = len(vocab)
n_embed = 3

# Create the attention head with the integrated embedding layer
attention_head = Head()

TypeError: Head.__init__() missing 2 required positional arguments: 'embed_dim' and 'vocab_size'

## Dummy data for testing

In [22]:
my_tokens = "he painted the car red"

# Apply the text pipeline to the sentence to get token indices
input_data = torch.tensor(text_pipeline(my_tokens),dtype = torch.long)

# print out the shape and the token indices tensor
print(input_data.shape)
print(input_data)

torch.Size([5])
tensor([12, 13, 15, 11, 14])


In [27]:
embedded_x,k,q,v,w = attention_head.attention(input_data)

# print the size of the resulting embedded vector for verification
print(embedded_x.shape)
print(f"embedded_x:\n {embedded_x}")
print(k.shape)
print(f"key:\n {k}")
print(q.shape)
print(f"Query:\n {q}")
print(v.shape)
print(f"Value:\n{v}")
print(w.shape)
print(f"Attention weight:\n{w}")

torch.Size([5, 3])
embedded_x:
 tensor([[-0.3683, -0.6410, -0.6404],
        [-0.1384,  0.1150, -0.0478],
        [ 0.5578, -1.0002, -0.1047],
        [ 0.3659, -0.0492,  0.6612],
        [ 0.0702,  1.8551,  0.4558]], grad_fn=<EmbeddingBackward0>)
torch.Size([5, 3])
key:
 tensor([[ 0.2685,  0.2143, -0.2498],
        [ 0.0078, -0.0442, -0.0450],
        [ 0.1122,  0.3683,  0.1077],
        [-0.2823, -0.1134,  0.1527],
        [-0.2070, -0.4332,  0.2876]], grad_fn=<MmBackward0>)
torch.Size([5, 3])
Query:
 tensor([[-0.4027,  0.2417,  0.3353],
        [ 0.1054, -0.0782,  0.1150],
        [-0.8143,  0.5693, -0.4430],
        [ 0.0641,  0.1180, -0.5265],
        [ 1.0843, -0.9183,  0.2100]], grad_fn=<MmBackward0>)
torch.Size([5, 3])
Value:
tensor([[-0.1004,  0.0797,  0.1339],
        [-0.0796, -0.0511,  0.0192],
        [ 0.3836,  0.3278, -0.0294],
        [ 0.1975,  0.1075, -0.1316],
        [-0.2107, -0.4564, -0.0877]], grad_fn=<MmBackward0>)
torch.Size([5, 5])
Attention weight:
tensor([[0

## Positional Encoding

In [29]:
position = torch.arange(0,vocab_size,dtype = torch.float).unsqueeze(1)
position,position.shape

(tensor([[ 0.],
         [ 1.],
         [ 2.],
         [ 3.],
         [ 4.],
         [ 5.],
         [ 6.],
         [ 7.],
         [ 8.],
         [ 9.],
         [10.],
         [11.],
         [12.],
         [13.],
         [14.],
         [15.],
         [16.],
         [17.],
         [18.],
         [19.],
         [20.],
         [21.],
         [22.],
         [23.],
         [24.],
         [25.],
         [26.]]),
 torch.Size([27, 1]))

In [30]:
# Retrieve the list of words from the vocabulary object
vocab_list = list(vocab.get_itos())

In [31]:
vocab_list

['<unk>',
 'nlp',
 'pytorch',
 'analysis',
 'entity',
 'machine',
 'named',
 'sentiment',
 'translation',
 'with',
 ',',
 'car',
 'he',
 'painted',
 'red',
 'the',
 'basics',
 'classification',
 'for',
 'introduction',
 'of',
 'recognition',
 'techniques',
 'text',
 'to',
 'using',
 'vs']

In [32]:
len(vocab_list)

27

In [33]:
for idx in range(vocab_size):
    word = vocab_list[idx] # get the word from the vocabulary list at the current index
    pos = position[idx][0].item() # Extract the numerical value of the position index from the tensor
    print(f"Word: {word}, position index: {pos}")
    

Word: <unk>, position index: 0.0
Word: nlp, position index: 1.0
Word: pytorch, position index: 2.0
Word: analysis, position index: 3.0
Word: entity, position index: 4.0
Word: machine, position index: 5.0
Word: named, position index: 6.0
Word: sentiment, position index: 7.0
Word: translation, position index: 8.0
Word: with, position index: 9.0
Word: ,, position index: 10.0
Word: car, position index: 11.0
Word: he, position index: 12.0
Word: painted, position index: 13.0
Word: red, position index: 14.0
Word: the, position index: 15.0
Word: basics, position index: 16.0
Word: classification, position index: 17.0
Word: for, position index: 18.0
Word: introduction, position index: 19.0
Word: of, position index: 20.0
Word: recognition, position index: 21.0
Word: techniques, position index: 22.0
Word: text, position index: 23.0
Word: to, position index: 24.0
Word: using, position index: 25.0
Word: vs, position index: 26.0


In [None]:
# initialize a matrix of zeros with dimensions [vocab_size, n_embed]
# this will be used to hold the positional encodings for each word in the vocabulary

pe = torch.zeros(vocab_size,n_embed)


In [38]:
class Head(nn.Module):

    def __init__(self):

        super().__init__()

        self.embedding = nn.Embedding(vocab_size,embed_dim)
    
        self.key = nn.Linear(embed_dim, embed_dim, bias = False)
        self.query = nn.Linear(embed_dim ,embed_dim, bias = False)
        self.value = nn.Linear(embed_dim , embed_dim ,bias = False)
    


    def attention(self,x):
        embed_x = self.embedding(x)
        k = self.key(embed_x)
        q = self.query(embed_x)
        v = self.value(embed_x)
        

        ## attention score
        # k_Shape : [batch_size, seq_len, embed_dim]
        w = q @ k.transpose(-2,-1) * k.shape[-1] ** -0.5
        w = torch.nn.softmax(w, dim = 1)
    
        return embed_x,k,q,v,w


    def forward(self,x):
        embed_x = self.embedding(x)
        k = self.key(embed_x)
        q = self.query(embed_x)
        v = self.value(embed_x)
        

        ## attention score
        # k_Shape : [batch_size, seq_len, embed_dim]
        w = q @ k.transpose(-2,-1) * k.shape[-1] ** -0.5
        w = torch.nn.softmax(w, dim = 1)

        # add weight values
        out = w @ v
        return out
    

    

In [34]:
a = torch.rand(size = (2,3,5))
a

tensor([[[0.5865, 0.8069, 0.6474, 0.9974, 0.4893],
         [0.0725, 0.8308, 0.6384, 0.0457, 0.5676],
         [0.4795, 0.1701, 0.9509, 0.5975, 0.7179]],

        [[0.6809, 0.2962, 0.5539, 0.7785, 0.8854],
         [0.7682, 0.0020, 0.0525, 0.7021, 0.0929],
         [0.8760, 0.6301, 0.4544, 0.3794, 0.1651]]])

In [35]:
a.shape

torch.Size([2, 3, 5])

In [36]:
b = a.transpose(-2,-1)
b.shape

torch.Size([2, 5, 3])

In [37]:
b

tensor([[[0.5865, 0.0725, 0.4795],
         [0.8069, 0.8308, 0.1701],
         [0.6474, 0.6384, 0.9509],
         [0.9974, 0.0457, 0.5975],
         [0.4893, 0.5676, 0.7179]],

        [[0.6809, 0.7682, 0.8760],
         [0.2962, 0.0020, 0.6301],
         [0.5539, 0.0525, 0.4544],
         [0.7785, 0.7021, 0.3794],
         [0.8854, 0.0929, 0.1651]]])

## Positional Encoding

In [72]:
class PositionalEncoding(nn.Module):

    """Positional encoding module injects some information about the relative or absolute position of the tokens in the sequence
    """

    def __init__(self,n_embed,vocab_size, dropout = 0.1):

        super().__init__()

        # Initialize a buffer for the positional encodings (not a parameter, so it's not updated during training)

        pe = torch.zeros(vocab_size, n_embed)

        position = torch.arange(0, vocab_size, dtype = torch.float).unsqueeze(dim = 1)

        ## Calculate the positional encodings once in log space
        pe = torch.cat((torch.cos(2 * 3.14 * position/25),torch.sin(2*3.14*position/25),torch.sin(2*3.14*position/5)),dim =1)

        self.register_buffer('pe',pe)


    def forward(self,x):

        # add positional encoding to each embedding vector x, assuming x is [seq_len,batch_size,embed_dim]
        # pe is a registered buffer, and doesn't require gradients

        pos = x + self.pe[:x.size(0),:]
        
        
        return pos

class Head(nn.Module):

    def __init__(self,embed_dim, vocab_size):

        super().__init__()

        # embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # positional encoding layer
        self.pos_encoder = PositionalEncoding(embed_dim,vocab_size)

        ## Layers to transform the position encoded embeddings into queries, keys and values

        self.key =nn.Linear(embed_dim, embed_dim, bias = False)
        self.query =nn.Linear(embed_dim, embed_dim, bias = False)
        self.value =nn.Linear(embed_dim, embed_dim, bias = False)


    def forward(self,x):
        
        """Self Attention Head"""

        embedded_x = self.embedding(x)

        # Add the positional embedding
        p_embedded = self.pos_encoder(embedded_x)

        q = self.query(p_embedded)
        k = self.key(p_embedded) ## key: [batch_size, seq_len, embed_dim]
        v = self.value(p_embedded)

        w = q @ k.transpose(-2,-1) * k.shape[-1] ** -0.5 # Queries * Keys / normalizations

        # apply the softmax function to the attention scores to get probabilities
        w = torch.nn.functional.softmax(w,dim =1)

        # Multiply the attention weights with values to get the output
        out = w @ v
        return out
        
        
        
        

    

In [92]:
# instantiate the head class with embedding dimension and vocabulary size as parameters
transformer = Head(n_embed,vocab_size)

# Pass the input data through the transformer model to obtain the output data

out = transformer(input_data)

# Print the shape of the output tensor
# The shape will provide insight into how the data has been transformed through the model
print(f"output shape:,{out.shape}")

print(f"Output: {out}")


output shape:,torch.Size([5, 3])
Output: tensor([[-0.4442, -0.0455, -0.7745],
        [-0.4178, -0.0963, -0.7719],
        [-0.4369, -0.0753, -0.8054],
        [-0.4862,  0.0094, -0.8236],
        [-0.5000,  0.0320, -0.8273]], grad_fn=<MmBackward0>)


In [89]:
n_embed,input_data


(3, tensor([12, 13, 15, 11, 14]))

In [88]:
vocab_size

27

In [75]:
vocab.get_itos()

['<unk>',
 '.',
 'a',
 'cars',
 'different',
 'driving',
 'give',
 'i',
 'level',
 'love',
 'me',
 'of',
 'pleasure',
 'them']

## **Transformer in PyTorch**

In [93]:
transformer_model = nn.Transformer(nhead = 16, num_encoder_layers=12)

In [94]:
src = torch.rand((10,32,512)) # batch_size, seq_len, emb_dim
trg = torch.rand((20,32,512))

In [95]:
out = transformer_model(src,trg)

In [98]:
out.argmax(dim = 1)

tensor([[28,  7, 21,  ...,  7, 17, 24],
        [ 4,  2,  6,  ..., 26, 28, 31],
        [ 6, 31,  7,  ..., 23, 11,  2],
        ...,
        [ 0, 11, 14,  ..., 13, 18,  2],
        [14, 28,  8,  ..., 16, 17, 24],
        [ 2, 21, 14,  ..., 13,  4, 11]])

## MultiHead Attention

In [100]:
# embedding_dim
embed_dim = 4

# Number of attention heads
num_heads = 2

print(f"should be zero:{embed_dim%num_heads}")

# Initialize Multihead Attention
multihead_attention = nn.MultiheadAttention(embed_dim = embed_dim , num_heads=num_heads)

should be zero:0


In [101]:
seq_len = 10 # 10 words per sentence
batch_size = 5 # 5 sentences in a batch
query = torch.rand((seq_len,batch_size, embed_dim))
key = torch.rand((seq_len,batch_size, embed_dim))
value = torch.rand((seq_len,batch_size,embed_dim))

# Perform multi head attention
attn_output, _ = multihead_attention(query,key,value)

print(f"Attention Output Shape: {attn_output.shape}")

Attention Output Shape: torch.Size([10, 5, 4])


In [102]:
# embedding dimension
embed_dim = 4

# Number of attention head
num_heads = 2

# Checking if the embed_dim is divisible by num_heads or not
print(f"Should be zero: {embed_dim%num_heads}")

# Number of encoder layers
num_layers = 6

# Initialize the encoder layer with specified embedding dimension and number of heads
encoder_layer = nn.TransformerEncoderLayer(d_model = embed_dim,
                                          nhead = num_heads)

## Build the transformer encoder by stacking the encoder layer 6
transformer_encoder = nn.TransformerEncoder(encoder_layer,num_layers=num_layers)


Should be zero: 0


Let's now test it with a random input

In [103]:
# Define sequence length as 10 and batch_size as 5 for the input data
seq_len = 10
batch_size = 5

# Generate random input tensor to simulate input embeddings for the transformer encoder
x = torch.rand((seq_len,batch_size,embed_dim))

# Apply the transformer encoder to the input
encoded = transformer_encoder(x)

print(f"Encoded Tensor Shape: {encoded.shape}")

Encoded Tensor Shape: torch.Size([10, 5, 4])


* embedding size = 240
* number of layers = 12
* number of attention heads = 12

In [104]:
embed_dim = 240
num_layers = 12
num_heads = 12
encoder_layer = nn.TransformerEncoderLayer(d_model = embed_dim,
                                          nhead=num_heads)

encoder = nn.TransformerEncoder(encoder_layer,num_layers=num_layers)

In [106]:
seq_len = 20
batch_size = 1

src = torch.rand((seq_len,batch_size,embed_dim))

output = encoder(src)

output.shape

torch.Size([20, 1, 240])

In [59]:
text = 'I love cars. Driving them give me a different level of pleasure'
tokenizer = get_tokenizer('basic_english')
tokens = tokenizer(text)

vocab = build_vocab_from_iterator([tokens], specials = ["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [60]:
len(list(vocab.get_itos()))

14

In [87]:
vocab.get_stoi()

{'using': 25,
 'introduction': 19,
 'classification': 17,
 'for': 18,
 'basics': 16,
 'the': 15,
 'text': 23,
 'recognition': 21,
 'to': 24,
 'machine': 5,
 'he': 12,
 'painted': 13,
 'car': 11,
 'with': 9,
 'translation': 8,
 'entity': 4,
 'sentiment': 7,
 'vs': 26,
 'red': 14,
 'nlp': 1,
 ',': 10,
 'named': 6,
 'techniques': 22,
 '<unk>': 0,
 'of': 20,
 'analysis': 3,
 'pytorch': 2}

In [62]:
embed_dim = 3
embed_vocab = nn.Embedding(len(vocab),embed_dim)

In [63]:
embed_vocab

Embedding(14, 3)

In [65]:
embed_vocab.weight

Parameter containing:
tensor([[ 0.4229, -1.6996,  0.1708],
        [ 0.0846, -0.6866,  1.8081],
        [-0.3060, -0.4981, -0.2505],
        [-0.2914,  1.7543,  1.2794],
        [-0.8538,  0.1899, -1.2747],
        [ 0.9335, -1.0490, -0.3024],
        [-0.2193,  0.3524, -0.4378],
        [-0.5833, -0.5402,  1.1046],
        [ 1.2563,  0.7371,  1.4163],
        [ 1.4141, -0.5095,  1.2078],
        [-0.7427,  1.2219,  0.2621],
        [ 0.5857,  1.1733, -2.3439],
        [-1.4810,  0.8754, -0.2977],
        [-0.8680, -0.1756, -1.2657]], requires_grad=True)

In [66]:
vocab_size = len(vocab)

pe = torch.zeros(vocab_size, embed_dim)

In [67]:
pe

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [70]:
position = torch.arange(0,vocab_size, dtype = torch.float).unsqueeze(dim =1)
position

tensor([[ 0.],
        [ 1.],
        [ 2.],
        [ 3.],
        [ 4.],
        [ 5.],
        [ 6.],
        [ 7.],
        [ 8.],
        [ 9.],
        [10.],
        [11.],
        [12.],
        [13.]])

In [71]:
pe = torch.cat((position, position, position),dim = 1)
pe

tensor([[ 0.,  0.,  0.],
        [ 1.,  1.,  1.],
        [ 2.,  2.,  2.],
        [ 3.,  3.,  3.],
        [ 4.,  4.,  4.],
        [ 5.,  5.,  5.],
        [ 6.,  6.,  6.],
        [ 7.,  7.,  7.],
        [ 8.,  8.,  8.],
        [ 9.,  9.,  9.],
        [10., 10., 10.],
        [11., 11., 11.],
        [12., 12., 12.],
        [13., 13., 13.]])