### Practical Example of Transformers

In [79]:
# libraries
import math
import torch 
import tokenizers
import pandas as pd
import torch.nn.functional as F
from transformers import BertTokenizer, AutoTokenizer, AutoModel

In [66]:
example = "I think the Transformers movie is great and it reminds me of my childhood."

##### Brief lesson on Tokenizers

It is the process of breaking down texts into smaller units (tokens). These tokens can be words, subwords, and even characters.

In [67]:
# tokenizer, assign unique id to each word, without using libraries
vocab = {w: i for i, w in enumerate(example.lower().split())}
print("vocab: ", vocab)
inverse_vocab = {i: w for w, i in vocab.items()}
print("inverse_vocab: ", inverse_vocab)
tok = torch.tensor([vocab[w] for w in example.lower().split()])
print("tokens: ", tok)

vocab:  {'i': 0, 'think': 1, 'the': 2, 'transformers': 3, 'movie': 4, 'is': 5, 'great': 6, 'and': 7, 'it': 8, 'reminds': 9, 'me': 10, 'of': 11, 'my': 12, 'childhood.': 13}
inverse_vocab:  {0: 'i', 1: 'think', 2: 'the', 3: 'transformers', 4: 'movie', 5: 'is', 6: 'great', 7: 'and', 8: 'it', 9: 'reminds', 10: 'me', 11: 'of', 12: 'my', 13: 'childhood.'}
tokens:  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])


In [68]:
# tokenizer, using library
lib_token = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
lib_tokenized = lib_token.encode(example)
print("tokens ", lib_tokenized.tokens)
print("Libray token ids ", lib_tokenized.ids)

tokens  ['[CLS]', 'i', 'think', 'the', 'transformers', 'movie', 'is', 'great', 'and', 'it', 'reminds', 'me', 'of', 'my', 'childhood', '.', '[SEP]']
Libray token ids  [101, 1045, 2228, 1996, 19081, 3185, 2003, 2307, 1998, 2009, 15537, 2033, 1997, 2026, 5593, 1012, 102]


#### Slight Deviation, but very relevant (What is [CLS] and [SEP]? )

[CLS] and [SEP] are special tokens that are found in BERT (reads L-R and vice verssa), they mean classification and separator. The [CLS] token is used for the context-aware representation of a word within a sentence, rather than its isolated meaning. 

[CLS] - Beginning and summary positiion of the sentence

[SEP] - End of a single sentence, or to seperate 2 sentences in a pair

Recall fromm the slides that BERT is a good example of an ENCODER-ONLY model as explained earlier, this ENCODER-ONLY layers can be fed directly into a neural-network and it would perform well in task like sentiment analysis (which is a classification tasks). 

Once the sentence ("I think the transformers movie are great") is passed to the BERT Transformer layers, the layers use Attention mechanisms to allow each token to attend to every other token in the sequence. This makes each token vector to capture richer and more complex relationships.

It is at the first position in the sequence, because it captures a summary representation of the entire sentence after passing through all the transformers layers. By the end, the final vector respresentation of the CLS would have seen and know the context of the whole sentence.

There are practical examples of this in the projects

#### Brief lesson on Embeddings

In [69]:
# tokenizer, using library and pretrained embeddings
lib_token = AutoTokenizer.from_pretrained("bert-base-uncased")
lib_model = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)

inputt = lib_token(example, return_tensors="pt")
print("library input ids: ", inputt['input_ids'])
out = lib_model(**inputt)
print("library output last hidden state: ", out)

library input ids:  tensor([[  101,  1045,  2228,  1996, 19081,  3185,  2003,  2307,  1998,  2009,
         15537,  2033,  1997,  2026,  5593,  1012,   102]])
library output last hidden state:  BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0213, -0.1722, -0.0661,  ..., -0.1890,  0.2644,  0.3204],
         [ 0.2207, -0.0398, -0.4240,  ...,  0.0933,  0.9360,  0.4135],
         [ 0.5625,  0.4415, -0.3280,  ..., -0.6412,  0.3912, -0.4904],
         ...,
         [-0.5079,  0.0307, -0.7063,  ..., -0.1720, -0.0915,  0.0143],
         [ 0.6181, -0.0887, -0.0525,  ...,  0.2632, -0.4169, -0.3356],
         [ 0.5937,  0.0210,  0.1422,  ...,  0.2542, -0.4829, -0.3218]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.8914, -0.3729, -0.8648,  0.6573,  0.7791, -0.2078,  0.6093,  0.1718,
         -0.8668, -0.9999, -0.5305,  0.9725,  0.9711,  0.2216,  0.8622, -0.6204,
         -0.3324, -0.5542,  0.2032,  0.0593,  0.5623,  1.0000, -0.1269,  0.2175,


In [70]:
tokens = lib_token.convert_ids_to_tokens(inputt["input_ids"][0])
print("library tokens: ", tokens)
attention = out.attentions[-1].mean(dim=1)[0]

library tokens:  ['[CLS]', 'i', 'think', 'the', 'transformers', 'movie', 'is', 'great', 'and', 'it', 'reminds', 'me', 'of', 'my', 'childhood', '.', '[SEP]']


In [71]:
df = pd.DataFrame(attention.detach().numpy(), index=tokens, columns=tokens)
print(df.round(5))

                [CLS]        i    think      the  transformers    movie  \
[CLS]         0.06400  0.02882  0.03441  0.07551       0.06283  0.04804   
i             0.01872  0.01916  0.01837  0.02005       0.01023  0.00992   
think         0.02251  0.01419  0.05490  0.01747       0.00470  0.00725   
the           0.02309  0.00729  0.01147  0.05147       0.01790  0.02902   
transformers  0.02872  0.01053  0.01125  0.02388       0.11720  0.03879   
movie         0.01925  0.00570  0.00730  0.02872       0.04509  0.10236   
is            0.01733  0.00704  0.01370  0.02586       0.00943  0.01584   
great         0.02111  0.00637  0.01341  0.02193       0.00921  0.01373   
and           0.03192  0.01238  0.01556  0.01084       0.01127  0.00950   
it            0.01728  0.00620  0.01001  0.02268       0.01444  0.01851   
reminds       0.02171  0.00735  0.01175  0.01176       0.00650  0.00639   
me            0.03091  0.01649  0.02268  0.01070       0.01073  0.00531   
of            0.02552  0.

In [72]:
torch.manual_seed(0)
vocab_size = len(vocab)
print("Vocabulary size: ", vocab_size)
# demension of the model
d_model = 8 

Vocabulary size:  14


In [73]:
# Random embeddings initialization
embed = torch.randn(vocab_size, d_model)
print("Embeddings: ", embed)    
# Get embeddings for the tokens
token_embed = embed[tok]
print("Token Embeddings: ", token_embed)
print("Token embeddings shape: ", token_embed.shape)

Embeddings:  tensor([[-1.1258e+00, -1.1524e+00, -2.5058e-01, -4.3388e-01,  8.4871e-01,
          6.9201e-01, -3.1601e-01, -2.1152e+00],
        [ 3.2227e-01, -1.2633e+00,  3.4998e-01,  3.0813e-01,  1.1984e-01,
          1.2377e+00,  1.1168e+00, -2.4728e-01],
        [-1.3527e+00, -1.6959e+00,  5.6665e-01,  7.9351e-01,  5.9884e-01,
         -1.5551e+00, -3.4136e-01,  1.8530e+00],
        [ 7.5019e-01, -5.8550e-01, -1.7340e-01,  1.8348e-01,  1.3894e+00,
          1.5863e+00,  9.4630e-01, -8.4368e-01],
        [-6.1358e-01,  3.1593e-02, -4.9268e-01,  2.4841e-01,  4.3970e-01,
          1.1241e-01,  6.4079e-01,  4.4116e-01],
        [-1.0231e-01,  7.9244e-01, -2.8967e-01,  5.2507e-02,  5.2286e-01,
          2.3022e+00, -1.4689e+00, -1.5867e+00],
        [-6.7309e-01,  8.7283e-01,  1.0554e+00,  1.7784e-01, -2.3034e-01,
         -3.9175e-01,  5.4329e-01, -3.9516e-01],
        [-4.4622e-01,  7.4402e-01,  1.5210e+00,  3.4105e+00, -1.5312e+00,
         -1.2341e+00,  1.8197e+00, -5.5153e-01],
   

#### Brief lesson on Simple Self Attention

In [74]:
# Initialize weights for Q, K, V
d_k = d_model
WQ = torch.randn(d_model, d_k)
print("WQ: ", WQ)
WK = torch.randn(d_model, d_k)
print("WK: ", WK)
WV = torch.randn(d_model, d_k)
print("WV: ", WV)

WQ:  tensor([[-0.2188, -2.4351, -0.0729, -0.0340,  0.9625,  0.3492, -0.9215, -0.0562],
        [-0.6227, -0.4637,  1.9218, -0.4025,  0.1239,  1.1648,  0.9234,  1.3873],
        [-0.8834, -0.4189, -0.8048,  0.5656,  0.6104,  0.4669,  1.9507, -1.0631],
        [-0.0773,  0.1164, -0.5940, -1.2439, -0.1021, -1.0335, -0.3126,  0.2458],
        [-0.2596,  0.1183,  0.2440,  1.1646,  0.2886,  0.3866, -0.2011, -0.1179],
        [ 0.1922, -0.7722, -1.9003,  0.1307, -0.7043,  0.3147,  0.1574,  0.3854],
        [ 0.9671, -0.9911,  0.3016, -0.1073,  0.9985, -0.4987,  0.7611,  0.6183],
        [ 0.3140,  0.2133, -0.1201,  0.3605, -0.3140, -1.0787,  0.2408, -1.3962]])
WK:  tensor([[-0.0661, -0.3584, -1.5616, -0.3546,  1.0811,  0.1315,  1.5735,  0.7814],
        [-1.0787, -0.7209,  1.4708,  0.2756,  0.6668, -0.9944, -1.1894, -1.1959],
        [-0.5596,  0.5335,  0.4069,  0.3946,  0.1715,  0.8760, -0.2871,  1.0216],
        [-0.0744, -1.0922,  0.3920,  0.5945,  0.6623, -1.2063,  0.6074, -0.5472],
     

In [75]:
# Dot Product Attentioin 
Q = token_embed @ WQ
print("Q: ", Q)
K = token_embed @ WK
print("K: ", K)
V = token_embed @ WV
print("V: ", V)

Q:  tensor([[ 0.1616,  2.7584, -2.6225,  1.2504, -1.2288,  1.5813, -1.1914,  1.5489],
        [ 1.5923, -2.4107, -4.8723,  0.4046,  0.6914, -1.3684,  0.0840, -0.5685],
        [ 0.5876,  5.9405, -1.3121,  1.2609, -0.9019, -5.0898,  0.3589, -6.1523],
        [ 0.9338, -3.6396, -3.4383,  1.3036,  1.0184,  0.7839, -1.1401,  1.5856],
        [ 1.1963,  1.1390,  0.3884,  0.0375, -0.3638, -1.2543,  0.0791,  0.4348],
        [-1.8314, -0.5894, -2.7677, -0.0492, -2.6213,  4.0686, -0.9983,  3.5586],
        [-0.9564,  0.4654,  1.6710, -0.4730,  0.9623,  1.0335,  3.7322,  0.9343],
        [-0.2259, -0.6480,  0.7992, -6.0051,  2.6603, -3.3969,  4.3647,  1.8787],
        [-1.0043, -0.9724, -5.2729, -2.5415, -2.7000,  0.1688,  3.5596,  0.8470],
        [ 1.7697,  3.6894, -2.7729, -0.1645, -3.3649, -1.7774,  0.6910, -0.9403],
        [ 1.1554, -0.2853,  3.2374,  0.1971,  0.9336, -0.3298, -0.2700,  1.7390],
        [ 0.6291,  1.0469,  7.4745, -3.0106, -0.9123,  0.2812, -1.0461,  3.9802],
        [ 0.

In [76]:
# Attention Scores, wee used the formula for the Attention as discussed in the slides
scores = (Q @ K.T) / (d_k ** 0.5)
attention_weights = F.softmax(scores, dim=-1)
output = attention_weights @ V
print("Ooutput: ", output)

Ooutput:  tensor([[ 4.5973e-01,  6.4362e-01, -4.0198e-02, -1.4335e+00, -1.7924e+00,
         -1.4687e+00, -1.0978e+00, -2.1365e+00],
        [-2.1062e+00, -7.6363e+00, -3.2941e+00, -1.3382e+00,  1.1331e+01,
         -2.2750e+00, -7.3850e-01,  4.3888e+00],
        [ 2.6113e-01,  2.5553e+00,  8.9947e-01, -2.2560e-01, -2.3838e+00,
          7.4905e-02,  7.0550e-01, -9.4588e-01],
        [-2.1039e+00, -7.6271e+00, -3.2885e+00, -1.3382e+00,  1.1315e+01,
         -2.2722e+00, -7.3533e-01,  4.3843e+00],
        [ 1.5734e+00,  9.6790e-01,  1.3899e+00, -8.3567e-01, -5.0847e+00,
          1.3496e+00, -1.7171e+00, -1.3990e+00],
        [-2.0462e+00, -7.4694e+00, -3.2554e+00, -1.2749e+00,  1.1102e+01,
         -2.2494e+00, -7.7797e-01,  4.2973e+00],
        [-2.1019e+00, -7.6255e+00, -3.2914e+00, -1.3395e+00,  1.1322e+01,
         -2.2810e+00, -7.3379e-01,  4.3944e+00],
        [-2.1072e+00, -7.6405e+00, -3.2958e+00, -1.3386e+00,  1.1336e+01,
         -2.2755e+00, -7.3998e-01,  4.3907e+00],
      

In [77]:
df = pd.DataFrame(attention_weights.detach().numpy(), columns=[inverse_vocab[i] for i in range(len(vocab))], index=[inverse_vocab[i] for i in range(len(vocab))])
# pd.set_option("display.float_format", "{:.2f}".format)
print(df.round(5))

                    i    think      the  transformers    movie       is  \
i             0.08135  0.69940  0.00006       0.20771  0.00013  0.00071   
think         0.00000  0.00008  0.00001       0.00003  0.00000  0.00007   
the           0.00001  0.01221  0.00000       0.20576  0.14167  0.00005   
transformers  0.00001  0.00005  0.00004       0.00002  0.00000  0.00081   
movie         0.43147  0.06218  0.00298       0.12072  0.01643  0.25209   
is            0.00407  0.00622  0.00941       0.00017  0.00000  0.00000   
great         0.00000  0.00001  0.00034       0.00000  0.00000  0.00000   
and           0.00000  0.00000  0.00000       0.00000  0.00000  0.00000   
it            0.00000  0.00000  0.00000       0.00000  0.00000  0.00000   
reminds       0.02035  0.49522  0.00160       0.42160  0.00289  0.00013   
me            0.29727  0.00038  0.00035       0.00171  0.00123  0.46527   
of            0.99135  0.00000  0.00000       0.00000  0.00000  0.00305   
my            0.00000  0.

#### Brief lesson on Simple Postional 

Transformers process all words in parallel, (part of the reason for the rise in GPUs). They do not know whta word comes first nor the position each word is in. Adding position to  each token embedding tells the model where the word sits in the sequence

In [81]:
def pose_encoding(T, d_model):
    pe = torch.zeros(T, d_model)
    for pos in range (T):
        for i in range(0, d_model, 2):
            pe[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
            pe[pos, i+1] = math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))

    return pe

token_embed = token_embed + pose_encoding(token_embed.size(0), d_model)
print("Token embeddings with Position encoding: ", token_embed)

Token embeddings with Position encoding:  tensor([[-1.1258,  0.8476, -0.2506,  1.5661,  0.8487,  2.6920, -0.3160, -0.1152],
        [ 2.0052,  0.7267,  0.3700,  2.3081,  0.1200,  3.2377,  1.1168,  1.7527],
        [ 0.4659,  0.2642,  0.6066,  2.7935,  0.5992,  0.4449, -0.3414,  3.8530],
        [ 1.0324,  1.3252, -0.1134,  2.1835,  1.3900,  3.5863,  0.9463,  1.1563],
        [-2.1272,  1.8737, -0.4127,  2.2484,  0.4405,  2.1124,  0.6408,  2.4412],
        [-2.0202,  2.5476, -0.1897,  2.0525,  0.5239,  4.3022, -1.4689,  0.4133],
        [-1.2319,  2.5235,  1.1753,  2.1778, -0.2291,  1.6082,  0.5433,  1.6048],
        [ 0.8678,  2.2737,  1.6609,  5.4105, -1.5298,  0.7659,  1.8197,  1.4485],
        [ 1.4095,  2.3134,  1.2706,  3.2898, -1.4766,  4.5672, -0.4731,  2.3356],
        [-0.8051,  0.6935, -0.3001,  1.5002, -1.0652,  3.1149, -0.1407,  2.8058],
        [-1.1814,  1.7677, -0.6386,  2.0008,  0.8439,  1.6000,  1.0395,  2.3582],
        [-2.2460,  3.2097, -1.6621,  1.9502, -1.0428,  1