# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>1 |</span></b> <b>BACKGROUND</b></div>


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

In the following notebook, we'll look at the following components of the Transformer Encoder structure

<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    
<ul>
<li>Simple Attention</li>
<li>Multi-Head Self Attention</li>
<li>Feed Forward Layer</li>
<li>Normalisation</li>
<li>Skip Connection</li>
<li>Position Embeddings</li>
<li>Transformer Encoder</li>
<li>Classifier Head</li>
</ul> 
</div> 

<br>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

Encoder simply put:
- Converts a **series tokens** into a **series of embedding vectors** (hidden state)
- The encoder (neural network) consists of **multiple layers** (**blocks**) constructed together 

The encoder structure:
- Composed of multiple encoder layers (blocks) stacked next to each other (similar to CNN layer stacks)
- Each encoder block contains **multi-head self attention** & **fully connected feed forward layer** (for each input embedding)

Purpose of the Encoder
- Input tokens are encoded & modified into a form that **stores some contextual information** in the sequence

The example we'll use:

> the bark of a palm tree is very rough


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>CLASSIFICATION HEAD</span></b></p></div>

- Transformers can be utilised for various application so they are created in a base form
- If we want to utilise them for a specific task, we add an extra component **head** to the transformer
- In this example, we'll utilise it for **classification** purposes, and look at how we can combine the base with the **head**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>2 |</span></b> <b>SIMPLE SELF ATTENTION</b></div>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>TYPES OF ATTENTION</span></b></p></div>

**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention</mark>**
 
- Mechanism which allows networks to assign **different weight distributions to each element** in a sequence 
- Elements in sequence - `token embeddings` (each token mapped to a vector of fixed dimension) (eg. BERT model - 768 dimensions)
 
 
**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>**

- Instead of using fixed embeddings for each token, can use whole sequence to **compute weighted average** of each `embedding`
- One can think of self-attention as a form of averaging
- Common form of `self-attention` **scaled dot-product attention** 


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>FOUR MAIN STEPS</span></b></p></div>


- Project each `token embedding` into three vectors **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>**
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** (nxn)

    - (we determine how much the query & key vectors relate to eachother using a similarity function)
    - Similarity function for scaled dot-product attention - dot product
    - queries & keys that are similar will have large dot product & visa versa
    - Outputs from this step - attention scores
    
    
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weight</mark>** (wij)

    - dot products produce large numbers 
    - attention scores first multiplied by a scaling factor to normalise their variance
    - Then normalised with softmax to ensure all column values sum to 1
    
    
- Update the token embeddings (hidden state)

    - multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">weights</mark>** by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** vector

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>SIMPLE ATTENTION FORMULATION</span></b></p></div>


- Well look at a simple example, and summarise the attention mechanism in one function
- `bert-base-uncased` model will be used to extract different model settings (eg. number of attention heads), so we will be building a similar model 

<br>

##### **1. DOCUMENT TOKENISATION**

- Each token in the sentence has been mapped to a **unique identifier** from a **vocabulary** or **dictionary**
- We start off by using the `bert-base-uncased` pretrained tokeniser

In [79]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel

# load tokeniser and model
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

# document well be using as an exmaple
text = "the bark of a palm tree is very rough"

In [80]:
# Tokenise input (text)
inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])


torch.Size([1, 9])

In [81]:
# Decode sequence
tokenizer.decode(inputs['input_ids'].tolist()[0])

'the bark of a palm tree is very rough'

At this point:

- `inputs.inpits_ids` A tensor of id mapped tokens
- Token embeddings are **independent of their context**
- **Homonyms** (same spelling, but different meaning) have the same representation

Role of subsequent attention layers:

- Mix the **token embeddings** to disambiguate & inform the representation of each token with the context of its content

In [82]:
'''

Create an embedding layer

'''

from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)

print(config.hidden_size,"hidden size")
print(config.vocab_size,"vocabulary size")

# load sample embedding layer of size (30522,758) -> same as bert-base
token_emb = nn.Embedding(config.vocab_size,
                         config.hidden_size)
token_emb

768 hidden size
30522 vocabulary size


Embedding(30522, 768)

##### **2. EMBEDDING VECTORS**


- Convert Tokenised data into embedding data (768 dimensions) using vocab of 30522 tokens
- Each input_ids is **mapped to one of 30522 embedding vectors** stored in nn.embedding, each with a size of 768 
- Our output will be [batch_size,seq_len,hidden_dim] by calling `nn.Embedding(hidden)`

In [83]:
'''

Convert Tokens to Embedding Vectors
utilising the existing model embedding embeddings

'''

inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 9, 768])

In [84]:
# 9 embedding vectors of 768 dimensions
inputs_embeds

tensor([[[ 0.6953,  0.1889,  0.1066,  ..., -0.3936, -0.7037,  0.0353],
         [-1.7725, -0.7538, -0.4660,  ..., -1.3029, -0.6142, -0.3187],
         [ 0.4180, -0.1114, -1.8086,  ...,  1.7038, -2.0903,  0.0084],
         ...,
         [-0.3142, -0.1601, -0.7460,  ...,  0.0048,  1.6216,  1.0151],
         [ 0.2993,  0.1934, -0.8340,  ..., -1.2851, -0.8688,  0.5341],
         [-1.8588,  0.5864, -0.1779,  ...,  0.3559,  1.8418, -0.1985]]],
       grad_fn=<EmbeddingBackward0>)

##### **3. QUERY, KEY, VALUE VECTORS**

- As the most simplistic case of attention, **we set them equal to one another**
- Attention mechanism with equal query and key vectors will assign a **very large score to identical words in the context** (diagonal component of matrix)

In [7]:
import torch
from math import sqrt

# setting them equal to one another
print("query and key components\n")
query = key = value = inputs_embeds
print('query size:',query.size())
dim_k = key.size(-1)   # hidden dimension 
print('key size:',key.transpose(1,2).size())

query and key components

query size: torch.Size([1, 9, 768])
key size: torch.Size([1, 768, 9])


##### **4. COMPUTE ATTENTION SCORES**

- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** using the **dot product as the similarity function**
- `torch.bmm` - batch matrix matrix product (as we work in batches during training)
- If we need to transpose a vector `vector.transpose(1,2)`

In [8]:
# dot product & apply normalisation
print("\ndot product (attention scores)")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()


dot product (attention scores)


torch.Size([1, 9, 9])

In [9]:
# attention scores
scores

tensor([[[28.2778, -0.0560, -0.5845,  0.8930, -0.1730, -0.6903,  1.4119,
           2.5140, -1.2024],
         [-0.0560, 27.7726,  0.7603,  0.1857, -0.3888, -0.9593,  0.7742,
          -2.4892, -0.0420],
         [-0.5845,  0.7603, 26.8805, -0.1206,  0.2067,  1.2373, -0.3023,
          -2.0496, -0.1639],
         [ 0.8930,  0.1857, -0.1206, 27.7590,  0.4918,  1.7533,  0.6470,
           0.0552, -0.0448],
         [-0.1730, -0.3888,  0.2067,  0.4918, 26.3990, -1.1320, -0.2163,
           0.4334,  0.4951],
         [-0.6903, -0.9593,  1.2373,  1.7533, -1.1320, 29.1548,  0.0852,
          -0.6562,  0.2810],
         [ 1.4119,  0.7742, -0.3023,  0.6470, -0.2163,  0.0852, 28.4623,
          -0.1245, -0.7585],
         [ 2.5140, -2.4892, -2.0496,  0.0552,  0.4334, -0.6562, -0.1245,
          28.0784, -0.5796],
         [-1.2024, -0.0420, -0.1639, -0.0448,  0.4951,  0.2810, -0.7585,
          -0.5796, 27.5657]]], grad_fn=<DivBackward0>)

##### **5. COMPUTE ATTENTION WEIGHTS (SOFTMAX FUNCTION)**



- Created a 5x5 matrix of **attention scores** per sample in the batch
- Apply the softmax for normalisation to get the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

In [10]:
import torch.nn.functional as F

print("sotfmax applied, attention weights :\n")
weights = F.softmax(scores, dim=-1)
print(weights.size())
print(weights)

print("\nsum of column values:/n")
weights.sum(dim=-1)

sotfmax applied, attention weights :

torch.Size([1, 9, 9])
tensor([[[1.0000e+00, 4.9519e-13, 2.9192e-13, 1.2792e-12, 4.4051e-13,
          2.6260e-13, 2.1492e-12, 6.4703e-12, 1.5736e-13],
         [8.2070e-13, 1.0000e+00, 1.8565e-12, 1.0451e-12, 5.8837e-13,
          3.3259e-13, 1.8825e-12, 7.2022e-14, 8.3224e-13],
         [1.1807e-12, 4.5303e-12, 1.0000e+00, 1.8776e-12, 2.6044e-12,
          7.2997e-12, 1.5656e-12, 2.7277e-13, 1.7979e-12],
         [2.1492e-12, 1.0595e-12, 7.7997e-13, 1.0000e+00, 1.4389e-12,
          5.0805e-12, 1.6804e-12, 9.2980e-13, 8.4132e-13],
         [2.8834e-12, 2.3237e-12, 4.2151e-12, 5.6058e-12, 1.0000e+00,
          1.1052e-12, 2.7612e-12, 5.2877e-12, 5.6243e-12],
         [1.0925e-13, 8.3488e-14, 7.5089e-13, 1.2580e-12, 7.0245e-14,
          1.0000e+00, 2.3725e-13, 1.1305e-13, 2.8856e-13],
         [1.7871e-12, 9.4449e-13, 3.2189e-13, 8.3169e-13, 3.5078e-13,
          4.7422e-13, 1.0000e+00, 3.8450e-13, 2.0397e-13],
         [7.8983e-12, 5.3048e-14, 8.2

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

##### **6. UPDATE VALUES**

Multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>** matrix by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">values</mark>** vector

In [11]:
attn_outputs = torch.bmm(weights, value)
print(attn_outputs)
print(attn_outputs.shape)

tensor([[[ 1.1372, -0.1640, -0.5709,  ..., -0.3382, -0.5156, -0.9368],
         [ 1.6893,  0.7558,  0.0126,  ..., -1.8282, -0.0124, -0.5235],
         [-0.2811,  0.3767, -0.2336,  ...,  1.5930, -1.5876,  1.9867],
         ...,
         [-0.4095, -0.1513,  0.4900,  ..., -0.2493,  0.7508,  0.0295],
         [-1.8162,  0.8497,  1.4001,  ..., -0.5877, -0.4474,  0.9935],
         [-0.6127, -0.1215,  2.2074,  ...,  1.5790,  0.3022, -0.4228]]],
       grad_fn=<BmmBackward0>)
torch.Size([1, 9, 768])


Now we have a general function:
- Which inputs vectors `query`, `key` & `value` 
- Calculates the scalar dot product attention 

In [12]:
'''

Scalar Dot Product Attention
scores = query*key.T / sqrt(dims)
weight = softmax(scores) 

'''

def sdp_attention(query, key, value):
    dim_k = query.size(-1) # dimension component
    sfact = sqrt(dim_k)     
    scores = torch.bmm(query, key.transpose(1,2)) / sfact
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>3 |</span></b> <b>MULTIHEAD SELF ATTENTION</b></div>


- The meaning of the word will be better informed by **complementary words in the context** than by **identical words** (which gives 1)

##### **SIMPLISTIC APPROACH**

- We only used the embeddings "as is" (no linear transformation) to compute the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

##### **BETTER APPROACH**

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>** layer applies **three independent linear transformations (`nn.linear`) to each embedding** to generate **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** 
- These transformations project the embeddings and **each projection carries its own set of learnable parameters** (**Weights**)
- This **allows the self-attention layer to focus on different semantic aspects of the sequence**



Its beneficial to have **multiple sets of linear projections** (each one represents an **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**)

Why do we need more than one **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**?
- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">softmax</mark>** of one head tends to focus on mostly **one aspect of similarity**


**Several heads** allows the model to **focus on several apsects at once**
- Eg. one head can focus on subject-verb interaction, another finds nearby adjectives
- **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">CV analogy</mark>**: filters; one filter responsible for detecting the head, another for facial features 

In [13]:
'''

Attention Class

# nn.linear : apply linear transformation to incoming data
#             y = x * A^T + b
# Ax = b where x is input, b is output, A is weight

# calculate scaled dot product attention matrix
# Requires embedding dimension 
# Each attention head is made of different q,k,v vectors

'''

class Attention(nn.Module):
    
    # initalisation 
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        
        # Define the three vectors
        # input - embed_dim, output - head_dim
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    # main class operation
    def forward(self, hidden_state):
        
        # calculate scaled dot product given a 
        attn_outputs = sdp_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state))
        
        return attn_outputs

`Attention` will be used in the construction of a model

- We’ve **initialised three independent linear layers** that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim]
- Where head_dim is the number of dimensions we are projecting into


In [14]:
# 
print(config.num_attention_heads,'heads')
print(config.hidden_size,'hidden state embedding dimension')

12 heads
768 hidden state embedding dimension


In [15]:
''' Sample Initialisation '''

# Initialised just one head, requires token embedding vector for forward operation

embed_dim = config.hidden_size
num_heads = config.num_attention_heads

attention = Attention(embed_dim,num_heads)
attention

Attention(
  (q): Linear(in_features=768, out_features=12, bias=True)
  (k): Linear(in_features=768, out_features=12, bias=True)
  (v): Linear(in_features=768, out_features=12, bias=True)
)

In [16]:
# Weights are always initialised randomly, attention_outputs varies
attention_outputs = attention(inputs_embeds)
attention_outputs

tensor([[[-0.3785, -0.1225,  0.1665, -0.2492,  0.0298, -0.0010,  0.4590,
           0.2504,  0.3419, -0.1287, -0.1394, -0.0081],
         [-0.4377, -0.0131, -0.0385, -0.4269, -0.0143, -0.2199,  0.3435,
           0.2661,  0.2000, -0.0322, -0.1178, -0.0684],
         [-0.4217, -0.0546,  0.0053, -0.1888,  0.0211, -0.2245,  0.3599,
           0.2941,  0.2816,  0.0566, -0.1054,  0.0277],
         [-0.1829, -0.1457,  0.1940, -0.1023,  0.0994,  0.0569,  0.1223,
           0.3326,  0.2057,  0.1363, -0.0909,  0.0537],
         [-0.5145, -0.1693,  0.0426, -0.2721, -0.0756, -0.1849,  0.4172,
           0.2842,  0.2850, -0.0334, -0.1068,  0.1154],
         [-0.3539, -0.0502,  0.0768, -0.2742,  0.0889, -0.1206,  0.3454,
           0.2391,  0.2898, -0.0085, -0.0863, -0.0209],
         [-0.5372, -0.2052,  0.1084, -0.2576, -0.0576, -0.1253,  0.5218,
           0.2485,  0.3603, -0.1158, -0.1089,  0.1241],
         [-0.3806, -0.0407,  0.0601, -0.3340,  0.0367, -0.1459,  0.3445,
           0.2424,  0.25

In [17]:
'''

Multihead attention class

'''


class multiHeadAttention(nn.Module):
    
    # Config during initalisation
    def __init__(self, config):
        super().__init__()
        
        # model params, read from config file
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        
        # attention head (define only w/o hidden state)
        # each attention head is initialised with embedd/heads head dimension
        self.heads = nn.ModuleList(
            [Attention(embed_dim, head_dim) for _ in range(num_heads)])
        
        # output uses whole embedding dimension for output
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    # Given a hidden state (embeddings)
    # Apply operation for multihead attention
        
    def forward(self, hidden_state):
        
        # for each head embed_size/heads, calculate attention
        heads = [head(hidden_state) for head in self.heads] 
        x = torch.cat(heads, dim=-1) # merge/concat head data together
    
        # apply linear transformation to multihead attension scalar product
        x = self.out_linear(x)
        return x

In [18]:
'''

Sample Usage: Multi-Head Attention

'''

# Every time will be different due to randomised weights
multihead_attn = multiHeadAttention(config) # initialisation with config
attn_output = multihead_attn(inputs_embeds) # forward by inputting embedding vectors (one for each token)

# Attention output (attention weights matrix x vector weights concat)
print(attn_output)
print(attn_output.size())

tensor([[[-0.2038,  0.0492, -0.3806,  ...,  0.1021,  0.0149, -0.1455],
         [-0.1592,  0.0654, -0.3598,  ...,  0.0183,  0.0195, -0.2444],
         [-0.2921,  0.0971, -0.3893,  ..., -0.0356,  0.0129, -0.1618],
         ...,
         [-0.2515,  0.1114, -0.3912,  ...,  0.1206, -0.0399, -0.2125],
         [-0.1645,  0.1249, -0.4061,  ...,  0.0065,  0.0102, -0.2034],
         [-0.2588,  0.1445, -0.3431,  ...,  0.0014, -0.1113, -0.1206]]],
       grad_fn=<ViewBackward0>)
torch.Size([1, 9, 768])


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>4 |</span></b> <b>FEED FORWARD LAYER</b></div>

**position-wise feed-forward layer**

The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">feed-forward</mark>** sublayer in the encoder & decoder
- **two layer fully connected neural network**


However, instead of processing the whole sequence of embedding as a single vector, 
- it **processes each embedding** independently
- Also see it referred to as a Conv1D with a kernel size of 1 (people with a CV background)


The **hidden size** of the **1st layer = 4x size of the embeddings** & **GELU activation function**
- Place where most of the capacity & memorization is hypothesized to happen
- It is most often scaled, when scaling up the models

In [19]:
class feedForward(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    # define layer operations input x
        
    def forward(self, x):    # note must be forward
        x = self.gelu(self.linear1(x))
        x = self.linear2(x)
        x = self.dropout(x)
        return x

In [20]:
# initailise feedforward layer
feed_forward = feedForward(config)              # initialise 
print(feed_forward,'\n')

# requires config & attn_outputs outputs
ff_outputs = feed_forward(attn_output) # forward operation
ff_outputs

feedForward(
  (linear1): Linear(in_features=768, out_features=3072, bias=True)
  (linear2): Linear(in_features=3072, out_features=768, bias=True)
  (gelu): GELU(approximate='none')
  (dropout): Dropout(p=0.1, inplace=False)
) 



tensor([[[ 0.0071, -0.0288, -0.0209,  ..., -0.0377,  0.0024,  0.0379],
         [-0.0073, -0.0401, -0.0253,  ..., -0.0378,  0.0008,  0.0478],
         [-0.0096, -0.0413, -0.0249,  ..., -0.0399,  0.0034,  0.0415],
         ...,
         [-0.0012, -0.0281, -0.0072,  ..., -0.0367,  0.0085,  0.0444],
         [ 0.0059, -0.0000, -0.0209,  ..., -0.0402,  0.0091,  0.0407],
         [ 0.0013, -0.0234, -0.0095,  ..., -0.0388,  0.0076,  0.0386]]],
       grad_fn=<MulBackward0>)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>5 |</span></b> <b>NORMALISATION LAYERS</b></div>

Transformer architecture also uses **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">layer normalisation</mark>** & **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">skip connections</mark>**
- **normalisation** - normalises batch input to have zero mean & unit variance
- **skip connections** - pass a tensor to the next level of the model w/o processing & adding it to the processed tensor

Two main approaches, when it comes to normalisation layer placement in decoder, encoder:
- **post layer** normalisation (transformer paper, layer normalisation b/w skip connections)
- **pre layer** normalisation 

<br>

| `post-layer` normalisation |  `pre-layer` normalisation in literature |
| - | - |
| Arrangement is tricky to train from scractch, as the gradients can diverge |  Most often found arrangement
| Used with LR warm up (learning rate gradually increased, from small value to some maximum value during training) | Places layer normalization within the span of the skip connection |
|  | Tends to be much more stable during training, and it does not usually require any learning rate warm-up |

In [21]:
class encoderLayer(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.hidden_size)
        self.norm2 = nn.LayerNorm(config.hidden_size)
        self.attention = multiHeadAttention(config)    # multihead attention layer 
        self.feed_forward = feedForward(config)        # feed forward layer

    def forward(self, x):
        
        # Apply layer norm. to hidden state, copy input into query, key, value
        # Apply attention with a skip connection
        x = x + self.attention(self.norm1(x))
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.norm2(x))
        
        return x

In [22]:
# Transformer layer output
encoder_layer = encoderLayer(config) # initialise encoder layer
print(encoder_layer,'\n')

print('input',inputs_embeds.shape) 
print('output',encoder_layer(inputs_embeds).size())

encoderLayer(
  (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attention): multiHeadAttention(
    (heads): ModuleList(
      (0-11): 12 x Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
    )
    (out_linear): Linear(in_features=768, out_features=768, bias=True)
  )
  (feed_forward): feedForward(
    (linear1): Linear(in_features=768, out_features=3072, bias=True)
    (linear2): Linear(in_features=3072, out_features=768, bias=True)
    (gelu): GELU(approximate='none')
    (dropout): Dropout(p=0.1, inplace=False)
  )
) 

input torch.Size([1, 9, 768])
output torch.Size([1, 9, 768])


There is an issue with the way we set up the **encoder layers** (which uses just embedding inputs)
- they are totally **invariant to the position of the tokens**
- Multi-head attention layer is effectively a weighted sum, the **information on token position is lost**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>6 |</span></b> <b>POSITIONAL EMBEDDINGS</b></div>

Let's incorporate positional information using **positional embeddings**


**positional embeddings** are based on idea:
  - Modify the **token embeddings** with a **position-dependent pattern** of values arranged in a vector
  
  
If the pattern is characteristic for each position
- the **attention heads** and **feed-forward layers** in each stack can learn to incorporate positional information into their transformations



- There are several ways to achieve this, and one of the most popular approaches is to use a `learnable pattern`
- This works exactly the same way as the token embeddings, but using the **position index** instead of the **token identifier** (from vocabulary dictionary) as input
- An efficient way of encoding the positions of tokens is learned during pretraining

Creating Custom `Embedding` class

Let’s create a custom Embeddings module (**token embeddings + positional embeddings**)
 - That combines a token embedding layer that projects the input_ids to a dense hidden state 
 - Together with the positional embedding that does the same for position_ids
 - The resulting embedding is simply the **sum of both embeddings**

In [23]:
'''

Token + Position Embedding 


'''

class tpEmbedding(nn.Module):
    
    def __init__(self, config):        
        super().__init__()
        
        # token embedding layer
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        
        # positional embedding layer
        # config.max_position_embeddings -> max number of positions in text 512 (tokens)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        
        self.norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        
        # Create position IDs for input sequence
        seq_length = input_ids.size(1) # number of tokens
        position_ids = torch.arange(seq_length, dtype=torch.long)[None,:] # range(0,9)
        
        # tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])
        # tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
        
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        
        # Add normalisation & dropout layers
        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [24]:
# Token and Position Embeddings
embedding_layer = tpEmbedding(config)
embedding_layer(inputs.input_ids)

tensor([[[ 3.0403, -0.0000,  1.1722,  ..., -1.6743, -3.0759, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0780, -1.9838],
         [ 2.4112, -0.0000,  0.9230,  ...,  0.0000, -3.2554, -3.6100],
         ...,
         [ 0.0000, -0.0000, -1.9303,  ...,  1.7901, -2.4660, -0.0000],
         [-0.0000,  0.0000, -0.0000,  ..., -0.0000,  2.1861, -0.9021],
         [-0.0000,  0.0000, -5.8991,  ...,  0.0209,  0.0000, -0.0000]]],
       grad_fn=<MulBackward0>)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>7 |</span></b> <b>PUTTING IT ALL TOGETHER</b></div>

- Constructing the Transformer **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">encoder</mark>**, combining the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Embedding</mark>** and **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Encoder</mark>**  layers
- We utilise both **token** & **positional** embeddings using `tpEmbedding`
- For a given number of heads, we store `encoderLayer`, which contains the **attention** & **feed-forward** layers (which are our layers)

In [25]:
# full transformer encoder combining the `Embedding` with the ``Embedding` ` layers

class TransformerEncoder(nn.Module):
    
    def __init__(self, config):       
        super().__init__()
        
        # token & positional embedding layer
        self.embeddings = tpEmbedding(config)
        
        # attention & forward feed layer 
        self.layers = nn.ModuleList([encoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        
        # embeddings layer output
        x = self.embeddings(x)
        
        # cycle through all heads
        for layer in self.layers:
            x = layer(x)
        return x

In [26]:
# Encoder initialisation & output
encoder = TransformerEncoder(config)
encoder_output = encoder(inputs.input_ids)
encoder_output

tensor([[[ 0.6233, -0.7978, -2.4776,  ..., -0.4364, -0.0118, -1.8506],
         [-3.6226, -1.5029,  0.7667,  ..., -1.8732,  3.5305, -2.8063],
         [ 0.4178,  1.5879, -0.8017,  ..., -3.4680, -0.4882, -3.1883],
         ...,
         [ 0.2196, -0.0766,  2.8025,  ...,  3.6958,  0.7566, -4.5964],
         [ 0.4364,  1.1692, -0.8789,  ...,  1.1547,  4.6363, -1.0828],
         [ 1.8008, -1.3734, -2.4738,  ...,  0.4895, -0.9201, -3.4636]]],
       grad_fn=<AddBackward0>)

In [27]:
# hidden state for each token in a batch
encoder_output.size()

torch.Size([1, 9, 768])

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>8 |</span></b> <b>CLASSIFICATION HEAD</b></div>

Quite often, transformers are divided into:
- Task independent body (`TransformerEncoder`)
- Task dependent head (`TransformerClassifier`)

Select one of the token outputs:
- The first token in such models is often used for the prediction **[CLS] token**
- Can attach a `dropout` and a `linear` transformation layer to make a classification prediction

In [86]:
class TransformerClassifier(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        
        # Transformer Encoder
        self.encoder = TransformerEncoder(config)
        
        # Classification Head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
#         x = self.classifier(x) # 768 -> 3 

        return x

In [132]:
import pandas as pd
class Embeddings():
    def getTensor(self,text):
        inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False,
                          padding=True) # don't use pad, sep tokens
        return inputs.input_ids
data = list(pd.read_csv("SEC-CompanyTicker.csv",index_col=0).companyName[:100])
inputs = Embeddings().getTensor(data)

- For each sample in the batch we get the **unnormalized logits** for each class in the output, which corresponds to the BERT model 

In [133]:
config.num_labels = 3
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs)
output

tensor([[-4.3533,  0.8686,  1.3328,  ..., -0.1595, -1.0860,  0.0706],
        [-0.0000, -0.0561, -1.5787,  ...,  0.0000,  0.0471, -1.3811],
        [ 0.0000,  0.9671, -1.0112,  ...,  2.8010,  1.5144, -1.0039],
        ...,
        [-0.7198,  0.9421, -0.2193,  ...,  0.8811, -0.0000, -3.2116],
        [-1.5848,  0.0000, -0.4742,  ...,  1.1263, -0.0000, -0.2135],
        [-0.7677,  0.8380, -2.6805,  ...,  0.7196, -3.7764, -0.5905]],
       grad_fn=<MulBackward0>)

In [134]:
for i in range(100):
    output = encoder_classifier(inputs)
output

tensor([[-0.1550, -0.8370,  0.1463,  ...,  2.9710, -2.1226,  0.9835],
        [-0.9236, -0.4818, -0.3254,  ...,  3.0201, -2.3562, -0.7856],
        [-0.4130, -0.0390,  2.5468,  ...,  0.4699, -5.0520, -1.2381],
        ...,
        [ 1.7425,  2.1272, -0.5952,  ..., -0.3944, -2.8264, -2.4440],
        [-3.3747,  1.8521, -0.0000,  ...,  1.3196,  0.1693, -0.8030],
        [-0.0000,  0.0000, -0.4076,  ...,  2.8518, -3.9287, -0.2514]],
       grad_fn=<MulBackward0>)

In [91]:
for i in range(100):
    output = encoder_classifier(inputs)
output

tensor([[ 0.4437,  0.3313, -0.6277,  ...,  0.6206,  0.0000,  1.5203],
        [-5.6815,  6.1352,  5.9350,  ...,  0.8245,  1.0475, -1.5590],
        [-0.8428,  5.7313,  0.1390,  ...,  1.5924,  1.5136, -4.6436],
        ...,
        [ 2.4312,  0.1644,  0.2303,  ...,  1.7255,  0.4047, -0.1814],
        [-5.8681,  0.4185,  4.6201,  ...,  6.0037,  0.3678, -1.6436],
        [-1.9681,  0.0000,  5.1553,  ...,  0.7944,  1.8766, -3.0019]],
       grad_fn=<MulBackward0>)

In [138]:
data

['Apple Inc.',
 'Microsoft Corp',
 'Alphabet Inc.',
 'Amazon Com Inc',
 'Nvidia Corp',
 'Tesla, Inc.',
 'Berkshire Hathaway Inc',
 'Meta Platforms, Inc.',
 'Eli Lilly & Co',
 'Visa Inc.',
 'Taiwan Semiconductor Manufacturing Co Ltd',
 'Exxon Mobil Corp',
 'Unitedhealth Group Inc',
 'Walmart Inc.',
 'Novo Nordisk A S',
 'Jpmorgan Chase & Co',
 'Spdr S&P 500 Etf Trust',
 'Johnson & Johnson',
 'Mastercard Inc',
 'Lvmh Moet Hennessy Louis Vuitton',
 'Procter & Gamble Co',
 'Broadcom Inc.',
 'Latam Airlines Group S.A.',
 'Home Depot, Inc.',
 'Chevron Corp',
 'Oracle Corp',
 'Merck & Co., Inc.',
 'Abbvie Inc.',
 'Coca Cola Co',
 'Adobe Inc.',
 'Toyota Motor Corp/',
 'Pepsico Inc',
 'Costco Wholesale Corp /New',
 'Asml Holding Nv',
 'Bank Of America Corp /De/',
 'Cisco Systems, Inc.',
 'Alibaba Group Holding Ltd',
 'Salesforce, Inc.',
 'Shell Plc',
 'Astrazeneca Plc',
 'Novartis Ag',
 'Mcdonalds Corp',
 'Accenture Plc',
 'Thermo Fisher Scientific Inc.',
 'Mexican Economic Development Inc',
 '

In [150]:
with torch.no_grad():
    q = Embeddings().getTensor(["Rtx"])
    xb = encoder_classifier(inputs).cpu().detach().numpy()
    xq = encoder_classifier(q).cpu().detach().numpy()
    xq

In [151]:
import faiss
    
class Faiss:
    def __init__(self):
        pass

    def faiss(self,xb):
        d = xb[0].size
        M = 32
        index = faiss.IndexHNSWFlat(d, M)            
        index.hnsw.efConstruction = 40         # Setting the value for efConstruction.
        index.hnsw.efSearch = 16               # Setting the value for efSearch.
        index.add(xb)
        return index
    
    def query(self,index,xq,k=3):
        D, I = index.search(xq, k)   
        return D, I
    

index = Faiss().faiss(xb)
D,I = Faiss().query(index,xq)
I = I[0]

In [152]:
I

array([99,  7, 15])

In [153]:
print(data[I[0]], data[I[1]], data[I[2]])

Rtx Corp Meta Platforms, Inc. Jpmorgan Chase & Co


In [100]:
model = TransformerClassifier(config)

# Save the entire model
torch.save(model, '../model/transformer_classifier_model.pth')


In [101]:
model = torch.load('../model/transformer_classifier_model.pth')


In [115]:
xq = model(q).cpu().detach().numpy()
xq

array([[-1.3505496e+00, -0.0000000e+00,  1.0903112e+00,  2.2117975e+00,
         8.6265182e-01,  2.8462455e+00, -1.8359869e+00, -1.2350719e+00,
         6.0446796e+00, -0.0000000e+00,  8.6977065e-01, -2.1661969e-02,
        -6.1760110e-01,  2.6821241e+00, -9.9749941e-01, -1.6994071e+00,
        -1.2452182e+00,  1.7675917e+00, -3.3013339e+00,  2.3750708e+00,
        -0.0000000e+00, -1.3645170e+00,  1.9900595e+00,  4.4813614e+00,
         2.5450987e-01,  1.8310524e+00,  6.3948996e-02,  3.5722333e-01,
        -4.7792560e-01,  6.1968966e+00, -8.1963480e-01,  1.4825616e+00,
         3.2396233e+00, -0.0000000e+00,  7.6081905e+00,  4.7303388e-01,
         9.6134752e-01, -1.8100398e+00,  2.3863897e+00,  3.1016536e+00,
        -4.0779333e+00, -1.3829366e+00, -5.4978627e-01, -1.4381657e+00,
        -1.9393079e+00,  7.0081747e-01,  3.2967257e+00,  5.9542090e-01,
        -0.0000000e+00, -1.4870367e+00,  1.5988714e+00, -4.2289000e+00,
        -2.6311826e-02, -6.3320732e-01,  2.5867221e-01,  3.78620

In [112]:
for param in model.parameters():
    param.requires_grad = False

In [159]:
model.train()
with torch.no_grad():
    output = model(q)
output

<function Tensor.size>

In [164]:
output.detach().numpy().size

768