# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>1 |</span></b> <b>BACKGROUND</b></div>


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

In the following notebook, we'll look at the following components of the Transformer Encoder structure

<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    
<ul>
<li>Simple Attention</li>
<li>Multi-Head Self Attention</li>
<li>Feed Forward Layer</li>
<li>Normalisation</li>
<li>Skip Connection</li>
<li>Position Embeddings</li>
<li>Transformer Encoder</li>
<li>Classifier Head</li>
</ul> 
</div> 

<br>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

Encoder simply put:
- Converts a **series tokens** into a **series of embedding vectors** (hidden state)
- The encoder (neural network) consists of **multiple layers** (**blocks**) constructed together 

The encoder structure:
- Composed of multiple encoder layers (blocks) stacked next to each other (similar to CNN layer stacks)
- Each encoder block contains **multi-head self attention** & **fully connected feed forward layer** (for each input embedding)

Purpose of the Encoder
- Input tokens are encoded & modified into a form that **stores some contextual information** in the sequence

The example we'll use:

> the bark of a palm tree is very rough


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>CLASSIFICATION HEAD</span></b></p></div>

- Transformers can be utilised for various application so they are created in a base form
- If we want to utilise them for a specific task, we add an extra component **head** to the transformer
- In this example, we'll utilise it for **classification** purposes, and look at how we can combine the base with the **head**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>2 |</span></b> <b>SIMPLE SELF ATTENTION</b></div>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>TYPES OF ATTENTION</span></b></p></div>

**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention</mark>**
 
- Mechanism which allows networks to assign **different weight distributions to each element** in a sequence 
- Elements in sequence - `token embeddings` (each token mapped to a vector of fixed dimension) (eg. BERT model - 768 dimensions)
 
 
**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>**

- Instead of using fixed embeddings for each token, can use whole sequence to **compute weighted average** of each `embedding`
- One can think of self-attention as a form of averaging
- Common form of `self-attention` **scaled dot-product attention** 


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>FOUR MAIN STEPS</span></b></p></div>


- Project each `token embedding` into three vectors **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>**
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** (nxn)

    - (we determine how much the query & key vectors relate to eachother using a similarity function)
    - Similarity function for scaled dot-product attention - dot product
    - queries & keys that are similar will have large dot product & visa versa
    - Outputs from this step - attention scores
    
    
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weight</mark>** (wij)

    - dot products produce large numbers 
    - attention scores first multiplied by a scaling factor to normalise their variance
    - Then normalised with softmax to ensure all column values sum to 1
    
    
- Update the token embeddings (hidden state)

    - multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">weights</mark>** by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** vector

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>SIMPLE ATTENTION FORMULATION</span></b></p></div>


- Well look at a simple example, and summarise the attention mechanism in one function
- `bert-base-uncased` model will be used to extract different model settings (eg. number of attention heads), so we will be building a similar model 

<br>

##### **1. DOCUMENT TOKENISATION**

- Each token in the sentence has been mapped to a **unique identifier** from a **vocabulary** or **dictionary**
- We start off by using the `bert-base-uncased` pretrained tokeniser

In [173]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel

# load tokeniser and model
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

# document well be using as an exmaple
text = ["the bark of a palm tree is very rough of off", "the"]

In [174]:
# Tokenise input (text)
inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

In [175]:
inputs.input_ids.is_nested

False

In [4]:
# Decode sequence
tokenizer.decode(inputs['input_ids'].tolist()[0])

'the bark of a palm tree is very rough'

At this point:

- `inputs.inpits_ids` A tensor of id mapped tokens
- Token embeddings are **independent of their context**
- **Homonyms** (same spelling, but different meaning) have the same representation

Role of subsequent attention layers:

- Mix the **token embeddings** to disambiguate & inform the representation of each token with the context of its content

In [5]:
'''

Create an embedding layer

'''

from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)

print(config.hidden_size,"hidden size")
print(config.vocab_size,"vocabulary size")

# load sample embedding layer of size (30522,758) -> same as bert-base
token_emb = nn.Embedding(config.vocab_size,
                         config.hidden_size)
token_emb

768 hidden size
30522 vocabulary size


Embedding(30522, 768)

##### **2. EMBEDDING VECTORS**


- Convert Tokenised data into embedding data (768 dimensions) using vocab of 30522 tokens
- Each input_ids is **mapped to one of 30522 embedding vectors** stored in nn.embedding, each with a size of 768 
- Our output will be [batch_size,seq_len,hidden_dim] by calling `nn.Embedding(hidden)`

In [6]:
'''

Convert Tokens to Embedding Vectors
utilising the existing model embedding embeddings

'''

inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 9, 768])

In [7]:
# 9 embedding vectors of 768 dimensions
inputs_embeds

tensor([[[ 0.4677, -0.4390, -0.2647,  ...,  0.7357,  0.2608,  0.7767],
         [ 0.0355, -2.2300,  1.9929,  ..., -0.1363,  0.4968,  0.4577],
         [ 0.1232,  0.4308,  0.3070,  ..., -0.3304,  0.8665, -0.5848],
         ...,
         [ 0.0763,  0.9183,  0.6317,  ..., -1.3229, -1.8031,  0.5473],
         [ 1.4331, -2.0097,  2.0351,  ..., -0.7298,  0.6222,  0.1742],
         [-0.1593,  1.3323, -2.2000,  ...,  0.5990, -0.2788,  0.0168]]],
       grad_fn=<EmbeddingBackward0>)

##### **3. QUERY, KEY, VALUE VECTORS**

- As the most simplistic case of attention, **we set them equal to one another**
- Attention mechanism with equal query and key vectors will assign a **very large score to identical words in the context** (diagonal component of matrix)

In [8]:
import torch
from math import sqrt

# setting them equal to one another
print("query and key components\n")
query = key = value = inputs_embeds
print('query size:',query.size())
dim_k = key.size(-1)   # hidden dimension 
print('key size:',key.transpose(1,2).size())

query and key components

query size: torch.Size([1, 9, 768])
key size: torch.Size([1, 768, 9])


##### **4. COMPUTE ATTENTION SCORES**

- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** using the **dot product as the similarity function**
- `torch.bmm` - batch matrix matrix product (as we work in batches during training)
- If we need to transpose a vector `vector.transpose(1,2)`

In [9]:
# dot product & apply normalisation
print("\ndot product (attention scores)")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()


dot product (attention scores)


torch.Size([1, 9, 9])

In [10]:
# attention scores
scores

tensor([[[26.5574,  1.9637, -1.6792,  0.6971,  0.6886, -0.4483, -0.4606,
           0.3681,  0.2506],
         [ 1.9637, 25.3258, -0.8987, -0.2280, -1.6313,  0.8909, -0.7721,
           0.6595, -1.0236],
         [-1.6792, -0.8987, 27.3665, -0.8606, -0.0365,  0.0332, -1.8328,
          -0.6056,  0.1560],
         [ 0.6971, -0.2280, -0.8606, 27.1989, -1.6901, -0.7582, -0.3339,
           2.4001, -0.2466],
         [ 0.6886, -1.6313, -0.0365, -1.6901, 26.4779,  1.2638,  0.4562,
          -0.6473, -0.0399],
         [-0.4483,  0.8909,  0.0332, -0.7582,  1.2638, 27.5632, -1.2946,
          -0.0516, -0.5878],
         [-0.4606, -0.7721, -1.8328, -0.3339,  0.4562, -1.2946, 26.2665,
           1.1146,  0.7085],
         [ 0.3681,  0.6595, -0.6056,  2.4001, -0.6473, -0.0516,  1.1146,
          29.8054,  0.6989],
         [ 0.2506, -1.0236,  0.1560, -0.2466, -0.0399, -0.5878,  0.7085,
           0.6989, 27.8790]]], grad_fn=<DivBackward0>)

##### **5. COMPUTE ATTENTION WEIGHTS (SOFTMAX FUNCTION)**



- Created a 5x5 matrix of **attention scores** per sample in the batch
- Apply the softmax for normalisation to get the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

In [11]:
import torch.nn.functional as F

print("sotfmax applied, attention weights :\n")
weights = F.softmax(scores, dim=-1)
print(weights.size())
print(weights)

print("\nsum of column values:/n")
weights.sum(dim=-1)

sotfmax applied, attention weights :

torch.Size([1, 9, 9])
tensor([[[1.0000e+00, 2.0851e-11, 5.4578e-13, 5.8754e-12, 5.8255e-12,
          1.8690e-12, 1.8461e-12, 4.2283e-12, 3.7595e-12],
         [7.1446e-11, 1.0000e+00, 4.0815e-12, 7.9819e-12, 1.9618e-12,
          2.4436e-11, 4.6325e-12, 1.9389e-11, 3.6024e-12],
         [2.4299e-13, 5.3033e-13, 1.0000e+00, 5.5093e-13, 1.2560e-12,
          1.3468e-12, 2.0838e-13, 7.1097e-13, 1.5227e-12],
         [3.0934e-12, 1.2264e-12, 6.5150e-13, 1.0000e+00, 2.8424e-13,
          7.2177e-13, 1.1033e-12, 1.6983e-11, 1.2039e-12],
         [6.3071e-12, 6.1987e-13, 3.0543e-12, 5.8452e-13, 1.0000e+00,
          1.1211e-11, 4.9993e-12, 1.6584e-12, 3.0442e-12],
         [6.8354e-13, 2.6082e-12, 1.1063e-12, 5.0137e-13, 3.7870e-12,
          1.0000e+00, 2.9322e-13, 1.0163e-12, 5.9453e-13],
         [2.4694e-12, 1.8084e-12, 6.2607e-13, 2.8030e-12, 6.1766e-12,
          1.0724e-12, 1.0000e+00, 1.1931e-11, 7.9493e-12],
         [1.6427e-13, 2.1983e-13, 6.2

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

##### **6. UPDATE VALUES**

Multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>** matrix by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">values</mark>** vector

In [12]:
attn_outputs = torch.bmm(weights, value)
print(attn_outputs)
print(attn_outputs.shape)

tensor([[[ 0.4677, -0.4390, -0.2647,  ...,  0.7357,  0.2608,  0.7767],
         [ 0.0355, -2.2300,  1.9929,  ..., -0.1363,  0.4968,  0.4577],
         [ 0.1232,  0.4308,  0.3070,  ..., -0.3304,  0.8665, -0.5848],
         ...,
         [ 0.0763,  0.9183,  0.6317,  ..., -1.3229, -1.8031,  0.5473],
         [ 1.4331, -2.0097,  2.0351,  ..., -0.7298,  0.6222,  0.1742],
         [-0.1593,  1.3323, -2.2000,  ...,  0.5990, -0.2788,  0.0168]]],
       grad_fn=<BmmBackward0>)
torch.Size([1, 9, 768])


Now we have a general function:
- Which inputs vectors `query`, `key` & `value` 
- Calculates the scalar dot product attention 

In [13]:
'''

Scalar Dot Product Attention
scores = query*key.T / sqrt(dims)
weight = softmax(scores) 

'''

def sdp_attention(query, key, value):
    dim_k = query.size(-1) # dimension component
    sfact = sqrt(dim_k)     
    scores = torch.bmm(query, key.transpose(1,2)) / sfact
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>3 |</span></b> <b>MULTIHEAD SELF ATTENTION</b></div>


- The meaning of the word will be better informed by **complementary words in the context** than by **identical words** (which gives 1)

##### **SIMPLISTIC APPROACH**

- We only used the embeddings "as is" (no linear transformation) to compute the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

##### **BETTER APPROACH**

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>** layer applies **three independent linear transformations (`nn.linear`) to each embedding** to generate **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** 
- These transformations project the embeddings and **each projection carries its own set of learnable parameters** (**Weights**)
- This **allows the self-attention layer to focus on different semantic aspects of the sequence**



Its beneficial to have **multiple sets of linear projections** (each one represents an **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**)

Why do we need more than one **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**?
- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">softmax</mark>** of one head tends to focus on mostly **one aspect of similarity**


**Several heads** allows the model to **focus on several apsects at once**
- Eg. one head can focus on subject-verb interaction, another finds nearby adjectives
- **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">CV analogy</mark>**: filters; one filter responsible for detecting the head, another for facial features 

In [14]:
'''

Attention Class

# nn.linear : apply linear transformation to incoming data
#             y = x * A^T + b
# Ax = b where x is input, b is output, A is weight

# calculate scaled dot product attention matrix
# Requires embedding dimension 
# Each attention head is made of different q,k,v vectors

'''

class Attention(nn.Module):
    
    # initalisation 
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        
        # Define the three vectors
        # input - embed_dim, output - head_dim
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    # main class operation
    def forward(self, hidden_state):
        
        # calculate scaled dot product given a 
        attn_outputs = sdp_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state))
        
        return attn_outputs

`Attention` will be used in the construction of a model

- We’ve **initialised three independent linear layers** that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim]
- Where head_dim is the number of dimensions we are projecting into


In [15]:
# 
print(config.num_attention_heads,'heads')
print(config.hidden_size,'hidden state embedding dimension')

12 heads
768 hidden state embedding dimension


In [16]:
''' Sample Initialisation '''

# Initialised just one head, requires token embedding vector for forward operation

embed_dim = config.hidden_size
num_heads = config.num_attention_heads

attention = Attention(embed_dim,num_heads)
attention

Attention(
  (q): Linear(in_features=768, out_features=12, bias=True)
  (k): Linear(in_features=768, out_features=12, bias=True)
  (v): Linear(in_features=768, out_features=12, bias=True)
)

In [17]:
# Weights are always initialised randomly, attention_outputs varies
attention_outputs = attention(inputs_embeds)
attention_outputs

tensor([[[ 0.0466, -0.2470,  0.1237,  0.2201,  0.3026,  0.1393, -0.2491,
          -0.3497, -0.1996,  0.0671,  0.2526,  0.2091],
         [ 0.0140, -0.3049,  0.1017,  0.2228,  0.3752,  0.1384, -0.1850,
          -0.3570, -0.1759,  0.0630,  0.1793,  0.1408],
         [ 0.0551, -0.2393,  0.0875,  0.2590,  0.3480,  0.1496, -0.2562,
          -0.3072, -0.2018,  0.0312,  0.2683,  0.2274],
         [-0.0058, -0.2037,  0.1226,  0.3035,  0.2113,  0.0648, -0.2598,
          -0.3115, -0.1923,  0.0440,  0.1671,  0.2235],
         [ 0.0202, -0.2561,  0.1774,  0.2478,  0.2812,  0.0958, -0.2397,
          -0.3498, -0.1908,  0.0311,  0.2039,  0.1867],
         [-0.0334, -0.2440,  0.1358,  0.3071,  0.3303,  0.0442, -0.2407,
          -0.2779, -0.2602, -0.0438,  0.3093,  0.1323],
         [ 0.0623, -0.1823,  0.1767,  0.2959,  0.3191,  0.2003, -0.2686,
          -0.3018, -0.1096,  0.1130,  0.3252,  0.3756],
         [ 0.0675, -0.2814,  0.0038,  0.1822,  0.4222,  0.1896, -0.2177,
          -0.3258, -0.25

In [18]:
'''

Multihead attention class

'''


class multiHeadAttention(nn.Module):
    
    # Config during initalisation
    def __init__(self, config):
        super().__init__()
        
        # model params, read from config file
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        
        # attention head (define only w/o hidden state)
        # each attention head is initialised with embedd/heads head dimension
        self.heads = nn.ModuleList(
            [Attention(embed_dim, head_dim) for _ in range(num_heads)])
        
        # output uses whole embedding dimension for output
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    # Given a hidden state (embeddings)
    # Apply operation for multihead attention
        
    def forward(self, hidden_state):
        
        # for each head embed_size/heads, calculate attention
        heads = [head(hidden_state) for head in self.heads] 
        x = torch.cat(heads, dim=-1) # merge/concat head data together
    
        # apply linear transformation to multihead attension scalar product
        x = self.out_linear(x)
        return x

In [19]:
'''

Sample Usage: Multi-Head Attention

'''

# Every time will be different due to randomised weights
multihead_attn = multiHeadAttention(config) # initialisation with config
attn_output = multihead_attn(inputs_embeds) # forward by inputting embedding vectors (one for each token)

# Attention output (attention weights matrix x vector weights concat)
print(attn_output)
print(attn_output.size())

tensor([[[-0.0017, -0.1517, -0.1362,  ..., -0.0517, -0.0008,  0.0395],
         [-0.0278, -0.0883, -0.1181,  ..., -0.0496,  0.0156,  0.0461],
         [-0.0047, -0.1191, -0.0997,  ..., -0.0956,  0.0191, -0.0170],
         ...,
         [ 0.0495, -0.1500, -0.1984,  ..., -0.0973,  0.0834, -0.0297],
         [ 0.0523, -0.0707, -0.1575,  ..., -0.0809,  0.0127,  0.0498],
         [ 0.0189, -0.1377, -0.0465,  ..., -0.0361,  0.0369,  0.0354]]],
       grad_fn=<ViewBackward0>)
torch.Size([1, 9, 768])


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>4 |</span></b> <b>FEED FORWARD LAYER</b></div>

**position-wise feed-forward layer**

The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">feed-forward</mark>** sublayer in the encoder & decoder
- **two layer fully connected neural network**


However, instead of processing the whole sequence of embedding as a single vector, 
- it **processes each embedding** independently
- Also see it referred to as a Conv1D with a kernel size of 1 (people with a CV background)


The **hidden size** of the **1st layer = 4x size of the embeddings** & **GELU activation function**
- Place where most of the capacity & memorization is hypothesized to happen
- It is most often scaled, when scaling up the models

In [20]:
class feedForward(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    # define layer operations input x
        
    def forward(self, x):    # note must be forward
        x = self.gelu(self.linear1(x))
        x = self.linear2(x)
        x = self.dropout(x)
        return x

In [21]:
# initailise feedforward layer
feed_forward = feedForward(config)              # initialise 
print(feed_forward,'\n')

# requires config & attn_outputs outputs
ff_outputs = feed_forward(attn_output) # forward operation
ff_outputs

feedForward(
  (linear1): Linear(in_features=768, out_features=3072, bias=True)
  (linear2): Linear(in_features=3072, out_features=768, bias=True)
  (gelu): GELU(approximate='none')
  (dropout): Dropout(p=0.1, inplace=False)
) 



tensor([[[ 0.0221, -0.0033,  0.0176,  ..., -0.0263,  0.0332,  0.0323],
         [ 0.0146,  0.0081,  0.0309,  ..., -0.0244,  0.0290,  0.0177],
         [ 0.0168, -0.0034,  0.0257,  ..., -0.0133,  0.0470, -0.0018],
         ...,
         [ 0.0153, -0.0022,  0.0000,  ..., -0.0272,  0.0446,  0.0177],
         [ 0.0140,  0.0049,  0.0292,  ..., -0.0321,  0.0360,  0.0117],
         [ 0.0139,  0.0007,  0.0179,  ..., -0.0282,  0.0300,  0.0180]]],
       grad_fn=<MulBackward0>)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>5 |</span></b> <b>NORMALISATION LAYERS</b></div>

Transformer architecture also uses **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">layer normalisation</mark>** & **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">skip connections</mark>**
- **normalisation** - normalises batch input to have zero mean & unit variance
- **skip connections** - pass a tensor to the next level of the model w/o processing & adding it to the processed tensor

Two main approaches, when it comes to normalisation layer placement in decoder, encoder:
- **post layer** normalisation (transformer paper, layer normalisation b/w skip connections)
- **pre layer** normalisation 

<br>

| `post-layer` normalisation |  `pre-layer` normalisation in literature |
| - | - |
| Arrangement is tricky to train from scractch, as the gradients can diverge |  Most often found arrangement
| Used with LR warm up (learning rate gradually increased, from small value to some maximum value during training) | Places layer normalization within the span of the skip connection |
|  | Tends to be much more stable during training, and it does not usually require any learning rate warm-up |

In [22]:
class encoderLayer(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.hidden_size)
        self.norm2 = nn.LayerNorm(config.hidden_size)
        self.attention = multiHeadAttention(config)    # multihead attention layer 
        self.feed_forward = feedForward(config)        # feed forward layer

    def forward(self, x):
        
        # Apply layer norm. to hidden state, copy input into query, key, value
        # Apply attention with a skip connection
        x = x + self.attention(self.norm1(x))
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.norm2(x))
        
        return x

In [126]:
# Transformer layer output
encoder_layer = encoderLayer(config) # initialise encoder layer
print(encoder_layer,'\n')

print('input',inputs_embeds.shape) 
print('output',encoder_layer(inputs_embeds).size())

encoderLayer(
  (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attention): multiHeadAttention(
    (heads): ModuleList(
      (0-11): 12 x Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
    )
    (out_linear): Linear(in_features=768, out_features=768, bias=True)
  )
  (feed_forward): feedForward(
    (linear1): Linear(in_features=768, out_features=3072, bias=True)
    (linear2): Linear(in_features=3072, out_features=768, bias=True)
    (gelu): GELU(approximate='none')
    (dropout): Dropout(p=0.1, inplace=False)
  )
) 

input torch.Size([1, 9, 768])
output torch.Size([1, 9, 768])


There is an issue with the way we set up the **encoder layers** (which uses just embedding inputs)
- they are totally **invariant to the position of the tokens**
- Multi-head attention layer is effectively a weighted sum, the **information on token position is lost**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>6 |</span></b> <b>POSITIONAL EMBEDDINGS</b></div>

Let's incorporate positional information using **positional embeddings**


**positional embeddings** are based on idea:
  - Modify the **token embeddings** with a **position-dependent pattern** of values arranged in a vector
  
  
If the pattern is characteristic for each position
- the **attention heads** and **feed-forward layers** in each stack can learn to incorporate positional information into their transformations



- There are several ways to achieve this, and one of the most popular approaches is to use a `learnable pattern`
- This works exactly the same way as the token embeddings, but using the **position index** instead of the **token identifier** (from vocabulary dictionary) as input
- An efficient way of encoding the positions of tokens is learned during pretraining

Creating Custom `Embedding` class

Let’s create a custom Embeddings module (**token embeddings + positional embeddings**)
 - That combines a token embedding layer that projects the input_ids to a dense hidden state 
 - Together with the positional embedding that does the same for position_ids
 - The resulting embedding is simply the **sum of both embeddings**

In [24]:
'''

Token + Position Embedding 


'''

class tpEmbedding(nn.Module):
    
    def __init__(self, config):        
        super().__init__()
        
        # token embedding layer
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        
        # positional embedding layer
        # config.max_position_embeddings -> max number of positions in text 512 (tokens)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        
        self.norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        
        # Create position IDs for input sequence
        seq_length = input_ids.size(1) # number of tokens
        position_ids = torch.arange(seq_length, dtype=torch.long)[None,:] # range(0,9)
        
        # tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])
        # tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
        
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        
        # Add normalisation & dropout layers
        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [157]:
# Token and Position Embeddings
embedding_layer = tpEmbedding(config)
embedding_layer(inputs.input_ids)

tensor([[[ 1.3445,  0.0000, -0.0000, -0.0000,  0.0000, -0.1042, -0.2192,
          -0.0000, -0.0000, -3.5667,  0.0000,  4.2198,  0.4537,  0.0000,
          -4.2740, -1.7250, -0.7161, -0.0000,  1.1875, -1.1810, -0.0000,
          -0.0000,  0.0000, -1.5592, -0.0000,  0.0000, -2.1069,  2.0811,
          -1.1085, -0.6592, -0.0000, -2.1012,  3.2155,  0.0000,  0.3414,
           0.0000,  0.0000,  2.3591,  0.0000,  3.7387, -1.7707, -1.3633,
           1.9601,  0.0000,  4.1266,  3.6932, -2.3514, -3.0475, -1.9932,
           2.1455,  0.0000, -0.0000,  0.7239, -0.0000,  0.0000, -1.0955,
           2.1274, -1.5079,  0.0000,  2.9827,  0.0000, -0.0000, -6.0318,
          -0.0000,  1.6531, -0.0000,  4.1636, -0.0000,  0.9650,  0.0000,
           0.0000, -0.0000, -2.0467,  1.0195,  0.0000,  0.0000,  0.0000,
           0.0000, -1.1940,  0.7735,  0.0000,  0.0000, -0.0000,  0.2993,
           0.0000, -1.7147, -3.9098, -0.5094, -0.0000, -0.0000, -1.2400,
          -0.0000, -0.0000,  0.0000, -2.0943, -1.17

In [152]:
inputs.input_ids

tensor([[6423]])

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>7 |</span></b> <b>PUTTING IT ALL TOGETHER</b></div>

- Constructing the Transformer **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">encoder</mark>**, combining the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Embedding</mark>** and **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Encoder</mark>**  layers
- We utilise both **token** & **positional** embeddings using `tpEmbedding`
- For a given number of heads, we store `encoderLayer`, which contains the **attention** & **feed-forward** layers (which are our layers)

In [26]:
# full transformer encoder combining the `Embedding` with the ``Embedding` ` layers

class TransformerEncoder(nn.Module):
    
    def __init__(self, config):       
        super().__init__()
        
        # token & positional embedding layer
        self.embeddings = tpEmbedding(config)
        
        # attention & forward feed layer 
        self.layers = nn.ModuleList([encoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        
        # embeddings layer output
        x = self.embeddings(x)
        
        # cycle through all heads
        for layer in self.layers:
            x = layer(x)
        return x

In [166]:
# Encoder initialisation & output
encoder = TransformerEncoder(config)

for param in encoder.parameters():
    param.requires_grad = False
for i in range(2):
    encoder_output = encoder(inputs.input_ids)
    print(encoder_output)

tensor([[[ 2.0581e+00, -1.6299e-01, -2.9520e+00, -5.2393e-01,  8.3359e-01,
          -3.3352e+00, -1.2720e-01, -7.9162e-01, -2.2214e+00, -2.7648e+00,
          -6.4970e-01, -3.2357e+00,  5.3065e-01, -2.7389e+00, -1.3367e-01,
          -2.3654e+00, -2.3908e+00,  2.9580e+00, -2.0250e+00, -5.0651e-01,
           3.8579e-01, -1.0198e+00,  2.0577e+00, -2.8406e-02,  2.6890e+00,
           2.6196e+00, -1.1550e+00,  5.7243e-01, -1.1750e+00,  2.3551e+00,
           1.6403e+00,  3.5082e+00, -1.0916e-01, -1.5080e+00,  1.8043e+00,
          -2.9783e+00,  4.0652e-01, -2.6620e+00, -1.6786e+00, -1.0035e+00,
           6.6816e-01,  1.7701e+00,  4.6573e-01,  1.1439e+00, -2.3352e+00,
           4.0620e-01, -1.3216e+00,  2.9990e+00,  1.0780e+00,  1.8285e+00,
          -4.4860e-01,  4.1073e+00, -6.6948e-01,  9.8133e-01,  2.5430e+00,
           9.5153e-01,  1.0964e-02,  1.4591e+00, -1.7102e+00, -1.1592e+00,
          -2.8592e+00, -1.5909e+00, -7.0151e-01,  1.3612e+00,  2.3833e+00,
          -1.9930e+00, -5

In [161]:
# Now, check if the parameters are frozen
for name, param in encoder.named_parameters():
    print(f'{name}: {param.requires_grad}')

# Use the encoder for inference
encoder_output = encoder(inputs.input_ids)

embeddings.token_embeddings.weight: False
embeddings.position_embeddings.weight: False
embeddings.norm.weight: False
embeddings.norm.bias: False
layers.0.norm1.weight: False
layers.0.norm1.bias: False
layers.0.norm2.weight: False
layers.0.norm2.bias: False
layers.0.attention.heads.0.q.weight: False
layers.0.attention.heads.0.q.bias: False
layers.0.attention.heads.0.k.weight: False
layers.0.attention.heads.0.k.bias: False
layers.0.attention.heads.0.v.weight: False
layers.0.attention.heads.0.v.bias: False
layers.0.attention.heads.1.q.weight: False
layers.0.attention.heads.1.q.bias: False
layers.0.attention.heads.1.k.weight: False
layers.0.attention.heads.1.k.bias: False
layers.0.attention.heads.1.v.weight: False
layers.0.attention.heads.1.v.bias: False
layers.0.attention.heads.2.q.weight: False
layers.0.attention.heads.2.q.bias: False
layers.0.attention.heads.2.k.weight: False
layers.0.attention.heads.2.k.bias: False
layers.0.attention.heads.2.v.weight: False
layers.0.attention.heads.2.v

In [145]:


# hidden state for each token in a batch
encoder_output.size()

torch.Size([1, 1, 768])

In [148]:
inputs.input_ids

tensor([[6423]])

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>8 |</span></b> <b>CLASSIFICATION HEAD</b></div>

Quite often, transformers are divided into:
- Task independent body (`TransformerEncoder`)
- Task dependent head (`TransformerClassifier`)

Select one of the token outputs:
- The first token in such models is often used for the prediction **[CLS] token**
- Can attach a `dropout` and a `linear` transformation layer to make a classification prediction

In [29]:
class TransformerClassifier(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        
        # Transformer Encoder
        self.encoder = TransformerEncoder(config)
        
        # Classification Head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        print(x.size())
        x = self.dropout(x)
        x = self.classifier(x) # 768 -> 3 
        return x

- For each sample in the batch we get the **unnormalized logits** for each class in the output, which corresponds to the BERT model 

In [143]:
config.num_labels = 1
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

torch.Size([1, 768])


tensor([[-0.3187]], grad_fn=<AddmmBackward0>)

In [31]:
inputs.input_ids

tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])

In [32]:
output

tensor([[-1.7248, -1.0299, -0.2984, -0.8205,  1.5645]],
       grad_fn=<AddmmBackward0>)

In [46]:
# Tokenise input (text)
inputs = tokenizer(["julia is nice","julia is kind","julia is mean","julia is happy"], 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

tensor([[6423, 2003, 3835],
        [6423, 2003, 2785],
        [6423, 2003, 2812],
        [6423, 2003, 3407]])


torch.Size([4, 3])

In [34]:
config.num_labels = 5
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

torch.Size([4, 768])


tensor([[ 0.2858,  2.2910,  1.1027, -1.3655,  0.5057],
        [-0.4398,  1.7071, -0.3332, -1.3267,  0.0404],
        [ 1.2801, -0.3843, -0.8903,  0.2645,  0.0164],
        [ 0.7872,  0.4755, -0.2488,  0.9355,  0.0259]],
       grad_fn=<AddmmBackward0>)

In [36]:
import numpy as np
output.detach().numpy()

array([[ 0.2858461 ,  2.291011  ,  1.1027378 , -1.3654897 ,  0.50570506],
       [-0.43984768,  1.707121  , -0.33323628, -1.3266757 ,  0.0403869 ],
       [ 1.2801098 , -0.38431922, -0.89034826,  0.26452443,  0.01639184],
       [ 0.787185  ,  0.47547337, -0.2487884 ,  0.93552494,  0.02594745]],
      dtype=float32)

In [96]:
inputs = tokenizer(x, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False,
                  padding=True ) # don't use pad, sep tokens
                    
output = encoder_classifier(inputs.input_ids)
output.detach().numpy()


torch.Size([3, 768])


array([[ 0.30067572, -0.19365877,  1.5731525 ,  0.34049684,  0.61987585],
       [ 1.599041  ,  0.65223604, -0.30029094, -0.67572856,  1.787407  ],
       [ 1.168354  ,  0.27019048,  0.68573517, -0.11251289,  0.5283011 ]],
      dtype=float32)

In [129]:
def encoder(x):
    inputs = x
    inputs = tokenizer(inputs, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False,
                      ) # don't use pad, sep tokens
    output = encoder_classifier(inputs.input_ids)
    output = output.detach().numpy()
    return output
encoder(["julia"])


torch.Size([1, 768])


array([[ 1.1925466 , -2.004026  ,  0.37853348, -0.53329134,  3.518763  ]],
      dtype=float32)

In [130]:
encoder(["julia"])

torch.Size([1, 768])


array([[-1.4516523 , -0.29492986, -1.4100991 , -0.5428195 ,  1.6403229 ]],
      dtype=float32)

In [138]:
for i in range(100):
    print(encoder(["julia"]))

torch.Size([1, 768])
[[ 1.8389405  -1.0258632  -1.9472094   0.24997222  2.9709525 ]]
torch.Size([1, 768])
[[-1.2937596  -0.55749947 -1.5676594  -0.10609275  2.590736  ]]
torch.Size([1, 768])
[[ 0.7001611  -0.7492791  -1.4172636   0.41050863  1.5200357 ]]
torch.Size([1, 768])
[[ 0.38916868 -2.6472218  -1.4488543   0.6265552   2.8951118 ]]
torch.Size([1, 768])
[[-0.47231916 -0.53634554 -0.5050098  -0.97858167  2.0714037 ]]
torch.Size([1, 768])
[[ 1.1571622  -0.14157459 -0.44913638  0.04301509  2.4572744 ]]
torch.Size([1, 768])
[[ 1.1925933  -2.0172298  -0.03272817  0.05695865  4.1540723 ]]
torch.Size([1, 768])
[[ 0.26143798 -0.39048213 -0.5843675  -0.12257874  1.6031576 ]]
torch.Size([1, 768])
[[-0.00853661 -1.8508751  -0.90174556  0.15717727  0.42443526]]
torch.Size([1, 768])
[[-0.10971469 -0.96365035 -1.2132721   0.33689216  0.30131382]]
torch.Size([1, 768])
[[ 0.90288585 -1.5594695  -0.31239155  0.93199253  0.9063376 ]]
torch.Size([1, 768])
[[-0.575763   -1.474622   -0.01893973 -1.609

In [98]:
import faiss
class Embeddings:
    #CLS is a special classification token and the last hidden state of BERT Embedding
    def cls_pooling(self, model_output):
        return model_output.last_hidden_state[:, 0]

    #BERT tokenizer of input text
    def get_embeddings(self, text_list):
        encoded_input = tokenizer(
            text_list, padding=True, truncation=True, return_tensors="pt"
        )
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
        model_output = model(**encoded_input)
        return self.cls_pooling(model_output).cpu().detach().numpy()
    
    
    #convert dataset into embeddings dataset to run FAISS
    def makeEmbeddings(self,dataset):
        embeddings = []
        for data in dataset:
            embeddings.append(self.get_embeddings(data)[0])
        return np.array(embeddings)
    
    def getQueryEmbedding(self, query):
        return self.get_embeddings([query])
    
class Faiss:
    def __init__(self):
        pass

    def faiss(self,xb):
        d = xb[0].size
        M = 32
        index = faiss.IndexHNSWFlat(d, M)            
        index.hnsw.efConstruction = 40         # Setting the value for efConstruction.
        index.hnsw.efSearch = 16               # Setting the value for efSearch.
        index.add(xb)
        return index
    
    def query(self,index,xq,k=3):
        D, I = index.search(xq, k)   
        return D, I

In [99]:
xb = encoder(["julia is nice", "isabelle is planning", "julia plans too"])
xq = encoder(["isabelle plans it"])
index = Faiss().faiss(xb)
D,I = Faiss().query(index,xq)
I

torch.Size([3, 768])
torch.Size([1, 768])


array([[1, 0, 2]])

In [100]:
import pandas as pd
df = pd.read_csv("SEC-CompanyTicker.csv", index_col=0)

In [109]:
x = list(df.head(20).companyName)
q = ["Apple Inc."]
xb,xq = encoder(x,q)
index = Faiss().faiss(xb)
D,I = Faiss().query(index,xb)
I

torch.Size([20, 768])
torch.Size([1, 768])


array([[ 0,  9, 17],
       [ 1, 10,  6],
       [ 2,  7,  8],
       [ 3, 13, 10],
       [ 4,  6,  0],
       [ 5,  0,  9],
       [ 6,  1,  4],
       [ 7,  2,  8],
       [ 8,  7,  0],
       [ 9, 17,  0],
       [10,  1,  3],
       [11, 15,  6],
       [12,  1,  9],
       [13,  3, 10],
       [14, 16, 17],
       [15, 10, 11],
       [16, 14, 17],
       [17,  9,  0],
       [18,  4,  2],
       [19,  7,  3]])

In [104]:
x

['Apple Inc.',
 'Microsoft Corp',
 'Alphabet Inc.',
 'Amazon Com Inc',
 'Nvidia Corp',
 'Tesla, Inc.',
 'Berkshire Hathaway Inc',
 'Meta Platforms, Inc.',
 'Eli Lilly & Co',
 'Visa Inc.',
 'Taiwan Semiconductor Manufacturing Co Ltd',
 'Exxon Mobil Corp',
 'Unitedhealth Group Inc',
 'Walmart Inc.',
 'Novo Nordisk A S',
 'Jpmorgan Chase & Co',
 'Spdr S&P 500 Etf Trust',
 'Johnson & Johnson',
 'Mastercard Inc',
 'Lvmh Moet Hennessy Louis Vuitton']

In [94]:
x = "am"
encoder(x)

torch.Size([1, 768])


array([[-0.6796455 , -0.26035303,  0.939491  , -0.8765463 ,  1.2847595 ]],
      dtype=float32)