# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>1 |</span></b> <b>BACKGROUND</b></div>


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

In the following notebook, we'll look at the following components of the Transformer Encoder structure

<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    
<ul>
<li>Simple Attention</li>
<li>Multi-Head Self Attention</li>
<li>Feed Forward Layer</li>
<li>Normalisation</li>
<li>Skip Connection</li>
<li>Position Embeddings</li>
<li>Transformer Encoder</li>
<li>Classifier Head</li>
</ul> 
</div> 

<br>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>ENCODER BASE</span></b></p></div>

Encoder simply put:
- Converts a **series tokens** into a **series of embedding vectors** (hidden state)
- The encoder (neural network) consists of **multiple layers** (**blocks**) constructed together 

The encoder structure:
- Composed of multiple encoder layers (blocks) stacked next to each other (similar to CNN layer stacks)
- Each encoder block contains **multi-head self attention** & **fully connected feed forward layer** (for each input embedding)

Purpose of the Encoder
- Input tokens are encoded & modified into a form that **stores some contextual information** in the sequence

The example we'll use:

> the bark of a palm tree is very rough


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>CLASSIFICATION HEAD</span></b></p></div>

- Transformers can be utilised for various application so they are created in a base form
- If we want to utilise them for a specific task, we add an extra component **head** to the transformer
- In this example, we'll utilise it for **classification** purposes, and look at how we can combine the base with the **head**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>2 |</span></b> <b>SIMPLE SELF ATTENTION</b></div>

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>TYPES OF ATTENTION</span></b></p></div>

**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention</mark>**
 
- Mechanism which allows networks to assign **different weight distributions to each element** in a sequence 
- Elements in sequence - `token embeddings` (each token mapped to a vector of fixed dimension) (eg. BERT model - 768 dimensions)
 
 
**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>**

- Instead of using fixed embeddings for each token, can use whole sequence to **compute weighted average** of each `embedding`
- One can think of self-attention as a form of averaging
- Common form of `self-attention` **scaled dot-product attention** 


<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>FOUR MAIN STEPS</span></b></p></div>


- Project each `token embedding` into three vectors **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>**
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** (nxn)

    - (we determine how much the query & key vectors relate to eachother using a similarity function)
    - Similarity function for scaled dot-product attention - dot product
    - queries & keys that are similar will have large dot product & visa versa
    - Outputs from this step - attention scores
    
    
- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weight</mark>** (wij)

    - dot products produce large numbers 
    - attention scores first multiplied by a scaling factor to normalise their variance
    - Then normalised with softmax to ensure all column values sum to 1
    
    
- Update the token embeddings (hidden state)

    - multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">weights</mark>** by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** vector

<div style="color:white;display:fill;border-radius:8px;font-size:100%; letter-spacing:1.0px;"><p style="padding: 5px;color:white;text-align:left;"><b><span style='color:#F1A424;text-align:center'>SIMPLE ATTENTION FORMULATION</span></b></p></div>


- Well look at a simple example, and summarise the attention mechanism in one function
- `bert-base-uncased` model will be used to extract different model settings (eg. number of attention heads), so we will be building a similar model 

<br>

##### **1. DOCUMENT TOKENISATION**

- Each token in the sentence has been mapped to a **unique identifier** from a **vocabulary** or **dictionary**
- We start off by using the `bert-base-uncased` pretrained tokeniser

In [1]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel

# load tokeniser and model
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

# document well be using as an exmaple
text = ["the bark of a palm tree is very rough of"]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


In [2]:
# Tokenise input (text)
inputs = tokenizer(text, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)


tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931,  1997]])


In [3]:
inputs.input_ids.is_nested

False

In [4]:
# Decode sequence
tokenizer.decode(inputs['input_ids'].tolist()[0])

'the bark of a palm tree is very rough of'

At this point:

- `inputs.inpits_ids` A tensor of id mapped tokens
- Token embeddings are **independent of their context**
- **Homonyms** (same spelling, but different meaning) have the same representation

Role of subsequent attention layers:

- Mix the **token embeddings** to disambiguate & inform the representation of each token with the context of its content

In [5]:
'''

Create an embedding layer

'''

from torch import nn
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_ckpt)

print(config.hidden_size,"hidden size")
print(config.vocab_size,"vocabulary size")

# load sample embedding layer of size (30522,758) -> same as bert-base
token_emb = nn.Embedding(config.vocab_size,
                         config.hidden_size)
token_emb

768 hidden size
30522 vocabulary size


Embedding(30522, 768)

##### **2. EMBEDDING VECTORS**


- Convert Tokenised data into embedding data (768 dimensions) using vocab of 30522 tokens
- Each input_ids is **mapped to one of 30522 embedding vectors** stored in nn.embedding, each with a size of 768 
- Our output will be [batch_size,seq_len,hidden_dim] by calling `nn.Embedding(hidden)`

In [6]:
'''

Convert Tokens to Embedding Vectors
utilising the existing model embedding embeddings

'''

inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 10, 768])

In [7]:
# 9 embedding vectors of 768 dimensions
inputs_embeds

tensor([[[-0.2747, -1.8235,  0.8008,  ...,  0.7047,  1.3227, -0.5303],
         [ 1.4124,  0.7308, -0.4173,  ...,  1.8594, -1.0689,  0.1558],
         [ 0.9272,  0.1791,  0.3571,  ...,  0.3408,  0.8663, -0.5704],
         ...,
         [-0.4634, -0.4977, -0.3177,  ...,  1.3512, -0.7524, -0.1523],
         [ 1.7729,  1.0229,  1.6567,  ..., -0.3762,  0.5209,  3.1483],
         [ 0.9272,  0.1791,  0.3571,  ...,  0.3408,  0.8663, -0.5704]]],
       grad_fn=<EmbeddingBackward0>)

##### **3. QUERY, KEY, VALUE VECTORS**

- As the most simplistic case of attention, **we set them equal to one another**
- Attention mechanism with equal query and key vectors will assign a **very large score to identical words in the context** (diagonal component of matrix)

In [8]:
import torch
from math import sqrt

# setting them equal to one another
print("query and key components\n")
query = key = value = inputs_embeds
print('query size:',query.size())
dim_k = key.size(-1)   # hidden dimension 
print('key size:',key.transpose(1,2).size())

query and key components

query size: torch.Size([1, 10, 768])
key size: torch.Size([1, 768, 10])


##### **4. COMPUTE ATTENTION SCORES**

- Compute **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** using the **dot product as the similarity function**
- `torch.bmm` - batch matrix matrix product (as we work in batches during training)
- If we need to transpose a vector `vector.transpose(1,2)`

In [9]:
# dot product & apply normalisation
print("\ndot product (attention scores)")
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()


dot product (attention scores)


torch.Size([1, 10, 10])

In [10]:
# attention scores
scores

tensor([[[ 2.5829e+01, -1.0454e+00,  3.1074e-01, -2.6252e+00, -1.3836e+00,
          -1.2813e+00, -8.3912e-01, -8.1960e-01,  1.0559e+00,  3.1074e-01],
         [-1.0454e+00,  2.7458e+01,  7.7848e-01,  1.1003e+00,  1.4799e+00,
           9.2834e-02,  2.7509e-02,  1.4460e+00, -8.3715e-01,  7.7848e-01],
         [ 3.1074e-01,  7.7848e-01,  2.8841e+01,  1.1517e-01,  1.9508e-01,
           1.3662e-01,  4.5663e-02,  4.8583e-01, -3.8592e-01,  2.8841e+01],
         [-2.6252e+00,  1.1003e+00,  1.1517e-01,  2.9882e+01,  2.1040e-01,
          -2.5550e+00,  5.2171e-02,  4.1555e-01, -1.6756e+00,  1.1517e-01],
         [-1.3836e+00,  1.4799e+00,  1.9508e-01,  2.1040e-01,  2.8372e+01,
           8.4505e-02, -3.4399e-01,  3.4558e-02,  2.5317e-01,  1.9508e-01],
         [-1.2813e+00,  9.2834e-02,  1.3662e-01, -2.5550e+00,  8.4505e-02,
           2.8661e+01, -1.2571e+00, -7.4733e-01,  1.0242e-01,  1.3662e-01],
         [-8.3912e-01,  2.7509e-02,  4.5663e-02,  5.2171e-02, -3.4399e-01,
          -1.2571e+

##### **5. COMPUTE ATTENTION WEIGHTS (SOFTMAX FUNCTION)**



- Created a 5x5 matrix of **attention scores** per sample in the batch
- Apply the softmax for normalisation to get the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

In [11]:
import torch.nn.functional as F

print("sotfmax applied, attention weights :\n")
weights = F.softmax(scores, dim=-1)
print(weights.size())
print(weights)

print("\nsum of column values:/n")
weights.sum(dim=-1)

sotfmax applied, attention weights :

torch.Size([1, 10, 10])
tensor([[[1.0000e+00, 2.1312e-12, 8.2716e-12, 4.3904e-13, 1.5197e-12,
          1.6834e-12, 2.6195e-12, 2.6711e-12, 1.7427e-11, 8.2716e-12],
         [4.1812e-13, 1.0000e+00, 2.5905e-12, 3.5741e-12, 5.2240e-12,
          1.3050e-12, 1.2225e-12, 5.0501e-12, 5.1491e-13, 2.5905e-12],
         [2.0336e-13, 3.2464e-13, 5.0000e-01, 1.6724e-13, 1.8115e-13,
          1.7086e-13, 1.5601e-13, 2.4227e-13, 1.0132e-13, 5.0000e-01],
         [7.6282e-15, 3.1653e-13, 1.1819e-13, 1.0000e+00, 1.2999e-13,
          8.1837e-15, 1.1097e-13, 1.5960e-13, 1.9717e-14, 1.1819e-13],
         [1.1944e-13, 2.0928e-12, 5.7909e-13, 5.8803e-13, 1.0000e+00,
          5.1847e-13, 3.3778e-13, 4.9321e-13, 6.1373e-13, 5.7909e-13],
         [9.9100e-14, 3.9159e-13, 4.0912e-13, 2.7728e-14, 3.8834e-13,
          1.0000e+00, 1.0152e-13, 1.6903e-13, 3.9536e-13, 4.0912e-13],
         [1.0422e-13, 2.4792e-13, 2.5246e-13, 2.5411e-13, 1.7099e-13,
          6.8616e-14, 

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

##### **6. UPDATE VALUES**

Multiply the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>** matrix by the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">values</mark>** vector

In [12]:
attn_outputs = torch.bmm(weights, value)
print(attn_outputs)
print(attn_outputs.shape)

tensor([[[-0.2747, -1.8235,  0.8008,  ...,  0.7047,  1.3227, -0.5303],
         [ 1.4124,  0.7308, -0.4173,  ...,  1.8594, -1.0689,  0.1558],
         [ 0.9272,  0.1791,  0.3571,  ...,  0.3408,  0.8663, -0.5704],
         ...,
         [-0.4634, -0.4977, -0.3177,  ...,  1.3512, -0.7524, -0.1523],
         [ 1.7729,  1.0229,  1.6567,  ..., -0.3762,  0.5209,  3.1483],
         [ 0.9272,  0.1791,  0.3571,  ...,  0.3408,  0.8663, -0.5704]]],
       grad_fn=<BmmBackward0>)
torch.Size([1, 10, 768])


Now we have a general function:
- Which inputs vectors `query`, `key` & `value` 
- Calculates the scalar dot product attention 

In [13]:
'''

Scalar Dot Product Attention
scores = query*key.T / sqrt(dims)
weight = softmax(scores) 

'''

def sdp_attention(query, key, value):
    dim_k = query.size(-1) # dimension component
    sfact = sqrt(dim_k)     
    scores = torch.bmm(query, key.transpose(1,2)) / sfact
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>3 |</span></b> <b>MULTIHEAD SELF ATTENTION</b></div>


- The meaning of the word will be better informed by **complementary words in the context** than by **identical words** (which gives 1)

##### **SIMPLISTIC APPROACH**

- We only used the embeddings "as is" (no linear transformation) to compute the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention scores</mark>** **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention weights</mark>**

##### **BETTER APPROACH**

- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">self-attention</mark>** layer applies **three independent linear transformations (`nn.linear`) to each embedding** to generate **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">query</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">key</mark>**,**<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">value</mark>** 
- These transformations project the embeddings and **each projection carries its own set of learnable parameters** (**Weights**)
- This **allows the self-attention layer to focus on different semantic aspects of the sequence**



Its beneficial to have **multiple sets of linear projections** (each one represents an **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**)

Why do we need more than one **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">attention head</mark>**?
- The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">softmax</mark>** of one head tends to focus on mostly **one aspect of similarity**


**Several heads** allows the model to **focus on several apsects at once**
- Eg. one head can focus on subject-verb interaction, another finds nearby adjectives
- **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">CV analogy</mark>**: filters; one filter responsible for detecting the head, another for facial features 

In [14]:
'''

Attention Class

# nn.linear : apply linear transformation to incoming data
#             y = x * A^T + b
# Ax = b where x is input, b is output, A is weight

# calculate scaled dot product attention matrix
# Requires embedding dimension 
# Each attention head is made of different q,k,v vectors

'''

class Attention(nn.Module):
    
    # initalisation 
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        
        # Define the three vectors
        # input - embed_dim, output - head_dim
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    # main class operation
    def forward(self, hidden_state):
        
        # calculate scaled dot product given a 
        attn_outputs = sdp_attention(
            self.q(hidden_state), 
            self.k(hidden_state), 
            self.v(hidden_state))
        
        return attn_outputs

`Attention` will be used in the construction of a model

- We’ve **initialised three independent linear layers** that apply matrix multiplication to the embedding vectors to produce tensors of shape [batch_size, seq_len, head_dim]
- Where head_dim is the number of dimensions we are projecting into


In [15]:
# 
print(config.num_attention_heads,'heads')
print(config.hidden_size,'hidden state embedding dimension')

12 heads
768 hidden state embedding dimension


In [16]:
''' Sample Initialisation '''

# Initialised just one head, requires token embedding vector for forward operation

embed_dim = config.hidden_size
num_heads = config.num_attention_heads

attention = Attention(embed_dim,num_heads)
attention

Attention(
  (q): Linear(in_features=768, out_features=12, bias=True)
  (k): Linear(in_features=768, out_features=12, bias=True)
  (v): Linear(in_features=768, out_features=12, bias=True)
)

In [17]:
# Weights are always initialised randomly, attention_outputs varies
attention_outputs = attention(inputs_embeds)
attention_outputs

tensor([[[-3.2667e-04, -7.7737e-02, -5.5776e-02,  3.0149e-02, -1.9336e-01,
          -3.2226e-01,  1.1496e-01,  3.2659e-02, -1.3306e-01, -1.6859e-01,
           9.3852e-02, -4.2003e-01],
         [-9.9959e-02, -1.0941e-01, -1.8432e-01,  3.6811e-02, -2.3326e-01,
          -3.2482e-01,  8.3666e-02,  1.0681e-01, -1.1075e-01, -3.4479e-01,
          -8.4057e-03, -4.5280e-01],
         [-1.2049e-02, -1.1459e-01, -1.6258e-01,  3.2581e-02, -2.4308e-01,
          -4.3216e-01,  9.4243e-02,  3.7738e-02, -1.4588e-01, -2.2938e-01,
           2.0252e-02, -4.3899e-01],
         [ 6.0900e-02,  4.9530e-02, -3.2478e-02,  3.4811e-02, -3.6170e-01,
          -1.8953e-01, -4.4606e-02,  5.6850e-02, -2.7326e-01, -1.6369e-01,
           1.8536e-01, -3.4686e-01],
         [-2.7913e-02, -4.9991e-02, -4.2541e-02,  4.2598e-02, -2.3129e-01,
          -2.1277e-01,  7.6612e-02,  1.0055e-01, -2.0214e-01, -2.0287e-01,
           1.5863e-01, -4.0711e-01],
         [ 3.6021e-02, -4.2606e-02, -1.3185e-01,  2.3449e-02, -3.

In [18]:
'''

Multihead attention class

'''


class multiHeadAttention(nn.Module):
    
    # Config during initalisation
    def __init__(self, config):
        super().__init__()
        
        # model params, read from config file
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        
        # attention head (define only w/o hidden state)
        # each attention head is initialised with embedd/heads head dimension
        self.heads = nn.ModuleList(
            [Attention(embed_dim, head_dim) for _ in range(num_heads)])
        
        # output uses whole embedding dimension for output
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    # Given a hidden state (embeddings)
    # Apply operation for multihead attention
        
    def forward(self, hidden_state):
        
        # for each head embed_size/heads, calculate attention
        heads = [head(hidden_state) for head in self.heads] 
        x = torch.cat(heads, dim=-1) # merge/concat head data together
    
        # apply linear transformation to multihead attension scalar product
        x = self.out_linear(x)
        return x

In [19]:
'''

Sample Usage: Multi-Head Attention

'''

# Every time will be different due to randomised weights
multihead_attn = multiHeadAttention(config) # initialisation with config
attn_output = multihead_attn(inputs_embeds) # forward by inputting embedding vectors (one for each token)

# Attention output (attention weights matrix x vector weights concat)
print(attn_output)
print(attn_output.size())

tensor([[[-0.0662, -0.0831,  0.0020,  ..., -0.0755, -0.1310, -0.0358],
         [-0.0500,  0.0506, -0.0379,  ..., -0.1019, -0.1485, -0.0841],
         [-0.0568,  0.0297, -0.0643,  ..., -0.0936, -0.1314,  0.0077],
         ...,
         [-0.0688,  0.0733, -0.0078,  ..., -0.0790, -0.0736, -0.0299],
         [-0.0882, -0.0383, -0.0683,  ..., -0.0849, -0.1354,  0.0153],
         [-0.0568,  0.0297, -0.0643,  ..., -0.0936, -0.1314,  0.0077]]],
       grad_fn=<ViewBackward0>)
torch.Size([1, 10, 768])


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>4 |</span></b> <b>FEED FORWARD LAYER</b></div>

**position-wise feed-forward layer**

The **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">feed-forward</mark>** sublayer in the encoder & decoder
- **two layer fully connected neural network**


However, instead of processing the whole sequence of embedding as a single vector, 
- it **processes each embedding** independently
- Also see it referred to as a Conv1D with a kernel size of 1 (people with a CV background)


The **hidden size** of the **1st layer = 4x size of the embeddings** & **GELU activation function**
- Place where most of the capacity & memorization is hypothesized to happen
- It is most often scaled, when scaling up the models

In [20]:
class feedForward(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    # define layer operations input x
        
    def forward(self, x):    # note must be forward
        x = self.gelu(self.linear1(x))
        x = self.linear2(x)
        x = self.dropout(x)
        return x

In [21]:
# initailise feedforward layer
feed_forward = feedForward(config)              # initialise 
print(feed_forward,'\n')

# requires config & attn_outputs outputs
ff_outputs = feed_forward(attn_output) # forward operation
ff_outputs

feedForward(
  (linear1): Linear(in_features=768, out_features=3072, bias=True)
  (linear2): Linear(in_features=3072, out_features=768, bias=True)
  (gelu): GELU(approximate='none')
  (dropout): Dropout(p=0.1, inplace=False)
) 



tensor([[[ 0.0489,  0.0626,  0.0695,  ..., -0.0066, -0.0000, -0.0140],
         [ 0.0520,  0.0506,  0.0638,  ..., -0.0022, -0.0000, -0.0243],
         [ 0.0455,  0.0482,  0.0706,  ..., -0.0121, -0.0302, -0.0320],
         ...,
         [ 0.0393,  0.0534,  0.0690,  ..., -0.0027, -0.0427, -0.0256],
         [ 0.0459,  0.0553,  0.0588,  ..., -0.0135, -0.0292, -0.0136],
         [ 0.0455,  0.0482,  0.0706,  ..., -0.0000, -0.0000, -0.0320]]],
       grad_fn=<MulBackward0>)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>5 |</span></b> <b>NORMALISATION LAYERS</b></div>

Transformer architecture also uses **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">layer normalisation</mark>** & **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">skip connections</mark>**
- **normalisation** - normalises batch input to have zero mean & unit variance
- **skip connections** - pass a tensor to the next level of the model w/o processing & adding it to the processed tensor

Two main approaches, when it comes to normalisation layer placement in decoder, encoder:
- **post layer** normalisation (transformer paper, layer normalisation b/w skip connections)
- **pre layer** normalisation 

<br>

| `post-layer` normalisation |  `pre-layer` normalisation in literature |
| - | - |
| Arrangement is tricky to train from scractch, as the gradients can diverge |  Most often found arrangement
| Used with LR warm up (learning rate gradually increased, from small value to some maximum value during training) | Places layer normalization within the span of the skip connection |
|  | Tends to be much more stable during training, and it does not usually require any learning rate warm-up |

In [22]:
class encoderLayer(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.hidden_size)
        self.norm2 = nn.LayerNorm(config.hidden_size)
        self.attention = multiHeadAttention(config)    # multihead attention layer 
        self.feed_forward = feedForward(config)        # feed forward layer

    def forward(self, x):
        
        # Apply layer norm. to hidden state, copy input into query, key, value
        # Apply attention with a skip connection
        x = x + self.attention(self.norm1(x))
        
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.norm2(x))
        
        return x

In [23]:
# Transformer layer output
encoder_layer = encoderLayer(config) # initialise encoder layer
print(encoder_layer,'\n')

print('input',inputs_embeds.shape) 
print('output',encoder_layer(inputs_embeds).size())

encoderLayer(
  (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attention): multiHeadAttention(
    (heads): ModuleList(
      (0-11): 12 x Attention(
        (q): Linear(in_features=768, out_features=64, bias=True)
        (k): Linear(in_features=768, out_features=64, bias=True)
        (v): Linear(in_features=768, out_features=64, bias=True)
      )
    )
    (out_linear): Linear(in_features=768, out_features=768, bias=True)
  )
  (feed_forward): feedForward(
    (linear1): Linear(in_features=768, out_features=3072, bias=True)
    (linear2): Linear(in_features=3072, out_features=768, bias=True)
    (gelu): GELU(approximate='none')
    (dropout): Dropout(p=0.1, inplace=False)
  )
) 

input torch.Size([1, 10, 768])
output torch.Size([1, 10, 768])


There is an issue with the way we set up the **encoder layers** (which uses just embedding inputs)
- they are totally **invariant to the position of the tokens**
- Multi-head attention layer is effectively a weighted sum, the **information on token position is lost**


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>6 |</span></b> <b>POSITIONAL EMBEDDINGS</b></div>

Let's incorporate positional information using **positional embeddings**


**positional embeddings** are based on idea:
  - Modify the **token embeddings** with a **position-dependent pattern** of values arranged in a vector
  
  
If the pattern is characteristic for each position
- the **attention heads** and **feed-forward layers** in each stack can learn to incorporate positional information into their transformations



- There are several ways to achieve this, and one of the most popular approaches is to use a `learnable pattern`
- This works exactly the same way as the token embeddings, but using the **position index** instead of the **token identifier** (from vocabulary dictionary) as input
- An efficient way of encoding the positions of tokens is learned during pretraining

Creating Custom `Embedding` class

Let’s create a custom Embeddings module (**token embeddings + positional embeddings**)
 - That combines a token embedding layer that projects the input_ids to a dense hidden state 
 - Together with the positional embedding that does the same for position_ids
 - The resulting embedding is simply the **sum of both embeddings**

In [24]:
'''

Token + Position Embedding 


'''

class tpEmbedding(nn.Module):
    
    def __init__(self, config):        
        super().__init__()
        
        # token embedding layer
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        
        # positional embedding layer
        # config.max_position_embeddings -> max number of positions in text 512 (tokens)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        
        self.norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        
        # Create position IDs for input sequence
        seq_length = input_ids.size(1) # number of tokens
        position_ids = torch.arange(seq_length, dtype=torch.long)[None,:] # range(0,9)
        
        # tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])
        # tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8]])
        
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        
        # Add normalisation & dropout layers
        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [25]:
# Token and Position Embeddings
embedding_layer = tpEmbedding(config)
embedding_layer(inputs.input_ids)

tensor([[[ 0.0000, -0.0000, -0.0000,  ...,  1.3367, -1.0344,  0.6113],
         [ 0.0000,  0.9846,  1.7784,  ..., -1.6795, -1.3171, -0.0000],
         [ 4.6805, -0.0000,  0.7242,  ...,  0.8793,  0.1183,  0.0000],
         ...,
         [ 2.4357,  1.7012, -0.0000,  ..., -0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.9958,  0.0000,  ..., -0.3156,  1.0391, -1.5924],
         [ 1.5081, -2.0370,  1.8315,  ..., -0.4919,  0.0000, -2.9875]]],
       grad_fn=<MulBackward0>)

In [26]:
inputs.input_ids

tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931,  1997]])

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>7 |</span></b> <b>PUTTING IT ALL TOGETHER</b></div>

- Constructing the Transformer **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">encoder</mark>**, combining the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Embedding</mark>** and **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Encoder</mark>**  layers
- We utilise both **token** & **positional** embeddings using `tpEmbedding`
- For a given number of heads, we store `encoderLayer`, which contains the **attention** & **feed-forward** layers (which are our layers)

In [27]:
# full transformer encoder combining the `Embedding` with the ``Embedding` ` layers

class TransformerEncoder(nn.Module):
    
    def __init__(self, config):       
        super().__init__()
        
        # token & positional embedding layer
        self.embeddings = tpEmbedding(config)
        
        # attention & forward feed layer 
        self.layers = nn.ModuleList([encoderLayer(config)
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        
        # embeddings layer output
        x = self.embeddings(x)
        
        # cycle through all heads
        for layer in self.layers:
            x = layer(x)
        return x

In [28]:
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
import faiss

from transformers import AutoTokenizer, AutoModel

config.num_labels = 5
encoder_classifier = TransformerClassifier(config)

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data = list(pd.read_csv("SEC-CompanyTicker.csv",index_col=0).companyName[:10])

# Tokenise input (text)
bert_embeddings = tokenizer(data, 
                   return_tensors="pt",      # pytorc tensor
                   padding=True,
                   add_special_tokens=False) # don't use pad, sep tokens



count = 0
# Print the same output multiple times
for _ in range(1000):
    if count % 100 == 0: 
        print(count)
    # Use the transformer to generate output
    output = encoder_classifier(bert_embeddings.input_ids).cpu().detach().numpy()
    count += 1
output


    
    
    
    
    
    

NameError: name 'TransformerClassifier' is not defined

In [231]:
# Tokenise input (text)
q = tokenizer(["Apple"], 
                   return_tensors="pt",      # pytorc tensor
                   padding=True,
                   add_special_tokens=False) # don't use pad, sep tokens


xq = encoder_classifier(q.input_ids).cpu().detach().numpy()
xq

array([[-1.125596  , -0.36863506, -2.3599467 ,  0.84576607,  1.0254472 ]],
      dtype=float32)

In [232]:

    
class Faiss:
    def __init__(self):
        pass

    def faiss(self,xb):
        d = xb[0].size
        M = 32
        index = faiss.IndexHNSWFlat(d, M)            
        index.hnsw.efConstruction = 40         # Setting the value for efConstruction.
        index.hnsw.efSearch = 16               # Setting the value for efSearch.
        index.add(xb)
        return index
    
    def query(self,index,xq,k=3):
        D, I = index.search(xq, k)   
        return D, I

xb = output
index = Faiss().faiss(xb)
D,I = Faiss().query(index,xq)
I

array([[5, 9, 8]])

In [161]:
# Now, check if the parameters are frozen
for name, param in encoder.named_parameters():
    print(f'{name}: {param.requires_grad}')

# Use the encoder for inference
encoder_output = encoder(inputs.input_ids)

embeddings.token_embeddings.weight: False
embeddings.position_embeddings.weight: False
embeddings.norm.weight: False
embeddings.norm.bias: False
layers.0.norm1.weight: False
layers.0.norm1.bias: False
layers.0.norm2.weight: False
layers.0.norm2.bias: False
layers.0.attention.heads.0.q.weight: False
layers.0.attention.heads.0.q.bias: False
layers.0.attention.heads.0.k.weight: False
layers.0.attention.heads.0.k.bias: False
layers.0.attention.heads.0.v.weight: False
layers.0.attention.heads.0.v.bias: False
layers.0.attention.heads.1.q.weight: False
layers.0.attention.heads.1.q.bias: False
layers.0.attention.heads.1.k.weight: False
layers.0.attention.heads.1.k.bias: False
layers.0.attention.heads.1.v.weight: False
layers.0.attention.heads.1.v.bias: False
layers.0.attention.heads.2.q.weight: False
layers.0.attention.heads.2.q.bias: False
layers.0.attention.heads.2.k.weight: False
layers.0.attention.heads.2.k.bias: False
layers.0.attention.heads.2.v.weight: False
layers.0.attention.heads.2.v

In [145]:


# hidden state for each token in a batch
encoder_output.size()

torch.Size([1, 1, 768])

In [148]:
inputs.input_ids

tensor([[6423]])

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#F1A424"><b><span style='color:#FFFFFF'>8 |</span></b> <b>CLASSIFICATION HEAD</b></div>

Quite often, transformers are divided into:
- Task independent body (`TransformerEncoder`)
- Task dependent head (`TransformerClassifier`)

Select one of the token outputs:
- The first token in such models is often used for the prediction **[CLS] token**
- Can attach a `dropout` and a `linear` transformation layer to make a classification prediction

In [223]:
class TransformerClassifier(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        
        # Transformer Encoder
        self.encoder = TransformerEncoder(config)
        
        # Classification Head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x) # 768 -> 3 
        return x

- For each sample in the batch we get the **unnormalized logits** for each class in the output, which corresponds to the BERT model 

In [187]:
config.num_labels = 100
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

torch.Size([1, 768])


tensor([[-0.1077, -1.1667, -1.6423, -0.6971, -1.2382, -0.1741, -1.3646, -1.4697,
         -0.2726,  2.0157,  1.1196, -0.8590,  0.5565,  1.2117, -0.5707, -0.7668,
         -0.7323, -2.0161,  0.0433,  0.3031, -1.2364,  0.1676, -0.1053, -0.3740,
         -0.2358,  0.6749,  1.2393,  1.2737, -1.4780,  0.2356,  1.2692,  0.9412,
          0.1535,  0.3772, -0.4796,  0.2924,  0.5760,  1.2281, -0.7525, -1.7116,
         -0.4864,  0.9515, -1.4321, -0.1793,  0.4792, -0.5470,  0.3365,  0.8325,
          1.2968,  0.1978, -1.2673,  1.9813, -1.1665, -1.8995,  0.4980,  1.6997,
         -2.3371,  0.0271, -1.0663, -1.0742,  0.2720,  0.9621,  0.0054, -0.9917,
          0.5763, -1.1573, -0.3858, -0.3850,  0.9824, -0.3361, -0.5372,  0.9757,
          0.9338, -0.6942,  0.8172,  0.5721, -1.2012, -0.1277, -0.0718, -0.5320,
         -1.0751,  0.3630,  0.0348, -0.1394, -0.9022,  0.1989, -1.3688,  0.3751,
         -0.5361,  3.0035,  1.3790, -0.1013,  0.8584, -1.4457,  0.1728,  0.8734,
          0.5113,  0.4873, -

In [31]:
inputs.input_ids

tensor([[ 1996, 11286,  1997,  1037,  5340,  3392,  2003,  2200,  5931]])

In [32]:
output

tensor([[-1.7248, -1.0299, -0.2984, -0.8205,  1.5645]],
       grad_fn=<AddmmBackward0>)

In [46]:
# Tokenise input (text)
inputs = tokenizer(["julia is nice","julia is kind","julia is mean","julia is happy"], 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False) # don't use pad, sep tokens

print(inputs.input_ids)
inputs.input_ids.size()

tensor([[6423, 2003, 3835],
        [6423, 2003, 2785],
        [6423, 2003, 2812],
        [6423, 2003, 3407]])


torch.Size([4, 3])

In [213]:
config.num_labels = 5
encoder_classifier = TransformerClassifier(config)
output = encoder_classifier(inputs.input_ids)
output

torch.Size([200, 768])


tensor([[-2.9110e-01,  5.7940e-01,  3.9706e-01, -1.8225e-01,  6.0488e-01],
        [-6.3441e-01, -1.2129e+00,  2.4899e-01, -8.7813e-01, -2.3223e+00],
        [ 9.6574e-01, -2.9699e-01,  6.2808e-01, -5.4478e-01, -4.3743e-01],
        [ 3.2243e-01, -8.6369e-01, -4.8803e-01,  6.2011e-01, -2.7719e+00],
        [ 1.7241e-01, -2.4628e-01,  2.8055e-01, -1.6542e-02,  2.9267e-01],
        [ 5.2029e-01, -1.7808e-01, -1.1362e+00, -2.1143e+00, -1.1951e+00],
        [ 9.5677e-02, -2.1166e+00, -8.8535e-01, -1.2726e+00, -1.1311e+00],
        [ 2.0610e-01, -9.4082e-02,  9.8888e-01, -1.0614e-01,  1.4996e-01],
        [-1.2756e+00, -1.2334e+00, -1.0847e-01, -8.5860e-01, -5.6302e-01],
        [ 1.1313e+00,  1.0202e+00,  1.4249e+00, -1.8494e-01, -1.2267e+00],
        [ 1.0440e+00,  1.4323e+00,  3.5243e-01,  2.9605e-02, -1.8420e+00],
        [-4.1269e-01, -3.1108e-01, -8.8196e-02, -7.1487e-01, -1.9682e-01],
        [-1.1787e+00, -8.7122e-01, -8.6970e-01, -6.9106e-01, -1.1400e+00],
        [ 2.5004e-01,  3.

In [36]:
import numpy as np
output.detach().numpy()

array([[ 0.2858461 ,  2.291011  ,  1.1027378 , -1.3654897 ,  0.50570506],
       [-0.43984768,  1.707121  , -0.33323628, -1.3266757 ,  0.0403869 ],
       [ 1.2801098 , -0.38431922, -0.89034826,  0.26452443,  0.01639184],
       [ 0.787185  ,  0.47547337, -0.2487884 ,  0.93552494,  0.02594745]],
      dtype=float32)

In [96]:
inputs = tokenizer(x, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False,
                  padding=True ) # don't use pad, sep tokens
                    
output = encoder_classifier(inputs.input_ids)
output.detach().numpy()


torch.Size([3, 768])


array([[ 0.30067572, -0.19365877,  1.5731525 ,  0.34049684,  0.61987585],
       [ 1.599041  ,  0.65223604, -0.30029094, -0.67572856,  1.787407  ],
       [ 1.168354  ,  0.27019048,  0.68573517, -0.11251289,  0.5283011 ]],
      dtype=float32)

In [129]:
def encoder(x):
    inputs = x
    inputs = tokenizer(inputs, 
                   return_tensors="pt",      # pytorc tensor
                   add_special_tokens=False,
                      ) # don't use pad, sep tokens
    output = encoder_classifier(inputs.input_ids)
    output = output.detach().numpy()
    return output
encoder(["julia"])


torch.Size([1, 768])


array([[ 1.1925466 , -2.004026  ,  0.37853348, -0.53329134,  3.518763  ]],
      dtype=float32)

In [130]:
encoder(["julia"])

torch.Size([1, 768])


array([[-1.4516523 , -0.29492986, -1.4100991 , -0.5428195 ,  1.6403229 ]],
      dtype=float32)

In [138]:
for i in range(100):
    print(encoder(["julia"]))

torch.Size([1, 768])
[[ 1.8389405  -1.0258632  -1.9472094   0.24997222  2.9709525 ]]
torch.Size([1, 768])
[[-1.2937596  -0.55749947 -1.5676594  -0.10609275  2.590736  ]]
torch.Size([1, 768])
[[ 0.7001611  -0.7492791  -1.4172636   0.41050863  1.5200357 ]]
torch.Size([1, 768])
[[ 0.38916868 -2.6472218  -1.4488543   0.6265552   2.8951118 ]]
torch.Size([1, 768])
[[-0.47231916 -0.53634554 -0.5050098  -0.97858167  2.0714037 ]]
torch.Size([1, 768])
[[ 1.1571622  -0.14157459 -0.44913638  0.04301509  2.4572744 ]]
torch.Size([1, 768])
[[ 1.1925933  -2.0172298  -0.03272817  0.05695865  4.1540723 ]]
torch.Size([1, 768])
[[ 0.26143798 -0.39048213 -0.5843675  -0.12257874  1.6031576 ]]
torch.Size([1, 768])
[[-0.00853661 -1.8508751  -0.90174556  0.15717727  0.42443526]]
torch.Size([1, 768])
[[-0.10971469 -0.96365035 -1.2132721   0.33689216  0.30131382]]
torch.Size([1, 768])
[[ 0.90288585 -1.5594695  -0.31239155  0.93199253  0.9063376 ]]
torch.Size([1, 768])
[[-0.575763   -1.474622   -0.01893973 -1.609

In [98]:
import faiss
class Embeddings:
    #CLS is a special classification token and the last hidden state of BERT Embedding
    def cls_pooling(self, model_output):
        return model_output.last_hidden_state[:, 0]

    #BERT tokenizer of input text
    def get_embeddings(self, text_list):
        encoded_input = tokenizer(
            text_list, padding=True, truncation=True, return_tensors="pt"
        )
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
        model_output = model(**encoded_input)
        return self.cls_pooling(model_output).cpu().detach().numpy()
    
    
    #convert dataset into embeddings dataset to run FAISS
    def makeEmbeddings(self,dataset):
        embeddings = []
        for data in dataset:
            embeddings.append(self.get_embeddings(data)[0])
        return np.array(embeddings)
    
    def getQueryEmbedding(self, query):
        return self.get_embeddings([query])
    
class Faiss:
    def __init__(self):
        pass

    def faiss(self,xb):
        d = xb[0].size
        M = 32
        index = faiss.IndexHNSWFlat(d, M)            
        index.hnsw.efConstruction = 40         # Setting the value for efConstruction.
        index.hnsw.efSearch = 16               # Setting the value for efSearch.
        index.add(xb)
        return index
    
    def query(self,index,xq,k=3):
        D, I = index.search(xq, k)   
        return D, I

In [99]:
xb = encoder(["julia is nice", "isabelle is planning", "julia plans too"])
xq = encoder(["isabelle plans it"])
index = Faiss().faiss(xb)
D,I = Faiss().query(index,xq)
I

torch.Size([3, 768])
torch.Size([1, 768])


array([[1, 0, 2]])

In [100]:
import pandas as pd
df = pd.read_csv("SEC-CompanyTicker.csv", index_col=0)

In [233]:
x = list(df.head(20).companyName)
q = ["Apple Inc."]
xb,xq = encoder(x,q)
index = Faiss().faiss(xb)
D,I = Faiss().query(index,xb)
I

TypeError: forward() takes 2 positional arguments but 3 were given

In [104]:
x

['Apple Inc.',
 'Microsoft Corp',
 'Alphabet Inc.',
 'Amazon Com Inc',
 'Nvidia Corp',
 'Tesla, Inc.',
 'Berkshire Hathaway Inc',
 'Meta Platforms, Inc.',
 'Eli Lilly & Co',
 'Visa Inc.',
 'Taiwan Semiconductor Manufacturing Co Ltd',
 'Exxon Mobil Corp',
 'Unitedhealth Group Inc',
 'Walmart Inc.',
 'Novo Nordisk A S',
 'Jpmorgan Chase & Co',
 'Spdr S&P 500 Etf Trust',
 'Johnson & Johnson',
 'Mastercard Inc',
 'Lvmh Moet Hennessy Louis Vuitton']

In [94]:
x = "am"
encoder(x)

torch.Size([1, 768])


array([[-0.6796455 , -0.26035303,  0.939491  , -0.8765463 ,  1.2847595 ]],
      dtype=float32)