**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [2]:
#@ INITIALIZING LIBRARIES AND DEPENDENCIES: 
import math
import numpy as np
from scipy.special import softmax

**Input Embedding**
- The input embedding sub-layer converts the input tokens to vectors of dimension: 512 using learned embeddings in the original **Transformer** model. Cosine similarity uses Euclidean (L2) norm to create vectors in a unit sphere. 

**Positional Encoding**
- The idea is to add a positional encoding value to the input embedding instead of having additional vectors to describe the position of the token in a sequence. 

In [3]:
#@ IMPLEMENTATION OF POSITIONAL ENCODING: EXAMPLE FUNCTION:
d_model = 512
def positional_encoding(pos, pe):
    for i in range(0, 512, 2):
        pe[0][i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
        pc[0][i] = (y[0][i]*math.sqrt(d_model)) + pe[0][i]
        pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i) / d_model)))
        pc[0][i+1] = (y[0][i+1]*math.sqrt(d_model)) + pe[0][i+1]
    return pe

**Architecture of Multi-Head Attention**

In [4]:
#@ STEP 1: REPRESENT THE INPUT:
print("Step 1: Input:3 inputs, d_model=4")
x = np.array([[1.0, 0.0, 1.0, 0.0],             # Input 1. 
              [0.0, 2.0, 0.0, 2.0],             # Input 2. 
              [1.0, 1.0, 1.0, 1.0]])            # Input 3. 
print(x)                                        # Inspection. 

Step 1: Input:3 inputs, d_model=4
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]


In [5]:
#@ STEP 2: INITIALIZING WEIGHT MATRICES:
print("Step 2: weights 3 dimensions x d_model=4")
print("w_query")
w_query = np.array([[1, 0, 1],
                    [1, 0, 0],
                    [0, 0, 1],
                    [0, 1, 1]])                     # Initializing query weight matrix. 
print(w_query)                                      # Inspection. 

Step 2: weights 3 dimensions x d_model=4
w_query
[[1 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 1]]


In [6]:
#@ STEP 2: INITIALIZING WEIGHT MATRICES:
print("w_key")
w_key = np.array([[0, 0, 1],
                  [1, 1, 0],
                  [0, 1, 0],
                  [1, 1, 0]])                   # Initializing key weight matrix. 
print(w_key)                                    # Inspection. 

w_key
[[0 0 1]
 [1 1 0]
 [0 1 0]
 [1 1 0]]


In [7]:
#@ STEP 2: INITIALIZING WEIGHT MATRICES:
print("w_value")
w_value = np.array([[0, 2, 0],
                    [0, 3, 0],
                    [1, 0, 3],
                    [1, 1, 0]])                 # Initializing value weight matrix. 
print(w_value)                                  # Inspection. 

w_value
[[0 2 0]
 [0 3 0]
 [1 0 3]
 [1 1 0]]


In [8]:
#@ STEP 3: INITIALIZING MATRIX MULTIPLICATION: 
print("Step 3: Matrix multiplication to obtain Q,K,V")
print("Query: x * w_query")
Q = np.matmul(x, w_query)                               # Initializing matrix multiplication. 
print(Q)                                                # Inspection. 

Step 3: Matrix multiplication to obtain Q,K,V
Query: x * w_query
[[1. 0. 2.]
 [2. 2. 2.]
 [2. 1. 3.]]


In [9]:
#@ STEP 3: INITIALIZING MATRIX MULTIPLICATION: 
print("Query: x * w_key")
K = np.matmul(x, w_key)                                 # Initializing matrix multiplication. 
print(K)                                                # Inspection. 

Query: x * w_key
[[0. 1. 1.]
 [4. 4. 0.]
 [2. 3. 1.]]


In [10]:
#@ STEP 3: INITIALIZING MATRIX MULTIPLICATION: 
print("Query: x * w_value")
V = np.matmul(x, w_value)                               # Initializing matrix multiplication. 
print(V)                                                # Inspection. 

Query: x * w_value
[[1. 2. 3.]
 [2. 8. 0.]
 [2. 6. 3.]]


In [11]:
#@ STEP 4: INITIALIZING ATTENTION SCORES: 
print("Step 4: Scaled Attention Scores")
k_d = 1                                                 # Square root of 3 and rounded down to 1. 
attention_scores = (Q @ K.transpose()) / k_d            # Getting attention scores. 
print(attention_scores)                                 # Inspection. 

Step 4: Scaled Attention Scores
[[ 2.  4.  4.]
 [ 4. 16. 12.]
 [ 4. 12. 10.]]


In [12]:
#@ STEP 5: SCALED SOFTMAX ATTENTION SCORES: 
print("Step 5: Scaled softmax attention scores")
attention_scores[0] = softmax(attention_scores[0])      # Implementation of softmax. 
attention_scores[1] = softmax(attention_scores[1])      # Implementation of softmax. 
attention_scores[2] = softmax(attention_scores[2])      # Implementation of sofmax. 
print(attention_scores[0])
print(attention_scores[0])
print(attention_scores[0])                              # Inspection. 

Step 5: Scaled softmax attention scores
[0.06337894 0.46831053 0.46831053]
[0.06337894 0.46831053 0.46831053]
[0.06337894 0.46831053 0.46831053]


In [14]:
#@ STEP 6: FINAL ATTENTION REPRESENTATIONS: 
print("Step 6: attention value")
print(V[0])
print(V[1])
print(V[2])
print("Attention 1")
attention1 = attention_scores[0].reshape(-1, 1)
attention1 = attention_scores[0][0]*V[0]               # Finalizing attention score. 
print(attention1)                                      # Inspection. 
print("Attention 2")
attention2 = attention_scores[0][1]*V[1]               # Finalizing attention score. 
print(attention2)                                      # Inspection. 
print("Attention 3")
attention3 = attention_scores[0][2]*V[2]               # Finalizing attention score. 
print(attention3)                                      # Inspection. 

Step 6: attention value
[1. 2. 3.]
[2. 8. 0.]
[2. 6. 3.]
Attention 1
[0.06337894 0.12675788 0.19013681]
Attention 2
[0.93662106 3.74648425 0.        ]
Attention 3
[0.93662106 2.80986319 1.40493159]


In [15]:
#@ STEP 7: SUMMING THE RESULTS: 
print("Step 7: Summing the results")
attention_input1 = attention1 + attention2 + attention3     # Summing the results. 
print(attention_input1)                                     # Inspection. 

Step 7: Summing the results
[1.93662106 6.68310531 1.59506841]


In [16]:
#@ STEP 8: INITIALIZING ATTENTION REPRESENTATIONS: EXAMPLE: 
attention_head1 = np.random.random((3, 64))                 # Initialization. 
print(attention_head1)                                      # Inspection. 

[[0.17599858 0.35796186 0.09740592 0.40705524 0.12187824 0.39475167
  0.58226689 0.01675513 0.07660266 0.85460523 0.77730928 0.78501463
  0.26676054 0.60916484 0.14232877 0.15225525 0.25331885 0.1194398
  0.25211881 0.5437024  0.17673746 0.48625728 0.92390308 0.79930546
  0.77348391 0.83721065 0.5155331  0.22352432 0.03883027 0.77928937
  0.16412865 0.22877619 0.76726344 0.57586296 0.20120711 0.7193665
  0.14831954 0.03804788 0.37581549 0.39252896 0.79651159 0.264765
  0.86412066 0.93821995 0.29480446 0.98413339 0.43236798 0.79724319
  0.46304075 0.56416182 0.76577606 0.83092159 0.18197875 0.44794159
  0.18095532 0.37264965 0.47574602 0.7214265  0.64125906 0.33467985
  0.38870496 0.4799408  0.1853437  0.46524622]
 [0.50360554 0.96032658 0.49206977 0.56752657 0.89448614 0.98268664
  0.02031872 0.13868322 0.18985929 0.30104888 0.90118555 0.57541545
  0.40293367 0.11509733 0.42609426 0.03867191 0.41853406 0.98024246
  0.55611437 0.43230776 0.05788317 0.35874955 0.63015156 0.4084252
  0.38