# Self attention

So lets try to recreate self attention, cuz why not.

---

First lets take a sentence, a beginner level one at that "My name is Baktho" <br>
Here let L be the length of the input = 4
<br>
Let d_k and d_v be the length of the arrays q,k and v
<br>
>Every single input for a transformer will have these 3 vectors:
>* q - a query vector ("What I'm looking for")
* k - a key vector ("What I can offer/ What I have")
* v - a Value vector ("What I actually
 offer")

In [31]:
import numpy as np
import math

L,d_k,d_v = 4,8,8
q = np.random.randn(L,d_k)
k = np.random.randn(L,d_k)
v = np.random.randn(L,d_v)

In [5]:
print("Q\n",q,"\nK\n",k,"\nV\n",v)

Q
 [[-0.91650365 -0.43938416 -0.98441261 -0.65332904  0.268756   -0.29742282
   0.0374929  -1.89440144]
 [-0.06346317 -0.1609114   1.73088298  0.30103895  0.89140338  0.75865651
   0.41346719  0.95797565]
 [-0.2178364  -1.48619427  0.02730158 -1.16561411 -0.17503116 -0.65109247
   1.42877914  2.02763922]
 [-0.76685037  0.67228886  0.55178884 -0.77246581  0.45459953  1.11097801
  -1.19093102 -0.96136145]] 
K
 [[ 0.25498692 -1.1890085  -1.47267718  1.42860365 -1.36587921  0.57772352
  -0.52634499 -1.24090347]
 [-1.66286252  0.65208971 -0.4453618  -0.30204426  0.42928219  1.2096245
   2.33289997 -0.57796015]
 [-1.67234527  0.099901   -1.83971522  0.14641315 -0.13421936 -1.01376185
  -0.49231473  1.38570551]
 [ 1.48226781 -1.10782742  1.45953526 -0.74570896  0.40941078 -0.16764686
  -0.62440186 -0.16653004]] 
V
 [[ 0.42672627  0.67092111 -0.45006872 -1.00261084  1.11385623  0.48990995
   1.0427403   0.2461847 ]
 [ 1.81759508  2.83380172 -1.50073764  0.1541898  -0.3610862   0.94638871
   0.

# Self attention

self attention = softmax((Q.Kᵀ ÷ √dk) + M)V





In [9]:
mm = np.matmul(q,k.T)
mm

array([[ 2.59722752,  2.81121371,  0.82610135, -1.36937592],
       [-4.12946109,  0.85006455, -2.81502306,  2.20605231],
       [-3.39907008,  1.03159696,  2.78478409,  1.04031016],
       [-1.0703409 ,  1.01745595, -1.71176672,  0.40351338]])

In [10]:
# Why we need sqrt(d_k) in denominator? To minimize variance and stabilize the values of Q.K^T
q.var(), k.var(), mm.var()

(0.8643864049699872, 1.1253028897510355, 4.6518239646636275)

In the above output, the variance of the matrix multiplication is much higher

In [11]:
scaled = mm/math.sqrt(d_k)
q.var(), k.var(), scaled.var()

(0.8643864049699872, 1.1253028897510355, 0.5814779955829534)

Here the variances are more or less in the same range

In [12]:
scaled

array([[ 0.91825859,  0.99391414,  0.29207093, -0.4841475 ],
       [-1.45998497,  0.3005432 , -0.99526095,  0.77995727],
       [-1.20175275,  0.3647246 ,  0.98456986,  0.36780519],
       [-0.37842265,  0.359725  , -0.60520093,  0.14266352]])

In the above output even the actual values are now in the same range

# Masking
*   This is to ensure words don't get context from words generated in the future.
*   Not required in the encoders but in the decoders




In [13]:
mask = np.tril(np.ones( (L,L) )) # triangular matrix
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

Here the first element of first row can only look at itself, the elements in the second row can only look at itself and one before it(cuz rest are 0-> Masked) and so on

In [14]:
mask[mask==0] = -np.Infinity
mask[mask==1] = 0

In [16]:
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [19]:
masked = scaled+mask
masked

array([[ 0.91825859,        -inf,        -inf,        -inf],
       [-1.45998497,  0.3005432 ,        -inf,        -inf],
       [-1.20175275,  0.3647246 ,  0.98456986,        -inf],
       [-0.37842265,  0.359725  , -0.60520093,  0.14266352]])

why use zero and -infi? -> for the above addition and for the softmax operation

# Softmax

SOFTMAX = eˣⁱ / ∑<sub>j</sub>eˣ<sub>j</sub>

> Converts vectors into probability distribution so the vectors add upto one

In [18]:
def softmax(x):
  return (np.exp(x).T/np.sum(np.exp(x), axis=-1)).T

In [20]:
attention = softmax(masked)

In [21]:
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.1467242 , 0.8532758 , 0.        , 0.        ],
       [0.06806351, 0.3260069 , 0.60592959, 0.        ],
       [0.17943625, 0.37539082, 0.14302819, 0.30214474]])

In [22]:
new_v = np.matmul(attention,v)
new_v

array([[ 0.42672627,  0.67092111, -0.45006872, -1.00261084,  1.11385623,
         0.48990995,  1.0427403 ,  0.2461847 ],
       [ 1.61352097,  2.51645479, -1.34657908, -0.01554085, -0.14467645,
         0.87941223,  0.74891533, -0.01720641],
       [ 1.67595736,  0.0111591 , -0.3607314 ,  0.03534027,  0.23825479,
         0.98037904,  0.68228347, -0.04371556],
       [ 1.11578216,  1.06139643, -0.81349669, -0.75494672,  0.35841123,
         0.88184481,  0.27369333,  0.07726451]])

In [23]:
v

array([[ 0.42672627,  0.67092111, -0.45006872, -1.00261084,  1.11385623,
         0.48990995,  1.0427403 ,  0.2461847 ],
       [ 1.81759508,  2.83380172, -1.50073764,  0.1541898 , -0.3610862 ,
         0.94638871,  0.69839095, -0.06249757],
       [ 1.74007731, -1.58161123,  0.26265871,  0.08798802,  0.46236134,
         1.05376072,  0.63312742, -0.06617454],
       [ 0.35752199,  0.34235483, -0.68491185, -2.1364204 ,  0.75448204,
         0.95303359, -0.88082517,  0.21849094]])

The first row of the Value and new Value is pretty much same but the rest are way different because of the self attention mechanism

# Summing it up

In [24]:
def softmax(x):
  return (np.exp(x).T/np.sum(np.exp(x), axis=-1)).T

def scaled_dot_product_attention(q,k,v, mask = None):
  d_k = q.shape[-1]
  mm = np.matmul(q,k.T)
  scaled = mm/math.sqrt(d_k)
  if mask is not None:
    scaled = scaled+mask
  attention = softmax(scaled)
  output = np.matmul(attention,v)
  return output, attention

In [30]:
values, attention = scaled_dot_product_attention(q,k,v, mask=mask)
print("Q\n",q,"\nK\n",k,"\nV\n",v)
print("Values\n",values,"\nAttention\n",attention)

Q
 [[-0.91650365 -0.43938416 -0.98441261 -0.65332904  0.268756   -0.29742282
   0.0374929  -1.89440144]
 [-0.06346317 -0.1609114   1.73088298  0.30103895  0.89140338  0.75865651
   0.41346719  0.95797565]
 [-0.2178364  -1.48619427  0.02730158 -1.16561411 -0.17503116 -0.65109247
   1.42877914  2.02763922]
 [-0.76685037  0.67228886  0.55178884 -0.77246581  0.45459953  1.11097801
  -1.19093102 -0.96136145]] 
K
 [[ 0.25498692 -1.1890085  -1.47267718  1.42860365 -1.36587921  0.57772352
  -0.52634499 -1.24090347]
 [-1.66286252  0.65208971 -0.4453618  -0.30204426  0.42928219  1.2096245
   2.33289997 -0.57796015]
 [-1.67234527  0.099901   -1.83971522  0.14641315 -0.13421936 -1.01376185
  -0.49231473  1.38570551]
 [ 1.48226781 -1.10782742  1.45953526 -0.74570896  0.40941078 -0.16764686
  -0.62440186 -0.16653004]] 
V
 [[ 0.42672627  0.67092111 -0.45006872 -1.00261084  1.11385623  0.48990995
   1.0427403   0.2461847 ]
 [ 1.81759508  2.83380172 -1.50073764  0.1541898  -0.3610862   0.94638871
   0.

This is a single simple attention head and transformers have multiple of the results stacked on top of each other in order to get multi head attention