# Self-Attention

To create an initial attention matrix, we need every work to look at every other workds in order to compare the affinities between them. 

$$
\begin{aligned}
\text { Attention }(Q, K, V) &=\operatorname{softmax} \left(\frac{Q. K^{T}} {\sqrt{d_{k}}} + M \right) V \\
\text { MultiHead }(Q, K, V) &=\text { Concat }\left(\text {head}_{1}, \ldots, \text { head }_{h}\right) W^{O} \\
\text { where head }_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right)
\end{aligned}
$$

- Q (Query) is what we are looking for (a matrix of words we want to find the affinities for)
- K (Key) is what we got. It is like a descriptor or identifier for each element in the sequence (a matrix of words we want to compare the query to)
- V (Value): is the answer, the actual words that we want to return.. 
    
For example, in the sentence: "The animal didn't cross the street because it was too tired", the word "it" is a pronoun that refers to the word "street". Therefore, the word "street" is the value.	
     E.g.: "cat" is the query, "feline" is the key, and "cat" is the value
- M represents the mask (hide future words to avoid cheating or data leakage)

In [1]:
import numpy as np 

In [18]:
# Length of the input sequence
# The input is a sequence of 4 words
seq_length = 4

# Dimension of the embedding space, i
# Each world will be represented as a vector of size 8
embedding_dim = 8

Q = np.random.randn(seq_length, embedding_dim)
K = np.random.randn(seq_length, embedding_dim)
V = np.random.randn(seq_length, embedding_dim)

print("Q: ", Q)
print("\nK: ", K)
print("\nV: ", V)

Q:  [[ 0.07232786 -0.59195325 -0.1419609   0.77552694  0.44124696 -1.97112574
  -0.38431624 -1.34772469]
 [-0.75082145  0.05107283  1.08104267 -0.10403621  0.22744124  0.99602557
   1.4203686  -0.48494464]
 [ 1.1651399   1.03615217 -0.41530289 -2.17987454 -1.08842478  0.78854814
  -0.4572014  -2.04668457]
 [ 0.10619677 -0.41690603 -0.63991237  0.3152003  -0.53981453  1.12299832
   0.17310038  1.02869691]]

K:  [[-0.95490556 -0.24524066  2.4711064  -0.50403096 -0.29284632  0.16942138
   0.77620133 -0.10588287]
 [-0.4312431  -1.0853785  -0.32632753 -0.60694835 -0.05643407 -0.06701821
  -0.60725339  0.32265753]
 [-0.31492119 -1.05385367 -0.68237752 -1.78838182 -0.37308738  1.01633119
  -0.51998175 -0.69869263]
 [ 1.45830128  0.21022749 -0.20899192 -0.10520092 -0.10807048 -0.43582661
  -0.86607773 -1.02268841]]

V:  [[ 0.24345222  0.18000543 -0.70356586 -0.58213488  0.1887324   0.18413888
  -0.72599888 -1.00914211]
 [ 0.7336737  -0.7366185  -0.79201003 -1.66652835  1.09357679  1.2101871
  

Dividing (Q @ K.T) the product by the square root of the dim reduces the variance.

In [23]:
(Q @ K.T).var(), ((Q @ K.T) / np.sqrt(embedding_dim)).var()

(5.87233951474035, 0.7340424393425435)

## Scaled-dot product attention

We are going to compute the scaled dot product attention.

![Alt text](imgs/multi-head-attention.png)

### Matmul: Q @ K

In [95]:
matmul = Q @ K.T
matmul

array([[-1.28435957,  0.09264717, -1.7154703 ,  2.45164693],
       [ 4.68422941, -1.11985814,  0.15870145, -2.49204791],
       [-0.98007198, -0.54264375,  5.59820479,  4.49612129],
       [-1.36554585,  0.60621855,  0.81286187, -1.46525039]])

### Scale

In [96]:
scale = matmul / np.sqrt(embedding_dim)

### Masking

The mask will just mask the next words as we will see below.

For the decoder, in reality, we aren't supposed to know the next word. So looking at the next words when trying to generate the context of the current word is cheating.

In [49]:
# Creating the mask
mask = np.tril(np.ones((seq_length, seq_length)))
mask[mask == 0] = -np.inf
mask[mask == 1] = 0
print(mask,  "\n")

# To get a more intuitive understanding 
# of the mask, let's fill it with words
mask_words = np.full_like(mask, fill_value='', dtype=object)
mask_words[..., 0] = "My"
mask_words[1:, 1] = "Name"
mask_words[2:, 2] = "is"
mask_words[3:, 3] = "Becaye"
print(mask_words)

[[  0. -inf -inf -inf]
 [  0.   0. -inf -inf]
 [  0.   0.   0. -inf]
 [  0.   0.   0.   0.]] 

[['My' '' '' '']
 ['My' 'Name' '' '']
 ['My' 'Name' 'is' '']
 ['My' 'Name' 'is' 'Becaye']]


**Applying the maskk to the attention matrix**

This will make the softmax ignore the masked values.

In [55]:
print('\n------- Scale -------')
print(scale)

# Applying the mask will only hide the future words
print('\n------- Scale + Mask -------')
print(scale + mask)


------- Scale -------
[[-0.45408968  0.03275572 -0.60651034  0.86678808]
 [ 1.65612519 -0.39592964  0.05610943 -0.88107199]
 [-0.34650777 -0.19185354  1.97926428  1.58961893]
 [-0.48279337  0.21433062  0.28739007 -0.51804424]]

------- Scale + Mask -------
[[-0.45408968        -inf        -inf        -inf]
 [ 1.65612519 -0.39592964        -inf        -inf]
 [-0.34650777 -0.19185354  1.97926428        -inf]
 [-0.48279337  0.21433062  0.28739007 -0.51804424]]


### Softmax

$$
\begin{aligned}
\text { Softmax }(x)_{i} &=\frac{\exp \left(x_{i}\right)}{\sum_{j} \exp \left(x_{j}\right)} \\
\end{aligned}
$$

In [97]:
def softmax_fn(x):
    # Compute the exponential of each element in x
    exp_x = np.exp(x)
    
    # Sum the exponentials along the last dimension (axis=-1)
    sum_exp_x = np.sum(exp_x, axis=-1, keepdims=True)  # Keepdims to maintain shape
    
    # Compute the softmax values by dividing each element of exp_x by sum_exp_x
    softmax_x = exp_x / sum_exp_x
    
    return softmax_x

Without the mask, we can see that the attention also focuses on the next words. But with the mask, only the current words are focused on. 

In [98]:
# Without mask
softmax = softmax_fn(scale)
print("------- Without Mask -------")
print(softmax)


softmax = softmax_fn(scale + mask)
print("\n------- With Mask -------")
print(softmax)

------- Without Mask -------
[[0.13826457 0.22498066 0.1187177  0.51803707]
 [0.70949574 0.09114938 0.14324246 0.05611243]
 [0.0517232  0.06037413 0.52936519 0.35853747]
 [0.16303918 0.32737769 0.35219111 0.15739202]]

------- With Mask -------
[[1.         0.         0.         0.        ]
 [0.88615508 0.11384492 0.         0.        ]
 [0.08063324 0.0941195  0.82524726 0.        ]
 [0.16303918 0.32737769 0.35219111 0.15739202]]


### Matmul

In [102]:
# attention_V = softmax_fn(scale + mask) @ V
attention_V = softmax @ V

print("\n------- Before Attention -------")
print(V)

print("\n------- After Attention -------")
print(attention_V)

# Applying the attention has modified the values of the vectors
# to better encapsulate the context of the workd.
# Notice how the first has remained intact.


------- Before Attention -------
[[ 0.24345222  0.18000543 -0.70356586 -0.58213488  0.1887324   0.18413888
  -0.72599888 -1.00914211]
 [ 0.7336737  -0.7366185  -0.79201003 -1.66652835  1.09357679  1.2101871
   1.19436486 -0.1395625 ]
 [ 0.27284825  1.40833987  0.25383996  2.32870208  1.02916032  0.74427341
  -0.87619586 -0.19458784]
 [ 1.12859828 -0.37376461 -0.46840821 -0.71206725 -2.0852268   0.92654678
  -0.85635522  1.66687045]]

------- After Attention -------
[[ 0.24345222  0.18000543 -0.70356586 -0.58213488  0.1887324   0.18413888
  -0.72599888 -1.00914211]
 [ 0.29926144  0.07565246 -0.71363477 -0.70558756  0.29174433  0.30094926
  -0.50737523 -0.91014489]
 [ 0.31385061  1.10741288  0.07820635  1.71796277  0.96745674  0.74295951
  -0.66920485 -0.25508903]
 [ 0.55360774  0.22537269 -0.35831875  0.06757948  0.42304647  0.83416766
  -0.17072973 -0.01639935]]
