In [1]:
import numpy as np
import math

In [2]:
'''
    L : Length of input sequence (eg: "My name is Naveen")
    d_k : dimension of key vector
'''
L,d_k=4,8

'''
Query (Q): Represents the token for which attention scores are computed.
Key (K): Represents tokens against which similarity is measured relative to the query.
Value (V): Represents the content the model should focus on based on the attention scores.
'''
Q=np.random.randn(L,d_k)
K=np.random.randn(L,d_k)
V=np.random.randn(L,d_k)

In [3]:
print('Q',Q)
print('K',K)
print('V',V)

Q [[-0.12952816  0.4968856  -1.11360572  0.64376398  0.68393544  1.71629992
  -0.46210517 -2.06693499]
 [ 1.62719989  0.12665391  1.33433868  0.60299945 -2.09292818  1.35875407
  -1.02141623 -0.69122584]
 [ 0.47803292 -0.87641853  0.69799344 -0.55551601  0.55171491 -0.43617022
  -1.82168364 -0.079966  ]
 [-1.47421489  0.56718449 -1.0868938  -0.48990292  0.64663416 -0.55543494
  -0.21241774  2.364312  ]]
K [[ 0.14817456  1.03362618 -0.32496023 -0.44147382 -0.15858248  1.39706859
   0.82240463 -0.06290707]
 [-0.78722209 -0.39043776 -0.99504845  0.6633271  -0.17409011 -0.62540127
   0.34685703  0.33057644]
 [-1.56385772 -0.17226573 -1.46686369  1.15585957  0.68717778  0.95540222
   0.50328574 -2.53532054]
 [ 0.25841468  0.58202733  0.17330716  0.84707466 -0.08903169 -1.97877716
  -1.35320251  0.32104006]]
V [[ 0.67190478 -1.85280841 -0.06877835  1.03676013 -1.22125968 -0.27264151
  -1.14697016  0.4664111 ]
 [ 0.70596376  0.61012559  0.26463287  0.4541557   0.10243701  0.43204941
   1.4055

### Formula

$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \text{Mask} \right) \mathbf{V} \$

In [10]:
scaled=np.matmul(Q,K.T)

In [11]:
scaled

array([[ 2.61138973, -0.59292471,  9.61208958, -2.88726473],
       [ 1.10584688, -3.32636072, -2.72847974, -0.10581839],
       [-3.00661381, -1.57872265, -3.01423804,  2.51723308],
       [-0.26465864,  2.63833476, -2.95168475,  1.43380138]])

### 1] Why we need sqrt(d_k) in denominator?

In [5]:
print("Varinace of Q: ",Q.var())
print("Varinace of K: ",K.var())
print("Varinace of QK_T: ",np.matmul(Q,K.T).var())

Varinace of Q:  1.194145607051731
Varinace of K:  0.8649907289769572
Varinace of QK_T:  10.773626616673676


#### NOTE : Since varinace is so larger so we need to use sqrt(dk) in denominator

### 2] Scaled vector

In [6]:
np.matmul(Q,K.T)/np.sqrt(d_k)

array([[ 0.92326569, -0.20963054,  3.39838686, -1.02080224],
       [ 0.39097592, -1.17604611, -0.96466326, -0.03741245],
       [-1.06299851, -0.55816275, -1.06569408,  0.88997629],
       [-0.09357096,  0.9327922 , -1.04357815,  0.50692534]])

### 3] Masking

### Masking in Transformer Architecture

In transformers, **masking is crucial for ensuring that attention mechanisms operate correctly by controlling which parts of the input sequences can be attended to during computation**. 

#### Types of Masks

1. **Padding Mask (Both Encoder and Decoder):**
   - **Purpose:** Handles variable-length sequences by masking padding tokens.
   - **Implementation:** Assigns a mask value of 0 to padding tokens and 1 to actual tokens.

2. **Self-Attention Mask (Encoder - Optional for bidirectional models, Decoder):**
   - **Purpose:** Prevents tokens from attending to future tokens.
   - **Implementation:** Creates an upper triangular matrix of 1s (allowing self-attention only to previous tokens) and 0s below the diagonal.

3. **Causal Mask (Decoder):**
   - **Purpose:** Maintains the autoregressive property during training.
   - **Implementation:** Creates a lower triangular matrix of 1s (allowing attention only to itself and previous tokens) and 0s above the diagonal.

4. **Cross-Attention Mask (Decoder):**
   - **Purpose:** Controls how decoder tokens attend to encoder outputs.
   - **Implementation:** Typically allows full attention to all encoder outputs, but can be customized based on specific requirements.

#### Combined Masking Example

Let's consider an example where we have a batch of sequences with variable lengths and we want to apply masking in a transformer model. Suppose we have two sequences:

- Sequence 1: [5, 6, 0, 0, 0]
- Sequence 2: [3, 4, 7, 0, 0]

##### Step-by-Step Masking

1. **Padding Mask Creation:**

   Create a padding mask for each sequence:

   - Sequence 1: [1, 1, 0, 0, 0]
   - Sequence 2: [1, 1, 1, 0, 0]

   Here, 1 denotes actual tokens, and 0 denotes padding tokens.

2. **Encoder Self-Attention Mask (Optional for bidirectional models):**

   Create an upper triangular matrix for each sequence (considering maximum sequence length in batch):

   - Sequence 1: 
     ```
     [1, 1, 1, 0, 0]
     [0, 1, 1, 0, 0]
     [0, 0, 1, 0, 0]
     [0, 0, 0, 1, 0]
     [0, 0, 0, 0, 1]
     ```

   - Sequence 2:
     ```
     [1, 1, 1, 0, 0]
     [0, 1, 1, 0, 0]
     [0, 0, 1, 0, 0]
     [0, 0, 0, 1, 0]
     [0, 0, 0, 0, 1]
     ```

   Here, the upper triangular matrix ensures that each token attends only to itself and previous tokens.

3. **Decoder Causal Mask:**

   Create a lower triangular matrix for each sequence (considering maximum sequence length in batch):

   - Sequence 1:
     ```
     [1, 0, 0, 0, 0]
     [1, 1, 0, 0, 0]
     [1, 1, 1, 0, 0]
     [1, 1, 1, 1, 0]
     [1, 1, 1, 1, 1]
     ```

   - Sequence 2:
     ```
     [1, 0, 0, 0, 0]
     [1, 1, 0, 0, 0]
     [1, 1, 1, 0, 0]
     [1, 1, 1, 1, 0]
     [1, 1, 1, 1, 1]
     ```

   Here, the lower triangular matrix ensures that each token in the decoder attends only to itself and previous tokens in the target sequence.

4. **Cross-Attention Mask (Decoder):**

   Typically, this mask allows full attention to all encoder outputs, so it might look like this (assuming full attention):

   - Sequence 1 to Encoder: [1, 1, 1, 1, 1]
   - Sequence 2 to Encoder: [1, 1, 1, 1, 1]

   Here, each token in the decoder can attend to all tokens in the encoder outputs.




### 4] Casual mask (Lower Triangular mask)

In [7]:
mask=np.tril(np.ones((L,L)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [8]:
mask[mask==0]=-np.inf
mask[mask==1]=0

In [9]:
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

### 5] Softmax

Converting vector to probability distribution

$\text{softmax}(\mathbf{Z})_i = \frac{e^{Z_i}}{\sum_j e^{Z_j}}\$


In [12]:
scaled+mask

array([[ 2.61138973,        -inf,        -inf,        -inf],
       [ 1.10584688, -3.32636072,        -inf,        -inf],
       [-3.00661381, -1.57872265, -3.01423804,        -inf],
       [-0.26465864,  2.63833476, -2.95168475,  1.43380138]])

In [29]:
def softmax(x):
    return (np.exp(x).T/np.sum(np.exp(x),axis=-1)).T

### Result with mask

In [30]:
softmax(scaled+mask)

array([[1.        , 0.        , 0.        , 0.        ],
       [0.98825145, 0.01174855, 0.        , 0.        ],
       [0.16227704, 0.67667844, 0.16104451, 0.        ],
       [0.04038407, 0.73614632, 0.00274947, 0.22072013]])

Note : here mask helps to hide the future words

### Result without mask

In [31]:
softmax(scaled)

array([[9.10377374e-04, 3.69492285e-05, 9.99048948e-01, 3.72551369e-06],
       [7.51198978e-01, 8.93041583e-03, 1.62378339e-02, 2.23632772e-01],
       [3.89469512e-03, 1.62404747e-02, 3.86511398e-03, 9.75999716e-01],
       [4.03840741e-02, 7.36146321e-01, 2.74947337e-03, 2.20720131e-01]])

In [32]:
attention=softmax(scaled+mask)

### New value

In [33]:
new_V=np.matmul(attention,V)
new_V

array([[ 0.67190478, -1.85280841, -0.06877835,  1.03676013, -1.22125968,
        -0.27264151, -1.14697016,  0.4664111 ],
       [ 0.67230492, -1.82387251, -0.06486125,  1.02991537, -1.20570817,
        -0.26436242, -1.1169821 ,  0.47716346],
       [ 0.67054954,  0.29088811, -0.06869083,  0.64834487, -0.15264286,
         0.29118993,  0.4266856 ,  0.84357297],
       [ 0.21306872,  0.31205755,  0.0796928 ,  0.24113729, -0.07229919,
         0.36807758,  0.67379595,  1.06461431]])

# Combined code

In [35]:
def softmax(x):
    return (np.exp(x).T/np.sum(np.exp(x),axis=-1)).T

def scaled_dot_product_attention(q,k,v,mask=None):
    d_k=q.shape[-1]
    scaled=np.matmul(q,k.T)/np.sqrt(d_k)
    if mask is not None:
        scaled=scaled+mask
    attention=softmax(scaled)
    out=np.matmul(attention,v)
    return out, attention

### At decoder with mask

In [38]:
output,attention=scaled_dot_product_attention(Q,K,V,mask)
print("At decoder with mask")
print('Attention')
print(attention)
print('ouput')
print(output)

At decoder with mask
Attention
[[1.         0.         0.         0.        ]
 [0.82735866 0.17264134 0.         0.        ]
 [0.27367108 0.45339455 0.27293437 0.        ]
 [0.16664837 0.46509848 0.0644493  0.30380386]]
ouput
[[ 0.67190478 -1.85280841 -0.06877835  1.03676013 -1.22125968 -0.27264151
  -1.14697016  0.4664111 ]
 [ 0.67778476 -1.42760418 -0.01121779  0.93617852 -0.99273491 -0.15098273
  -0.70630465  0.62441373]
 [ 0.64599065  0.07241986 -0.29982531  0.78247507 -0.32807639  0.194277
  -0.24993441  0.47098936]
 [ 0.01249149 -0.04338059 -0.13213281  0.26319459 -0.30025833  0.25574815
  -0.0978199   0.69690778]]


### At encoder mask is optional mostly we wont use

In [39]:
output,attention=scaled_dot_product_attention(Q,K,V)
print("At decoder with mask")
print('Attention')
print(attention)
print('ouput')
print(output)

At decoder with mask
Attention
[[0.07491553 0.02413022 0.89023229 0.01072195]
 [0.47214209 0.09851984 0.12170996 0.30762811]
 [0.09342608 0.15478024 0.09317458 0.65861909]
 [0.16664837 0.46509848 0.0644493  0.30380386]]
ouput
[[ 0.51434784  0.86056161 -1.31192635  1.03705718 -0.22521728  0.23104164
  -1.93693892 -0.85348564]
 [-0.0170473  -0.7706552  -0.33615472  0.47247852 -0.72104862  0.0304382
  -1.08907182  0.27408086]
 [-0.77966004 -0.17016258 -0.42551327 -0.14468256 -0.40437427  0.24625786
  -1.00670459  0.25495132]
 [ 0.01249149 -0.04338059 -0.13213281  0.26319459 -0.30025833  0.25574815
  -0.0978199   0.69690778]]
