# **Self Attention Mechanism**
*`The self-attention mechanism is a key component of Transformers, enabling each element in an input sequence to attend to all other elements. This generates a weighted representation of the input, allowing the model to capture dependencies between tokens, regardless of their positions.`*

`Before starting with Self Attention Mechanism you should have the basics of Embeddings.`\
**Input Representation:**\
Each token in the input sequence `(e.g., a word or subword)` is converted into a vector `(embedding)` that represents its meaning in context.

In [None]:
import numpy as np
import tensorflow as tf

### **Defining the Weight Matrices**
`supposing embeddings is defined and it has a shape of (10,16) after reshape. (initially (1,10,16))`

Create Queries, Keys, and Values: Transform each embedding into three different vectors: `Query (Q)`, `Key (K)`, and `Value (V)` using learned weight matrices.\
**Query (Q):** Represents "what this token wants to focus on."\
**Key (K):** Represents "the tokens being compared to."\
**Value (V):** Represents "the actual information this token carries."\

**Shape of Q, K, V should be** `(batch_size, seq_length, attention_dim)`\


> In this example batch_size is 1. we only have 1 sequence.



In [None]:
embed_dim = embeddings.shape[1] # embeddings dim
attention_dim = embed_dim // 1 # 1 head

w_q = np.random.randn(attention_dim, embed_dim)
w_k = np.random.randn(attention_dim, embed_dim)
w_v = np.random.randn(attention_dim, embed_dim)

print("---------------------------")
print(f"embed_size: {embed_dim}")
print("---------------------------")
print(f"w_q Shape : {w_q.shape}")
print("---------------------------")
print(f"w_k Shape : {w_k.shape}")
print("---------------------------")
print(f"w_v Shape : {w_v.shape}")
print("---------------------------")

---------------------------
embed_size: 16
---------------------------
w_q Shape : (16, 16)
---------------------------
w_k Shape : (16, 16)
---------------------------
w_v Shape : (16, 16)
---------------------------


Each input embedding 𝑋 (for each token) is multiplied by "learned" weight matrices **(w_q, w_k, w_v: `Shape = (attention_dim, embed_dim)`)**\


> attention_dim = embed_dim // numberOfHeads



In [None]:
# Calculate "queries (Q)", "keys (K)", and "values (V)" matrices
queries = np.matmul(embeddings, w_q.T)
keys = np.matmul(embeddings, w_k.T)
values = np.matmul(embeddings, w_v.T)

print("-----------------------------")
print(f"Q Shape : {queries.shape}")
print("-----------------------------")
print(f"K Shape : {keys.shape}")
print("-----------------------------")
print(f"V Shape : {values.shape}")
print("-----------------------------")

-----------------------------
Q Shape : (10, 16)
-----------------------------
K Shape : (10, 16)
-----------------------------
V Shape : (10, 16)
-----------------------------


### **Compute Attention Scores**

In [None]:
# Calculate attention scores
attention_scores = np.matmul(queries, keys.T)
# scale the scores
attention_scores = attention_scores / np.sqrt(attention_dim)
print(f"Attention Scores:\n {attention_scores} \n Shape : {attention_scores.shape}")

Attention Scores:
 [[ 2.70287240e-03  2.70287240e-03  1.78910538e-02  9.55023619e-03
   1.88574373e-02 -1.62622976e-02 -1.98783640e-02  4.45656019e-04
  -3.13456869e-03 -4.76539751e-03]
 [ 2.70287240e-03  2.70287240e-03  1.78910538e-02  9.55023619e-03
   1.88574373e-02 -1.62622976e-02 -1.98783640e-02  4.45656019e-04
  -3.13456869e-03 -4.76539751e-03]
 [-8.57778044e-03 -8.57778044e-03  2.13849131e-02 -2.00194034e-02
   2.57808482e-02 -1.66025541e-03 -1.47558982e-02 -6.66359595e-03
   7.32537604e-03  4.55469433e-03]
 [ 2.69862993e-02  2.69862993e-02 -3.60754410e-03 -2.43195157e-03
  -1.86257856e-03 -3.38294556e-03  5.16355997e-03 -5.76117284e-03
   2.22295662e-03 -7.59801258e-03]
 [-1.53231871e-02 -1.53231871e-02  1.66089440e-02 -1.81660526e-02
   1.90844939e-02 -5.67780452e-03 -1.52599985e-02 -1.19689781e-02
   1.54026757e-02  1.07075743e-02]
 [-5.40705112e-03 -5.40705112e-03  6.90545702e-03 -9.45911361e-03
   1.62412810e-02  3.04873908e-04 -1.08263818e-02 -6.13013830e-03
   1.12018576e

### **Compute attention weights**

In [None]:
attention_weights = tf.nn.softmax(attention_scores, axis=-1)
print(f"Attention Weights:\n {attention_weights} \n Shape : {attention_weights.shape}")

Attention Weights:
 [[0.10018203 0.10018203 0.10171522 0.10087036 0.10181357 0.09829996
  0.09794514 0.09995615 0.09959892 0.09943663]
 [0.10018203 0.10018203 0.10171522 0.10087036 0.10181357 0.09829996
  0.09794514 0.09995615 0.09959892 0.09943663]
 [0.09914789 0.09914789 0.10216358 0.09801994 0.10261368 0.09983613
  0.09853723 0.09933786 0.10073726 0.10045854]
 [0.10235127 0.10235127 0.09926737 0.09938413 0.09944074 0.09928966
  0.10014188 0.09905381 0.09984784 0.09887203]
 [0.09866501 0.09866501 0.10186643 0.09838492 0.10211892 0.09962127
  0.09867124 0.09899651 0.10174363 0.10126705]
 [0.09944609 0.09944609 0.10067809 0.09904394 0.1016224  0.10001574
  0.09890862 0.09937421 0.10111157 0.10035324]
 [0.09811947 0.09811947 0.10167408 0.09924622 0.10244858 0.09955095
  0.09969227 0.10083584 0.1009405  0.09937262]
 [0.10084396 0.10084396 0.1026202  0.09882481 0.10198718 0.09924824
  0.0978655  0.1005724  0.09958613 0.09760763]
 [0.09943131 0.09943131 0.10089736 0.10080526 0.0993579  0.0

### **Compute weighted sum**

In [None]:
weighted_sum = np.matmul(attention_weights, values)
print(f"Weighted Sum:\n {weighted_sum} \n Shape : {weighted_sum.shape}")

Weighted Sum:
 [[-0.07191973  0.05369578  0.01552018 -0.01301887 -0.06879538  0.05129807
   0.02357675 -0.03675518  0.10005823  0.06798076  0.06032971  0.03579166
   0.00610018 -0.02639282  0.06365587 -0.05795981]
 [-0.07191973  0.05369578  0.01552018 -0.01301887 -0.06879538  0.05129807
   0.02357675 -0.03675518  0.10005823  0.06798076  0.06032971  0.03579166
   0.00610018 -0.02639282  0.06365587 -0.05795981]
 [-0.0712428   0.05377908  0.01586796 -0.01391785 -0.06824096  0.05154462
   0.02331869 -0.03595441  0.09951587  0.06805199  0.06093533  0.03547583
   0.00571188 -0.02641789  0.06426359 -0.05854774]
 [-0.07198051  0.05478058  0.01604369 -0.01293537 -0.06847798  0.04998116
   0.02426112 -0.03690346  0.10076066  0.06731393  0.05979927  0.03576031
   0.00597596 -0.02498828  0.06373311 -0.05658847]
 [-0.07147503  0.05334159  0.0157559  -0.01353773 -0.06797707  0.0516849
   0.02328957 -0.03570395  0.09971888  0.06814356  0.06105552  0.03568642
   0.00592407 -0.02674121  0.06426525 -0.0

In [None]:
# Get the weighted sum of the second token
weighted_sum[1]

array([-0.07191973,  0.05369578,  0.01552018, -0.01301887, -0.06879538,
        0.05129807,  0.02357675, -0.03675518,  0.10005823,  0.06798076,
        0.06032971,  0.03579166,  0.00610018, -0.02639282,  0.06365587,
       -0.05795981])

# **Multi-Head Attention Mechanism**
*`In multi-head attention in Transformers, there are multiple attention heads, each with its own set of query, key, and value matrices. Each head focuses on different parts of the input data and learns different relationships between words (or tokens) in the sequence. By doing so, the model can capture different aspects of the sequence simultaneously.`*

In [None]:
# Check embeddings shape
embeddings.shape

(10, 16)

In [None]:
h = 3 # number of attention heads

# Initialize the weights
multiHead_W_q, multiHead_W_k, multiHead_W_v = np.random.randn(h, attention_dim, embed_dim), np.random.randn(h, attention_dim, embed_dim), np.random.randn(h, attention_dim, embed_dim)

In [None]:
# Calculate queries, keys and values matrices
multiHead_queries = np.matmul(embeddings, multiHead_W_q.transpose(0, 2, 1))
multiHead_keys = np.matmul(embeddings, multiHead_W_k.transpose(0, 2, 1))
multiHead_values = np.matmul(embeddings, multiHead_W_v.transpose(0, 2, 1))

print("----------------------------------------")
print(f"Queries:\n {multiHead_queries} \n Shape : {multiHead_queries.shape}")
print("----------------------------------------")
print(f"Keys:\n {multiHead_keys} \n Shape : {multiHead_keys.shape}")
print("----------------------------------------")
print(f"Values:\n {multiHead_values} \n Shape : {multiHead_values.shape}")
print("----------------------------------------")

----------------------------------------
Queries:
 [[[ 0.18487642 -0.07525517  0.01157229  0.04350997 -0.0645356
    0.21331945  0.08393987 -0.12669617 -0.13951258  0.00223234
   -0.05053065 -0.09170585 -0.05476004 -0.06682444 -0.15111652
    0.02846103]
  [ 0.18487642 -0.07525517  0.01157229  0.04350997 -0.0645356
    0.21331945  0.08393987 -0.12669617 -0.13951258  0.00223234
   -0.05053065 -0.09170585 -0.05476004 -0.06682444 -0.15111652
    0.02846103]
  [-0.11771118  0.0400669  -0.04372065 -0.10702574  0.07184724
    0.04635327 -0.01170662 -0.07947526  0.07866789 -0.11991283
    0.05006804  0.06260486 -0.20345729 -0.06922961 -0.04266434
   -0.02847235]
  [ 0.08979838 -0.04733125  0.18040464  0.1979478  -0.18076144
    0.00212834  0.05924504  0.42180878 -0.07092024 -0.01344724
    0.02770039  0.14239369  0.08854688  0.13258597  0.0444258
   -0.11311032]
  [-0.19124372  0.16710342 -0.02560563 -0.23831425  0.11031772
   -0.03766105  0.04390058 -0.08066733  0.2568171  -0.13622504
    0.

# **Cross-Attention**

**What is cross-attention, and how does it differ from self-attention?**

While self-attention focuses on relationships within a single sequence (e.g., words in a sentence), cross-attention allows a model to attend to another sequence. This is essential in tasks where two different sequences interact, like in sequence-to-sequence models (e.g., translation) or multi-modal tasks (e.g., linking text and images).

In [None]:
import numpy as np
import tensorflow as tf

# Example sequences
seq_1_embed = np.random.randn(5, 16)  # 5 tokens, 16-dimensional embeddings (Query source)
seq_2_embed = np.random.randn(10, 16)  # 10 tokens, 16-dimensional embeddings (Key/Value source)

# Dimensions
embed_dim = seq_1_embed.shape[1]  # Embedding size
attention_dim = embed_dim // 1 # Single-head attention

# Initialize weight matrices
w_q = np.random.randn(attention_dim, embed_dim)
w_k = np.random.randn(attention_dim, embed_dim)
w_v = np.random.randn(attention_dim, embed_dim)

* **The Query (Q) comes from one sequence (e.g., a target sentence).**
* **The Key (K) and Value (V) come from another sequence (e.g., a source sentence or image features).**

In [None]:
# Compute Query, Key, and Value
queries = np.matmul(seq_1_embed, w_q.T)  # Q from Sequence 1
keys = np.matmul(seq_2_embed, w_k.T)    # K from Sequence 2
values = np.matmul(seq_2_embed, w_v.T)  # V from Sequence 2

In [None]:
# Compute attention scores
attention_scores = np.matmul(queries, keys.T) / np.sqrt(attention_dim)

In [None]:
# Apply softmax
attention_weights = tf.nn.softmax(attention_scores, axis=-1)

# Compute weighted sum
weighted_sum = np.matmul(attention_weights, values)

print(f"Attention Weights Shape: {attention_weights.shape}")
print(f"Weighted Sum Shape: {weighted_sum.shape}")

Attention Weights Shape: (5, 10)
Weighted Sum Shape: (5, 16)
