<a href="https://colab.research.google.com/github/EvgeniaKantor/DI-Bootcamp_ML/blob/main/dchW7D2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load Input Tensor (Word Embeddings):**

Start with numerical representations of words (embeddings) because neural networks process numbers. This is the input data our self-attention mechanism will work on.

In [1]:
import torch

# Sample input embeddings (6 tokens, each with 3-dimensional embedding)
inputs = torch.tensor([
    [0.43, 0.15, 0.89],  # your
    [0.55, 0.87, 0.66],  # journey
    [0.57, 0.85, 0.64],  # starts (our query)
    [0.22, 0.58, 0.33],  # with
    [0.77, 0.25, 0.10],  # one
    [0.05, 0.80, 0.55]   # step
])

**1.1 Computing Attention Weights for Inputs[2]:**

1.1.1 Attention Score:

The dot product measures how similar two vectors are. Higher scores indicate greater similarity. We’re finding how relevant each word is to our “query” word.

In [2]:
# Select query vector (focus word)
query = inputs[2]  # "starts"

# 1.1.1 Attention Scores (dot product between query and all input vectors)
attn_scores_2 = torch.zeros(inputs.size(0))
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)

print("Attention Scores for 'starts':", attn_scores_2)


Attention Scores for 'starts': tensor([0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605])


The first value (0.9422) corresponds to how related the word "your" is to "starts".

The second value (1.4754) corresponds to the word "journey" and so on, down to "step".

**1.1.2 Attention Weights:**

Softmax transforms the scores into probabilities (attention weights). These weights represent how much “attention” each word should receive when we create the context vector.

In [3]:
import torch.nn.functional as F

# 1.1.2 Convert scores to Attention Weights using softmax
attn_weights_2 = F.softmax(attn_scores_2, dim=0)
print("Attention Weights:", attn_weights_2)

Attention Weights: tensor([0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565])


**Explanation:**

These values represent the normalized attention scores, where each score has been transformed into a probability that reflects how much each word should contribute to the context vector for the word "starts".

The values sum to 1, as softmax ensures that we have a valid probability distribution.

**Word-wise Breakdown:**

The word "journey" (index 1) has the highest attention weight of 0.2369, meaning it's the most relevant word to "starts".

The word "starts" itself (index 2) has an attention weight of 0.2326, indicating it’s very relevant to itself (which is expected).

The word "your" (index 0) has the lowest attention weight of 0.1390, meaning it's less relevant to "starts" compared to other words.

**1.1.3 Context Vector:**

The context vector is a weighted sum of the input vectors. It represents a refined version of the query, incorporating information from other relevant words.

In [4]:
# 1.1.3 Compute Context Vector (weighted sum)
context_vector_2 = torch.sum(attn_weights_2.unsqueeze(1) * inputs, dim=0)
print("Context Vector for 'starts':", context_vector_2)

Context Vector for 'starts': tensor([0.4431, 0.6496, 0.5671])


**Explanation:**

This context vector is a weighted sum of all the word embeddings in the sequence, where the weights are given by the attention weights.

Each value in this vector represents the "refined" version of the word "starts", considering the relationships (similarities) it has with the other words in the sequence.

**Interpretation:**

The context vector essentially "represents" the word "starts" in the context of the entire sentence, incorporating information about how each other word in the sentence is related to it.

Since "journey" and "starts" are highly related (as seen from the attention weights), the context vector will reflect a strong influence from these words.

**1.2 Computing Attention Weights for All Inputs:**

1.2.1 Attention Score:

Extend the process to compute attention scores for every word against every other word in the sequence. This creates a matrix of relationships.


In [5]:
# 1.2.1 Compute attention scores matrix (dot product for each pair)
attn_scores_all = torch.matmul(inputs, inputs.T)  # shape (6, 6)
print("All Attention Scores:\n", attn_scores_all)

All Attention Scores:
 tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


This is a symmetric matrix where each element
score
𝑖
𝑗
score
ij
​
  represents the attention score between the word at position
𝑖
i (row) and the word at position
𝑗
j (column).

For example, the attention score between "your" (index 0) and "starts" (index 2) is 0.9422.

The diagonal elements represent the self-attention scores, i.e., how much attention each word gives to itself. For instance, the self-attention score for "starts" (index 2) is 1.4570, meaning it's highly relevant to itself.

**1.2.2 Attention Weights:**

Apply softmax across rows to get attention weights for each word, showing its relationship to all others.

In [6]:
# 1.2.2 Apply softmax row-wise to get attention weights
attn_weights_all = F.softmax(attn_scores_all, dim=1)
print("All Attention Weights:\n", attn_weights_all)

All Attention Weights:
 tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


These attention weights are the result of applying softmax to the attention scores. The values in this matrix represent the probability that a word (row) attends to another word (column).

For example, for the word "starts" (index 2), it pays the most attention to "journey" (index 1) with an attention weight of 0.2369, which reflects how related these words are in the context.

**1.2.3 All Context Vector:**

Generate a context vector for each word, capturing its meaning in the context of the entire sequence.

In [7]:
# 1.2.3 Generate context vectors (weighted sum for each word)
context_vectors_all = torch.matmul(attn_weights_all, inputs)
print("All Context Vectors:\n", context_vectors_all)

All Context Vectors:
 tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


Each context vector is a weighted sum of the input embeddings, where the weights are the attention weights calculated previously.

For example, the context vector for "starts" (index 2) is tensor([0.4431, 0.6496, 0.5671]), which represents a refined version of the word "starts" in the context of the entire sequence.

**2. The ‘Self’ in Self-Attention**¶

In self-attention, the ‘self’ refers to the mechanism’s ability to computer attention weights by relating different positions within a single input sequence.

2.1 Weights Parameters vs Attention Weights

In the weight matrices W, the term ‘weight’ is short for ‘weight parameters’, the values of a neural network that are optimized during training. This is not to be confused with the attention weights.

As I already saw in the previous section, attention weights determin the extent to which a context vector depends on the different parts of input, i.e., to what ectent the network focuses on different parts of the input.

In summary, weight parameters are the fundamental, learned coefficents that definedthe networks connection, while attention weights are dynamic, context specific values.

**2.1 Weights Parameters vs Attention Weights:**

Distinguish between learned parameters (weights of the network) and dynamically computed attention weights. This clarifies the different roles they play.

**Weights Parameters (Wq, Wk, Wv):**

Weights Parameters are the learnable parameters (i.e., the weights) in the neural network. These parameters are learned during the training process and define the connection between neurons in the network.

In the case of self-attention, Wq (query weight), Wk (key weight), and Wv (value weight) are the weight matrices used to transform the input word embeddings into queries, keys, and values, respectively.

**Attention Weights:**

Attention Weights are computed dynamically during inference based on the input sequence. These weights indicate how much attention each word should pay to the other words in the sequence when constructing the context vector for each word.

The attention weights are context-specific and depend on the similarity (via dot-product) between the query and key vectors.

**2.2 Computing Weight Parameters for Inputs[2]:**

2.2.1 Initialize the three weight matrices Wq, Wk, Wv:
Introduce learnable weight matrices (Wq, Wk, Wv) to transform input vectors into queries, keys, and values. This adds flexibility and allows the model to learn complex relationships.

In [8]:
# Initialize the weight matrices (random for simplicity)
Wq = torch.rand(3, 3)  # Query weight matrix (3x3)
Wk = torch.rand(3, 3)  # Key weight matrix (3x3)
Wv = torch.rand(3, 3)  # Value weight matrix (3x3)

**2.2.2 Compute the query, key, and value vectors for inputs[1]:**

These transformations project the input into different “spaces” that emphasize different aspects of the word’s meaning.

In [9]:
# Select input index 1 ("journey")
input_1 = inputs[1]

# Compute the query, key, and value vectors using the weight matrices
query_1 = torch.matmul(input_1, Wq)
key_1 = torch.matmul(input_1, Wk)
value_1 = torch.matmul(input_1, Wv)

print(f"Query vector for 'journey': {query_1}")
print(f"Key vector for 'journey': {key_1}")
print(f"Value vector for 'journey': {value_1}")

Query vector for 'journey': tensor([1.0983, 1.7410, 1.5428])
Key vector for 'journey': tensor([1.4116, 0.7436, 0.9073])
Value vector for 'journey': tensor([1.2868, 1.5327, 0.7423])


**2.2.3 Compute the Attention Score inputs[1][1] or ω11:**

Calculate the similarity between the transformed query and key.

In [10]:
# Compute the attention score between query_1 and key_1
attn_score_1_1 = torch.dot(query_1, key_1)

print(f"Attention score (ω11) between 'journey' query and key: {attn_score_1_1}")

Attention score (ω11) between 'journey' query and key: 4.244719982147217


**2.2.4 Compute all the Attention Scores for inputs[1]:**

Calculate all the similarity scores against the query vector.

In [11]:
# Compute all attention scores between query_1 and all keys (dot products)
attn_scores_all_1 = torch.matmul(inputs, Wk.T)  # Dot product between all inputs and Wk
print("All Attention Scores for 'journey':\n", attn_scores_all_1)

All Attention Scores for 'journey':
 tensor([[0.5232, 1.1521, 0.3574],
        [0.7191, 1.3298, 0.7116],
        [0.7241, 1.3226, 0.7098],
        [0.3542, 0.6728, 0.3956],
        [0.6108, 0.8200, 0.4761],
        [0.3336, 0.7944, 0.4392]])


**2.2.5 Attention weights for inputs[1]:**

Normalize the attention scores.

In [12]:
# Apply softmax to the attention scores to get attention weights
attn_weights_1 = F.softmax(attn_scores_all_1, dim=0)
print("Attention Weights for 'journey':\n", attn_weights_1)

Attention Weights for 'journey':
 tensor([[0.1612, 0.1846, 0.1409],
        [0.1961, 0.2205, 0.2008],
        [0.1971, 0.2189, 0.2004],
        [0.1362, 0.1143, 0.1464],
        [0.1760, 0.1325, 0.1586],
        [0.1334, 0.1291, 0.1529]])


**2.2.6 Calculate Context vector for inputs[1]:**

Generate the context vector.

In [13]:
# Compute context vector for 'journey' by taking weighted sum of input vectors
context_vector_1 = torch.sum(attn_weights_1.unsqueeze(1) * inputs, dim=0)
print("Context Vector for 'journey':", context_vector_1)

Context Vector for 'journey': tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])


**2.3 Computing Weight Parameters for All Inputs:**

**2.3.2 Compute the query, key, and value vectors:**

Compute the transformed vectors for all input words.


In [14]:
# Compute the query, key, and value vectors for all inputs
queries = torch.matmul(inputs, Wq)
keys = torch.matmul(inputs, Wk)
values = torch.matmul(inputs, Wv)

print("All Queries:\n", queries)
print("All Keys:\n", keys)
print("All Values:\n", values)

All Queries:
 tensor([[0.6376, 1.1715, 1.1689],
        [1.0983, 1.7410, 1.5428],
        [1.1020, 1.7286, 1.5322],
        [0.5779, 0.9429, 0.8136],
        [0.8592, 1.0181, 0.9099],
        [0.5825, 1.1272, 0.9741]])
All Keys:
 tensor([[0.8316, 0.5522, 0.3095],
        [1.4116, 0.7436, 0.9073],
        [1.3995, 0.7317, 0.8929],
        [0.7891, 0.4133, 0.5618],
        [0.7879, 0.3115, 0.3833],
        [0.9565, 0.5554, 0.7301]])
All Values:
 tensor([[0.8957, 1.0611, 0.4683],
        [1.2868, 1.5327, 0.7423],
        [1.2731, 1.5030, 0.7403],
        [0.7049, 0.8791, 0.3975],
        [0.6687, 0.5438, 0.4965],
        [0.8842, 1.2215, 0.4439]])


**2.3.3 Compute the Attention Score for all inputs:**

Compute all attention scores between all words.

In [15]:
# Compute all attention scores (dot product between queries and keys)
attn_scores_all = torch.matmul(queries, keys.T)
print("All Attention Scores:\n", attn_scores_all)

All Attention Scores:
 tensor([[1.5389, 2.8317, 2.7933, 1.6440, 1.3154, 2.1140],
        [2.3522, 4.2447, 4.1886, 2.4529, 1.9990, 3.1440],
        [2.3451, 4.2311, 4.1753, 2.4447, 1.9941, 3.1329],
        [1.2531, 2.2551, 2.2252, 1.3028, 1.0609, 1.6705],
        [1.5584, 2.7955, 2.7600, 1.6100, 1.3429, 2.0517],
        [1.4083, 2.5443, 2.5099, 1.4728, 1.1835, 1.8945]])


**2.3.4 Attention weights for all inputs:**

Normalize the attention scores.

In [16]:
# Normalize the attention scores with softmax
attn_weights_all = F.softmax(attn_scores_all, dim=1)
print("All Attention Weights:\n", attn_weights_all)

All Attention Weights:
 tensor([[0.0845, 0.3078, 0.2962, 0.0938, 0.0676, 0.1502],
        [0.0558, 0.3702, 0.3500, 0.0617, 0.0392, 0.1231],
        [0.0561, 0.3697, 0.3496, 0.0619, 0.0395, 0.1233],
        [0.1024, 0.2790, 0.2708, 0.1077, 0.0845, 0.1555],
        [0.0887, 0.3058, 0.2951, 0.0934, 0.0715, 0.1453],
        [0.0942, 0.2934, 0.2835, 0.1005, 0.0752, 0.1532]])


**2.3.5 Calculate Context vector for all inputs:**

Generate all context vectors.

In [17]:
# Compute all context vectors by taking weighted sum of input vectors
context_vectors_all = torch.matmul(attn_weights_all, values)
print("All Context Vectors:\n", context_vectors_all)

All Context Vectors:
 tensor([[1.0929, 1.3092, 0.6248],
        [1.1505, 1.3786, 0.6587],
        [1.1500, 1.3780, 0.6584],
        [1.0655, 1.2740, 0.6094],
        [1.0910, 1.3050, 0.6242],
        [1.0794, 1.2921, 0.6171]])
