### Using Numpy to Create self attention

 Think of Self-Attention Like a Classroom Discussion
Imagine you are in a classroom discussing the sentence:

 **"The cat sat on the mat."**

You are the teacher, and you want to understand what each word in the sentence means in context by looking at all the other words.

 1: Understanding the Need for Attention
Each word in a sentence has a meaning, but its meaning depends on other words too.

Example:

The word "sat" means "to be in a sitting position."
But who is sitting? ("cat" is the subject).
Where? ("on the mat" tells us the location)

Instead of looking at words one by one, self-attention allows each word to "ask" other words for relevant information.

Step 2: Assigning Roles – Queries, Keys, and Values
To help each word decide what to focus on, we give each word three roles:

1. Query (Q) – The Question:

Each word "asks" about the meaning of the sentence from its perspective.
Example: "What is important for me to understand?"

2. Key (K) – The Information Holder:

Each word "offers" its meaning to others.
Example: "This is what I mean, take it if needed."

3. Value (V) – The Final Meaning:

Each word carries useful information that will be passed to others.

Think of this like students asking and answering questions in a classroom.

If you are the student (Query), you ask important questions (Q).
The other students (Keys) provide answers (K).
You collect and process those answers (V) to understand better.


Step 3: Scoring the Importance of Words
Now, each word "talks" to every other word and decides how important they are.

How? By comparing Queries (Q) and Keys (K)!

If a word’s Query (Q) matches well with another word’s Key (K), that means it’s important.
The higher the match, the more attention it gets!

 Step 4: Making the Attention Weights
Now that we have scores, we want to convert them into a proper weight (like a probability).

How? By applying softmax():

It makes the most important words stand out (high scores become bigger, low scores become smaller).
The total attention across all words sums to 1 (100%).

Step 5: Updating the Meaning Using Values (V)
Each word now mixes information from other words using the attention weights.

Each word’s final meaning is a blend of the words it pays attention to!

Example:

"sat" borrows information from "cat" and "on" (because they got high attention scores).
This helps "sat" understand that it refers to the cat sitting on something.

```
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
```
`Attention_Scores = Q @ K.T
`

`Scaled_Scores = Attention_Scores / sqrt(d_k)
`

`Attention_Weights = softmax(Scaled_Scores)
`

`Output = Attention_Weights @ V
`

```
Self_Attention(Q, K, V) = softmax((Q @ K.T) / sqrt(d_k)) @ V

```

In [41]:
import numpy as np

In [42]:
# Define the input (Word embeddings)
x = np.array([[1, 0, 1], # word 1
              [0, 1, 1], # word 2
              [1, 1, 0] # word 3
])

print("Input Word Embeddings:", x)

Input Word Embeddings: [[1 0 1]
 [0 1 1]
 [1 1 0]]


In [43]:
# Initialize Query(Q), Keys(K), Values(V) matrices using random numbers
q = np.random.rand(3,3)
k = np.random.rand(3,3)
v = np.random.rand(3,3)

In [44]:
print(q)

[[0.18347741 0.41284234 0.78488908]
 [0.49234389 0.69207146 0.17371277]
 [0.68670259 0.70446771 0.06658257]]


In [45]:
print(q[0]) # displays the first row
print(q[1]) # displays the second row
print(q[2]) # displays the third row


[0.18347741 0.41284234 0.78488908]
[0.49234389 0.69207146 0.17371277]
[0.68670259 0.70446771 0.06658257]


In [46]:
# Display the shape of the row
print("row:", q.shape[0])

row: 3


In [47]:
# Display the shape of the column
print("Column:", q.shape[1])

Column: 3


In [48]:
# Compute the q,k,v using dot product or (@)
q_x = np.dot(x, q)
k_x = np.dot(x, k)
v_x = np.dot(x, v)

In [49]:
print("Query:\n",q_x, "\n")
print("Key:\n",k_x, "\n")
print("Value:\n",v_x)

Query:
 [[0.87018    1.11731005 0.85147165]
 [1.17904648 1.39653917 0.24029534]
 [0.67582129 1.10491381 0.95860185]] 

Key:
 [[0.91529893 1.54382904 1.54427484]
 [1.25577347 1.45645986 0.77585783]
 [0.55500505 1.77899187 1.08786789]] 

Value:
 [[1.96157533 0.52258213 1.37471803]
 [1.3899051  0.72213917 0.85464171]
 [1.36134791 0.27962817 0.90030526]]


In [50]:
# Compute the attention scores using dot product and Transpose
attention_scores = np.dot(q_x, k_x.T)

In [51]:
print(attention_scores)

[[3.83631675 3.38068715 3.39692845]
 [3.60627974 3.70105356 3.40021816]
 [3.80472123 3.20167982 3.38354908]]


In [52]:
# Apply softmax function
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

In [53]:
# Let's apply the softmax to get the attention weights
attention_weights = softmax(attention_scores)

In [54]:
attention_weights

array([[0.43888925, 0.27827713, 0.28283362],
       [0.34326595, 0.37739007, 0.27934398],
       [0.4538395 , 0.24831602, 0.29784448]])

In [55]:
# Compute the final output (weighted sum of values)
output = attention_weights @ v_x
print("Final Ouput after Self-attention:\n\n", output)

Final Ouput after Self-attention:

 [[1.63272808 0.50939874 1.0958128 ]
 [1.57816274 0.53002525 1.04592204]
 [1.64084603 0.49977284 1.10427352]]
