# Attention Mechanism
As we've learned, the Attention Mechanism is a method that provide textual context to the Artificial Intelligence. Let's first explore how single-headed attention is calculated by hand. Next, we'll cover the implementation of multiheaded attention mechanism in both Pytorch and Tensorflow libraries in Python. Finally, we'll apply our understanding of the attention mechanism to building a quick Weighted Averaging Neural Network (WANN).

In [1]:
!pip install tensorflow
!pip install keras



In [2]:
%pip install gensim --quiet
%pip install tensorflow-datasets --quiet
%pip install -U tensorflow-text==2.8.2 --quiet
%pip install pydot --quiet
%pip install nltk --quiet

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.4 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboardx 2.6.2.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorflow-datasets 4.9.6 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorflow-metadata 1.15.0 requires protobuf<4.21,>=3.20.3; python_version < "3.11", but you have protobuf 3.19.6 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to

In [3]:
# import python libraries
import einops
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# tensorflow libraries
import tensorflow as tf
from tensorflow import keras

# pytorch libraries
import torch
import torch.nn as nn


# natural language processing libraries
import nltk
from nltk.data import find
import gensim


sns.set()

## Calculating the Attention Mechanism

In [4]:
# given
q = [1, 2., 1]

k1 = v1 = [-1, -1, 3.]
k2 = v2 = [1, 2, -5.]

In [30]:
### Attention Mechanism - Implement the three steps of the attention mechanism

# step 1 - attention score using dot product
s1 = np.dot(q, k1)
s2 = np.dot(q, k2)

# step 2 - weights using softmax
alpha_1 = np.exp(s1)/sum(np.exp([s1, s2]))
alpha_2 = np.exp(s2)/sum(np.exp([s1, s2]))


# step 3 - context vector c
c = np.add([alpha_1 * i  for i in k1], [alpha_2 * i  for i in k2])

print(f"Attention weights: {alpha_1}, {alpha_2}")
print(f"Computed Context Vector (Attention Output): {c}")

Attention weights: 0.5, 0.5
Computed Context Vector (Attention Output): [ 0.   0.5 -1. ]


## Attention Implementation In TensorFlow/Pytorch

In [39]:
# Trying with Keras API
test_query = np.array([q])
test_keys_values = np.array([k1, k2])

print(f"Query Vector: {test_query}")
print(f"Key Vector: \n{test_keys_values}")
print(f"Value Vector: \n{test_keys_values}")

Query Vector: [[1. 2. 1.]]
Key Vector: 
[[-1. -1.  3.]
 [ 1.  2. -5.]]
Value Vector: 
[[-1. -1.  3.]
 [ 1.  2. -5.]]


In [40]:
# convert the arrays to tensors
test_query_tf = tf.convert_to_tensor(test_query)
test_keys_values_tf = tf.convert_to_tensor(test_keys_values)

# Apply the scaled dot-product attention
attention_output, attention_weights = tf.keras.layers.Attention()([test_query_tf, test_keys_values_tf], return_attention_scores=True)

print("Attention Outputs: ", attention_output)
print("Attention Weights: ", attention_weights)

Attention Outputs:  tf.Tensor([[ 0.   0.5 -1. ]], shape=(1, 3), dtype=float32)
Attention Weights:  tf.Tensor([[0.5 0.5]], shape=(1, 2), dtype=float32)


In [41]:
# Convert the arrays to PyTorch tensors
test_query_ptorch = torch.tensor(test_query, dtype=torch.float32) # (sequence_length, batch_size, embed_dim)
test_keys_values_ptorch = torch.tensor(test_keys_values, dtype=torch.float32) # (sequence_length, batch_size, embed_dim)

# Apply the scaled dot-product attention"
attention_output = nn.functional.scaled_dot_product_attention(test_query_ptorch, test_keys_values_ptorch, test_keys_values_ptorch, dropout_p=0.0, is_causal=False)

# Compute the attention scores (scaled dot-product)
d_k = test_query_ptorch.size(-1)  # Embedding dimension
scores = torch.matmul(test_query_ptorch, test_keys_values_ptorch.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))

# Apply the softmax to get the attention weights
attention_weights = nn.functional.softmax(scores, dim=-1) 

print("Attention Outputs: ", attention_output)
print("Attention Weights: ", attention_weights)

Attention Outputs:  tensor([[ 5.9605e-08,  5.0000e-01, -1.0000e+00]])
Attention Weights:  tensor([[0.5000, 0.5000]])


## The "WANN" Model
For our next part, we'll 