# Attencion Mask

In the article [A Deep Dive into Transformers Architecture](https://medium.com/@krupck/a-deep-dive-into-transformers-architecture-58fed326b08d), I explained in detail all the theory behind the architecture and mechanisms of transformers, covering everything from the mathematical foundations to the nuances that make this model so powerful and versatile. Now it’s time to turn theory into practice! Let’s roll up our sleeves and implement the attention layer, diving into how each component interacts to solidify our knowledge and gain a deeper understanding of the inner workings of these models that have revolutionized the field of artificial intelligence. This hands-on step will be essential to reinforce the concepts learned and pave the way for building real-world applications based on transformers.

We’ll create a model capable of predicting sequences with a length of 10 tokens.

A Transformer model consists of several key components:
1. **Embedding Layer**: Transforms words into fixed-size numerical vectors.  
2. **Attention Mechanism**: Allows the model to focus on different parts of the input.  
3. **Encoder and Decoder Layers**: Process data sequentially.  
4. **Linear and Softmax Layers**: Perform the final predictions.  

For this project, the main objective is to implement item 2. However, to make the example functional, I will also implement items 1 and 4.  

### Initial Imports

In [1]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Embedding Layer  

The embedding function is used to convert sequential inputs into dense, fixed-size vectors. These vectors, known as embeddings, are a fundamental part of natural language processing (NLP) models.  

Embeddings are crucial for deep learning models in NLP as they provide a rich and dense representation of words or tokens, capturing contextual and semantic information essential for tasks like machine translation, text classification, and more.

In [2]:
# Define the function to create an embedding matrix
def embedding(input, vocab_size, dim_model):

    # Creates an embedding matrix where each row represents a token in the vocabulary. 
    # The matrix is initialized with randomly distributed values.
    embed = np.random.randn(vocab_size, dim_model)

    # For each token index in the input, the corresponding embedding is selected from the matrix.
    # Returns an array of embeddings corresponding to the input sequence.
    return np.array([embed[i] for i in input])

### Attention Mechanism  

In the Transformer, **Q** (Query), **K** (Key), and **V** (Value) are derived from the same input in the encoder’s attention layers but come from different inputs in the decoder (Q comes from the decoder’s previous layer output, while K and V come from the encoder’s output).  

The attention mechanism computes a set of scores (using the dot product between Q and K, hence the name "scaled dot-product attention"), applies a softmax function to obtain attention weights, and uses these weights to scale the values (V). This produces an output that is a weighted combination of the relevant input information.  

This process allows the model to "pay attention" to the most relevant parts of the input for each part of the output, which is especially useful for tasks like translation, where the relevance of different input words can vary depending on the part of the sentence being translated.

### Softmax Activation Function

The softmax function is a widely used activation function in neural networks, especially in classification scenarios where it is important to transform raw output values (logits) into probabilities that sum to 1. Below is the code for the softmax function, with comments on each line explaining its functionality.

In [3]:
# Softmax Activation Function
def softmax(x):
    # Compute the exponential of each element in the input, adjusted by the maximum value in the input
    # to prevent numerical overflow
    e_x = np.exp(x - np.max(x))
    
    # Divide each exponential by the sum of exponentials along the last axis (axis=-1)
    # The reshape(-1, 1) ensures proper division in a multidimensional context
    return e_x / e_x.sum(axis=-1).reshape(-1, 1)

### Scaled Dot Product

The `scaled_dot_product_attention()` function is a component of the attention mechanism in Transformer models. It calculates the attention between sets of queries (Q), keys (K), and values (V).  

Essentially, this function enables the model to assign different levels of importance to various parts of the input, a key aspect that makes Transformer models particularly effective for NLP tasks and other sequential problems.

In [4]:
# Define the function to calculate scaled dot-product attention
def scaled_dot_product_attention(Q, K, V):
    # Compute the dot product between Q and the transpose of K
    matmul_qk = np.dot(Q, K.T)
    
    # Get the dimension of the key vectors
    depth = K.shape[-1]
    
    # Scale the logits by dividing by the square root of the depth
    logits = matmul_qk / np.sqrt(depth)
    
    # Apply the softmax function to get the attention weights
    attention_weights = softmax(logits)
    
    # Multiply the attention weights by the values V to get the final output
    output = np.dot(attention_weights, V)
    
    # Return the weighted output
    return output

### Model Output with Linear Operation and Softmax

The `linear_and_softmax()` function combines a linear layer followed by a softmax function, commonly used in deep learning models, especially for classification tasks.

In [5]:
# Define the function that applies a linear transformation followed by softmax
def linear_and_softmax(input):
    # Initialize a weight matrix with randomly distributed values
    # This matrix connects each model dimension (dim_model) to each vocabulary word (vocab_size)
    weights = np.random.randn(dim_model, vocab_size)
    
    # Perform the linear operation (dot product) between the input and the weight matrix
    # The result, logits, is a vector representing the input transformed into a higher-dimensional space
    logits = np.dot(input, weights)
    
    # Apply the softmax function to the logits
    # This transforms the logits into a probability vector, where the elements sum to 1
    return softmax(logits)

### Building the Final Model

In [6]:
# Final model function
def transformer_model(input):
    # Embedding
    embedded_input = embedding(input, vocab_size, dim_model)
    
    # Attention Mechanism
    attention_output = scaled_dot_product_attention(embedded_input, embedded_input, embedded_input)
    
    # Linear layer and softmax
    output_probabilities = linear_and_softmax(attention_output)
    
    # Choose the indices with the highest probability
    output_indices = np.argmax(output_probabilities, axis=-1)
    
    return output_indices

---

### Initial Hyperparameters 

In [7]:
# Model dimension
dim_model = 4

# Sequence length
seq_length = 5

# Vocabulary size
vocab_size = 100

---

### Using the Model for Predictions

In [8]:
# Generating random data for the model input
input_sequence = np.random.randint(0, vocab_size, seq_length)
print("Input Sequence:", input_sequence)

# Making predictions with the model
output = transformer_model(input_sequence)
print("Model Output:", output)

Input Sequence: [28 59 35 80 14]
Model Output: [29 29 81 29 29]


---

### Step-by-Step Execution 

In [9]:
# Generating random data for the model input
input_sequence = np.random.randint(0, vocab_size, seq_length)
input_sequence

array([61, 27, 89, 75, 13])

In [10]:
# Embedding
embedded_input = embedding(input_sequence, vocab_size, dim_model)
embedded_input[0:1]

array([[ 1.45603545,  1.01156137,  1.2830995 , -0.90838249]])

In [11]:
# Attention Mechanism
attention_output = scaled_dot_product_attention(embedded_input, embedded_input, embedded_input)
attention_output

array([[ 1.28904947,  0.89450866,  1.00997079, -0.42816149],
       [ 0.6436809 , -1.64215604, -0.72219075,  0.07433874],
       [ 0.87272191,  0.63963007,  0.23511223,  1.33028821],
       [ 0.92727302,  0.54858306,  0.34207077,  0.78671392],
       [ 0.59935589,  0.99040885,  0.22946714,  1.15830619]])

In [12]:
# Linear layer and softmax
output_probabilities = linear_and_softmax(attention_output)
output_probabilities

array([[9.38738600e-04, 1.69069844e-02, 7.51233476e-04, 3.64708815e-04,
        9.43103740e-04, 5.62525975e-04, 6.81753218e-04, 8.96685434e-04,
        2.45220271e-03, 4.20546520e-04, 1.55358292e-04, 2.00917670e-03,
        5.32383013e-05, 1.20201745e-04, 9.77862415e-03, 2.94847080e-04,
        3.04520184e-03, 9.03622368e-04, 7.11965073e-04, 1.00718321e-03,
        7.94099593e-02, 1.50927796e-03, 1.48418340e-02, 1.81402415e-03,
        4.56680418e-04, 2.53146550e-03, 1.78300070e-03, 1.06983862e-02,
        1.38337568e-05, 1.30624459e-04, 1.12358496e-03, 1.36139662e-03,
        1.70383111e-03, 5.68688783e-03, 6.40813610e-03, 1.76048073e-03,
        8.53477153e-03, 2.27582194e-05, 1.80740720e-01, 1.58274873e-02,
        1.17145903e-04, 3.73107067e-04, 2.62458863e-03, 2.73960809e-04,
        6.05706288e-05, 1.10463185e-03, 2.70273205e-03, 4.27700223e-04,
        2.02133877e-04, 5.54596546e-04, 6.38692649e-04, 3.38231589e-04,
        1.25578118e-04, 1.42738402e-04, 4.16319225e-04, 8.564406

In [13]:
# Selecting the indices with the highest probabilities
output_indices = np.argmax(output_probabilities, axis=-1)
output_indices

array([84, 29, 95, 95, 95], dtype=int64)

### Conclusion  

In this article, we explored the fundamentals of a Transformer model in a practical way, implementing key components such as the scaled dot-product attention mechanism, the embedding layer, and the combination of linear and softmax layers. This hands-on approach complements the previously discussed theory, helping to solidify understanding of the components that make Transformers so powerful and versatile.  

Throughout the process, we built a functional pipeline capable of processing input sequences, applying attention to focus on relevant parts of the data, and generating predictive outputs based on probabilities. This implementation serves as a starting point for understanding how Transformers operate internally and how they can be adapted for more complex tasks, such as machine translation, sentiment analysis, and text generation.  

Transformers remain one of the most impactful architectures in modern artificial intelligence, and understanding them deeply is essential for anyone looking to work with state-of-the-art models. Now that you have a solid foundation, I recommend exploring more advanced implementations, such as multi-head attention layers, masking mechanisms for context control, and training on real-world datasets.  

Continuous learning and practice are key to mastering this technology. I hope this article has been helpful and inspiring for your journey! 🚀  