## Building a Transformer Model without using the framework

Let's create a model capable of predicting sequences of length equal to 10 tokens.

A Transformer model consists of several main parts:

- 1- Embedding Layer: Transforms words into numerical vectors of fixed size.
- 2- Attention Mechanism: Allows the model to focus on different parts of the input.
- 3- Encoder and Decoder Layers: Process data sequentially.
- 4- Linear and Softmax Layer: For final predictions.

For this project the main objective is to implement item 2, but to make the example functional we will also implement items 1 and 4.

### Hyperparameters

In [16]:
# Imports
import numpy as np

# Model dimension
dim_model = 64

# Sequence length
seq_length = 10

# Vocabulary size
vocab_size = 100

### Embedding Layer

The embedding function is used to convert sequential inputs into dense vectors of fixed size. These vectors are known as embeddings and are a fundamental part, especially of PLN models.

These embeddings are fundamental for deep learning models in PLN, as they provide a rich and dense representation of words or tokens, capturing contextual and semantic information that is essential for tasks such as machine translation, text classification, among others.

In [17]:
# Define the function to create an embedding matrix
def embedding(input, vocab_size, dim_model):

    # Create an embedding matrix where each row represents a vocabulary token
    # The array is initialized with normally distributed random values
    embed = np.random.randn(vocab_size, dim_model)
    
    # For each token index in the input, select the corresponding embedding from the array
    # Returns an array of embeddings corresponding to the input sequence
    return np.array([embed[i] for i in input])

### Attention Mechanism

In this example, let's keep it simple and use just one attention layer.

In Transformer, Q, K and V are derived from the same input in encoder attention layers, but from different inputs in the decoder (Q comes from the output of the previous decoder layer, while K and V come from the encoder output). The attention mechanism calculates a set of scores (using the dot product between Q and K, hence the name "scaled dot-product attention"), applies a softmax function to obtain attention weights, and uses these weights to weight the values, creating an output that is a weighted combination of the relevant input information.

This process allows the model to give "attention" to the most relevant parts of the input for each part of the output, which is especially useful in tasks such as translation, where the relevance of different words in the input can vary depending on the part of the sentence being used. translated

### Softmax Activation Function

The softmax function is a widely used activation function in neural networks, especially in classification scenarios, where it is important to transform raw output values (logits) into probabilities that sum to 1. Below is the softmax function code with comments on each line explaining how it works:

In [18]:
# Softmax Activation Function
def softmax(x):
    
    # Calculates the exponential of each input element, adjusted by the maximum value in the input
    # to avoid numeric overflow
    e_x = np.exp(x - np.max(x))
    
    # Divide each exponential by the sum of the exponentials along the last axis (axis=-1)
    # Reshape(-1, 1) ensures that division is performed correctly in a multidimensional context
    return e_x / e_x.sum(axis=-1).reshape(-1, 1)

### Scale Dot Product

The scaled_dot_product_attention() function is a component of the attention mechanism in Transformer models. It calculates attention between sets of queries (Q), keys (K) and values (V).

Essentially, this function allows the model to give different importance to different parts of the input, a key aspect that makes Transformer models particularly effective for PLN and other sequential tasks.

In [19]:
# Define the function to calculate attention scaled by dot product
def scaled_dot_product_attention(Q, K, V):
    
    # Calculate the dot product between Q and the transpose of K
    matmul_qk = np.dot(Q, K.T)
    
    # Gets the dimension of the key vectors
    depth = K.shape[-1]
    
    # Scale the logits by dividing them by the square root of the depth
    logits = matmul_qk / np.sqrt(depth)
    
    # Apply the softmax function to obtain the attention weights
    attention_weights = softmax(logits)
    
    # Multiply the attention weights by the V values to get the final output
    output = np.dot(attention_weights, V)
    
    # Returns the weighted output
    return output

### Model Output with Linear and Softmax Operation

The linear_and_softmax() function is a combination of a linear layer followed by a softmax function, commonly used in deep learning models, especially in classification tasks.

In [20]:
# Defines the function that applies a linear transformation followed by softmax
def linear_and_softmax(input):
    
    # Initialize a weight matrix with normally distributed random values
    # This matrix connects each model dimension (dim_model) to each vocabulary word (vocab_size)
    weights = np.random.randn(dim_model, vocab_size)
    
    # Performs the linear operation (scalar product) between the input and the weight matrix
    # The result, logits, is a vector that represents the input transformed into a higher-dimensional space
    logits = np.dot(input, weights)
    
    # Apply the softmax function to the logits
    # This transforms the logits into a vector of probabilities, where each element sums to 1
    return softmax(logits)

### Building the Final Model

In [21]:
# Final model function
def transformer_model(input):
    
    # Embedding
    embedded_input = embedding(input, vocab_size, dim_model)
    
    # Attention Mechanism
    attention_output = scaled_dot_product_attention(embedded_input, embedded_input, embedded_input)
    
    # Layer linear and softmax
    output_probabilities = linear_and_softmax(attention_output)
    
    # Choosing the indices with the highest probability
    output_indices = np.argmax(output_probabilities, axis=-1)
    
    return output_indices

## Using the Model for Predictions

In [None]:
# Generating random data for model input
input_sequence = np.random.randint(0, vocab_size, seq_length)

print("Input Sequence:", input_sequence)

# Making predictions with the model
output = transformer_model(input_sequence)

print("Model Output:", output)