# Transformers

## Introduction

Transformers were introduced in 2017 by Google Brain researchers and have since revolutionized the field of machine learning, particularly in processing sequential data like text. Their architecture allows them to handle entire sequences of data simultaneously, which is a major shift from the previous models that processed data one element at a time. This makes transformers not only more efficient in understanding context in tasks like language processing but also faster to train due to their parallel processing capabilities. Their versatility has also led to significant advancements in various areas beyond language, like computer vision and audio processing, making them a popular choice in the field of artificial intelligence.

## Evolution of Text Processing Approaches Prior to Transformers

### Bag-of-Words model

The Bag-of-Words model was one of the earliest and simplest techniques. It involved representing text data as a bag of words, essentially a set where the order of words didn’t matter, only their frequency. While easy to understand and implement, this approach had significant drawbacks. It couldn't capture the context or the ordering of words, making it limited in handling complex language tasks. For example, it couldn't differentiate between "dog bites man" and "man bites dog," since both sentences contain the same words with the same frequency.

### Recurrent Neural Networks (RNNs)

RNNs marked a significant advancement. They processed text sequences word by word and could remember information from previous words. This sequential processing made them better at understanding context compared to Bag-of-Words. However, RNNs struggled with long sequences. They often failed to remember information from the beginning of the text by the time they reached the end, a problem known as "short-term memory." This made them less effective for tasks involving longer texts.

### Long Short-Term Memory networks (LSTMs)

LSTMs were developed to address the short-term memory issue of RNNs. LSTMs could remember information over longer periods, making them more suitable for complex language tasks involving longer sequences. But, they still processed data sequentially, which made training them time-consuming and computationally expensive, especially with large datasets. This sequential processing also limited their ability to be parallelized, a key factor in speeding up the training of machine learning models.

## "Attention is All You Need"

Then, in 2017, a groundbreaking paper titled "Attention Is All You Need" was written by researchers at Google Brain, introducing the world to the transformer model. This paper marked a significant shift in how machine learning models handle sequential data, like text. The researchers proposed a new architecture that, unlike its predecessors, didn't rely on sequential data processing methods like RNNs and LSTMs. Instead, it introduced the concept of 'attention mechanisms', which allowed the model to focus on different parts of the input data, enhancing its ability to understand context and relationships within the text.

<center><img src="assets/attention.png" width="500" height="600"/></center>

This image illustrates the transformer architecture as explained in the original paper "Attention Is All You Need." It provides a visual representation of the complex mechanisms that power the transformer model. Using this diagram as a guide, we'll delve into understanding how a transformer processes information. Specifically, we will explore the process of translating a sentence from English to Russian, dissecting how the model's components interact to achieve accurate and context-aware translation. Through this architecture, we'll see the journey of data from input to output, learning the roles of embeddings, attention mechanisms, and neural network layers in the process.

Assuming we have the task of translating a sentence from English to Russian, the transformer architecture facilitates this process through several intricate steps. For instance, the sentence "I love Machine Learning" would be translated into Russian as "Я люблю Машинное Обучение." To accomplish this task using the transformer architecture, the model performs a series of complex operations, each contributing to the accurate translation of the sentence, capturing the essence and context of the original English phrase into its Russian counterpart.

> **NOTE:** After each step, a dummy code example will be used to illustrate the idea. This allows us to demonstrate the concept without introducing the complexity of actual mechanisms used in transformers. In the original transformer architecture, the mechanisms behind the scenes are more complicated.

### Step 1: Tokenization and Embedding

The first step is to prepare the input sentence for the model. The sentence "I love Machine Learning" is broken down into tokens. In this case, tokens could correspond to words or subwords. This process is known as tokenization. Each token is then converted into a numerical representation called an embedding. These embeddings are vectors that capture the meaning of the tokens in a high-dimensional space.

In [75]:
# The sentence is split into tokens (words in this case).
tokens = "I love Machine Learning".split()

# Assuming we have a predefined vocabulary with associated simple embeddings.
vocabulary = {
    "I": [1, 0, 0],
    "love": [0, 1, 0],
    "Machine": [0, 0, 1],
    "Learning": [1, 1, 0]
}

# Convert each token into its corresponding embedding from the vocabulary.
embeddings = [vocabulary[token] for token in tokens]

print(embeddings)

[[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 1, 0]]



In a real-world scenario, embeddings are multi-dimensional and complex. Advanced tokenization methods like Byte Pair Encoding (BPE) are employed to optimize the tokenization process, providing a more efficient and meaningful representation of language data. BPE, for example, segments text into a set of frequent subwords, which enables the model to effectively handle a wide range of words, including rare and compound words. This approach not only improves the quality of embeddings by capturing subword-level information but also aids in managing vocabulary size, enhancing the model's robustness and versatility in language comprehension and processing.

Furthermore, these sophisticated embeddings are designed such that words with similar meanings are positioned closely in the multi-dimensional vector space. This proximity in vector space is crucial, as it allows the model to recognize and utilize the semantic relationships between words, enhancing its ability to understand and generate coherent and contextually relevant language.

### Step 2: Positional Encoding

After tokenization and embedding, each embedding receives a positional encoding. This step is essential because, unlike RNNs, the transformer architecture doesn't inherently process the sequence order. Positional encodings add information to each token's embedding to provide the model with the sequence order of the words.

In [78]:
dimensionality = len(embeddings[0])  # Ensure positional encodings match the dimensionality of the embeddings

# Generate simplified positional encodings, scaled to not overpower the embeddings
positional_encodings = [[(i+1) / 10] * dimensionality for i in range(len(embeddings))]

# Combine the positional encodings with the original embeddings
encoded_embeddings = [[e + p for e, p in zip(embedding, pos_encoding)] 
                      for embedding, pos_encoding in zip(embeddings, positional_encodings)]

print(encoded_embeddings)

[[1.1, 0.1, 0.1], [0.2, 1.2, 0.2], [0.3, 0.3, 1.3], [1.4, 1.4, 0.4]]


In [79]:
# Each vector now contains both the semantic meaning from the original embeddings and the positional information.
[
 [1.1, 0.1, 0.1],  # Encoded embedding for "I"
 [0.2, 1.2, 0.2],  # Encoded embedding for "love"
 [0.3, 0.3, 1.3],  # Encoded embedding for "Machine"
 [1.4, 1.4, 0.4]   # Encoded embedding for "Learning"
]

[[1.1, 0.1, 0.1], [0.2, 1.2, 0.2], [0.3, 0.3, 1.3], [1.4, 1.4, 0.4]]

In real-world transformers, positional encodings are created using sine and cosine functions and added to the word embeddings. This gives each word a unique position in the sentence. Sine and cosine are used because they create wave-like patterns that are different for each position and repeat in a predictable way. This repeating pattern helps the transformer understand very long sentences without getting confused.

### Step 3: Self-Attention Mechanism

After adding positional encodings to the embeddings, the next step in the transformer architecture is the self-attention mechanism. This is where the transformer starts to analyze and interpret the sentence, focusing on different parts of it to understand the context and relationships between words.

In the self-attention process, the model calculates attention scores for each word in relation to every other word in the sentence. This means it assesses how much focus should be put on other words when considering each specific word. For example, in our sentence "I love Machine Learning," the model might learn to pay more attention to "love" when processing the word "Machine Learning," since these words are closely related in this context.

In [82]:
def simplified_self_attention(encoded_embeddings):
    attention_scores = []

    # Iterate over each word embedding in the sentence
    for word_embedding in encoded_embeddings:
        scores = []

        # For each word, calculate its attention score with every other word
        for other_embedding in encoded_embeddings:
            # Simplified calculation: dot product of the two embeddings
            score = sum(e1 * e2 for e1, e2 in zip(word_embedding, other_embedding))
            scores.append(score)

        # Collect the scores for this word against all other words
        attention_scores.append(scores)

    # Normalize the scores for each word to create attention weights
    # This converts them into probabilities that sum up to 1
    attention_weights = [[s / sum(scores) for s in scores] for scores in attention_scores]

    # Apply the calculated attention weights to the embeddings
    attended_embeddings = []
    for i, weights in enumerate(attention_weights):
        # Initialize a zero vector for the attended embedding
        attended_embedding = [0] * len(encoded_embeddings[0])

        # Apply each weight to the corresponding word embedding
        for j, weight in enumerate(weights):
            # Update the attended embedding by adding weighted contributions from all embeddings
            attended_embedding = [e + weight * encoded_embeddings[j][k] for k, e in enumerate(attended_embedding)]

        # Store the attended embedding for this word
        attended_embeddings.append(attended_embedding)

    return attended_embeddings

# Applying the simplified self-attention to our encoded embeddings
attention_output = simplified_self_attention(encoded_embeddings)

print(attention_output)

[[1.0473684210526315, 0.8184210526315789, 0.4], [0.8173913043478261, 1.0695652173913044, 0.4434782608695653], [0.7136363636363635, 0.7568181818181816, 0.7181818181818181], [0.9152173913043478, 0.95, 0.4326086956521739]]


This code demonstrates a very basic version of self-attention. In reality, the process is much more complex, involving matrices called 'queries', 'keys', and 'values', which are derived from the embeddings and used to compute the attention scores in a more sophisticated way. This mechanism allows the transformer to capture and utilize the contextual relationships within the sentence, which is essential for understanding and generating human-like language.

Also, the term "multi-head" in the context of self-attention is used. It refers to the model's ability to pay attention to different parts of the sentence in multiple, distinct ways at the same time. Imagine it as having multiple 'heads', each looking at the sentence from a different perspective or focusing on different types of relationships between words. This allows the transformer to capture a more comprehensive understanding of the text, as each 'head' might pick up on different nuances or aspects of the sentence. Essentially, multi-head attention provides a richer, more diverse interpretation of the sentence compared to using a single perspective.

### Step 4: Feed-Forward Neural Network

Once the self-attention mechanism has processed the sentence "I love Machine Learning" and given us new, contextually-rich embeddings, the next step in a transformer involves a feed-forward neural network. This network works on each word's embedding separately but in the same way for each word.

Think of it as a kind of filter that further refines the meaning of each word. For our translation task, this step is like fine-tuning the understanding of each word ("I", "love", "Machine", "Learning") in the context of the whole sentence.

The feed-forward network in the transformer consists of two layers. It takes the output from the self-attention mechanism, processes it through these layers, and then outputs new embeddings. These new embeddings are still about the same words, but now they've been adjusted even more based on the context of the whole sentence.

In the context of translating from English to Russian, this step helps to ensure that each word's meaning is as accurate and contextually appropriate as possible before the final translation is generated. It's like making sure each piece of the puzzle is in its best form before putting the whole puzzle together.

In [84]:
def simplified_feed_forward_network(attention_output):
    # Dummy parameters for the feed-forward network
    hidden_layer_size = 5  # Typically much larger in real transformers
    output_size = len(attention_output[0])  # Same as the input size

    # Simple feed-forward network with one hidden layer
    for i, word_embedding in enumerate(attention_output):
        # Simulate a hidden layer with arbitrary transformation
        hidden_layer = [sum(word_embedding) * 0.1] * hidden_layer_size

        # Simulate an output layer that brings the dimensions back to the original embedding size
        output_embedding = [sum(hidden_layer) / hidden_layer_size] * output_size

        # Update the attention output with the output of the feed-forward network
        attention_output[i] = output_embedding

    return attention_output

# Applying the simplified feed-forward network to the output of the self-attention mechanism
transformed_output = simplified_feed_forward_network(attention_output)

print(transformed_output)

[[0.06797368421052633, 0.06797368421052633, 0.06797368421052633], [0.06991304347826088, 0.06991304347826088, 0.06991304347826088], [0.0656590909090909, 0.0656590909090909, 0.0656590909090909], [0.06893478260869566, 0.06893478260869566, 0.06893478260869566]]


In the provided code, we're running a simple feed-forward network on each word's embedding from the self-attention output. This network, with a basic transformation in the hidden layer followed by an output layer, alters each embedding. The purpose is to refine the understanding of each word in the sentence context. The result is a new set of embeddings, modified by this process, representing a more nuanced understanding of each word.

In actual transformers, the process is more intricate. The real feed-forward network uses complex layers and non-linear functions to deeply process each embedding. This advanced processing captures finer details and subtleties in the data, crucial for tasks like translation, where nuances in meaning and context greatly influence the output quality.

### Step 5: Decoder

First 4 steps are done in the Encoder part of the transformer's architecture. Then, the result from the encoder goes to the decoder part. Both of these components have their own roles.

Encoder's Role:

1. Understanding the Sentence: 
   - The Encoder processes the English sentence "I love Machine Learning" through tokenization, embedding, adding positional encodings, self-attention, and a feed-forward neural network.
  
2. Creating Contextual Representations: 
   - It transforms the sentence into a set of vectors that represent not just the words, but also their context and relationship within the sentence.



Decoder's Role:

1. Receiving Encoder's Output: 
   - The Decoder receives these contextual vectors from the Encoder.

2. Translating Step by Step: 
   - It begins translating the sentence into Russian, one word at a time, using its self-attention and feed-forward layers, based on the Encoder's output.

3. Considering Context and Previous Translation: 
   - The Decoder checks what it has already translated, ensuring each new word aligns with the previous ones for contextual accuracy.

4. Final Output: 
   - This results in the Russian translation "Я люблю Машинное Обучение".

The Encoder processes and understands the original sentence, while the Decoder uses this information to generate an accurate and coherent translation.

In [91]:
from jupyterquiz import display_quiz
display_quiz("assets/transformers_questions.json")

<IPython.core.display.Javascript object>