# BERT Architecture Explained
## MI201

##**Group 4** :
- Diego FLEURY CORRÊA DE MORAES
- Hazael SOLEDADE DE ARAUJO JUMONJI
- Lucas DE OLIVEIRA MARTIM

### Project 3 : **Sentiment Analysis Using LLMs**

### Introduction

In this notebook, we aim to answer questions 5 and 6 in more detail, providing the necessary context without overwhelming the main experiments notebook with too many cells. Transformer-based architectures have revolutionized NLP, so it's important to explain them properly. To keep the focus on theory while minimizing code, we've opted for this separate, less code-intensive notebook.

In addition to reading the [paper](https://arxiv.org/pdf/1810.04805), we followed it with a [visual intuition](https://www.youtube.com/watch?v=wjZofJX0v4M&pp=ygUJM2IxYiBMTG1z) of the general picture of LLMs and, lastly, an excelent [series of blog post](https://jalammar.github.io/).

**NOTE**: The images from this notebook are taken from a series of Jay Alammar's blog posts. They're often used in media for explanation purposes, and are licensed under [Creative Commons](http://creativecommons.org/licenses/by-nc-sa/4.0/). Here are the references:


- *Alammar, J* (2018). **The Illustrated Transformer** [Blog post].
  * Retrieved from https://jalammar.github.io/illustrated-transformer/

-*Alammar, J* (2018). **Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)** [Blog post].
  * Retrieved from https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

- *Alammar, J* (2018). **The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)** [Blog post].
  * Retrieved from https://jalammar.github.io/illustrated-bert/

-*Alammar, J* (2019). **A Visual Guide to Using BERT for the First Time** [Blog post].
  * Retrieved from https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/


In [16]:
from IPython.display import HTML # Needed for media

#Attention is All You Need

The **"Attention Is All You Need"** [paper](https://arxiv.org/pdf/1706.03762) introduced the Transformer, a model that revolutionized Natural Language Processing (NLP). Unlike previous architectures that relied on recurrence (Recurent Neural Networks), the Transformer's context-capturing mechanism is built entirely on self-attention mechanisms, allowing for faster training and better handling of long-range dependencies. This innovation laid the foundation for modern models like BERT and GPT, making attention-based architectures the new standard in deep learning for NLP.

## Natural Language Processing

**But what is NLP ?**

NLP stands for Natural Language Processing, and is a field of artificial intelligence that enables machines to understand, interpret, and generate human language. It combines linguistics and machine learning to power many real-world applications.



### Embeddings (Classical vs Contextual)

However, a fundamental challenge in NLP is that computers **cannot** process *raw* text **directly** - they operate on **numerical** data. This means that any text input must first be transformed into a numerical representation. Early approaches relied on simple techniques like **one-hot encoding** or **TF-IDF**, but these methods failed to capture more nuanced **semantically**-rich relationships between words. To address this, more advanced techniques called **embeddings** were developed, mapping words to **dense** (not one-hot-encoded) vector representations that encode semantic meaning. The various directions in the constructed vector space can, therefore, encode some form of meaning in their directions.

Even so, traditional (non-neural) embeddings like **Word2Vec** or **GloVe** assign a **fixed** vector to each word, ignoring context (for example, "stick" has a different meaning in the sentence "stick to it" than in "my dog found a nice stick in the park"). This limitation led to the development of **contextual** embeddings, where the representation of a word **dynamically** changes based on its surrounding words - an approach central to Transformer-based models like BERT.

In [None]:
contextEmbeddingURL = "https://jalammar.github.io/images/context.png"
HTML(f"<img src='{contextEmbeddingURL}' width='800' height='300' controls></img>")

### What are sequence to sequence and language models

One of the most important distinctions to be made in the field is the task's nature. Be it on it's **structure** or **objective**.

Practical applications may involve generating a **single** output from a **sequence** of text (*sequence-to-one*, e.g., sentiment classification), a **sequence** from a **single** input (one-to-sequence, e.g., text generation from a prompt) or a multitude of **text** from some previous **text** (*sequence-to-sequence*, e.g., machine translation).

In [22]:
seq2seqURL= "https://jalammar.github.io/images/seq2seq_1.mp4"
HTML(f"<video src='{seq2seqURL}' autoplay loop width='800' height='600' controls></video>")

**Language models** are a specific type of NLP model that concerns itself with the task of predicting the next word in a sequence of text, given the previous/current sequence (that contextualizes it). It's implied in the objective's nature that the model should capture the probability distribution of words and phrases, capturing implicit **semantic** relationships. However, while this process helps model linguistic structure, it does not equate to true comprehension - this falls under the domain of **Natural Language Understanding** (NLU).  


The "*Attention is All You Need*" paper introduced the transformer architecture as a **language model** for a **sequence-to-sequence** task: **neural machine translation**, and we'll briefly go into detail here, as it's crucial for the comprehension of the BERT paper.

In [102]:
seq2seq_enc_nmt_decURL = "https://jalammar.github.io/images/seq2seq_4.mp4"
HTML(f"<video src='{seq2seq_enc_nmt_decURL}' autoplay loop width='800' height='600' controls></video>")

## Pre-BERT : Recurrent Neural Networks

Let's first see what were the previous solutions to the problem, prior to **BERT**.

Contextual embeddings using neural networks were steadily advancing, with **Recurrent Neural Networks** (RNNs) being the primary tool of choice. Their inherently **sequential** nature made them a natural choice for modelling the nuances of text processing, as they process input word by word, maintaining a notion of order and context. This resolved theoretically problem of conditioning on context the meaning of each word.

(Explain briefly the idea of passing hidden states between cells, and backprop throgh time)

The core mechanism of RNNs is the **hidden state** $ h_t $, which acts as a memory that carries **information** from previous words in a sequence. At each time step $t$, the RNN updates its hidden state as follows:

$h_t = f(W_h h_{t-1} + W_x x_t)$

where:
- $x_t$ is the current input word,
- $h_{t-1}$ is the hidden state from the previous time step,
- $W_h$ and $W_x$ are weight matrices,
- $f$ is a non-linear activation function (typically tanh or ReLU).

This allows information to be passed from **word to word**, enabling the model to capture sequential dependencies. Training happens through **Backpropagation Through Time** (BPTT), where gradients are propagated backward across all time steps to adjust the model's parameters.

While in theory this mechanism enables learning long-range dependencies, in practice, RNNs struggle with it due to the vanishing gradient problem - gradients become too small during backpropagation, preventing effective learning of distant relationships. Even architectures designed to mitigate this, such as **Long Short-Term Memory** (LSTM) and **Gated Recurrent Units** (GRU), still face limitations when dealing with very long sequences.

These challenges paved the way for **Transformers**, which fully replace *recurrence* with **attention**, enabling **parallel** processing and eliminating the constraints of sequential models—leading to architectures like BERT, which efficiently leverage computational power to capture context.

In [106]:
RNN_explanationURL = "https://jalammar.github.io/images/RNN_1.mp4"
HTML(f"<video src='{RNN_explanationURL}' autoplay loop width='800' height='600' controls></video>")

Notice, in the model bellow, the time it would take to process sequence. It needs to flow through (in this example, of neural machine translation) lots of RNN cells before going into the decoder, and the bottleneck vector (the one between the encoder and decoder) needs to capture **ALL** of the original phrase semantic meaning, independent of it's initial length. A very difficult task, given that it needs to have always the same dimentions.

In [81]:
RNN_neural_machine_translationURL = "https://jalammar.github.io/images/seq2seq_6.mp4"
HTML(f"<video src='{RNN_neural_machine_translationURL}' autoplay loop width='800' height='600' controls></video>")

## The Attention Mechanism

Enough teasing ! What is then the attention mechanism ?

The attention mechanism comes from two principles:

- **Neural Networks benefit greatly from scale**, and large-scale performance cannot be achieved within reasonable time and cost constraints without **paralelization**. Powerful NLP models should, therefore, be **paralelizable**, not sequential.

- Whatever mechanism these networks apply should have **constant-time access between each pair of words** in the internal data structure, and optimization should take into consideration that. This would allow the network to be able to capture the context independently of the lenght of the sequence.


**The Core Idea of Attention**

Instead of processing words one at a time like RNNs, the attention mechanism allows a model to **weigh** the importance of all words in a sentence **simultaneously** when interpreting a given word. This is done by computing attention **scores** between each word pair, creating a weighted representation of **context**.

Mathematically, attention computes an output vector for each word as a weighted sum of all other words in the sentence. This weight is determined by how **relevant** each other word is to the current word, which is dynamically learned during training.

In [29]:
attentionURL = "https://jalammar.github.io/images/t/transformer_self-attention_visualization.png"
HTML(f"<img src='{attentionURL}' width='600' height='600' controls></img>")

The objective is to model a transformation in the embedding space: from the input vectors to a **richer contextual representation** at the end. All of this follows a nice [visual intuition](https://www.youtube.com/watch?v=wjZofJX0v4M&pp=ygUJM2IxYiBMTG1z).

The idea is to let the words "interact" and the way this is implemented is by making a strong analogy with database concepts:

 - Each word **broadcasts** it's useful semantic meaning (*key*)
 - Each word **requests** relevant information from the rest of the words(*query*).
 - The manner of which these interactions affect the word's representation is captured through the **weighted** update of a vector (*value*) that modifies the position of the original embedding in the embedding space, such that it "soakes in" context from the surroundings, and moves it incrementally towards a more precise representation.

All of this is constructed via the multiplication of the original vector by weight matrices for each component (**Q,K,V**).

In [None]:
attention_computationURL = "https://jalammar.github.io/images/t/self-attention-matrix-calculation.png"
HTML(f"<img src='{attention_computationURL}' width='600' height='700' controls></img>")

This happens for all the words in the input.

In [34]:
attention_weightsURL = "https://jalammar.github.io/images/t/transformer_self_attention_vectors.png"
HTML(f"<img src='{attention_weightsURL}' width='900' height='600' controls></img>")

For computing the weighting from the key-query interaction a softmax operation is used.

In [46]:
softmaxedURL = "https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png"
HTML(f"<img src='{softmaxedURL}' width='900' height='400' controls></img>")

The **K, Q, V representation** enables **parallelization** because all words are transformed into these vectors simultaneously, avoiding sequential dependencies. It also ensures **constant-time access** between words, as attention scores are computed in a single matrix operation, allowing each word to directly interact with all others regardless of sequence length.

## The Transformer Architecture

The Transformer architecture was originally designed for **machine translation**, requiring two main components: an **encoder** (processing the source language) and a **decoder** (generating the target language). While both use attention mechanisms, they apply them differently based on their roles.  

The **encoder** relies on **self-attention**, where each word attends to all others in the input sequence, capturing full contextual meaning. The **decoder**, however, performs **masked self-attention**, which means each word can only attend to **previous** words in the sequence. This is because the decoder is functioning as a **language model**, generating text **autoregressively** - predicting words one at a time, and feeding them back into it's input, without access to future tokens, ensuring valid sequential output.  

Additionally, the decoder includes **cross-attention** (represented by the "Encoder-Decoder Attention" block in the image bellow), which allows it to attend to the encoder's output, effectively learning how to map the input sentence to its translated version.

In [107]:
transformerHoleURL = "https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png"
HTML(f"<img src='{transformerHoleURL}' width='1000' height='600' controls></img>")

### Multi-Head Self-Attention

One important sidenote: the transformer doesn't employ just a single set of attention weights for each layer, but lots of them, each of which called a "attention head". The idea is to allow each head to specialize in different types of word semantics.

In [47]:
multiheadURL = "https://jalammar.github.io/images/t/transformer_attention_heads_qkv.png"
HTML(f"<img src='{multiheadURL}' width='900' height='600' controls></img>")

In the end, all the outputs from the attention heads are concatenated.

In [49]:
multiheadConcatURL = "https://jalammar.github.io/images/t/transformer_attention_heads_z.png"
HTML(f"<img src='{multiheadConcatURL}' width='1000' height='600' controls></img>")

And a (learned) output-weight matrix returns the vectors to it's original size. Each head is another opportunity for paralelization.

In [50]:
transformerURL = "https://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png"
HTML(f"<img src='{transformerURL}' width='1000' height='600' controls></img>")

In the end, the keys are passed to the decoder side of things, to perform cross-attention with the queries generated in the decoder. The values are also passed.

In [None]:
neural_machine_translationpt1URL = "https://jalammar.github.io/images/t/transformer_decoding_1.gif"
HTML(f"<img src='{neural_machine_translationpt1URL}' width='1000' height='600' controls></img>")

In [None]:
neural_machine_translationpt2URL = "https://jalammar.github.io/images/t/transformer_decoding_2.gif"
HTML(f"<img src='{neural_machine_translationpt2URL}' width='1000' height='600' controls></img>")

# BERT

## What is BERT

In [None]:
bert_motivationURL = "https://jalammar.github.io/images/bert-transfer-learning.png"
HTML(f"<img src='{bert_motivationURL}' width='800' height='400' controls></img>")

In [94]:
BERT_tasks = "https://jalammar.github.io/images/bert-tasks.png"
HTML(f"<img src='{BERT_tasks}' width='800' height='500' controls></img>")

## Internal Architecture

In [63]:
encoder_archURL = "https://jalammar.github.io/images/t/transformer_resideual_layer_norm.png"
HTML(f"<img src='{encoder_archURL}' width='800' height='600' controls></img>")

### Input (Tokens, embeddings, segment embeddings, positional encodings)

In [56]:
positional_encodingURL = "https://jalammar.github.io/images/t/transformer_positional_encoding_example.png"
HTML(f"<img src='{positional_encodingURL}' width='1000' height='300' controls></img>")


In [60]:
positional_encoding_plotURL = "https://jalammar.github.io/images/t/attention-is-all-you-need-positional-encoding.png"
HTML(f"<img src='{positional_encoding_plotURL}' width='900' height='500' controls></img>")

### Multi-Head Self-Attention

### Normalization, Skip Connections and Feed-Forward

Layer norm instead of batch norm due to :

- Better adaptation to sequence data
- Ease of paralelization
- Same computation in train & test time

Skip connections:

- Facilitate propagation of gradients in earlier layers
- Numerically stabilize gradients to be closer to 1
- Make attention learn features of it's own, without worrying about preserving the output (it's actually part of the semantics of the network)

In [70]:
skipConnectionsURL = "https://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png"
HTML(f"<img src='{skipConnectionsURL}' width='700' height='600' controls></img>")

## BERT Pretraining

### **Masked** Language Modeling

In [91]:
MLM_URL = "https://jalammar.github.io/images/BERT-language-modeling-masked-lm.png"
HTML(f"<img src='{MLM_URL}' width='900' height='600' controls></img>")

### NSP ([CLS] token)

In [92]:
NSP_URL = "https://jalammar.github.io/images/bert-next-sentence-prediction.png"
HTML(f"<img src='{NSP_URL}' width='900' height='600' controls></img>")

In [88]:
bert_classificationURL = "https://jalammar.github.io/images/bert-classifier.png"
HTML(f"<img src='{bert_classificationURL}' width='700' height='400' controls></img>")

# The Hugging Face Library

In [97]:
tokenization_URL = "https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png"
HTML(f"<img src='{tokenization_URL}' width='1000' height='600' controls></img>")

In [99]:
classification_process_full_URL = "https://jalammar.github.io/images/distilBERT/bert-model-calssification-output-vector-cls.png"
HTML(f"<img src='{classification_process_full_URL}' width='1000' height='600' controls></img>")

