### What is BERT

Bidirectional Encoder Representation from Transformers

- Encoder Representations: language modeling system, pre-trained with unalbeled data. Then fine-tunning
- from Transformer: based on powerful NLP algorithm. Defines the architecture of BERT.
- Bidirectional: uses with left and right context when dealing with a word. Defines the training process.

- ELMo = bidirectional with LSTM
- OpenAi GPT = Transformers from left to right only
- BERT both (transformers + bidirectional)

link https://medium.com/@wwydmanski/whats-the-difference-between-self-attention-and-attention-in-transformer-architecture-3780404382f3

Self-attention and attention are both mechanisms that allow transformer models to attend to different parts of the input or output sequences when making predictions. These mechanisms are crucial for the performance of transformer models in tasks such as language translation, text summarization, and sentiment analysis, where the model needs to understand the relationships between different words or phrases in the input and output sequences.

Attention refers to the ability of a transformer model to attend to different parts of another sequence when making predictions. This is often used in encoder-decoder architectures, where the encoder vectorizes the input sequence, and the decoder attends to the encoded representation of the whole input when making predictions.



# Understanding Query (Q), Key (K), and Value (V) in Transformers

## Overview
Transformers use a self-attention mechanism to determine how much focus each token in a sequence should have on others. The three main components of self-attention are:

### Query (Q)
- The **query** represents the token that is searching for relevant information.
- It determines **what the model is currently attending to**.
- Each token in a sequence generates its own query vector.

### Key (K)
- The **key** represents the potential "match" or "index" for a given query.
- It determines **how relevant other tokens are** to the current query.
- Each token has a key vector, and the model computes the **similarity** between queries and keys.

### Value (V)
- The **value** holds the actual content or information of each token.
- It determines **what information is retrieved** when attention is applied.
- The value vector is used to compute the final output.

## How It Works
1. Each token in the sequence is mapped to its respective **Q, K, and V** vectors.
2. The similarity between **Q and K** is computed using a dot product.
3. The results are scaled and passed through a **softmax function** to generate attention scores.
4. These scores are used to weight the **V (Value)** vectors.
5. The final output is a weighted sum of the value vectors, emphasizing important tokens.

## Mathematical Representation
The attention mechanism is defined as:

\[\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V\]

where:
- \(QK^T\) computes similarity scores between queries and keys.
- \(\sqrt{d_k}\) is a scaling factor to prevent large values.
- **Softmax** normalizes the scores.
- The final result is multiplied with **V** to obtain the attended output.

## Why It Matters
Self-attention allows transformers to capture long-range dependencies in text, making them highly effective for NLP tasks like machine translation, text summarization, and language modeling.


