## General Overview of Transformers ##

For GPT - the input is broken into little pieces called **Tokens**. Tokesn can be words or pieces of words. In terms of images, this can be parts of an image called 'patches' or in terms of sounds, this can be a part of that sound. The general idea is that input is always split into smaller parts.

Each token is associated to a vector meaning that each token has a list of numbers associated to it. The vectors gives 'meaning' to that token. If we imagine these vectors as coordinates in dimensional space, we can imagine that the tokens 'jump' and 'leap' are very close to each other because they are similair in meaning. These vectors are also called **Token Embeddings**

These sequence of vectors are then passed onto the attention block. The attention block basically connects the 'meaning' of different associating words together to better understand context.

So the process is akin to:

**Input -> Tokens (Token-Embeddings) -> Attention Mechanism**

The attention model understands the meaning of words. This is important because depending on the context, words can have different meanings. So a machine learning model would be different from a fashion model. The attention mechanism is the one that determines the context and thus changes the meaning of the words depending on the surrounding words that they are with. 

Once the vectors move through the attention mechanism, it is then processed by the **Multilayer Perceptron** / **Feed Forward Layer**. In this layer, the vectors don't connect to each other as it was done with the attention mechanism. In this section, it is more akin to the layer asking a series of questions for each vector. 

One important thing to understand about token embeddings is that they are essentially holding values akin to a 'linguistic feature' of some sort. So one value in the vector might correspond to asking the question if it is a noun or verb. Whether it is a quote or if it refers to a person, place, object, etc. These 'linguistic features' are things that we can't really know much because that is the job of the model but this brief description is akin to what the layer works with. 

So again, this goes through each vector taking into consideration their linguistic features and then updates them depending on their responses to the questions that they are given.

This layer is much akin to a regular neural network with linear layers and non-linear layers (ReLU). 

The process then repeats again and again. 

**Input -> Tokens (Token-Embeddings) -> Attention Mechanism -> Multilayer Perceptron / Feed Forward Layer -> Attention Mechanism -> Multilayer Perceptron / Feed Forward Layer -> ...**

Notice how these two are very critical in getting important information. The attention mechanism grabs context from the surrounding words, adding the values from other vectors to other associated words thus forming a 'contextual connection' between words and then the Multilayer Perceptron then asks important linguistic features again with the newly added 'context' from the attention mechanism. This repeats over and over again until the values are then enough for us to predict what is the next best word to follow.

Again just like any other neural network, the ultimate end of this entire process would be a single vector. Remember that Transformer blocks functions just akin to a regular neural network. **Intermediate Transformer** blocks output the same size for the token embeddings and then the **Final Transformer Block** creates the **Final Vector Representation**. 

Once we have the final vector representation / logits, we apply the softmax function to create a probability distribution. Basically the same concept as with neural networks. Logits -> Prediction Probabilities.

Models have a predefined library of words and this library is called an **Embedding Matrix**. This embedding matrix determins which word turns into it's associated vector. These vectors usually start off with random values and are updated as the training model goes.

## Deeper Into The Attention Layer ###

The attention layer has three important parts: **The Query, Key, and Value Matrices**. These components make up the center of the attention layer and these three are the one's that grab 'context' from inputs.

Let's start off with the **Query Matrix**. It isn't actually just one matrix. You start off with the **Query Weight Matrix** which typically has the same dimensions as the **Token Embeddings**. The query weight matrix is randomly initialized at the start or can use learned weights. If randomly initialized these weights are learnable through training. Now, the token embeddings are multiplied with the query weight matrix to get the **Query Matrix**

With the **Key Matrix**, it is also the same as the **Query Matrix** in how it is formed. You have a **Key Weight Matrix** that usually also has the same dimensions as the **Token Embeddings**. It can be randomly initialized and trained or can use learned weights. The token embeddings are then multipled with the key weight matrix to get the **Key Matrix**.

Conceptually, you can imagine this as the Query (Question) is asking a response from Key (Answer). Now since we have a question and an answer, we naturally also have a result to see if the answer to the question is correct! This is where the dot product comes into play. 

**IMPORTANT:** Token Embedding, Word embedding, Input Embedding, and Model Embedding usually refer to the same concept. But if you want to be a bit more pedantic with the terminologies, here's a more in-depth explanation:

**Token Embedding**:
* Definition: The vector representation of tokens (which can be words, subwords, characters, etc.) used in the model.
* Usage: Commonly used in modern NLP models that use subword tokenization.
    
**Word Embedding**
* Definition: The vector representation of whole words.
* Usage: Often used interchangeably with token embedding, especially in models where words are the basic units of representation.

**Input Embedding**
* Definition: The vector representation of input tokens fed into the model.
* Usage: This term emphasizes that these embeddings are the initial representations used as inputs to the model.

**Model Embedding**
* Definition: The size of the embeddings used throughout the model.
* Usage: Refers to the dimensionality of the embeddings used consistently across the model.

The dot product is different from matrix multiplication. The dot product calculates the similarity of two vectors. So naturally, it will output a scalar value if used with two different vectors. When applied to matrices, the dot product between rows of the Query matrix and columns of the Key matrix results in a matrix of similarity values, or **Attention Scores**.

Just remember that dot product is different from matrix multiplication because matrix multiplication combines rows and columns to make a new matrix while the dot product between two vectors will output a scalar value and the dot product between rows of one matrix and columns of another matrix will output a matrix of similarity values.

Now, going back to our Query and Key values, performing the dot product between them will yield a matrix of similarity values or **Attention Scores**!

Going deeper into this ... how does that actually happen? How do we know the similarities between two words? Visualize that the words are put into a X-dimensional space wherein there are many different directions. The word embeddings are placed all across this space. Assuming that we have already pretrained weights for the word embeddings, then there would be certain directions for each word embedding. 

The word 'man' would be in a similiar direction to 'king' because they both are talking about a 'male' person. So the values for these two would be much closer together than say, 'caterpillar' and 'concrete'. So the dot product for 'man' and 'king' would be a higher value. 

That is how we are able to identify the similarity between two words. The way 'attention' works is by taking the context of words that are closely related to each other.

Alright, now going back to our attention scores. These values can range a whole lot from very large numbers to infinity. We don't want that. First, we start by dividing the values of the results of the dot product by the square root of the key-query dimension. Doing this would give us a matrix of **Scaled Scores**. This allows us to reign in the values so to speak so that we won't have to work with super large numbers. 

Once we're done, we want to distribute the values into a probability distribution soooo... you probably already know the answer to this one - it's softmax time! We apply softmax to normalize the values. Once we've done that, we essentially converted our values into something that can work as 'weights' because they are now in the range of 0 to 1.

Since we've essentialy applied softmax, our attention scores now turn into **Attention Weights**. Once we have the attention weights, we would have a matrix of values and the distribution of these attention weights in a matrix is called the **Attention Pattern**.

Summarizing it all down - **Attention Scores** -> **Scaled Scores** -> **Attention Weights**

**Attention Scores:** Computed by taking the dot product between the Query and Key matrices.

**Scaled Scores:** The attention scores are divided by square root of dk

**Attention Weights:** The scaled scores are passed through the softmax function to produce normalized weights.

**Attention Pattern:** The matrix of attention weights, which shows the distribution of attention across the sequence.

**Attention Matrix:** The name of the matrix of attention weights.

## Masking ##

This part will involve some recalling. First, it's important to remember that masking is only applied during training. 

Here's how training works with transformers:

1. **Given Sequence:** The transformer takes a certain 'sequence' as input, which is just the user input.

2. **Labeling:** The transformer uses that same sequence as its label. So if the given sequence is 'the car is moving', the transformer would try to predict that exact same sequence.

3. **Step-by-Step Prediction:** It starts with the first token of that sequence and then tries to predict the next word. For example, starting with 'the', it tries to predict the next word 'car', then 'is', and so on. 

Since we use the same 'sequence' as both the input and the label, we need to ensure the model doesn’t cheat by knowing the next words. If the model knows what the next words are, it won't learn to predict effectively. This is where masking comes into play.

**Masking and the Attention Mechanism**

* **Purpose of Masking:** Masking ensures the model cannot see future tokens. This prevents it from using information it shouldn’t have access to when predicting the next token in the sequence.

* **Attention Mechanism:** Attention is about capturing the context of surrounding words. However, during training, we don’t want future words to influence the prediction of the current token. For example, when predicting the word following 'the', the model shouldn't consider 'car is moving'.

By applying masking, we ensure that each token in the sequence only attends to previous tokens and itself, not future tokens. This way, the model learns to predict the next word based only on the current and previous words, similar to how it would have to operate during actual usage when generating text.

Summary - 

1. Masking: Applied only during training to prevent the model from 'cheating' by looking at future tokens.
2. Attention: Captures context but is restricted by masking to only consider current and past tokens.
3. Training: Uses the given sequence as both input and label, with masking ensuring proper learning of token predictions.

**How Masking is Applied**

When we calculate the dot product between the key and query matrices, masking ensures that later tokens do not influence earlier ones. Here’s how it works:

**Dot Product Calculation:** The attention mechanism involves computing the dot product between the query and key matrices. This results in a matrix of similarity scores (attention scores).

**Masking Future Tokens:** During training, we mask future tokens to prevent the model from seeing them when predicting the current token. This is done by setting the similarity scores (resulting from the dot product) for these future tokens to zero.

Let’s consider an example sequence: "the car is moving". We have queries (Q) and keys (K) for each token in the sequence.

1. *For the first token Q1 ("the"):*

    * K1 ("the"): Allowed (has a value)
    * K2 ("car"): Masked (set to zero)
    * K3 ("is"): Masked (set to zero)
    * K4 ("moving"): Masked (set to zero)

----

2. *For the second token Q2 ("car"):*

    * K1 ("the"): Allowed (has a value)
    * K2 ("car"): Allowed (has a value)
    * K3 ("is"): Masked (set to zero)
    * K4 ("moving"): Masked (set to zero)

----

3. *For the third token Q3 ("is"):*

    * K1 ("the"): Allowed (has a value)
    * K2 ("car"): Allowed (has a value)
    * K3 ("is"): Allowed (has a value)
    * K4 ("moving"): Masked (set to zero)

----
4. *For the fourth token Q4 ("moving"):*

    * K1 ("the"): Allowed (has a value)
    * K2 ("car"): Allowed (has a value)
    * K3 ("is"): Allowed (has a value)
    * K4 ("moving"): Allowed (has a value)

This masking ensures that each query (Q) can only attend to the keys (K) that are from the preceding and current tokens, effectively preventing future information from influencing the current token’s prediction.

**Key and Query Influence:**

* *Keys (K)* - can influence multiple queries (Q) because each key is used in the context of every token that precedes or matches it.
* *Queries (Q)* - are influenced only by the keys that correspond to preceding and current tokens due to the applied mask.


**When Masking is Applied**

Masking is applied before we do softmax because if we use softmax first then after applying masking, the total value of the column's wouldn't be 1 so it can't be considered normalized anymore. So technically speaking, we don't set the masked scores to zero but instead set them to a very large negative number (negative infinity). Once we apply softmax to this the masked scores would now be set to zero and the rest can then be properly normalized.

Going back to the **Attention Matrix** - the size of the attention matrix is dependent on the square of the context size. This is the reason why context size is one of the critical factors that influence large language models because it isn't easy to scale up the context size due to the high increase on the attention pattern.

There's a good reason as to why a bigger context size is good because that means more words can be taken into context. Big context sizes such as in the realm of paragraphs can do wonders as compared to a context size that is limited to only a sentence. That is because for the most part, context revolves around paragraphs not just mere sentences. 

Because of the importance of context windows/sizes there have been many techniques tried on the attention layer just for the sake of increasing this number. Examples such as Reformer, Linformer, Blockwise Attention, Ring Attention, Longformer, etc. are just some examples but for now, we'll stick with the basics. 

## Value Matrix ##

We talked about key and queries but we forgot about one important part of this entire process - the value matrix. The **Value Matrix** is just the same as the key and query matrices in that it also has a **Value Weight Matrix** which has the same size as the token embeddings and can be randomly initialized and trained or can use pretrained weights. 

The **Value Weight Matrix** is fed with the input embeddings and then finally it outputs a **Value Matrix**.

Once we have the **Value Matrix** we then can start working with the next step which pretty straightforward. The attention matrix is multiplied to the value matrix. Note the wording! **Attention Matrix Multiplied To The Value Matrix** and not the other way around.

Now we were just talking about a single head of attention. In reality, a single full attention block would comprise many attention heads which are run in parallel of each other. So each attention head has their own key, query, and value matrices. 