## Visual Guide To Transformer Neural Networks - Video Notes ##

Transformer Architecture - Encoder / Decoder

1. Input Data Processing

- If training data encompasses the entirety of wikipedia then that means that we will cover all the words in the english language. We assign a numeric index.
- What is inputted into the transformer is not the word themselves but their corresponding indices.
- Imagine each text index as variable X.

Example Text Input -  When | you | play | the | game | of | thrones 

Example Text Index - [2458,  5670, 234,   987,  398,   607, 1230]

![Display](notes/vocab-indices.png)

The inputs are then processed into the next layer - **Embedding Layer**

The embedding layer also has indice for every word in the vocabulary, for each of the indices, a vector is attached to it. Initially, these vectors all have random numbers. The model updates these values during the training phase. 

The size of the vector is a hyperparameter. In the original paper the vector size is 512. To make it easier, we will downscale the vector size into 5 so that they are easier to visualize. 

![Display](notes/vocab-vector.png)

**What is a Word Embedding?**

These are vector representations of a given word. Each dimension in the vector represents some kind of linguistic feature. For example, a dimension could be tracking whether it is a verb or not, another could be tracking if it is a noun, etc. If the size of the vector is 512 then that means the word embedding is tracking 512 linguistic features. We can't really identify what information do each of the dimension tracks because that's the work of the neural network. 

**Graphic Visualization**

![Display](notes/word-embedding-graph.png)

In the graph, we can see how 'close' two words are based on the values that they have on each dimension (denoted as d). Once training starts, words can shift around and get closer or farther from each other. Essentialy words move around the graph and how close they are determines how much two words are associated to each other. So for example, if we have another word 'caterpillar' in the graph. Then surely enough, that would be quite farther as compared to play and game because you won't often find 'caterpillar' used in the same frequency as the other two.

**REMEMBER!** word embeddings are different from input indices.

The INPUT EMBEDDINGS LAYER takes in INPUT INDICES and then converts them into WORD EMBEDDINGS.

INPUT INDICES -> INPUT EMBEDDINGS LAYER -> WORD EMBEDDINGS. 

**Position Embeddings**

Position embeddings are important because they track the sequence of the words. Consider the following: 

**For a LSTM** - whenever they take in word embeddings, they do this sequentially meaning that each word embedding goes in one at a time. Now this preserves the sequence of the words BUT is slow. They can know which word comes first, second, ... until the last. BUT again it's slow.

**For a Transformer** - they take all the words in one go. Now that's considerably faster as compared to LSTMs but of course, this doesn't preserve the sequence / order of the words so the Transformer doesn't know which word goes first, second, .... etc. This poses problems because we all know how important sequence is when it comes to creating sentences. 

So, with that said and done. How do we make sure that the Transformer can recognize the sequence of the words without making it as slow as an LSTM. That's where *Position Embeddings* start to play a role. We introduce a "position embedding" vector that has the same number of dimensions as the "word embeddings" vector. 

We need a way to make these two vectors into one so we just add the position embeddings to the word embeddings. Just like this:

![Display](notes/position-embeddings.png)

So pretty simple. We create a new vector for the position embeddings and then add these values to the word embeddings vector. But that leaves out a very important question. What are the values for the position embeddings? There might be a straightforward solution to that to. Just assign 1, 2, 3, 4, ... to the position embeddings. 

Sadly, it's not that simple. If we use regular integers then once we add the embeddings together then it would drastically distort the values, especially once we get to higher integer positions such as position 30 for example. Adding 0.02 (word embedding) to 30 (position embedding) doesn't sound like it's a good idea.

How about fractions? Nope! Won't work either. That's because if we're going to use fractions then we are dependent on the length of the sentence for the values. We should have a consistent value range for each position value regardless of the sentence length. 

![Display](notes/position-values-fraction.png)

So for example, in the image provided above. We can see that "P1" for Sentence 1  (0.33) is different from Sentence 2 (0.20). That's because the length of the sentence in Sentence 1 is only up to 3 while Sentence 2 extends to 4. Thus the values are different. That's not a good thing. The values should remain consistent with sentences regardless of sentence length.

**Wave Frequencies**

In the original transformer paper, they managed to solve this 'position' problem by using wave frequencies. Let's take a look at how this is actually done. Use the image below as a reference. 


* *pos* - stands for the position of the vector
* *d* - stands for the number of dimensions in the vector. This should be the same as the number of dimensions in the word embedding vector.
* *i* - stands for the indices of the dimensions in the vector.

It's easy to get confused with dimensions, indices, and elements. But for a quick rundown here's the definitions:

Indices - the means of acccessing the values of a vector.
Elements - the values of the vector.
Dimensions - the structure of the vector.

Consider the vector:

[0.1,0.2,0.3,0.4]

Elements: The values 0.1, 0.2, 0.3, and 0.4.
Indices: The positions 0, 1, 2, and 3 used to access the elements.
Dimensions: The vector has 4 dimensions, each representing a different feature or aspect of the data.

![Display](notes/wave-frequencies.png)

To keep things simple. Imagine using the regular way of integer-based positioning. We increment by the position value so that'll be [1,2,3,4 ...] but instead of applying the position value to each dimension, we instead turn them into frequencies first using the formula above. Now it isn't shown in the picture but we alternate betweeen the sin and cos functions.

Why sin and cos? That's because these are frequencies and they are pretty stable. Frequencies will remain that way consistently all throughout so you won't have any problem of values changing with different position lengths. You also avoid shifting position values too much because sin and cos relatively stay between 0 and 1. Lastly, you have many different frequencies available.

There are many other diffferent means of getting position values but in the original paper, this is the main one being used so let's focus on that. Let's talk more on the equation itself.

One important thing to note is that *i* does not directly mean the position dimension. *i* is the value that you can get from determining if the position dimension is odd or even. More specifically, *i* is the one that determines the frequency. It does not refer to index.

It is important to note that one area of confusion is with the term *Dimension* that is involved in calculating *i*. *Dimension* is the index of that specific value in the positional embedding. It is different from *dmodel* that is used in the PE equation. 

*dmodel* refers to the hyperparameter that is used for the general dimension size of the matrices in the Transformer model.

It's important to remember these key differences and while it can be confusing, it is crucial to remember what these different words mean.

![Display](notes/solving-for-i.png)

![Display](notes/positional-odd-even.png)

So what is happening is that i = 0 means frequency 0 is assigned to the positional dimensions of 0 and 1. They have the same frequency value but that is alright because 0 uses sine while 1 uses cos so even if they use the same frequency value they still have different values. 

You can see this being applied all throughout. So the next two positional dimensions would be 2 and 3 and these would also use the same frequency value but still have different values because one would use sine and another would be using cos.

Now that we have our positional embeddings, we just add that to our word embeddings then we're done. That's it. 

![Display](notes/positional-and-token.png)

Obviously, we've gotta have something new to call this word + input embedding. No worries, there's a term for this - **Input Embeddings**.

Let's recap the process then:

INPUT INDICES -> INPUT EMBEDDINGS LAYER -> WORD EMBEDDINGS -> POSITIONAL EMBEDDINGS -> WORD + POS -> INPUT EMBEDDINGS

We're pretty much done with the necessary preperations for the inputs and now we can start processing the parts of the Transformer architecture that do a lot of the heavy lifting. We'll now get started working on the **Multi-Head Attention** layer. This part is the big monster that we're gonna have to deal with. You can say that we're getting to the part where the "magic" happens.

But before we get to explaining the layer - let's ask an important question first. Why is this layer important? What makes it unique? 

It's because it is able to 'grasp' context. The entire purpose of **Attention** is to emphasize the importance of certain words that drastically impact the meaning of the sentence. For example, 'He managed to fight off a huge beast'. Attention would be able to distinguish the important parts of this sentence such as He, fight, and beast. Attention makes a model understand the important things. 

However, Transformers aren't innovative because they only utilize attention. Transformers are unique in that they also have another ace up in their sleeve which is called **Self-Attention**.

Self attention brings in to the table the ability to understand that a word might mean something entirely different because of the surrounding words that it is with or the 'context' of the word. 

For example, the word 'model' is different when talking about a fashion model and it is also entirely different when talking about a 3D 'model' not only that but 'model' is also different in the world of machine learning. Attention is able to understand that this single word could be used for entirely different things just basing on the 'context' and paying attention to the surrounding words.

So what's the key difference between **Simple-Attention** and **Self-Attention**? Let's take a look through some diagrams.

![Display](notes/simple-vs-self-attention.png)

**Simple-Attention:** This highlights keywords that is most relevant to a specific query. It selectively chooses words and does not take into account other words in the sentence that doesn't exactly answer that query. This process puts more emphasis on words that can answer the query thus posing a challenge because other words that left behind oftentimes are as critical as the focus words in a context-wise sense.

**Self-Attention:** On the other hand, self-attention revolves around working with the words in the input. It takes into consideration the relationship between different words thus the values of each word can change depending on the words surrounding it essentially grabbing the 'context'. What's great about self-attention is that it takes all words into account and doesn't fixate on specific words only.

## Multi-Head Attention ##

Now, let's take a look where this self-attention mechanism works. We'll be deep diving into the next layer - the multi-head attention layer. Let's start by getting familiar with the contents of the layer and it's innerworkings. See how it is structured and the different parts that it contains. Then we'll start picking these parts off one by one and examining each one. Once we're done, we can start assembling them together back so that we can see how it works entirely.

![Display](notes/multihead.png)

The multihead attention layer is primarily comprised of linear layers. Each having their own seperate weights. The linear layers have different hyperparameters that you can change but for the most part they downscale the input embeddings to save on computation costs. Aside from that they function as regular linear layers of a neural network.

Now let's start introducing these three different linear layers: **Key Layer, Query Layer, Value Layer**.

The name pretty much explains how these linear layers function. The Query is akin to asking a question - the Key is the answer to the Query - and the Value is what is being returned from the result of the Query-Key.

The **Key**, **Query**, and **Value** Linear Layers are composed of a **Weights Matrix** that is randomly initialized which can be updated to be better during training or can be pretrained. 

So we have the **Key, Query, Value** and inside each is the **Key Weights Matrix, Query Weights Matrix, Value Weights Matrix**.

![Display](notes/raw-dimension-inputs.png)

Now what do we actually feed these layers? The **Token/Input Embeddings**! Now since we have three of these layers then that means we just duplicate three of the embeddings as well. So each layer gets fed an **Token Embedding**. Of course, we're going to need to transpone the token embeddings first. 

Once these layers are fed (matrix multiplication) with the token embeddings, they output what we call the actual **Key, Query, Value*** matrices. 

These are just called **Key, Query, Value**. 

![Display](notes/key-query-value.png)

We'll be locking in first into the **Query-Key** matrices. We have to do a dot product operation between these two. What's a dot product? It calculates how similiar the values are betweeen the Query-Key matrices. This is where we are doing 'self-attention'. One way to visualize this is by imagining an X-dimension where words are placed all over. Certain directions mean something. 

The word embeddings are then scattered all over this dimension. For example a word 'king' can be close in distance to 'man' because they both mean 'male'. This is what direction means. There is a pretty simple means of explaining the values that result from dot product:

* The values are positive if the words are pointing in similiar directions.
* The values are zero if the words are pointing two different directions.
* The values are negative if the words are pointing in opposite directions.

Once we calculate the dot product we get a new matrix of similarity values - or what we call **Attention Scores**. We're not done yet. These attention scores can be very big values or even infinity. We don't want that, we gotta normalize the values. We can do that by simply dividing the results of the dot product by the square root of the key-query dimensions. This process is called **Scaling**. Once we've done this we now have a matrix of **Scaled Scores**.

![Display](notes/scaling-values.png)

Once we have scaled the values down, we further squash the values to even a smaller number. We need all the values to range between 0 and 1. We do that by using our old trusty softmax function. We're getting into familiar territory now. Not much different when it comes to regular neural networks.

![Display](notes/softmax-values.png)

The final output would be something akin to this:

![Display](notes/attention-filter.png)

The final output is called the **Attention Filter** / **Attention Weights**. The distribution of the weights in the matrix is called the **Attention Pattern**. Notice how all the values in the columns when added up all equal to 1? That's softmax in action. 

Let's make a pipeline so that we can get a better grasp on the concepts and terms that we've just used: 

1. INPUT EMBEDDINGS
2. KEY & QUERY WEIGHT MATRICES -> KEY & QUERY MATRICES
3. DOT PRODUCT OF QUERY AND KEY MATRICES -> ATTENTION SCORE MATRIX
4. SCALING -> SCALED SCORES
5. APPLY SOFTMAX -> ATTENTION WEIGHTS / ATTENTION FILTER

What about the value matrix? The value matrix is pretty simple it goes straight towards the next step so it retains as a value matrix because it's going to be used for something later (we'll get to that). Meanwhile the Query and Key matrices combine to turn into a attention filter. 

That's quite a lot so quick recap:

Our input embeddings go through each linear layer seperately. You can think of it as making three copies of the input embeddings. We pass each of these copies to each layer - Key, Query, Value layers. The value layer outputs a value matrix. 

The Key and Query layers output a Key and Query matrix where we use the dot product operation to get an attention score matrix. We scale these attention scores to get scaled scores and then further apply the softmax function to finally get the attention weights / attention filter.

What are we left with at the end? This: 

![Display](notes/filter-value.png)

One thing to keep in mind when we've been discussing this entire process is that there isn't just 'one' attention filter being created. If we were just using one attention filter then this wouldn't be a multi-headed attention layer but instead become a single-headed attention layer. 

In the original Transformer paper, there were 8 total attention filters being created. GPT-3 has 96 attention heads. Here, we're just going to visualize at 3. Scaling isn't easy when it comes to attention heads because each one involves a lot of parameters and weights. 

Increasing by one attention head is gonna cost a lot when it comes to computation power so it isn't something that is taken lightly. In addition, there is also the subject of the **Context Size**. The size of the **Attention Matrix** is highly dependent on square of the context size. 

Confused as to what context size is? That's just another term for the number of tokens that the model takes in per input.

This is technically a multiheaded attention layer looks like:

![Display](notes/multi-head.png)

As you can see, each attention filter works by grabbing certain details from the input. In this case, we're dealing with an image so the attention filters are taking different details from the input. One might be taking the details of Azula, the clouds behind her, or even the mountains. This is why each head of attention is important because they are able to grab more information from the input at the cost of additional computation requirements.

![Display](notes/concat-values.png)

We have all these different attention heads so now what? Each output their own attention filters so what do we do with them? Pretty easy, we just concatenate them into one big matrix. In our specific example, we were using three heads of attention so we'll concat three attention filters. 

Of course, we want to return the dimensions back to their original size. This is for the reason that we want to keep consistency as much as possible throughout the model. Now the final output is often just called the **Multi-head Attention Output**.

## Add & Normalize ##

![Display](notes/residiual-connections.png)

Time to zoom out a bit and talk about **Residual Connections** - another feature of the Transformer architecture. We know that the multi-head attention layer is important in the sense that it is able to grab important details from the input. In addition, it also allows the input to affect each other. 

But in this process, we're going through a lot of linear layers which will inevitably overwrite some critical information. This is where residual connections come into play. Residual connections act as a 'highway' wherein the input can go directly skip through certain operations. 

You can think of this as something that 'reminds' the weights of the previous information. So the token embeddings goes through the multi-head attention layer AND also skips through to the **Add & Norm** layer via a residual connection.

Starting off with the simple part of **Adding**. The token embeddings that go through the residual connections is simply added to the output of the multi-head attention layer. The resulting output would then proceed to **Normalization**.

![Display](notes/add.png)

Proceeding with normalization, you just get the mean and the standard deviation from each row. Once you have these values then you start going through each neuron/value in the row. The formula to normalize the values is in the image shown below. You can check out the video from the series (episode 3) to see this in action.

![Display](notes/norm.png)

Then you're done! The resulting output would be the matrix's values are now all normalized and can proceed to the next layer - the **Feed Forward** layer. This step is pretty simple. The Feed Forward layer is composed of linear layers with activation layers in-betweene each. No need to explain much in this case because we've already discussed this before and you should already know this by then.

Once you're through with the layer, you go through another add & norm layer as the final layer to finish up. This sums up the entire process of **Encoder** of a Transformer and for the most part, explains the majority of the critical parts of what a Transformer is.

Let's start moving on towards the **Decoder** which retains the same concept as the encoder with some slight differences.

## Decoder ##

As compared to the encoder, which takes just one input—the base input text—the decoder takes in two: the output of the encoder and the generated text thus far. There is quite a slight difference in where these inputs to the decoder come in as well. 

![Display](notes/decoder.png)

There are a couple of things to dissect in the image of how this Transformer works. First, we duplicate the output of the encoder to create two instances that act as **Key and Value** inputs for the Multi-Head Attention layer of the decoder. The **Query Matrix**, however, comes from the decoder's output embeddings. But what do we input to the output embeddings layer?

You might ask, "But this is the first pass, so the decoder hasn't generated any text yet. What goes through the output embeddings?"

Great question! The initial input for the output embeddings is a special token called < Start >. It goes through the embedding process: gets turned into a token, receives a positional embedding, and then goes through the decoder’s multi-head attention layer (ignore the masking part for now; we'll get back to that). It then passes through an Add & Norm layer and a feed-forward network, eventually becoming the Query matrix for the encoder-decoder multi-attention layer. This Query matrix goes quite a long way before converging with the Key and Value inputs from the encoder.

The special multi-head attention layer where the encoder and decoder converge is called the **Encoder-Decoder Multiattention Layer**.

So, summing it all up: the multi-head attention layer of the decoder takes in the Key and Value matrices from the encoder. The decoder produces the Query matrix from the embeddings of the generated text. Initially, if it hasn't yet generated any text, the first thing that goes through the output embeddings of the decoder is the special token <Start>.

Now, the Encoder-Decoder Multiattention Layer is pretty much the same as before so there's no need to explain much aside from the fact that it gets the Key-Value matrices from the Encoder and the Query matrix from the Output Embeddings of the Decoder. 

Aside from that, the processs is pretty straightforward. The layer outputs a matrix that is fed to a Feed Forward and Add & Norm layer. Finally, it goes through a Linear layer. We'll focus in on this one because it's important since it's the last layer but expect it to be as the same as how a regular Linear layer operates.

This last linear layer serves as a 'classifier' because this is the part where it predicts what is the next word to come out. The linear layer's outputs is dependent on the number of classes that we have. For example, if we were classifying between a dog or a cat, then there would be two outputs for the layer. In our case, it's going to be the entire vocabulary list that we have. 

Now what is exactly being fed to this classifier? As it stands now, what we have is the output matrix from the previous Add & Norm layer. If we just send this directly into the linear layer, it'll output vectors. We don't want that. What we want are scalar values for every word. That way we can pick the word which has the maximum score. 

How do we do that? First, we flatten the entire matrix into one single row. We concatenate these and then start passing them onto the linear layer. Because of this the output of the classifier will now be a score for each value. These are **Logits**. Of course, we don't want to work with just logits. We have to convert these again with softmax so that we can have a probability distribution.

With that done, we're ready to pick the next word to generate. The one with the biggest possibility.

![Display](notes/linear.png)

!!!!! **THIS ENTIRE SEGMENT IS NOT YET VERIFIABLE** !!!!!!

--------------------

**SEGWAY:** Normally flattening isn't applied during inference. Only used in training because it facilitates with loss calculation. Some loss functions want the data to be in a specific shape so that is why we need to flatten the data. 

Recall - Each token contains the length of a vocabulary size. The matrix that the Add & Norm layer would output is akin to something like [batch_size, seq_length, hidden_dim]. When it is fed through the linear layer - it would become [batch_size, seq_length, vocab_size]. Wherein vocab_size is the length of each token in the seq_length. Remember that seq_length contains vectors of logits. The batch_size contains the number of sequences. So each batch_size would contain a differet number of sequences in seq_length. 

In flattening, we concatenate the seq_length with batch_size. After flattening this would become [batch_size * seq_length, vocab_size]. This is now a vector that contains the number of tokens that we have and across all batches. This is similiar to how we multiply the Height & Width of pixels in a CNN. In our case, we are working with tokens (words) and the number of batches we have. 

**NOTE:** With vocab_size, this is literally just a number that indicates the number of unique tokens (words, subwords, or characters) that the model can predict. It is not like Width in CNNs wherein inside are pixel values. There is nothing inside vocab_size. The batch_size on the other hands contains a lot of different sequences contained in seq_length. So each batch size contains different sequences.

**NOTE:** Make sure to remember that * in this case is not multiplication but concatenation. With flattening, we are NOT changing values just RESHAPING the way it is presented. WE NEVER TOUCH THE VALUES NOR CHANGE IT IN FLATTENING.

--------------------

!!!!! **THIS ENTIRE SEGMENT IS NOT YET VERIFIABLE** !!!!!!

--------------------

**FLATTENING IS NOT FOR INFERENCE:**

Flattening makes inference much more complicated because it loses the positional structure of the input. When working with [batch_siz, seq_length, vocab_size], We can identify the last token easily because it is the last element in seq_length. So we just choose the highest value in the last vector of seq_length as the next word. 

For example - we are working with [1,3,100]. We narrow down to last element of seq_length and just choose from the 100 elements inside. 

If we do flattening then it'll be much harder. Because the seq_length is concatenated with vocab size. So we need to know exactly where the borders are between each token. So if we do flattening we will instead have [1,300]. This layout is easier when training because we have truth values to base on and calculate the loss directly 1-1. That's not the case with inference.

We just made it harder for ourselves to predict the next word because we lost the positional structure of where the token begins and ends.

--------------------

Once we have started predicting the first word from the decoder - the cycle restarts again. The new genereated word alongside the previous words (in this case < start> ) is passed onto the output embeddings of the decoder. It get's a positional embedding and goes through all the layers in the decoder until the next word is generated. This process repeats until we get to the end where the decoder generates a special < end > token that indicates that everything is done.

## Masking in Multi-Head Attention ##

Masking is applied during the training phase of a transformer model. There are key differences between inference and training. During training, we are able to directly calculate the loss function and improve our model because we have truth labels to compare to. However, for inference we're creating words on our own thus we're not directly sure if what we're predicting is correct. 

We just hope that it's correct based on the loss and accuracy metrics that we have during training. There is no guarantee that during inference, what we're predicting is correct. During training, we have a set of inputs such as *"when life gives you lemons"* then we have a truth label that is what should the model predict *"make lemonades"*. Simply put we have questions and we have answers. This allows the model to learn from it's mistakes. 

Now how does this play in masking? Technically speaking, we already have the answers. But we have to make sure that the model doesn't cheat. So we need to find a way to hide the answer from the model. That is where masking comes to play. 

Let's assume that the decoder already has generated it's first word - "fight". Rather than sending in the newly generated word into the decoder, the next input would be the first word of the masked truth labels. We unmask "make" and send it to the decoder. This way the decoder can calculate the loss between it's prediction and the truth which can be used to improve the weights and make the model better.

![Display](notes/mask-overview.png)

**Deeper Into The Attention Layer**

Recall back to how the attention layer work. Before we proceed to the softmax operation of the scaled scores of the attention matrix, we apply masking first. So masking works by basically setting all the future tokens into negative infinity. What this usually looks like is that it halves the entire the attention matrix diagonally. Wherein the upper right half is set to negative infinity while the bottom left half contains values.

![Display](notes/masking.png)

Notice how in the image shown that future words are set to negative infinity. Starting at "I" - it can only see itself but anything further is set to negative infinity. That repeats again til the < end > mark. 

Once you've applied the softmax function it would look much better with the attention values only working with the previous words that have been already generated. 

Again, you can't take into context words that still haven't been generated. Just akin to human speech. You can't consider words that haven't yet been spoken.

![Display](notes/masking-filter.png)

Let's take another look at this in a different perspective. You can see here the process of how attention works during training with masking. At the start, it has access to the full input from the encoder so it includes them in attention. Also notice that the < start> of the decoder is the < end > of the encoder. The "I" would be just unmasked from the decoder then this will be included in the next iteration. This process keeps repeating over and over until < end > is generated.

![Display](notes/attention-start.png)

![Display](notes/attention-end.png)

So that pretty much sums up how Transformers work! Now this is without talking about certain specifics such as batching but that'll be much more apparent when working with code. Right now, we're focused on explaining the big parts of the Transformer architecture. Not to mention that there are many different methods of how this actually works basing on the many iterations of the architecture. But for the most part, this is how the original paper works. 