**NOTE:** The following information is based on the book "Build Large Language Model From Scratch" By Sebastian Raschka. I am just trying to take notes, explain some stuff further for myself when needed, and do some coding.

The following figure from the book is really nice.

![end-to-end-LLM](./images/end-to-end-LLM.png)

At the end of the day, what we are looking for is to get something like this:

> user input --> MODEL --> output


How to get that MODEL?

* Stage 1: Prepare training data, and Design (Create) LLM's architecture
* Stage1 -> Stage 2:  Pre-training LLM, i.e. using general data to get a fitted model (known as foundational model) for general-purpose tasks.
* Stage 2: Evaluating model. Also, revisit `Stage1 -> Stage2` and use publicly available pre-trained weights
* Stage 2 --> 3: Fine-tuning model using task/domain-specific data. They can perform better (e.g. BloombergGPT)
* Stage 3: We now have Classifier (which uses class labels) OR Personal Assistant (which uses instruction dataset)

As mentioned above, having custom-made LLM performs better when it needs to be used for a particular task or domain. Furthermore, it allows to get same performance with smaller model, which makes it possible to be embedded on user's system.

A GPT-like model, at its core, is based on transformers. In contrast to transformers that contain both encoding and decoding layers, the GPT's architecture only contains the decoder part. The model is mainly trained to predict the next word (token). Therefore, in Stage 1, the training data does NOT need to have label! Because, we can get it from the text data itself. For instance, if there is a sentence like: `The sky is blue`. I can just get the last word of the sentence and use it as the "label" for the sample `The sky is`. 

**NOTE: fine-tuning** <br>
The two common ways of fine-tuning are: (i) instruction fine-tuning, (ii) classification fine-tuning

**NOTE: self-attention** <br>
One of the key components of transformers that play an important role in LLM is `self-attention`. It somehow gives weigh to words to show their importance relative to each other! So, it can take an input with longer length as it can pay better attention to the parts that matter for predicting the next word in the sequence.

**NOTE: GPT vs BERT** <br>
`GPT` is mainly designed to perform "text completion" by predicting the next word in the sequence. `BERT`, however, is better at "masked word prediction"

**Note:** <br>
As mentioned earlier, the transformer layer in GPT has the decoder part only, and it helps with predicting the next word. The cool thing is that the output is then used as the input for predicting the word after that, and so on! Hence, the GPT model is known to be of type `autoregressive` model.

**NOTE: When to use GPT?** <br>
GPT-like model is good at predicing next word, and text generation in general. So, the following tasks can get benefit from it:
* Machine Translation
* Text Summarization
* Writing new articles / code

**NOTE: Different ways to leverage GPT** <br>
* As is: text completion!
* Zero-shot: Perform task WITHOUT any domain/task-specific training data (Example: `Translate to Farsi: Sky -->`)
* Few-shot: Perform task with a few examples (Example: `woc -> cow, sky --> yks, lie --> eil, nima --> `)

And, of course, some other ways to use it is: 
* fine-tuning it via instruction
* fine-tuning it via classification labels.

**Question:** But...how does the model understand the "instruction" from input? and seperate it from the input whose output is being expected.

One of the cool thing the book has pointed out is that the GPT-like model can perform well for translation despite the fact that it is just decoder-only and it is mainly designed to predict the next word (and NOT translation). It is called "emergent behaviour".