# **Introduction to LLMs**

## **What's Covered**
1. A little bit about Transformers
2. Attention is all you need
3. What is Language Modeling?
4. What are LLMs?
5. Pre-Training, Transfer Learning and Fine-Tuning
6. Popular Modern LLMs
    - BERT
    - GPT
    - T5
    - Domain Specific LLMs
7. Prompt Engineering
8. Applications
9. Quick Summary

## **A little bit about Transformers:**
<img style="float: right;" width="400" height="400" src="data/images/transformer.jpeg">

1. Sequence to Sequence Model.
2. Has two main components: Encoder and Decoder
3. An **encoder** which is tasked with taking in raw text, splitting them up into its core components, convert them into vectors and using **self-attention** to understand the context of the text.
4. A **decoder** excels at generating text by using a modified type of attention (i.e. **cross attention**) to predict the next best token.
5. Transformers are **trained** to solve a specific NLP task called as **Language Modeling**.
6. **Why not RNNs? -** Transformer's self attention mechanism allows each word to "attend to" all other words in the sequence which enables it to capture long-term dependencies and contextual relationships between words. The goal is to understand each word as it relates to the other tokens in the input text.
7. **Limitation:** Transformers are still limited to an input context window (i.e. maximum length og text it can process at any given moment)

## **Attention is all you need**
1. **Attention** is a mechanism that assigns different weights to different parts of the input allowing the model to prioritize and emphasize the most important information while performing tasks like translation or summarization.
2. Attention allows a model to focus on different parts of the input dynamically, leading to improved performance.

## **What is Language Modeling?**
1. Language Modeling involves creation of statistical/deep learning models for predicting the likelyhood of a sequence of tokens in a specified vocabulary.
2. Two types of Language Modeling Tasks are:  
    a. Autoencoding Task  
    b. Autoregressive Task  
3. **Autoregressive Language Models** are trained to predict the next token in a sentence, based on the previous tokens in the phrase. These models correspond to the **decoder** part of the transformer model. A mask is applied on the full sentence so that the attention head can only see the tokens that came before. These models are ideal for text generatation. For eg: **GPT**
4. **Autoencoding Language Models** are trained to reconstruct the original sentence from a corrupted version of the input. These models correspond to the **encoder** part of the transformer model. Full input is passed. No mask is applied. Autoencoding models create a bidirectional representation of the whole sentence. They can be fine-tuned for a variety of tasks, but their main application is sentence classification or token classification. For eg: **BERT**
5. **Combination of autoregressive and autoencoding language models** are more versatile and flexible in generating text. It has been shown that the combination models can generate more diverse and creative text in different context compared to pure decode-based autoregressive models due to their ability to capture additional context using the encoder. For eg: **T5**

## **What are LLMs?**
1. Usually derived from Transformer architecture (but nor necesserily) by training on large amount of text data.
2. Designed to understand and generate human language, code, and much more.
3. Highly parallelized and scalable.
4. Example: BERT, GPT and T5
5. Techniques like: Stop word removal, stemming, and truncation are not used nor are they necessary for LLMs. LLMs are designed to handle the inherent complexity and variability of human language, including the use of stop words and variations in word forms like tenses and misspellings.
6. Every LLM on the market has been **pre-trained** on a large corpus of the text data and on a specific language modeling related tasks.
7. **Remember:** How an LLM is **pre-trained** and **fine-tuned** makes all the difference.
8. **How to decide whether to train our own embeddings or use pre-trained embeddings?** - A good rule of thumb is to compute the vocabulary overlap. If the overlap between the vocabulary of our custom domain and that of pre-trained word embeddings is significant, pre-trained word embeddings tends to give good results.
9. **One more important factor to consider while deploying models with embeddings-based feature extraction approach:** - Remember that learned or pre-trained embedding models have to be stored and loaded into memory while using these approaches. If the model itself is bulky, we need to factor this into our deployment needs.

## **Pre-Training, Transfer Learning and Fine-Tuning**
<img style="float: right;" width="400" height="400" src="data/images/transfer_learning.jpeg">

1. **Pre-training** of an LLM happens on a large corpus of text data and on a specific language modeling related task. During this phase LLM tries to learn and understand general language and relationships between words.
2. **Transfer Learning** is a technique used in machine learning to leverage the knowledge gained from one task to improve performance on another related task. Understand that pre-trained model has already learned a lot of information about the language and the relationships between words, and this information can be used as a starting point to improve performance on a new task.  
    **a.** Transfer Learning for LLMs involves taking an LLM that has been pre-trained on one corpus of text data and then fine-tuning it for a specific downstream task, such as text classification or text generation, by updating the model's parameter with task-specific data.  
    **b.** Transfer Learning allows LLMs to be **fine-tuned** for specific tasks with much smaller amounts of task-specific data than it would require if the model were trained from scratch. This greatly reduces the amount of time and resources required to train LLMs.  
<img style="float: right;" width="400" height="400" src="data/images/fine_tuning_loop.jpeg">
3. **Fine-tuning** involves training the LLM on a smaller, task-specific dataset to adjust its parameters for the specific task at hand. The basic fine-tuning loop is more or less same.  
    **a.** Define a model you want to fine-tune as well as fine-tuning parameters (eg: learning rate)  
    **b.** Aggregate some training data.  
    **c.** Compute loss and gradients.  
    **d.** Update the model via backpropogation.  
4. The Transformers package from Hugging Face provides a neat and clean interface for training and fine-tuning LLMs.

## **Popular Modern LLMs**

#### **1. BERT (Bidirectional Encoder Representation from Transformers)**
<img style="float: right;" width="300" height="300" src="data/images/bert_oov.jpeg">

1. By Google - Autoencoding Language Model
2. Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until— BERT!
3. Tasks - BERT can solve 11+ NLP tasks such as sentiment analysis, named entity recognition,  etc...
4. Pretrained on:  
    **a.** English Wikipedia - At the time 2.5 billion words  
    **b.** Book Corpus - 800 million words  
5. BERT's tokenizer handles OOV tokens (out of vocabulary / previously unknown) by breaking them up into smaller chunks of known tokens.
6. Trained on two language modeling specific tasks:  
    **a.** **Masked Language Modeling (MLM) aka Autoencoding Task** - Helps BERT recognize token interaction within the sentence.    
    **b.** **Next Sentence Prediction (NSP) Task** - Helps BERT to understand how tokens interact with each other between sentences.  
<img style="float: right;" width="300" height="300" src="data/images/bert_language_model_task.jpeg">
7. BERT uses three layer of token embedding for a given piece of text: Token Embedding, Segment Embedding and Position Embedding.
8. BERT uses the encoder of transformer and ignores the decoder to become exceedingly good at processing/understanding massive amounts of text very quickly relative to other slower LLMs that focus on generating text one token at a time.
9. BERT itself doesn't classify text or summarize documents but it is often used as a pre-trained model for downstream NLP tasks. 
<img style="float: right;" width="300" height="300" src="data/images/bert_classification.jpeg">
10. 1 year later RoBERTa by Facebook AI shown to not require NSP task. It matched and even beat the original BERT model's performance in many areas.
11. Reference: [Click here to read more](https://huggingface.co/blog/bert-101)
12. BERT Implementation: [Click here to learn how to use BERT](https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)

#### **2. GPT (Generative Pre-Trained Transformer)**

1. By OpenAI - Autoregressive Language Model.
2. Pretrained on: Proprietary Data (Data for which the rights of ownership are restricted so that the ability to freely distribute the is limited)
3. Autoregressive Language Model that uses attention to predict the next token in a sequence based on the previous tokens.
4. GPT relies on the decoder portion of the Transformer and ignores the encoder to become exceptionally good at generating text one token at a time.

#### **3. T5 (Text to Text Transfer Transformer)**
<img style="float: right;" width="400" height="400" src="data/images/t5.jpeg">

1. By Google - Combination of Autoencoder and Autoregressor Language Model.
2. Tasks: T5 can solve tasks such as summarization, translation, Q&A, and text classification
3. T5 uses both encoder and decoder of the Transformer to become highly versatile in both processing and generating text.
4. T5 based models can generate wide range of NLP tasks, from text classification to generation.

#### **4. Domain Specific LLMs**

1. BioGPT - Trained on large scale biomedical literature (more than 2 million articles). Developed by the AI healthcare company, Owkin, in collaboration with Hugging Face.
2. SciBERT
3. BlueBERT

## **Prompt Engineering**
1. Popular LLMs: GPT-3, GPT-4, ChatGPT, Coral, GPT-J, FLAN-T5, etc...
2. If you are wondering what is the best way to talk to ChatGPT and GPT-4 to get optimal results, we will cover that under **Prompt Engineering**.
3. **Prompt Engineering** involves crafting prompts that effectively communicate the task at hand to the LLM, leading to accurate and useful outputs.
4. Few Language Models that have been specifically designed and trained to be aligned with instructional prompts are GPT-3, GPT-4, ChatGPT (closed-source model from OpenAI), FLAN-T5 (an open-source model from Google) and Cohere's command series (closed-source).

## **Applications:**
#### **1. Medical Domain**
1. Electronic Medical Record (EMR) Processing
2. Clinical Trial Matching
3. Drug Discovery

#### **2. Finance**
1. Fraud Detection
2. Sentiment Analysis of Financial News
3. Trading Strategies
4. Customer Service Automation via Chatbots and Virtual Assistants

#### **3. And many more**
1. Text Classification
2. Text Summarization
3. Chatbots
4. Information Retreival

## **Quick Summary**
1. What really sets the Transformers appart from other deep learning architectures is its ability to capture long-term dependencies and relationships between tokens using attention mechanism.
2. Attention is the crucial component of Transformer.
3. Factor behind transformer's effectiveness as a language model is it is highly parallelizable, allowing for faster training and efficient processing of text.
4. LLMs are usually derived from Transformer architecture (but nor necesserily) by training on large amount of text data.
5. Designed to understand and generate human language, code, and much more.
6. LLMs are pre-trained on large corpus and fine-tuned on smaller datasets for specific tasks.
7. Popular LLMs: GPT-3, GPT-4, ChatGPT, Coral, GPT-J, FLAN-T5, etc...
8. If you are wondering what is the best way to talk to ChatGPT and GPT-4 to get optimal results, we will cover that under **Prompt Engineering**.

## **What Next? How to use LLMs?**

Given a business problem, ask this to yourself:
1. What NLP task does it map to?
    - Text Classification
    - Token Classification
    - Text Generation
    - Fill-Mask
    - Conversational
    - Sentence Similarity
    - Question Answer
    - Summarization
    - Table Q&A
    - Translation
    - Zero-Shot Classification
2. Given the task, what model(s) work for that task?

**Example:**
> **Business Problem:** Generate a news feed for an app so that users can scroll through
> **Mapping to a NLP task:** Given news article, a standard NLP task is to summarize

Now before we get into how to solve problems like above, a quick note on NLP ecosystem:

| Popular Tools | Utility |
| :---: | :---: |
| **Hugging Face Transformers** | Pre-trained models and Pipelines |
| **NLTK** | Classical NLP + corpora |
| **SpaCy** | Production grade NLP, especially NER |
| **Gensim** | Classical NLP + Word2Vec |
| **OpenAI** | ChatGPT, Whisper |
| **Spark NLP** | Scale-out, production-grade NLP |
| **LangChain** | LLM Workflows |