# An Introduction to Large Language Models


To convert this notebook to slides and show:

`jupyter nbconvert LLMs.ipynb --to slides --TemplateExporter.exclude_input=True --post serve`



* Language modelling
* The training and using of of ChatGPT on paper
* Using LLMs
* Prompt engineering and fine-tuning


## Language Models





<img src="img/whatislm.jpg" alt="What is Language Model" width="700"/>


### Language modelling

Language modelling is the task of assigning a probability to sentences in a language. 
    
* Given a sequence of words $(w_1, w_2, w_3, \ldots ,w_{T})$ of length $T$, a language model assigns a probability 
    $P(w_{1}, w_{2}, \ldots ,w_{T})$ to the whole sequence. 
$
\ \ \ \ \ \ \ \ \ P(w_{1}, w_{2}, \ldots ,w_{T}) = ?
$
    
* This is equivalent to assign a probability for a word following a sequence of words:

$
\ \ \ \ \ \ \ \ \ \ \ \ \ 
P(w_1,w_2,\ldots,w_{T}) = P(w_1,w_2,\ldots,w_{T-1}) \times P(w_T|w_1,w_2,\ldots,w_{T-1})
$

$
\ \ \ \ \ \ \ 
\Rightarrow P(w_T|w_1,w_2,\ldots,w_{T-1}) = \dfrac{P(w_1,w_2,\ldots,w_{T-1})}{P(w_1,w_2,\ldots,w_{T})}  
$


**Language model is a probability distribution over sequences of words.**   

**Language modelling is essentially a classification problem**

**The Generative part: Decoding**

The output of the cat sentence as example:

* Input: The cat sat on the ...
* Output: [mat(0.21), rug(0.17), chair(0.08), stairs(0.02), ... floor(0.005)]

Which word should the model return?

* the most likely one: *mat* has the hightest probability
* top K or Top P: *[mat(0.21), rug(0.17), chair(0.08)]*
* randomly sample over distribution: *[rug(0.17), floor(0.005)]*

**temperature** is used to control the randomness of output


### Examples of using language models

**word suggestions** when typing on your phone or Google search

**speech recognition** Given $A$ is a sequence of acoustic symbols and $W$ a string of words

$\ \ \ \ \hat{W} = \underset{W}{argmax}\ P(W|A)$ 

Apply Bayes' rule $P(W|A) = \dfrac{P(W) \times P(A|W)}{P(A)}$:

$
\ \ \ \ \hat{W} = \underset{W}{argmax}\ P(W) \times P(A|W)
$

**Translations**

The speech to text recogniser should decide in favor of a word string $W$ satisfying

where $P(W|A)$ is the probability that the words $W$ were spoken, given the evidence $A$ was observed.

$\ \ \ \ A=a_1, a_2, \ldots, a_m \ \ \ \ \ \ a_i\in{\mathcal{A}}$
denote a sequence of acoustic symbols from audio signals

$\ \ \ \ W = w_1,w2,\ldots, w_{n} \ \ \ \ \ \ w_i \in{\mathcal{W}}$
denote a string of n words, each belonging to a fixed vocabulary $\mathcal{W}$

Same for the machine translation problem: 
The output is a probability across the target vocabulary and   
it has computational access to the history.

The same language model part is also called **decoder**, which we will touch later.
: .

### Traditional approach to language modelling

<img src="img/prehistory_dinosaur_02.png" alt="history" width=100>

* Counting in large corpus of text ~ Law of large numbers!

* A n-gram is a chunk of n consecutive words.
* 3-gram example:  

$
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(w_T=m|w_1,w_2,\ldots,w_{T-3}, w_{T-2},w_{T-1}) \\ 
\ \ \ \ \ \ \ \ \ \approx P(w_T=m|w_{T-3},w_{T-2},w_{T-1})
$
* Collect statistics on different n-grams' frequency to estimate: 

$
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{p}(w_T=m|w_{T-3}, w_{T-2},w_{T-1}) =\dfrac{\#(w_{T-3},w_{T-2},w_{T-1}, w_{T})}{\#(w_{T-3},w_{T-2},w_{T-1})} 
$

* Smoothing techniques and other tricks to deal with sparsity problems




### Neural Networks for language modelling

**MLP**: n-gram of words as input (fixed window) and the probability distribution over the next word as output

<img src="img/multiclass_softmax.webp" alt="Last layer of classification with softmax" width=700>

[Ref: A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

**RNNs, LSTM, GRU**
<img src="img/rnn_01.png" alt="RNN" width=600>

---
**Sequence to sequence model**
|seq2seq model  | seq2seq unrolled over time|
|:---:|:---:|
|<img src="img/seq2seq_simple.webp" alt="seq2seq model" width="300"/>|<img src="img/seq2seq.webp" alt="seq2seq model" width="450"/>|

[Ref: Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)


**RNN Advantages**:
* Can process any length input
* Computation for step t can (in theory) use information from any steps back
* Model size doesn’t increase for longer input context
* Same weights applied on every timestep, so there is symmetry in how inputs are processed.

**RNN Disadvantages**:
* Recurrent computation is slow
* In practice, difficult to access information from many steps back

### Transformer Architecture


<img src="img/transformer.png" alt="Simplified Transformer" width="400"/>

[Ref: Attention Is All You Need](https://arxiv.org/abs/1706.03762)


Transformer overview

|  |  |
|:---:|:---:|
|<img src="img/tf1.png" alt="Transformer step by step" width="400"/>|<img src="img/tf2.png" alt="Transformer step by step" width="400"/>|
|<img src="img/tf4.png" alt="Transformer step by step" width="400"/>|<img src="img/tf3.png" alt="Transformer step by step" width="400"/>|


**The Transformer Architecture**

* Removes the RNN completely
* Keeps the encoder-decoder architecture, and use attentions on both parts and between
* Uses the ResNet's structure: the skip connections to train deeper networks
* Uses the position encoding to encode the token's position information
* Use the input enbedding layer to learned vector representation of each word  

---

* More details on training of the RNN and Transformer models will be covered in the future
* Previous talks on these topics from [**Deep learning Guild**](https://www.google.com) and [**NLP Guild**](https://www.google.com)

### Large Language Models (LLMs)

* Take advantages of parallel computing based on the transformer architecture
* Trained on massive datasets of text (**trillions of tokens**, Internet scale)
* Transformers with **billions parameters**
* Thousand of latest GPUs, months of training, and cost millions

___
* Implicitly learned syntax and semantics of human language, the general knowledge and "understanding" about the world
* Become general purpose models that excel at a wide range of tasks


### Types of LLMs

<img src="img/TypeOfLLMs.png" alt="Type of LLMs" width=800>

* Discriminative LM: predict next word in a sequence of words based on previous words
* Generative LM: Generate text by sampling from the probability distribution over sequence of words 



## The Training of ChatGPT on paper

* Pre-training
* Supervised Fine Tuning
* Reinforcement Learning from Human Feedback (RLHF)

<img src="img/chatgpt-training_1.png" alt="Training of chatGPT" width=700>

[Ref: Chip Huyen's Blog](https://huyenchip.com/2023/05/02/rlhf.html)

### Pre-training

![Alt_text](img/chatGPT_training04.png "Generative Pretraining")

* Supervised learning on unlabled data
* Trained on vast amont of data (1 trillion tokens are equivalent to 15 million books)
* Will run out of Internet data in the next few years with this trend of data consumption
* Many companies have changed their data terms to prevent others from scraping their data for LLMs
* 99% of the computing time and flops are used on this pre-trainig step
  

[Ref: LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/pdf/2302.13971.pdf)

**Shoggoth with a smiley face analogy**
* The pretrained model is an untamed monster
* This monster was then finetuned on higher quality data
* Then the finetuned model was further polished using RLHF to make it customer-appropriate

<img src="img/shoggoth.jpg" alt="Shoggoth with smiley face" width=400>


**Alignment** - Steering LLMs to intended goals and interests

* The pretrained model is an untamed monster because 
  * it learned the powerful general representations
  * it's not trained on specific useful tasks
  * it was trained on indiscriminate data scraped from the Internet: misinformation, propaganda, conspiracy theories, or attacks against certain demographics.
* This monster needs to be finetuned on higher quality data – 
StackOverflow, Quora and human annotations – which makes it somewhat socially acceptable.
* Then the finetuned model was further polished using RLHF to make it customer-appropriate, finally we get a smiley face.

### Supervised Fine Tuning

* To optimize the pretrained model to generate useful responses
* To show the language model examples of how to appropriately respond
* To relieve the user burden of design their own prompts ([GPT3 is a few-shot learner](https://arxiv.org/abs/2005.14165))
* OpenAI hire 40+ high quality labelers to create around 13,000 **(prompt, response) pairs** for InstructGPT.
* Prompts are designed for different use cases (e.g. question answering, summarization, translation)

Example training dataset: [Training language models to follow instructions with human feedback, page 26~33](https://arxiv.org/pdf/2203.02155.pdf)


<img src="img/ChatGPT_training01.png" alt="SFV" width=400>

First part of the alignment is the Supervised Fine Tuning:
* High quality labelers provide demonstrations of the desired  
behaviour on the input prompt distribution

* Fine-tune a pretrained GPT-3 model on this data using supervised learning.

<img src="img/ChatGPT_SFT.png" alt="Supervised Fine Tunning and RLHF" width=700>

[Ref: Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

### Reinforcement Learning from Human Feedback (RLHF)

**The goal of Alignment**

Given a particular history, the objective is to maximise the probability 
the model assigns to the sequence of tokens in the corresponding response.

<img src="img/imitation_game.jpeg" alt="The goal of Alignment" width="700"/>


**The goal of Alignment** is kind of **Brainwashing**.

**RLFH** Originally developed for training simple robots in simulated environments and Atari
games  it has recently been applied to fine-tuning language models 

Teach the model by having it mimic how humans respond in conversations.  
The model then creates an **expert policy**, which acts like a rule book for how the model should respond to requests. 

* Step 1: Collect comparison data, and train a reward model. 
  * For a given input, use the existing model to generate several outputs 
  * Labelers then indicate which output they prefer by ranking the outputs
* Train a reward model to predict the human-preferred output.
* Optimize a policy against the reward model using PPO. 

The output of the Reward Model is a scalar as reward. The supervised policy is fine-tuned to  optimize this reward using the PPO algorithm.



#### Reward model

<img src="img/reward_model.jpeg" alt="Reward model" width="500"/>

**Logistic Regression for Scorecard**

#### Proximal Policy Optimization
* Policy is the mapping from action space to state space
* Policy gradient optimization algorithm for updating an existing policy to gain reward

<img src="img/policy_model.png" alt="Policy model" width="600"/>

[ref: How ChatGPT is Trained](https://www.youtube.com/watch?v=VPRSBzXzavo)

**PPO**

* Policy: A policy, in Reinforcement Learning terminology, is a mapping from action space to state space. It can be imagined to be instructions for the RL agent, in terms of what actions it should take based upon which state of the environment it is currently in.

* PPO is a policy gradient optimization algorithm, that is, in each step there is an update to an existing policy to seek improvement on certain parameters It ensures that the update is not too large, that is the old policy is not too different from the new policy

### Performance

<img src="img/PPO_performance_02.png" alt="Human evaluations of various models' performance" width=700>

[Ref: Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)

Recap

<img src="img/chat_GPT_training_pipeline.png" alt="Training of chatGPT" width=700>

[Ref: Andrew Karpathy's Microsoft BUILD Talk: State of GPT](https://www.youtube.com/watch?v=bZQun8Y4L2A)

## The Process of using ChatGPT

<img src="img/usingChatGPT.png" alt="Using ChatGPT" width="500"/>

**Content Moderation!**

OpenAI's **Moderation** API

The moderations endpoint can check whether content complies with OpenAI's usage policies. 

The models classifies the following categories:
* hate
* hate/threatening
* harassment
* self-harm, self-harm/intent, self-harm/instructions
* sexual sexual services sexual/minors
* violence violence/graphic



## Using LLMs 

### Tasks
* Prompting: Generate text or even image directly
* Embedding: Extract semantic information from unstructured data for building applicaitons

### Applications
* Search
* Generative
* Summarise, Rewrite, Extract
* Classify, Cluster

### Way of using LLMs

#### Model-as-a-Service via API

* OpenAI API examples:
  * Chat: given a conversation, return a response
  * Completions: Given a prompt, return multiple predicted completions, with probabilities of alternative tokens at each position
  * Edits: given a prompt and an instruction, the model will return an edited version of the prompt
  * Image: create, edit, modify images
  * Embeddings: get a vector representation of a given input that can be easily consumed by machine learning models and algorithms
* Google has similar APIs: PaLM, Imagen, Codey, Chirp, and Embeddings

#### Fine-tune an existing third-party Model in a managed environment via API

#### Open-source model in a managed environment


**Model-as-a-Service via API** 

The Embeddings API is a powerful tool that can be used to improve the performance of a variety of applications.
It can be used for a variety of tasks, such as search, clustering, recommendations, and anomaly detection.


**Fine-tune an existing third-party Model in a managed environment via API**

**Advantages for both above approaches**
* Low barrier to entry and convenient to implement 
* Access to the latest, the largest and most sophisticated LLMs

**Limitations**
* Data residency and privacy
* Potentially higher cost
* Dependency on third party


**Open-source model in a managed environment**

**Advantages**
* Wide range of choice
* Potentially lower cost
* Independence

**Tradeoffs**
* Complexity: Setting up and maintaining a LLM requires data science and engineering expertise. 
* Smaller scale, and narrower performance 


## Prompt Engineering

Steering an LLM’s behavior towards a particular outcome without updating the model’s weights/parameters.


### Prompting Design
Effectively communicating with LLMs to get desired results


**Prompts**: The text feed to the model

**Prompt engineering**, also known as in-context prompting, is a method for steering an LLM’s behavior towards a particular outcome without updating the model’s weights/parameters. It’s the process of effectively communicating with LLMs to get desired results. Prompt engineering is used on a variety of tasks from question answering to arithmetic reasoning.

Prompts are a set of text instructions that LLMs receive to generate a response or complete a task. There are several types of prompts like summarization, inferring or transforming. Thus, Prompt engineering aims to take these prompts and help the model to achieve high accuracy and relevance in its outputs.

The two most common types of prompting are zero-shot and few-shot prompting.

### Zero-shot Prompting

### Few-shot Prompting (In-context Learning)

### Chain-of-Thought Prompting 

**Zero-shot Prompting**

Zero-shot learning involves feeding the task to LLMs without any examples that indicate the desired output,   
hence the name zero-shot. 

For example, one could just feed a model a sentence and expect it to output the sentiment of that sentence.


**Few-shot Prompting**

Few-shot learning, on the other hand, involves providing the model with a small  
number of high-quality examples that include both input and desired output for  
the target task.  
By seeing these good examples, the model can better understand the user's intention  
and criteria for generating accurate outputs.  

As a result, few-shot learning often leads to better performance compared to  
zero-shot learning.  
However, this approach can consume more tokens and may encounter context length  
limitations when dealing with long input and output text.

This kind of *in-context learning* using few-shot prompting by offering  
demonstrations in the prompt can guide the LLM to carry out the task. In other words,  
conditioning the model on a selection of task-specific examples helps   
improve the model’s performance.


**Chain-of-Thought (CoT) prompting**

Chain-of-Thought prompting generates a sequence of short sentences known as  
**reasoning chains**. These describe step-by-step reasoning logic leading to   
the final answer with more benefits seen for complex reasoning tasks. 


In summary:

![prompt summary](img/zero-cot.webp)

Ref and image soure: [Kojima et al. (2022)](https://arxiv.org/abs/2205.11916)


### A variety of fancy prompting ideas

* ReAct: Combines **Re**asoning and **Act**ing with LLMs.  
    *"What's the age of the universe?" -> "I need to find more information on the universe" -> "[search on Wikipedia]"*
* Code as Reasoning: When given a question, try to write code that solves this question. Then send the code to a programmatic runtime to get the result. 
* Automatic Prompt Design: Automating the generation and selection of prompts.


### Formalising Prompts

There are a few parts of a prompt that are quite common:

<img src="img/PromptParts.png" alt="Formal prompt" width=150>

[Ref and image source: Learning Prompting](https://learnprompting.org/docs/basics/formalizing)


* A role
* An instruction / task
* A question
* Context
* Examples (few shot)

Not all of these occur in every prompt, and there is no standard order for them. The following is another example: 

## Fine-Tuning

* Task specific tuning can make LLMs more suitable for domain problems and more reliable
* Further train the model on new data

### Parameter-Efficient Fine-tuning (PEFT)

<img src="img/model_tuning.jpg" alt="Model tuning" width="400"/>


### Prompt Tuning

Tune a vector that get sent prepended to the input text

<img src="img/prompt_tuning.jpg" alt="Prompt tuning" width="400"/>


**Parameter-efficient fine-tuning** is a method of fine-tuning that  
focuses on training only a subset of the pre-trained model’s parameters.  
This approach involves identifying the most important parameters for the  
new task and only updating those parameters during training. Doing so,  
PEFT can significantly reduce the computation required for fine-tuning.

**Prompt-tuning** is an efficient, low-cost way of adapting a LLM to new   
downstream tasks without retraining the model and updating its weights.


## Debatable Topics and Pitfalls of LLMs

LLMs are extremely powerful, but it is debatable on what they are actually doing.

* Stochastic parrot
* Emergent abilities
* Citing sources on generations
* Hallucinations
* Bias
* Prompt Hacking


* **Stochastic parrot or not?**: Although large language models are good at generating convincing language, many people believe LLMa do not actually understand the meaning of the language it is processing

* **Emergent abilities**: Emergent abilities are skills that suddenly and unpredictably show up (emerge) in AI systems. Theese abilities are not present in smaller models but it seems that there are qualitative changes that come from scaling the AI language models. There are greate debate on how to define and how to measure them.

* **Hallucinations**: LLMs will frequently generate falsehoods when asked a question that they do not know the answer to. Sometimes they will state that they do not know the answer, but much of the time they will confidently give a wrong answer.

* **Bias**: LLMs are often biased towards generating stereotypical responses. Even with safe guards in place, they will sometimes say sexist/racist/homophobic things. Be careful when using LLMs in consumer-facing applications, and also be careful when using them in research (they can generate biased results).

* **Citing Sources**: LLMs for the most part cannot accurately cite sources. This is because they do not have access to the Internet, and do not exactly remember where their information came from. They will frequently generate sources that look good, but are entirely inaccurate.  
(Strategies like search augmented LLMs can often fix this problem)

* **Prompt Hacking**:  Users can often trick LLMs into generating any content they want.


**Recap**

* Language modelling is to learn the probability distribution of word sequences
* Transformers are by far the latest and best way to learn this distribution
* LLMs become general purpose models that excel at a wide range of tasks
* Proper prompts / tuning can condition the generative model before query to extract uesful information



![alt text](img/message_from_Bard.png "message from Bard")

## Questions and Discussion

<img src="img/questions_discussion.png" alt="Questions and Discussions" width=300>
