# An Introduction to Large Language Models


To convert this notebook to slides and show:

`jupyter nbconvert LLMs.ipynb --to slides --TemplateExporter.exclude_input=True --post serve`



## Language Models


### Language modelling

Language modelling is the task of assigning a probability to sentences in a language. 
    
* Given a sequence of words $(w_1, w_2, w_3, \ldots ,w_{T})$ of length $T$, a language model assigns a probability 
    $P(w_{1}, w_{2}, \ldots ,w_{T})$ to the whole sequence. 
$
\ \ \ \ \ \ \ \ \ P(w_{1}, w_{2}, \ldots ,w_{T}) = ?
$
    
* This is equivalent to assign a probability for a word following a sequence of words:

$
\ \ \ \ \ \ \ \ \ \ \ \ \ 
P(w_1,w_2,\ldots,w_{T}) = P(w_1,w_2,\ldots,w_{T-1}) \times P(w_T|w_1,w_2,\ldots,w_{T-1})
$

$
\ \ \ \ \ \ \ 
\Rightarrow P(w_T|w_1,w_2,\ldots,w_{T-1}) = \dfrac{P(w_1,w_2,\ldots,w_{T-1})}{P(w_1,w_2,\ldots,w_{T})}  
$

* Language model is a probability distribution over sequences of words. 
* Language modelling is essentially a classification problem


### Examples of using language models
#### speech recognition

$\ \ \ \ A=a_1, a_2, \ldots, a_m \ \ \ \ \ \ a_i\in{\mathcal{A}}$
denote a sequence of acoustic symbols from audio signals

$\ \ \ \ W = w_1,w2,\ldots, w_{n} \ \ \ \ \ \ w_i \in{\mathcal{W}}$
denote a string of n words, each belonging to a fixed vocabulary $\mathcal{W}$

The speech to text recogniser should decide in favor of a word string $W$ satisfying

$\ \ \ \ \hat{W} = \underset{W}{argmax}\ P(W|A)$

where $P(W|A)$ is the probability that the words $W$ were spoken, given the evidence $A$ was observed.

Apply Bayes' rule $P(W|A) = \dfrac{P(W) \times P(A|W)}{P(A)}$:

$
\ \ \ \ \hat{W} = \underset{W}{argmax}\ P(W) \times P(A|W)
$

#### Translations

### Traditional approach to language modelling

* Law of large numbers: counting in large corpus of text

* A n-gram is a chunk of n consecutive words.
* n-order Markov prpoerty assumption
$
\ \ \ \ \ \ \ 
P(w_T=m|w_1,w_2,\ldots,w_{T-3}, w_{T-2},w_{T-1}) \approx P(w_T=m|w_{T-3},w_{T-2},w_{T-1}), 3rd\ order\ estimation\ example
$
* Collect statistics about how frequent different n-grams are and use these to
predict next word.
$
\ \ \ \ \ \ \ 
\hat{p}(w_T=m|w_{T-3}, w_{T-2},w_{T-1}) =\dfrac{\#(w_{T-3},w_{T-2},w_{T-1}, w_{T})}{\#(w_{T-3},w_{T-2},w_{T-1})}
$

* Smoothing techniques and other tricks to deal with sparsity problems



### Neural Networks for language modelling

#### MLP: n-gram of words as input (fixed window) and the probability distribution over the next word as output
#### RNNs, LSTM, GRU
![Alt text](img/rnn_01.png "RNN")


**RNN Advantages**:
* Can process any length input
* Computation for step t can (in theory) use information from any steps back
* Model size doesn’t increase for longer input context
* Same weights applied on every timestep, so there is symmetry in how inputs are processed.

**RNN Disadvantages**:
* Recurrent computation is slow
* In practice, difficult to access information from many steps back

#### Seq2seq
![Alt text](img/seq_to_seq.png "Sequence to Sequence architecture")



#### Transformer
![Alt text](img/transformer.png "Simplified Transformer")


![Alt text](img/transformer_simple.png "Transformer")


### Large Language Models (LLMs)

* LLMs are language model consisting of billions parameters
* Trained on trillions of tokens
* Take advantages of parallel computing based on the transformer architectures
* Become general purpose models that excel at a wide range of tasks
* Implicitly learned syntax and semantics of human language, the general "knowledge" about the world


![Alt text](img/TypeOfLLMs.png "Type of LLMs")


## The Training of ChatGPT on paper

* Pre-training
* Supervised Fine Tuning
* Reinforcement Learning from Human Feedback (RLHF)


![Alt text](img/chatgpt-training_1.png "Training of chatGPT")


### Pre-training

![Alt_text](img/chatGPT_training04.png "Generative Pretraining")

* Supervised learning on unlabled data
* Trained on vast amont of data (1 trillion tokens are equivalent to 15 million books)
* We’ll run out of Internet data in the next few years with this this trend of data consuming
* Many companies have changed their data terms to prevent others from scraping their data for LLMs
* The Internet is being rapidly populated with LLM generated data


**Shoggoth with a smiley face analogy**

![Alt text](img/shoggoth.jpg "Type of LLMs")


**Shoggoth with a smiley face analogy**

* The pretrained model is an untamed monster because it was trained on indiscriminate data scraped from the Internet: misinformation, propaganda, conspiracy theories, or attacks against certain demographics.
* This monster was then finetuned on higher quality data – StackOverflow, Quora and human annotations – which makes it somewhat socially acceptable.
* Then the finetuned model was further polished using RLHF to make it customer-appropriate, e.g. giving it a smiley face.

### Supervised Fine Tuning

* To optimize the pretrained model to generate useful responses
* To show the language model examples of how to appropriately respond
* To relieve the user burden of design their own prompts ([GPT3 is a few-shot learner](https://arxiv.org/abs/2005.14165))
* OpenAI hire 40 high quality labelers to create around 13,000 (prompt, response) pairs for InstructGPT.
* Prompts are designed for different use cases (e.g. question answering, summarization, translation)

Example training dataset: [Training language models to follow instructions with human feedback, page 26~33](https://arxiv.org/pdf/2203.02155.pdf)


![Alt_text](img/ChatGPT_training01.png "Generative Pretraining")

![Alt_text](img/ChatGPT_SFT.png "Supervised Fine Tunning and RLHF")



### Reinforcement Learning from Human Feedback (RLHF)

#### Reward model

#### Proximal Policy Optimization
* Policy
* Policy gradient optimization algorithm for updating an existing policy to gain reward


**PPO**

* Policy: A policy, in Reinforcement Learning terminology, is a mapping from action space to state space. It can be imagined to be instructions for the RL agent, in terms of what actions it should take based upon which state of the environment it is currently in.

* PPO is a policy gradient optimization algorithm, that is, in each step there is an update to an existing policy to seek improvement on certain parameters It ensures that the update is not too large, that is the old policy is not too different from the new policy

### Performance

![Alt_text](img/PPO_performance.png "Human evaluations of various models' performance")


## The Process of using ChatGPT




## How to Use LLMs in the Enterprise

### Model-as-a-Service via API
**Advantages**
* Low barrier to entry and convenient to implement 
* Access to the latest, the largest and most sophisticated LLMs
**Limitations**
* Data residency and privacy
* Potentially higher cost
* Dependency on third party

### Open-source model in a managed environment
**Advantages**
* Wide range of choice
* Potentially lower cost
* Independence
**Tradeoffs**
* Complexity: Setting up and maintaining a LLM requires data science and engineering expertise. 
* Smaller scale, and narrower performance 

### Fine-tune an existing third-party Model in a managed environment via API



## Debatable Topics and Pitfalls of LLMs

LLMs are extremely powerful, but there are debatable on what they are actually doing.

* **Stochastic parrot or not?**: Although large language models are good at generating convincing language, many people believe LLMa do not actually understand the meaning of the language it is processing

* **Emergent abilities**: Emergent abilities are skills that suddenly and unpredictably show up (emerge) in AI systems. Theese abilities are not present in smaller models but it seems that there are qualitative changes that come from scaling the AI language models. There are greate debate on how to define and how to measure them.

* **Hallucinations**: LLMs will frequently generate falsehoods when asked a question that they do not know the answer to. Sometimes they will state that they do not know the answer, but much of the time they will confidently give a wrong answer.

* **Bias**: LLMs are often biased towards generating stereotypical responses. Even with safe guards in place, they will sometimes say sexist/racist/homophobic things. Be careful when using LLMs in consumer-facing applications, and also be careful when using them in research (they can generate biased results).

* **Citing Sources**: LLMs for the most part cannot accurately cite sources. This is because they do not have access to the Internet, and do not exactly remember where their information came from. They will frequently generate sources that look good, but are entirely inaccurate.  
(Strategies like search augmented LLMs can often fix this problem)

* **Prompt Hacking**:  Users can often trick LLMs into generating any content they want.


