# An Overview of ChatGPT

## Introduction

> ChatGPT is an AI system that can engage in back and forth conversational interactions in a chatbot-style interface. It is capable of writing code, correcting or adjusting its responses based on feedback from the user.

ChatGPT is a complicated system that builds on top of a large language model, like GPT-3. The sum of this amounts to a breakthrough that has truly democratised AI by making it available in an intuitive interface that anyone can use.

ChatGPT is certainly capable of making mistakes, including:
- Providing infactual information
- Producing biased responses
- Sometimes (although more and more rarely) producing inappropriate or harmful responses

## How ChatGPT Works

So how does it work under the hood?

ChatGPT is implemented in 3 steps, as shown below:

![](./images/How%20chatGPT%20is%20trained.png)


Your first win, when it comes to understanding how ChatGPT works, is to understand these 3 steps:

1. Supervised Fine-Tuning (SFT): Fine tune a pre-trained language model (GPT-3.5) to act like a chatbot
2. The Reward Model (RM): Train a new _reward model_ to identify which responses generated by the chatbot are better than others
3. Reinforcement Learning with Human Feedback (RLHF): Use the reward model to score generated responses, and update the language model to prefer responses with a higher score

> The point of SFT and RLHF are to make the language model better and more aligned with human intention. The RM is a necessary component to do RLHF.

## InstructGPT

Before the development and release of ChatGPT, this method was used to produce a model called InstructGPT, a language model based on GPT-3 which interprets prompts as instructions rather than as some text that needs continuing on from. 
This makes the models more easy to interact with because you can just give them commands, instead of having to do prompt engineering.

Nowadays, all models deployed on the OpenAI API use the InstructGPT variant.

### InstructGPT Results:

As reported in the [paper on InstructGPT](https://arxiv.org/pdf/2203.02155.pdf)
- Labelers significantly prefer InstructGPT outputs over outputs from GPT-3
- InstructGPT models show improvements in truthfulness over GPT-3
- InstructGPT shows small improvements in toxicity over GPT-3, but not bias
- InstructGPT still makes simple mistakes
- And more

Aside from that, it's clear how ChatGPT has become extremely useful in many use cases by following the same training approach.

## Data Collection

> For all steps of training, data is required

As described in their [paper](https://arxiv.org/pdf/2203.02155.pdf), to collect data to fine tune the very initial InstructGPT models, OpenAI had human labellers create prompts. The three requested prompt types were:
- Plain: Ask the labelers to come up with an arbitrary task, while ensuring the
tasks had sufficient diversity.
- Few-shot: Ask the labelers to come up with an instruction, and multiple query/response
pairs for that instruction
- User-based: Ask labelers to come up with prompts corresponding to use-cases stated in waitlist applications to the OpenAI
API.

These manually created prompts led to three datasets, used for the three stages of training:
- The supervised fine-tuning (SFT) dataset
    - Features: Prompts
    - Labels: Ideal responses
- The reward model (RM) dataset
    - Features: Prompts & responses
    - Labels: Rankings of each response 
- The PPO dataset
    - Features: Prompts & responses
    - No labels

To create the datasets for the original InstructGPT models, OpenAI hired a team of 40 labellers from [Upwork](https://www.upwork.com/) and used [Scale AI](https://scale.com/rlhf) to manage the datasets.

This means that a team of humans literally write out acceptable responses to a range of prompts. These responses are saved, and make up the raw data for a dataset.

## Supervised Fine-Tuning (SFT)

The first step, is to _train a supervised policy_.

> A policy is something that defines how you act in a certain context. In the case of ChatGPT, the context is the instruction written by the user (or the conversation so far), and the policy defines what response ChatGPT should produce.

The policy is determined by the parameters of the language model. Initially, this policy is defined by the parameters of the model used as a starting point (the backbone).

> The starting point for ChatGPT was to use [GPT-3.5](https://platform.openai.com/docs/model-index-for-researchers/models-referred-to-as-gpt-3-5) as a backbone

The backbone GPT-3.5 model is a large language model (LLM) already highly competent when it comes to language generation. It is able to generate novel text in a variety of formats including:
- Syntactically perfect English
- Working code
- Answers to trivia and uncommon knowledge
- Consistent short stories and text
- Stylised writing
- And more

However, the original language model has still learnt some undesirable behaviours due to the data it was trained on (a large portion of the internet):
- Biased and inappropriate answers
- Lack of consistent, professional, or helpful tone

> The point of the supervised fine-tuning (step 2) and RLHF (step 3) is to straighten out the undesired behaviours of a general purpose language model.

> SFT is used as a way to get a good model parameter initialisation to further fune-tune using RLHF later

To do this, in the SFT stage, the pre-trained language model is trained further, updating its parameters, to do language modelling (predicting the next word) on the SFT dataset explained earlier, which contains examples of prompts and corresponding responses.

> SFT requires a labelled dataset (with model responses rather than just prompts)

By seeing many of these examples, the language model should learn:
- The style of a chatbot
- To avoid some biases and inappropriate responses
- To be helpful to the user

> Like RLHF (step 3), the point of supervised fine-tuning is to make the language model better at acting like a helpful chatbot and more aligned with human intention.


## Step 2: Train a Reward Model

> The second step is to _train a reward model_, which will be used in step 3 to score the quality of responses generated by the language model.

> The reward model is trained to output a scalar value, the reward, that tells the model how good its response was.

The dataset used to train the reward model is a list of several outputs for the same prompt ranked in order of preference by a human labeller. 
The aim of the model is to predict a higher score for the higher ranked response.


## Step 3: Reinforcement Learning with Human Feedback

### Recap: What is Reinforcement Learning?

> Reinforcement learning is where an agent (in our case, the AI system) interacts with an an environment (in our case, interacting with the chat interface by responding to prompts), and tries to maximise a reward which is receives for doing well (or a punishment for not doing well).

### No Labels, No Problem

The fine-tuned model can use the reward model (RM) generated in step 2 to evaluate new generated text, without requiring labelled ideal responses.

> RLHF does not require the prompts in its dataset to be labelled with ideal responses, but it does require the reward model (RM)

