<br>

# GPT-3

## Building Innovative NLP Products using Large Language Models

Note to follow along with this really cool book by Sandra Kublic & Shubham Saboo

<br>

## 1 The Era of Large Language Models

OpenAI's paper: [Language Models are Few Shot Learners](https://arxiv.org/pdf/2005.14165.pdf)

Some Definitions
* NLP - combines computational lingustics and machine learning to create intelligent machines capable of identifying the context of understanding the intent of natural language
* Language Modeling - the task of assigning a probability to a sequence of words in a text in a specific language

Generative Pre-Trained Transformers
* Generative Models - generate text
* Pre-trained Models - trained for a more general task and is then available to be fine-tuned for different tasks.
* Transformer Models - a machine learning model that processes a sequence of text al at once (instead of a word at a time), and that has a powerful 'attention' mechanism to understand the connection between words.
* Sequence-to-Sequence (Seq2Seq) - transformes a given sequence of elements (ex a sentence) into another sequence (ex a translation to another language). consist of two parts: an encoder and a decoder

Transformer Attention Mechanisms
* attention mechanism - a technique that mimics cognitive attention: it looks at an input sequence, piece by piece and, on the basis of probabilities, decides at each step which other parts of the sequence are important
* self-attention - connections of words within a sentence
* encoder-decoder attention - connection between words from the source sentence to words form the target sentence
* GPT is just the decoder part of the transformer

A brief history of GPT-3
* GPT-1 - Trained with the 'Book Corpus' dataset and uses the decoder component of the original transformer. was able to perform decent **zero-shot learning** (perform a task without having seen an example of that kind in th epast). **zero-shot task transfer** - the model is presented with few to no examples and asked to understand the task based on the examples and the instruction.
* GPT-2 - bigger! and with multitasking capabilities. trained on a larger collection of data and with more parameters (10x GPT-1). trained with WedText (Reddit data). Reading comp. summarization etc
* GPT-3 - in num parameters and size of training data are 2 orders of magnitude larger than GPT-2! accessible to the public via an API
* API - application programming interface - a software intermediary that sends information back and forth between a website or app
* Model-as-a-Service - MaaS, developers can pay per API call.

<br>

## 2 Using the Open AI API

**Playground** a "text in, text out" interface  
generate robust prompts that generate favorable responses for applications  

### Navigating the OpenAI Playground

#### Prompt Engineering and Design
* There is a direct relation between the training prompt you provide and the quality of the completion you get  
* The user's job is to get the model to use the information it already has to generate useful results: give GPT-3 just enough context (in the form of a training prompt) to figure out patterns and perform a given task
* The standard flow for designing a training prompt is to try for zer-shot first, then few-shot, then go for corpus-based fine-tuning
* Steps for designing a training prompt:
    1. Define the task (e.g. classficiation, text generation etc.)
    2. Is there a way to get a zero-shot solution?
    3. Formulate the problem in a textual fashion: text-in, text-out
    4. If you do end up using existing examples, use as few as possible and try to incorporate diversity, capturing all the representations to avoid overfitting the model or skewing the predictions
    
### How the OpenAI API Works

#### API Components
* Model - chose and execution engine (e.g. da vinci, babbage, ada, curie)
* Response length - how much text the engine should return
* Temperature - scope of randomness
    - low temp: most 'correct' but perhaps most boring with little variation. 
    - high temp: more diverse text, but higher prob of grammer mistakes and nonsense
* Top-P How many random results should the model consider
    - low Top-P: a deterministic response with limited creativity
    - high Top-P: 
    - **Tip**: change either Top-P or Temperature while keeping the dial for the other set at 1
* Frequency penalty - reduce likelihood that model will repeat lines
* Presence penalty - increase the likelihood that the model will incorporate new topics/sources
* Best of - specify the number of completions/results exemplars to return
* Stop Sequence - a set of characters that signal the API to stop generating completions
* Inject start/restart text - allows you to insert text at the beginning/end of the completion
    - inject start example: Once upon a time...
    - restart text example: ...and they lived happily ever after
* Show probabilities - show probability of tokens
    - helpful for 'debugging' the text prompt
    
#### Execution Engines
* Davinci 
    - Pros: Largest, most performant, most generalizable, better at complex tasks
    - Cons: Most expensive and slowest
* Curie
    - Tries to optimize between power and speed
    - performant and fast for classifications of chat-bot style responses
* Babbage 
    - Faster than Curie, but best for relatively simple tasks
    - less expensive than Davinci and Curie
* Ada
    - Fastest and cheapest of the GPT-3 architectures
    - Best for use with simple tasks, but given the right context, can do more complicated jobs
* InstructGPT Models - produce better results than their base counterparts and are now the default models of the API (`text-davinci-002` vs `davinci`)


[OpenAI's Comparison Tool](https://gpttools.com/comparisontool)