# Introduction to LLMs

## Table of contents

**1. What is an LLM**

    1.1 LLMs history
    
    1.2 Key characteristics 
    
    1.3 LLM types
    
**2. LLMs nowadays**
    
**3. Building an LLM in Python**

**4. Guidelines for prompting**
    
    4.1 Write clear and specific
    
    4.2 Give time the model to think!
    
    4.3 Avoid model hallucinations

**5. Iterative prompt development**

**6. Using an OpenAI API**

    6.1 What is an API?
    
    6.2 How does it work?
    
    6.3 API implementation

**7. Creating a Chatbot**

### 1. What is an LLM?

In the realm of artificial intelligence and natural language processing, an LLM, or **Large Language Model**, is a type of deep learning model designed to understand and generate human language. These models are trained on vast amounts of text data, enabling them to perform a wide range of language-related tasks with high accuracy and fluency.

### 1.1 LLMs history

Well, we should go back to the early days of natural language processing (NLP) during the 1950s to the 1980s, most systems were rule-based, relying on handcrafted rules and simple statistical models. These early NLP systems, such as the famous **ELIZA program created by Joseph Weizenbaum in 1966**, demonstrated the potential of conversational agents by simulating conversation using pattern matching and substitution methods. However, **ELIZA and similar systems lacked true understanding of language and were limited in scope**.

![image-3.png](attachment:image-3.png)

The 1990s saw the rise of statistical methods in NLP, significantly improving tasks like speech recognition and machine translation. During this period, **the introduction of statistical models, such as Hidden Markov Models (HMMs) and n-grams**, marked a shift towards more **data-driven approaches**. This period laid the groundwork for future advancements in NLP by emphasizing the **importance of large datasets and statistical inference**.


The emergence of **deep learning in the 2010s** revolutionized NLP. One of the significant milestones was the development of word embeddings, such as **Word2Vec in 2013** by researchers at Google. Word2Vec enabled the capture of semantic relationships between words in a continuous vector space, which greatly enhanced the ability of models to understand and process language.

![image-4.png](attachment:image-4.png)

In 2014, the introduction of the **sequence-to-sequence (seq2seq) model** by Google marked another leap forward. Seq2seq models, particularly when combined with attention mechanisms, improved the performance of tasks like machine translation and text summarization. This period also saw the development of models like GloVe (Global Vectors for Word Representation) by Stanford, which further improved word representation techniques.

The real breakthrough in LLMs came with the advent of the **Transformer architecture**, introduced by Vaswani et al. in the paper "Attention is All You Need" in 2017. Transformers, with their self-attention mechanisms, allowed for **better handling of long-range dependencies in text and parallelization of training**, leading to significant performance improvements over previous models.

Following the introduction of Transformers, **OpenAI launched GPT** (Generative Pre-trained Transformer) models, starting with GPT-1 in 2018. GPT-1 demonstrated the power of pre-training on large corpora followed by fine-tuning for specific tasks. This was followed by GPT-2 in 2019, which was notable for its impressive language generation capabilities and raised important discussions about the ethical implications of powerful AI systems.

![image-5.png](attachment:image-5.png)

In June 2020, **OpenAI released GPT-3**, a model with **175 billion parameters**, which further pushed the boundaries of what LLMs could achieve. GPT-3's capabilities included not only advanced text generation but also strong performance in tasks requiring reasoning, translation, and question answering. Its ability to perform zero-shot, one-shot, and few-shot learning made it a versatile tool for a wide range of applications.

In 2021, **Google introduced LaMDA (Language Model for Dialogue Applications)**, focusing on open-domain conversation, and **Facebook (now Meta) released their LLaMA (Large Language Model Meta AI)** series. These models aimed to push the boundaries of **conversational AI** and more general NLP tasks.

![image-6.png](attachment:image-6.png)

As the field progressed, efforts to make LLMs more accessible and **ethically aligned with human values** became increasingly important. Models like Anthropic's Claude and Cohere's Command R series, which emerged in the early 2020s, emphasized safety and reliability in AI interactions.

The development of LLMs continues to evolve rapidly, with ongoing research focusing on improving model efficiency, interpretability, and alignment with human values. The journey from early rule-based systems to advanced transformers has fundamentally transformed the landscape of NLP, making LLMs indispensable tools in both research and practical applications.

![image-2.png](attachment:image-2.png)

### 1.2 Key characteristics

#### Scale and Training:

LLMs are trained on massive datasets containing diverse text sources, including books, articles, websites, and more. This extensive training helps them understand nuances, context, and various language patterns.
The size of an LLM refers to the number of parameters (weights) it has. For example, GPT-3 has 175 billion parameters, making it one of the largest models available.

#### Transformer Architecture:

LLMs typically use the transformer architecture, which allows them to process and generate text efficiently. The transformer model uses attention mechanisms to weigh the importance of different words in a sentence, enabling it to capture long-range dependencies and context more effectively than previous models like RNNs (Recurrent Neural Networks).

#### Pre-training and Fine-tuning:

LLMs undergo a two-step process: pre-training and fine-tuning. During pre-training, the model learns general language patterns from a large corpus of text. Fine-tuning involves adjusting the model on a smaller, task-specific dataset to enhance its performance on particular tasks.

#### Versatility:

LLMs can perform a wide range of tasks, including text generation, translation, summarization, question answering, sentiment analysis, and more. Their versatility makes them highly valuable in various applications, from chatbots to content creation and beyond.

### 1.3 LLM types

Large Language Models (LLMs) come in various forms, primarily categorized into base LLMs and instruction-tuned LLMs. Understanding these types helps in selecting the right model for specific applications and tasks.

#### Base LLM

A base LLM is a foundational language model trained on a vast corpus of text data without any specific task fine-tuning. It learns general language patterns, structures, and knowledge.

Characteristics:

- General Understanding: Capable of generating coherent and contextually relevant text based on its extensive training data.
- Versatility: Can be adapted to various tasks, but might require additional fine-tuning for optimal performance in specific applications.
- Example Models: GPT-2, GPT-3 (before any fine-tuning for specific tasks).

Example:

- Prompt: "Explain the theory of relativity."
- Base LLM Output: "The theory of relativity, developed by Albert Einstein, consists of two parts: special relativity and general relativity. It revolutionized our understanding of space, time, and gravity..."


#### Instruction-Tuned LLM

An instruction-tuned LLM is a base model that has undergone additional fine-tuning to follow specific instructions more effectively. This tuning is typically done using datasets that pair instructions with desired responses.

Characteristics:

- Enhanced Task Performance: Better at following user instructions and producing task-specific outputs.
- Higher Accuracy: Provides more accurate and relevant responses to prompts designed as instructions.
- Example Models: GPT-3.5, GPT-4 (fine-tuned versions designed for instruction-following).

Example:

- Prompt: "Summarize the theory of relativity in simple terms."
- Instruction-Tuned LLM Output: "The theory of relativity, proposed by Einstein, includes special relativity, which explains how time and space are linked, and general relativity, which describes how gravity affects them. It's key to understanding how the universe works."


To summarize, Base LLMs offer broad language understanding and versatility, while instruction-tuned LLMs excel at instruction-based tasks. Choose the model type based on the specific needs of your application.

***But remember that an LLM is like someone smart but that doesn't know the specifics***

### 2. LLMs nowadays

In recent years, artificial intelligence (AI) has undergone a remarkable evolution, becoming **more advanced and accessible** than ever before. This accessibility has democratized AI, allowing a wide range of organizations, from tech giants to startups, to develop their own large language models (LLMs). These models, which excel in understanding and generating human language, are now at the forefront of technological innovation.

The rapid advancement of AI technologies has led to a competitive landscape where **companies strive to create the most powerful and efficient LLMs**. This competition drives constant improvements, resulting in more capable and versatile models that can perform a diverse array of tasks. From aiding in customer service and generating creative content to assisting in scientific research and enhancing productivity tools, LLMs have become integral to various industries.

Let's see some of the prominent LLMs leading the way:


- **OpenAI GPT-4**: Developed by OpenAI, GPT-4 is the successor to GPT-3 and continues to be one of the most advanced and widely used LLMs. It is known for its versatility and ability to perform a wide range of natural language processing (NLP) tasks.

- **Google Gemini**: Google's Gemini, is another leading LLM known for its strong performance in various NLP tasks. It is integrated into several Google products and services.

- **Anthropic's Claude**: Developed by Anthropic, Claude is designed with a focus on safety and alignment, aiming to be a reliable and ethical AI assistant.

- **Meta's LLaMA 3**: Meta (formerly Facebook) has developed the LLaMA series, which is tailored for research and practical applications in NLP.

- **Grok by xAI**: Developed by xAI, Grok is another noteworthy LLM designed to deliver advanced language understanding and generation capabilities, having all X data contributes to the competitive and ever-evolving landscape of AI technology.

- **Cohere's Command R**: Cohere offers the Command R series of language models, which are optimized for retrieval-augmented generation tasks and provide robust NLP capabilities.

- **Bloom**: Bloom is an open-source, multilingual LLM developed by the BigScience research initiative. It aims to provide accessible and transparent AI research tools.

- **Huawei's PanGu**: PanGu is a large-scale LLM developed by Huawei, aimed at advancing AI capabilities and applications in various domains.

#### Comparison

![image-3.png](attachment:image-3.png)


## 3. Building an LLM in Python

### Notes

In this section, we will develop a simple **n-gram** language model. They are relatively straightforward to implement and can provide a decent starting point for generating or understanding language patterns. We will use Python to create our model.

We will cover everything from data preparation and preprocessing to building and using the n-gram model to generate text. This example will use a **bigram model**. Now let's dive into those concepts.


#### 3.1 N-gram models

An n-gram model is a type of probabilistic language model used to **predict the next item in a sequence of words or symbols**. The "n" in "n-gram" refers to the number of items in a given sequence that the model uses to make predictions. Essentially, an n-gram model uses the assumption that the probability of a word appearing in a text depends only on the preceding 𝑛−1 words. This model is widely used in various applications of natural language processing (NLP) such as speech recognition, spelling correction, and text generation.

#### In an n-gram model

- **Tokenization**: The text is divided into tokens (words or characters).

- **N-gram Formation**: The tokens are used to form n-grams, which are sequences of 'n' consecutive tokens. These sequences form the basis for building the model.

- **Probability Calculation**: The model calculates the probability of each n-gram within the training data. For example, the probability of a word given the preceding words.

- **Prediction**: To predict the next token in a sequence, the model uses the probabilities learned during training to select the most likely next token based on the previous 𝑛−1 tokens.

![image.png](attachment:image.png)


#### Types of n-grams

- **Unigram**: An n-gram model where 𝑛=1. Each word is modeled as an independent entity. 
    For example, in the sentence "The quick brown fox", the unigrams would be "The", "quick", "brown", "fox".

- **Bigram**: An n-gram model where 𝑛=2. Here, the model considers pairs of consecutive words. 
    For example, in the sentence "The quick brown fox", the bigrams would be "The quick", "quick brown", "brown fox".

- **Trigram**: An n-gram model where 𝑛=3. This model considers triplets of consecutive words. 
    For example, in the sentence "The quick brown fox", the trigrams 
    would be "The quick brown", "quick brown fox".
    
![image-2.png](attachment:image-2.png)


#### Limitations of the n-gram model

Despite their utility, n-gram models have limitations:

- **Sparsity**: As 'n' increases, the number of possible n-grams increases exponentially, many of which may never occur in the training data. This can make the model sparse and unreliable for these unseen n-grams.

- **Context Sensitivity**: The context captured by n-grams is limited to 𝑛−1n−1 tokens, which may not be sufficient for understanding full sentence or paragraph contexts.

- **Performance**: They generally require significant memory and computational r



#### 3.2 Bigram model

A bigram model is a specific type of n-gram model where 𝑛=2. It predicts the occurrence of a word based on the preceding word. 

For example, if your sentence starts with the word "New", a bigram model might predict that the next word is likely to be "York" if in the training data, "New York" appears frequently.

The bigram model is simple but powerful enough to capture some context, unlike the unigram model which treats each word independently. However, it is less sophisticated than higher order n-grams (like trigrams), which can capture more context and thus often have higher accuracy, at the expense of requiring more computational resources and data.

### 3.3 Steps to Create a Simple n-gram Language Model

**1. Data Collection**: Gather the data that the model will train on.

**2. Data Preprocessing**: Clean and tokenize the text data.

**3. Model Development**: Create the n-gram model.

**4. Model Training**: Train the model using your data.

**5. Prediction**: Use the model to predict the next word(s) in a sequence.

#### Let's build it!

In [None]:
#Step 1: Import Required Libraries

import re
import random
from collections import defaultdict, Counter

*For simplicity, we'll use a small text example. You should replace this with more comprehensive data for better results.*

**Context**

The text we have is about Lami the rabbit, a fictional character, known for his curiosity, discovers a magical key that leads him to a hidden chamber guarded by an elderly tortoise, Tarek. Gifted with the Key of Curiosity, Lami unlocks the forest's hidden wonders and shares its magic with his friends, inspiring others to embrace their adventurous spirit.

In [None]:
# Prepare the text

text = """iPhone 15 Review: A Leap into the Future or Just a Step?
The iPhone 15 has finally landed, boasting several improvements and new features that aim to set it apart in the fiercely competitive smartphone market. Let’s dive into what the iPhone 15 offers and see if it’s worth the upgrade.
Design and Display
Apple continues to push the boundaries of design, and the iPhone 15 is no exception. The new model likely features a sleek, robust design with even thinner bezels and a more advanced display technology that offers brighter, more vivid colors and higher efficiency. We could expect an enhancement in the display refresh rate, making animations and transitions smoother than ever.
Camera System
The camera has always been a strong selling point for iPhones, and the iPhone 15 probably takes this to a new level. Enhanced sensor capabilities, combined with sophisticated software enhancements, likely allow for better low-light performance and more detailed images. The introduction of new photography modes and an advanced video recording features would be typical upgrades aimed at content creators.
Performance and Battery Life
With the introduction of a new chip, possibly the A17, the iPhone 15 is expected to deliver exceptional performance. This increase in power usually comes with efficiency improvements, contributing to longer battery life despite higher performance demands.
iOS Version
The iPhone 15 would ship with the latest version of iOS, offering improved user experiences, enhanced privacy features, and better integration with the wider Apple ecosystem. New iOS features often include updates to existing apps, more customization options, and improvements in Siri’s capabilities.
Connectivity
Enhancements in connectivity, such as advanced Wi-Fi and Bluetooth technology or even more robust 5G capabilities, are likely. These improvements help streamline interaction with other devices and enable faster data transfer speeds.
Conclusion
The iPhone 15 is expected to be a top contender in the market, pushing forward with incremental but impactful improvements that enhance the user experience. Whether it’s a must-buy will depend on personal preference and the importance of the latest tech in one’s daily life. For those with older iPhone models, this could be a compelling upgrade, but those with more recent versions might find the improvements less persuasive."""


*We'll clean and tokenize the text.*

In [None]:
#Data Preprocessing

def tokenize(text):
    text = text.lower()  # Convert to lower case
    tokens = re.findall(r'\b\w+\b', text)  # Extract words only
    return tokens

tokens = tokenize(text)
print(tokens)

*We'll build a bigram model.*

In [None]:
#Create the n-gram Model

# Define a function to build a bigram model from a list of tokens.
def build_bigram_model(tokens):
    # The outer dictionary keys are words, and the values are Counters which will count occurrences of following words.
    model = defaultdict(Counter)
    
    # Iterate through the tokens with index i up to the second-to-last token
    # because we will be looking at pairs of consecutive tokens.
    for i in range(len(tokens) - 1):
        # Capture the current word and the next word in the sequence.
        prev, curr = tokens[i], tokens[i + 1]
        
        # Increment the count of the current word following the previous word.
        # model[prev][curr] is the count of how often 'curr' follows 'prev'.
        model[prev][curr] += 1
    
    # Return the completed model, which now represents the bigram frequency distribution.
    return model

# Generate the bigram model using the tokens from our text.
bigram_model = build_bigram_model(tokens)

*This function uses the model to predict the next word given a starting word.*

In [None]:
#Generate Text Using the Model

# Define a function to generate text based on a bigram model, starting from a given first word.
def generate_text(model, first_word, num_words=50):
    # Initialize 'word' with the 'first_word', which will be the starting point for generating text.
    word = first_word.lower()
    
    # Loop for the number of words we want to generate.
    for _ in range(num_words-1):

        # Retrieve a list of possible next words that could follow the current word according to the bigram model.
        next_words = list(model[word].keys())
        
        # Retrieve a list of the weights for each word in 'next_words'. These weights are based on the frequency of occurrence following the current word.
        word_weights = list(model[word].values())
        
        # Randomly choose one of the next words based on the weights, which are proportional to how frequently each word followed the current word in the training data.
        word = random.choices(next_words, weights=word_weights, k=1)[0]
        
        # Print the chosen word, followed by a space, without moving to a new line.
        print(word, end=' ')

# Call the generate_text function, starting with 'lami' as the first word, to generate 50 words.
generate_text(bigram_model, 'Design')

This example is **very basic and primarily educational**. Real-world language models like those used in commercial applications (e.g., GPT models by OpenAI) are trained on vast amounts of data and use more complex architectures like transformers. However, starting with an n-gram model provides a good foundation in understanding how language models work.



### 3.4: Now it is your turn

Test it out with your own text and experiment with different configurations!

In [None]:
#Your code

## 3.1 Using a pretrained model

In [None]:
#%pip install --disable-pip-version-check \
#    torch==1.13.1 \
#    torchdata==0.5.1 --quiet

#%pip install \
#    transformers==4.27.2 \
#    datasets==2.11.0  --quiet

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

### 3.1.2 - Summarize Dialogue without Prompt Engineering

In this use case, you will be generating a summary of a dialogue with the pre-trained Large Language Model (LLM) FLAN-T5 from Hugging Face. The list of available models in the Hugging Face `transformers` package can be found [here](https://huggingface.co/docs/transformers/index). 

Let's upload some simple dialogues from the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset contains 10,000+ dialogues with the corresponding manually labeled summaries and topics. 

In [None]:
from datasets import load_dataset

huggingface_dataset_name = "knkarthick/dialogsum"
# Check if you need to specify additional parameters like `name`
dataset = load_dataset(huggingface_dataset_name)

Print a couple of dialogues with their baseline summaries.

In [None]:
example_indices = [40, 200]
dash_line = '-'.join('' for x in range(100))

for i, index in enumerate(example_indices):
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()

Load the [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5), creating an instance of the `AutoModelForSeq2SeqLM` class with the `.from_pretrained()` method. 

In [None]:
model_name='google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

To perform encoding and decoding, you need to work with text in a tokenized form. **Tokenization** is the process of splitting texts into smaller units that can be processed by the LLM models. 

Download the tokenizer for the FLAN-T5 model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` switches on fast tokenizer. At this stage, there is no need to go into the details of that, but you can find the tokenizer parameters in the [documentation](https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/auto#transformers.AutoTokenizer).

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Test the tokenizer encoding and decoding a simple sentence:

In [None]:
sentence = "What time is it, Tom?"

sentence_encoded = tokenizer(sentence, return_tensors='pt')

sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"][0], 
        skip_special_tokens=True
    )

print('ENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

Now it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. **Prompt engineering** is an act of a human changing the **prompt** (input) to improve the response for a given task.

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    
    inputs = tokenizer(dialogue, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{dialogue}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)
    print(f'MODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

You can see that the guesses of the model make some sense, but it doesn't seem to be sure what task it is supposed to accomplish. Seems it just makes up the next sentence in the dialogue. Prompt engineering can help here.

## 3.2 - Summarize Dialogue with an Instruction Prompt

Prompt engineering is an important concept in using foundation models for text generation. You can check out [this blog](https://www.amazon.science/blog/emnlp-prompt-engineering-is-the-new-feature-engineering) from Amazon Science for a quick introduction to prompt engineering.

### 3.2.1 - Zero Shot Inference with an Instruction Prompt

In order to instruct the model to perform a task - summarize a dialogue - you can take the dialogue and convert it into an instruction prompt. This is often called **zero shot inference**.  You can check out [this blog from AWS](https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/) for a quick description of what zero shot learning is and why it is an important concept to the LLM model.

Wrap the dialogue in a descriptive instruction and see how the generated text will change:

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']

    prompt = f"""
Summarize the following conversation into 3 short sentences.

{dialogue}

Summary below:
    """

    # Input constructed prompt instead of the dialogue.
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )
    
    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}')
    print(dash_line)    
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

This is much better! But the model still does not pick up on the nuance of the conversations though.

**Exercise:**

- Experiment with the `prompt` text and see how the inferences will be changed. Will the inferences change if you end the prompt with just empty string vs. `Summary: `?
- Try to rephrase the beginning of the `prompt` text from `Summarize the following conversation.` to something different - and see how it will influence the generated output.

### 3.2.2 - Zero Shot Inference with the Prompt Template from FLAN-T5

Let's use a slightly different prompt. FLAN-T5 has many prompt templates that are published for certain tasks [here](https://github.com/google-research/FLAN/tree/main/flan/v2). In the following code, you will use one of the [pre-built FLAN-T5 prompts](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py):

In [None]:
for i, index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
        
    prompt = f"""
Dialogue:

{dialogue}

What was going on in that conversation?
"""

    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(
        model.generate(
            inputs["input_ids"], 
            max_new_tokens=50,
        )[0], 
        skip_special_tokens=True
    )

    print(dash_line)
    print('Example ', i + 1)
    print(dash_line)
    print(f'INPUT PROMPT:\n{prompt}')
    print(dash_line)
    print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
    print(dash_line)
    print(f'MODEL GENERATION - ZERO SHOT:\n{output}\n')

Notice that this prompt from FLAN-T5 did help a bit, but still struggles to pick up on the nuance of the conversation. This is what you will try to solve with the few shot inferencing.

## 3.3 - Summarize Dialogue with One Shot and Few Shot Inference

**One shot and few shot inference** are the practices of providing an LLM with either one or more full examples of prompt-response pairs that match your task - before your actual prompt that you want completed. This is called "in-context learning" and puts your model into a state that understands your specific task.  You can read more about it in [this blog from HuggingFace](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api).

### 3.3.1 - One Shot Inference

Let's build a function that takes a list of `example_indices_full`, generates a prompt with full examples, then at the end appends the prompt which you want the model to complete (`example_index_to_summarize`).  You will use the same FLAN-T5 prompt template from section [3.2](#3.2). 

In [None]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']
        
        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""
    
    dialogue = dataset['test'][example_index_to_summarize]['dialogue']
    
    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""
        
    return prompt

Construct the prompt to perform one shot inference:

In [None]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)

Now pass this prompt to perform the one shot inference:

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

### 3.4 - Few Shot Inference

Let's explore few shot inference by adding two more full dialogue-summary pairs to your prompt.

In [None]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)

Now pass this prompt to perform a few shot inference:

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

In this case, few shot did not provide much of an improvement over one shot inference.  And, anything above 5 or 6 shot will typically not help much, either.  Also, you need to make sure that you do not exceed the model's input-context length which, in our case, if 512 tokens.  Anything above the context length will be ignored.

However, you can see that feeding in at least one full example (one shot) provides the model with more information and qualitatively improves the summary overall.

**Exercise:**

Experiment with the few shot inferencing.
- Choose different dialogues - change the indices in the `example_indices_full` list and `example_index_to_summarize` value.
- Change the number of shots. Be sure to stay within the model's 512 context length, however.

How well does few shot inferencing work with other examples?

### 3.5 - Generative Configuration Parameters for Inference

**Exercise:**

Change the configuration parameters to investigate their influence on the output. 

Putting the parameter `do_sample = True`, you activate various decoding strategies which influence the next token from the probability distribution over the entire vocabulary. You can then adjust the outputs changing `temperature` and other parameters (such as `top_k` and `top_p`). 

Uncomment the lines in the cell below and rerun the code. Try to analyze the results. You can read some comments below.

In [None]:
#generation_config = GenerationConfig(max_new_tokens=50)
#generation_config = GenerationConfig(max_new_tokens=10)
#generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.1)
#generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.5)
generation_config = GenerationConfig(max_new_tokens=50, do_sample=True, temperature=1.0)

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        generation_config=generation_config,
    )[0], 
    skip_special_tokens=True
)

print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')

Comments related to the choice of the parameters in the code cell above:
- Choosing `max_new_tokens=10` will make the output text too short, so the dialogue summary will be cut.
- Putting `do_sample = True` and changing the temperature value you get more flexibility in the output.

As you can see, prompt engineering can take you a long way for this use case, but there are some limitations. Next, you will start to explore how you can use fine-tuning to help your LLM to understand a particular use case in better depth!

## 4. LLM APIs

### 4.0 Intro Notes

The OpenAI API provides access to a powerful suite of natural language processing (NLP) tools developed by OpenAI, including the well-known GPT (Generative Pre-trained Transformer) models. Here are key aspects and features of the OpenAI API:

- Text Generation

- Content Creation

- Coding and Scripting

- Customization


#### 4.1 What is an API?

An API, or Application Programming Interface, is a set of rules and protocols for building and interacting with software applications. It defines the methods and data formats that applications can use to communicate with each other. 

APIs are crucial for modern software development, facilitating communication between different applications and systems. They define methods and data formats for interaction, ensuring seamless data exchange and functionality integration. By utilizing various types of APIs, such as Web APIs, SOAP APIs, and GraphQL APIs, developers can efficiently retrieve data, automate tasks, and integrate third-party services. With their ability to enable interoperability and scalability, APIs play a fundamental role in creating complex, integrated, and efficient software solutions.


#### 6.2 How does it work?

The OpenAI API provides access to OpenAI's powerful language models, such as GPT-3 and GPT-4, enabling a wide range of applications like text generation, translation, summarization, and more. Here's an in-depth look at how it works, pricing, and key details:

**1. Access and Authentication:**

To use the OpenAI API, you need to **create an account on the OpenAI platform** and obtain an API key. This key is used to authenticate your requests.

**2. Endpoints:**

The API offers various endpoints for different tasks, such as text completion, chat, edits, and more. Each endpoint is accessed via specific URLs.

**3. Making Requests:**

Requests are made using standard HTTP methods (GET, POST, etc.). A typical request includes the API key for authentication, a JSON payload specifying the task, and any necessary parameters (like the prompt for text generation).

**4. Response Handling:**

The API responds with a JSON object containing the results of the request. For example, a text generation request might return the generated text along with metadata about the request.

**5. Error Handling:**

The API uses standard HTTP status codes to indicate success or failure. Detailed error messages help diagnose and fix issues.

**Prices**

*First of all you can think of tokens as pieces of words, where 1.000 tokens is about 750 words. This paragraph is 30 tokens*

![image.png](attachment:image.png)


**Keep in mind that ChatGPT Pro is a subscription plan specifically for using ChatGPT in a web-based chat interface, which provides enhanced features and faster response times within that platform. The OpenAI API, on the other hand, allows developers to integrate these language models into their own applications, websites, and services. So you should have to pay to use the API**

### 4.2 API implementation

*Keep in mind that this is strictly for informational purposes. To actually use it, you can take advantage of the $5 trial OpenAI provides or find a website that allows you to use it for learning purposes.*

#### Getting an API KEY

The API empowers you with greater control and versatility to work with the GPT model (model inside ChatGPT). It also allows seamless integration with other applications.

To access the model through the API, **you will need an API key**.

*To obtain your own API key, you'll need to create an account and set up billing. You can create your account at https://platform.openai.com/.*

In this introduction, we will read the api key from a txt file. Create a `key.txt` and paste the key your lead teacher gave you in that file.

Alternatively, save it as an environment variable and read it as it follows

```python
import os

api_key = os.getenv("OPENAI_API_KEY")
```

In [None]:
# You can save your key in a text file and read it

#### OpenAI Python Package Installation

To utilize the GPT API, you'll need to have the OpenAI Python package installed.

You can easily install it by running the command pip install --upgrade openai. *Adding the --upgrade flag ensures that you have the most up-to-date version, in case you installed openai before, as the GPT API is a recently introduced feature.*

In [None]:
#!pip install --upgrade openai

In [None]:
import openai

# load and set our key
openai_api_key = ''

#### Chat Completions API

To use a GPT model via the OpenAI API, **you’ll send a request containing the inputs and your API key, and receive a response containing the model’s output**.

As of June 2024 there are two main APIs endpoints to work with GPT models.
- Completions API endpoint: only for the older legacy models
- Chat Completions API endpoint: to access the latest models, gpt-4 and gpt-4.o.

Chat models in Chat Completions API take as mandatory parameters:
- **List of messages as input**
- **Model**: we will use gpt-3.5-turbo (one fo the olders)

They return a **model-generated message as output**.

An example API call looks as follows:

In [None]:
import os
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
    api_key=openai_api_key,
)

In [None]:
messages=[ # messages parameter must be a list of dictionaries
    # can be as short as one message or many back and forth turns.
    {"role": "system", "content": "You are a famous chef. Share your best cooking tips and tricks."}, # each dictionary has a role and content
    {"role": "user", "content": "What are the ingredients for making pancakes?"},
  ]

chat_completion = client.chat.completions.create(
    messages = messages,
    model = "gpt-3.5-turbo",
)

answer = chat_completion.choices[0].message.content
print(answer)

In conversations using the Chat Completion API, each message has a **role ("system," "user," or "assistant").** Typically, a conversation starts with a system message to set the assistant's behavior, followed by alternating user and assistant messages.
- The system message is optional and can be used to customize the assistant's personality or provide specific instructions
- User messages contain requests or comments (prompts)
- Assistant messages store previous assistant responses or serve as examples of desired behavior.

#### Chat completions response format

An example chat completions API response looks as follows:

In [None]:
print(chat_completion)

In Python, the assistant’s reply can be extracted with response['choices'][0]['message']['content'].



Every response includes a finish_reason. The possible values for finish_reason are:

- **stop**: API returned complete model output.
- **length**: Incomplete model output due to max_tokens parameter or token limit.
- **content_filter**: Omitted content due to a flag from our content filters.
- **null**: API response still in progress or incomplete.



### Conversation history

Since it recommended Baking Powder, let's ask how much in another prompt:

In [None]:
messages=[{"role": "user", "content": "How much Baking Powder do you recommend for this recipe?"}]

chat_completion = client.chat.completions.create(
    messages = messages,
    model = "gpt-3.5-turbo")

answer = chat_completion.choices[0].message.content
print(answer)

We can see that the history was not saved.

**Including conversation history is crucial** when user instructions refer to prior messages. We can do that by sending all the prompts, with its role, in the *messages* parameter.

In [None]:
import textwrap

messages=[
    {"role": "system", "content": "You are a famous chef. Share your best cooking tips and tricks."},
    {"role": "user", "content": "What are the ingredients for making pancakes?"},
    {"role": "assistant", "content": "To make pancakes, you'll need flour, eggs, milk, baking powder, and a pinch of salt."}, # we shortened it for our demo purpose
    {"role": "user", "content": "Can I use almond milk instead of regular milk?"}
  ]

chat_completion = client.chat.completions.create(
    messages = messages,
    model = "gpt-3.5-turbo")

answer = chat_completion.choices[0].message.content

# Wrap the text to a specific width
wrapped_text = textwrap.fill(answer, width=80); print(wrapped_text)

***Each user instruction relies on the prior messages in the conversation history to make sense.***

Since the language models don't have inherent memory of past requests, it's important to include the relevant conversation history in each API request. If the conversation exceeds the model's token limit, it may need to be truncated or shortened while ensuring the essential context and instructions are retained.

#### Creating a basic conversation loop

This example demonstrates a conversation loop that performs the following tasks:

1. Takes console input continuously and formats it as the user's role content within the messages array.
2. Prints the model's response to the console and formats it as the assistant's role content within the messages array.

This approach ensures that each time a new question is asked, the ongoing conversation transcript is sent along with the latest question. Since the **model lacks memory**, it's crucial to include an updated transcript with each new question. Otherwise, the model may lose context from previous questions and answers.

In [None]:
# Re start the kernel or variables
message_history=[{"role": "system", "content": "You are a helpful assistant."}]

def gpt_response(inp, message_history):
    # We save the user's input
    message_history.append({"role": "user", "content": f"{inp}"})

    chat_completion = client.chat.completions.create(
        messages = message_history,
        model = "gpt-3.5-turbo")

    # We save the assistant response
    message_history.append({"role": "assistant", "content": f"{chat_completion.choices[0].message.content}"})

    return message_history

In [None]:
while(True):
    message_history = gpt_response(input("> "), message_history) # Lets ask by input different prompts
    print(message_history[-1]["content"]) # Print the last response

Let's check the message history to see if it has all the conversation.

In [None]:
message_history

#### Request parameters

Let's take a look at some request parameters.

- **Model**: Model type (e.g., GPT-3.5 turbo., GPT 4, GPT 4.o)
- **Prompt**: expects a list of messages in a chat-based format
- **Temperature (default 1)**: sampling temperature. Between 0 and 2. Higher value means more diverse and random output, while a lower value makes it more focused and deterministic.
- **Max tokens (default 16)**: limits the length of the generated response (max length)

In [None]:
# Lets rewrite the gpt_response function to include possible parameters

def gpt_response(inp, message_history, **params):
    # We save the user's input
    message_history.append({"role": "user", "content": f"{inp}"})

    # Generate a response from the chatbot model
    completion_params = {
        "model": "gpt-3.5-turbo",
        "messages": message_history,
        **params  # Include additional parameters
    }

    chat_completion = client.chat.completions.create(**completion_params
        )

    # We save the assistant response
    message_history.append({"role": "assistant", "content": f"{chat_completion.choices[0].message.content}"})

    return message_history

Lets explore different settings by using a max_tokens value of 100 and testing three temperature levels (0, 1, and 2) to generate responses from the model, completing the prompt "My favourite animal is."

In [None]:
message_history=[{"role": "system", "content": "Complete the prompt."}]

for i in [0,1,2]:
    message_history = gpt_response("My favourite animal is ",
                                   message_history,
                                   max_tokens=100,
                                   temperature=i)
    print(message_history[-1]["content"]+"\n") # Print the last response

- **top_p** (Defaults to 1)
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

A higher value gives access to more tokens (and more diversity) and a lower value is more deterministic.

We generally recommend altering this or temperature but not both.



Lets explore different settings by using a max_tokens value of 100 and testing two top_p levels (0 and 1) to generate responses from the model, completing the prompt "My favourite animal is."

In [None]:
message_history=[{"role": "system", "content": "Complete the prompt."}]

for i in [0,1]:
    message_history = gpt_response("My favourite animal is ",
                                   message_history,
                                   max_tokens=100,
                                   top_p=i)
    print(message_history[-1]["content"]+"\n") # Print the last response

- **presence_penalty** (Defaults to 0)
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. **Higher values promote creativity by penalising the model when it uses predefined tokens.**

- **frequency_penalty** (Defaults to 0)
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. **Higher values penalise the model for repetition and reward variety.**


Lets explore different settings by using a max_tokens value of 100 and testing two presence penalty and frequency penalty levels (-2 and 2) to generate responses from the model, using the prompt "generate 20 ways to say you can't buy that because you're broke"

In [None]:
message_history=[]

for i in [-2,2]:
    message_history = gpt_response("generate 20 ways to say you can't buy that because you're broke",
                                   message_history,
                                   max_tokens=100,
                                   presence_penalty=i,
                                   frequency_penalty=i)
    print(message_history[-1]["content"]+"\n")# Print the last response

### BONUS: Using Chat Completion for non-chat scenarios
The Chat Completion API is designed to work with multi-turn conversations, but it also works well for non-chat scenarios.

### Sentiment Analysis

Lets set up a sentiment analysis scenario where a user inputs a tweet, and the program generates responses using GPT, providing sentiment predictions until interrupted.

We will give the role *system* the task to decide whether a Tweet's sentiment is positive, neutral, or negative.
We include as a user pre-prompt the example "I loved the new Batman movie! Sentiment:", and an example assistant answer "Positive".

In [None]:
# Lets give an example of sentiment analysis
message_history=[{"role": "system", "content": "Decide whether a Tweet's sentiment is positive, neutral, or negative."},
             {"role": "user", "content": "Tweet: \"I loved the new Batman movie!\"\nSentiment:"},
              {"role": "assistant", "content": "Positive"}
  ]
print("Write a tweet: ")

while(True):
    message_history = gpt_response(input("> "),
                                   message_history,
                                   temperature=0,
                                   max_tokens=60,
                                   frequency_penalty=0.5)
    print(message_history[-1]["content"])

### Language Translation

Lets set up a translation scenario where a user inputs a phrase to be translated into Spanish and Portuguese. The program generates responses using GPT, providing translations until interrupted.

We will give as a system pre-prompt *Translate this into Spanish, Portuguese, Italian*

In [None]:
message_history=[{"role": "system", "content": "Translate this into Spanish, Portuguese, Italian"}
  ]
print("Write a phrase to translate to Spanish, Portuguese and Italian: ")

while(True):
    message_history = gpt_response(input("> "), message_history,temperature=0.3,max_tokens=60)
    print(message_history[-1]["content"])

## 5. Creating a Chatbot

### Intro notes

Now you will see how you can utilize the chat format to have extended conversations with chatbots personalized or specialized for specific tasks or behaviors.

![image.png](attachment:image.png)

### Code

In [None]:
#Same set up
import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key  = os.getenv('OPENAI_API_KEY')

In [None]:
#Set the right adjustments
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # this is the degree of randomness of the model's output
    )
#     print(str(response.choices[0].message))
    return response.choices[0].message["content"]

In [None]:
messages =  [  
{'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'},    
{'role':'user', 'content':'tell me a joke'},   
{'role':'assistant', 'content':'Why did the chicken cross the road'},   
{'role':'user', 'content':'I don\'t know'}  ]

In [None]:
response = get_completion_from_messages(messages, temperature=1)
print(response)

In [None]:
messages =  [  
{'role':'system', 'content':'You are friendly chatbot.'},    
{'role':'user', 'content':'Hi, my name is Isa'}  ]
response = get_completion_from_messages(messages, temperature=1)
print(response)

In [None]:
messages =  [  
{'role':'system', 'content':'You are friendly chatbot.'},    
{'role':'user', 'content':'Yes,  can you remind me, What is my name?'}  ]
response = get_completion_from_messages(messages, temperature=1)
print(response)