<center><img src="https://drive.google.com/uc?export=view&id=1-UrUldqGx2Ezp8fqw0J5fOU2YXswfjsb" width=640 /></center>

<br/>

<font size=6>
<center>Lesson 4: ChatGPT
</center>
</font>

<font size=4>
<center><i>Discovering the many things that one model can do</i>
</center>
</font>




# 1. Introduction


## Lesson Overview

This time, we’ll go deep into a machine learning model that’s made a lot of noise recently: ChatGPT. Fortunately, just like BERT from the previous lesson, it’s based on the **transformers** architecture, so it will be easier to understand what’s happening. In this lesson, you’ll:

* Understand the structure of the ChatGPT model and how it’s **trained**
* Learn about **in-context learning** and how it relates to other types of learning
* Practice solving real-life problems using the **OpenAI API** and **LangChain** library

<center><img src="https://drive.google.com/uc?export=view&id=1djg_MFGfQwg68uPN4cKW5pt3YNMsGlmX" width=1000 /></center>

# 2. Understanding ChatGPT


## The Base GPT model

First, a quick review of the **transformers** architecture from the previous lesson:


<center><img src="https://drive.google.com/uc?export=view&id=19D0hMWMKa2-wqt-6GN-RtoDwxvmb2AD-" width=700 /></center>

There are two major parts here:

* An **encoder**, which takes a sequence of input tokens and generates a sequence of embedding vectors
* A **decoder**, which takes both the embedding vectors and the previous output tokens and generates next output tokens

Note that there are three different types of embeddings: input token embeddings, encoder outputs, and output token embeddings. Token embeddings are connected to natural language, but encoder outputs are just vectors that only the model understands.

In the previous lesson, we worked with a BERT model, which only has the encoder part; it builds the embedding vectors that can be used later for classifying the whole sequence, (for example, to detect the sentiments of the reviews).

But what if instead, we build a model with only the decoder?

This may sound a little strange – because how would we decode something without encoding it first? However, this way we actually arrive at a **generative model**. Rather than transforming an input sequence into an output sequence (like in a machine translation task), a generative model just continues its input sequence; it can even create a text from scratch. (We talked a little bit about models like this in lesson 1.)

The following picture illustrates how it works:

<center>
<img src="https://drive.google.com/uc?export=view&id=1x3dtwbvP1tbL7Jqys8QX_NW6AehJekPw" width=600 />
</center>




A decoder-only model predicts the continuation of a text token by token, adding each new token to its input.

Recent decoder-only models have billions of trainable parameters, and so they’re referred to as **Large Language Models** (**LLMs**).

The emergence of LLMs was a huge breakthrough in machine learning, and these days, several Large Language Models appear every week. Let’s just mention a few of them:



1. The **GPT** family, created by OpenAI, consists of decoder-only models (GPT means **Generative Pre-trained Transformer**). It includes very famous models like GPT-3.5, GPT-4 and GPT-4o
2. The **Claude** family by Anthropic
3. The **Gemini** family by Google
4. **Mistral** and **Mixtral**, which are open source and can be downloaded from Hugging Face.

Let’s discuss the typical architecture for these models:



* The input data is transformed into vectors using **token embeddings** and **positional embeddings**.
* These vectors go through a lot of big **transformer layers**. (Like, really huge layers with a ton of parameters.)
* The final vector embeddings are translated back into token probabilities using **softmax**.

The training process for these models often consists of several steps (more details ahead!), but the first step is always the same: **pre-training**. During this step, the model browses through huge volumes of real texts and learns to reproduce them. Roughly speaking, for each text and for each token position of this text the model learns to solve a classification task: predicting the next token.

After the pre-training stage we get the **base GPT model**.

## From GPT to ChatGPT

Now, from the base GPT model, there are two more training steps to get the ChatGPT model.

**1. Supervised Fine-Tuning**

The ability to continue a sentence is not enough – ChatGPT should complete tasks given by users and have the ability to keep up a conversation.

To do that, OpenAI hired people who created specific fine tuning data: prompts paired with completions, this is where a prompt is given as a task and a completion solves this task. This could be something like this:

<br/>

| Prompt | Completion |
| -------- | -------- |
| How can I train an LLM?   | You need to start with pre-training and then...   |
| I feel depressed :(   | What's wrong? How can I help you?   |

<br/>

This step is just another supervised training step where the value comes from specific, high quality data.

**2. Alignment training**

We also expect an LLM not only to produce likely texts, but to be helpful, honest, and, well, “harmless”.

Helpfulness can be achieved after Supervised Fine-Tuning, but after that the model can still produce harmful or toxic answers. (And we wouldn’t want ChatGPT to tell us how to assemble a bomb or to write an 18+ movie script, would we?)

In other words, we need to ensure that the model aligns with human preferences. There are several ways of doing this; OpenAI used **RLHF** (**Reinforcement Learning on Human Feedback**).

We won't go too far into the details about this type of learning, let’s just explain the basics.

First, a **reward model** is trained. This is a model that is able to rank different completions of a prompt by assigning each a numerical score indicating how acceptable it is. So, a really toxic completion will receive a very low score, while a helpful, harmless one will get a much larger score.

The reward model is trained on triplets (prompt, better completion, worse completion) labeled by human contractors.

Next, RLHF trains the model to generate completions, maximizing the reward model score.

<center>
<img src="https://drive.google.com/uc?export=view&id=1oO3mzSFK3UrrCmNiqOSsRdfqfPKtJ7xt" width=800 />

*Image from [InstructGPT paper](https://arxiv.org/abs/2203.02155)*
</center>

This whole thing is the process of fine-tuning a general GPT model to create a ChatGPT model with more specific goals. We’ll continue overviewing this terminology a little bit later in the lesson.

Let's now pause and get a zoomed-out view on ChatGPT, because it’s easy to get lost in all the complex details. In a nutshell, this model is similar to the sentiment prediction linear model from lesson 2; this is a machine learning model with parameters that are trained on some data with a specific goal. ChatGPT obviously seems much more intelligent than a simple linear model, and the magic happens because ChatGPT has more parameters and a bigger training dataset – both larger by several magnitudes. The numbers on the right are for ChatGPT, based on GPT-3.5.

<center><img src="https://drive.google.com/uc?export=view&id=1Ax621tRQzJsT44tqRL_EQ6QyxpS0Q4sW" width=700 /></center>

Note that because ChatGPT is still a machine learning model, there can be some unexpected effects to this, for example:

* The model does not actually give the best possible answers for the prompts, it instead tries to give the answer that was more suitable during training. The difference is subtle, but there are some known cases of “weird” behaviour. For example, you can add things like "take a deep breath before answering", or, "I'll tip you $100 if you answer well", to the prompt, and the results will improve. It's not entirely clear why the model does this, but this could be the result of seeing that humans answer better after those prompts.
* Even after Supervised Fine-Tuning and RLHF, the model can still make mistakes or **hallucinate** – in other words, create something entirely new. (For example, if you ask it to name several research papers about elven artifacts of Gondolin, it will probably make up some just to be helpful and nice.)

## OpenAI API

Now, you’ve probably already tried out the ChatGPT in action (if not, we highly recommend doing so by visiting [https://chat.openai.com/](https://chat.openai.com/)). Now, we want to show you another way to use it: through OpenAI API, which is a more flexible endpoint for the same thing. Let's try it out, and we’ll also compare how the answers from the base GPT model and ChatGPT model differ.

First, we need to install and import the “openai” package which is the wrapper for the API in Python.

In [None]:
!pip install -qq openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.5/328.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import openai

In order to make requests to this API, we need to give it a proper API key so that it knows the account the requests should be attributed to, and charged from. (If you’ve never used the API platform before, you’ll get some free credits for the start that should be sufficient for our lessons.)

In [None]:
OPENAI_API_KEY = "<insert your key here>"

client = openai.OpenAI(api_key=OPENAI_API_KEY)


Now, let’s give ChatGPT a prompt and check out its answer. With the API, you need to explicitly state the model you want to use;  we’ll be using ChatGPT-3.5 Turbo because it’s much cheaper than the more recent GPT-4o, yet it still provides good answers. Here’s the code for that (see the comments for explanations):

In [None]:
response = client.chat.completions.create(   # this function is for chat models
    model="gpt-3.5-turbo",   # this is the name of the model in the system
    messages=[   # prompts are given in a structured format of messages
        {
            "role": "user",   # this means that the message is from user
            "content": "Write a haiku",   # the actual text of the prompt
        }
    ]
)
response

ChatCompletion(id='chatcmpl-9lH1lxA4BZkBoAfOsTckluW735wOt', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Silent moonlight falls\nWhispers in the midnight air\nPeaceful dreams take flight', role='assistant', function_call=None, tool_calls=None))], created=1721053897, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=18, prompt_tokens=11, total_tokens=29))


The response object contains lots of meta-information, but for our purposes just the text is enough:

In [None]:
print(response.choices[0].message.content)

Silent moonlight falls
Whispers in the midnight air
Peaceful dreams take flight


As previously discussed, the ChatGPT model is built on top of a more general GPT model. The only base model available is the one for the GPT-3 model, because there is not much point in using those. But we can get access to that via the API and analyse its responses for a better understanding of ChatGPT:

In [None]:
# this model uses a different, more simple legacy API
response = client.completions.create(
    model="davinci-002",    # the name of the base GPT-3 model
    prompt="Write a haiku")
print(response.choices[0].text)

 about something green No matter how hard you try, or want, to do one


As you can see, the answer is very strange. This is because this model never learned to follow the instructions; rather, it just tries to continue text generation from a given start instead. From this point of view, the answer makes more sense.

Note that if we phrase the prompt as if we are only waiting for completion, both models will do well with it:

In [None]:
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "From the moment I walked in, I"}
    ],
    max_tokens=200,   # let's limit the number of tokens to prevent generation of very long answers
)
print(response.choices[0].message.content)

 could feel the energy in the room buzzing with excitement. The decor was vibrant and colorful, the music was upbeat and lively, and the smell of delicious food filled the air. I knew right away that this was going to be a fun and lively experience.

As I made my way through the room, I was greeted by friendly staff members who were eager to assist me and make sure I had everything I needed. I could tell that they genuinely cared about providing excellent service and creating a memorable experience for their customers.

I found a seat at a table near the dance floor and watched as people of all ages and backgrounds came together to enjoy the festivities. The dance floor was packed with people moving to the rhythm of the music, and the energy was infectious.

I ordered a drink and some appetizers, and as I enjoyed my meal, I couldn't help but smile at the joy and laughter that surrounded me. This was truly a special place, where people could come together to celebrate and enjoy life.

A

In [None]:
response = client.completions.create(
    model="davinci-002",
    prompt="From the moment I walked in, I",
    max_tokens=200,
)
print(response.choices[0].text)

 knew we had arrived at the right place to have an authentic Justice of the Peace wedding at Cherry Hill,.”

Who knew that so much love could grow from a group dance class, but I don’t think either Todd or Samantha would have it any other way. “People thought my name was Samantha for so long, that now I answer to it,” Todd told me. “We’ll be together forever.” “No one has ever made me laugh like Todd,” Samantha said. “We must have been separated at birth or something.”

Related




Let's do a small exercise to get more familiar with the API.

There is a parameter in the API called `temperature` that controls the randomness of the answers. Do you recall that the model generates one output token at a time, and each time predicts the probability of each possible next token? Temperature controls how the actual generated token is chosen:



* If the temperature is 0, the model just takes the token with the highest probability
* With smaller temperatures, the model generates a random token, but with a very high preference for high probability tokens (it amplifies their already high probabilities)
* Then, with higher temperatures, the model starts generating more low probability tokens, at some point starting to boost lower probability tokens to make them more even with others

In the OpenAI API, this is described as a balance between consistency (low temperatures) and diversity or creativity (high temperatures).

**Exercise.** Try using different temperature values (it has a range from 0 to 2) with a single prompt and analyse how the results change. Use the parameter _n_ to generate several answers with one request.

In [None]:
# <your code here>

In [None]:
# <SOLUTION>

import textwrap

prompt = "Pigeon is a bird that"
n_answers = 3
temperatures = [0., 0.5, 1.0, 2.0]

for temp in temperatures:
    response = client.completions.create(
        model="davinci-002",
        prompt=prompt,
        max_tokens=100,
        n=n_answers,
        temperature=temp,
    )
    print("== Temperature:", temp)
    print()
    for r in response.choices:
        print(textwrap.fill(r.text, 80))
        print()

== Temperature: 0.0

 is very common in the world. It is a bird that is very easy to find in the
world. Pigeon is a bird that is very easy to find in the world. Pigeon is a bird
that is very easy to find in the world. Pigeon is a bird that is very easy to
find in the world. Pigeon is a bird that is very easy to find in the world.
Pigeon is a bird that is very easy to find in the world

 is very common in the world. It is a bird that is very easy to find in the
world. Pigeon is a bird that is very easy to find in the world. Pigeon is a bird
that is very easy to find in the world. Pigeon is a bird that is very easy to
find in the world. Pigeon is a bird that is very easy to find in the world.
Pigeon is a bird that is very easy to find in the world

 is very common in the world. It is a bird that is very easy to find in the
world. Pigeon is a bird that is very easy to find in the world. Pigeon is a bird
that is very easy to find in the world. Pigeon is a bird that is very easy to
find in 

Well, it seems the maximum temperature results are totally out of control (creative), but this is an expected behaviour.

<center><img src="https://drive.google.com/uc?export=view&id=1Qtbbu3FicxKTVXERSFIiV5-PkjxRk4kN" width=256 /></center>

If you want to play more with this sampling aspect, there’s another parameter called `top_p` that can cut-off lowest probability tokens (you can read about it in [API documentation](https://platform.openai.com/docs/api-reference/chat/create)).

In the following sections, we’ll explore the capabilities of this model and see what applications can be built on top of it.

# 3. In-context Learning

Before starting to build applications on top of ChatGPT, let's understand one important concept relevant to it: **in-context learning** (or **few-shot learning)**. This is the ability to temporarily learn new things from the data provided in the prompt. Let's see some examples; below is a typical question that pulls the information from all the base knowledge of GPT (with the addition of nice phrasing trained in ChatGPT):

**👤 Prompt:**

<blockquote><p>
What company created the IPhone?
</p></blockquote>

**🤖 ChatGPT:**

<blockquote><p>
The iPhone was created by Apple Inc. Apple is a technology company that designs, manufactures, and markets consumer electronics, computer software, and other products, including the iPhone. The first iPhone was introduced by Apple's co-founder, Steve Jobs, and it was initially released on June 29, 2007. Since then, Apple has continued to release new models of the iPhone, making it one of the most popular and widely used smartphones globally.
</p></blockquote>




Now, let's make it harder and make the question a bit cryptic.

**👤 Prompt:**

<blockquote><p>
In this text hgieugheda means IPhone. What company created hgieugheda?
</p></blockquote>

**🤖 ChatGPT:**

<blockquote><p>
The term "hgieugheda" does not correspond to any recognizable product or brand name, and as of my last knowledge update in January 2022, it is not associated with the iPhone or any known technology product. If "hgieugheda" is a fictional term or code you've created to represent the iPhone in a specific context, it's important to note that in reality, the iPhone is a product of Apple Inc. If there have been developments or changes after my last update, I would not be aware of them.
</p></blockquote>

Despite adding an unnecessary critique of our prompt, the model still managed to give the correct answer to our question. This is a primitive example of in-context learning: the model learned new "rules" of the prompt and answered accordingly.

We can make it even more explicit by giving training examples, and asking for predictions. Let's ask it to classify some food between fruits and vegetables, but without telling the model what the rule is, just assigning values 0 (vegetable) and 1 (fruit) to them:

**👤 Prompt:**

<blockquote><p>
tomato - 0

orange - 1

apple - 1

cucumber - 0

bellpepper - 0

Now following the same rule, answer this:

pear - ?
</p></blockquote>

**🤖 ChatGPT:**

<blockquote><p>
The pattern in the provided list appears to be assigning a value of 1 to fruits and a value of 0 to vegetables. Based on this pattern:

pear - 1

So, according to the established rule, the value for "pear" would be 1.
</p></blockquote>

This is a simple example, but anyway we can treat this as learning and predicting.

However, there is a limit to how far we can take this:

**👤 Prompt:**

<blockquote><p>
In this text, iPhone means Microsoft Windows. What company created the iPhone?
</p></blockquote>

**🤖 ChatGPT:**

<blockquote><p>
The iPhone was created by Apple Inc., not Microsoft Windows. Apple is the company responsible for designing and manufacturing the iPhone, which is a line of smartphones that run on the iOS operating system developed by Apple. Microsoft Windows, on the other hand, is an operating system developed by Microsoft for personal computers, not mobile devices like the iPhone. If your text is using "iPhone" to refer to Microsoft Windows, it may be a misrepresentation or a mistake, as these are products of different companies.
</p></blockquote>

We can see that the model tried really hard to give the correct answer, but the "memory" from training data went into conflict with the context, and the "memory" prevailed. This once again shows us the limits of what ChatGPT is capable of: it could be hard for it to generate an unlikely answer like "Microsoft created IPhone", even though in the given context this is the correct answer.


<center><img src="https://drive.google.com/uc?export=view&id=114PQPyQ7fUSmG8n3aUwb1m8APdTz-wTS" width=320 /></center>

**Exercise.** Can you engineer a prompt that will help the model answer this question correctly? Try not to change the question, just add instructions about how to approach it.

**\<SOLUTION\>**

**👤 Prompt:**

<blockquote><p>
In this text "iPhone" means Microsoft Windows. What company created the "iPhone"? Try to solve this as a puzzle, do not refer too much to common knowledge.
</p></blockquote>

**🤖 ChatGPT:**

<blockquote><p>
Understanding the puzzle presented, let's approach it with lateral thinking:

If "IPhone" is interpreted to mean Microsoft Windows in this context, and we're looking for the company that created "IPhone" (interpreted as Microsoft Windows), we need to think about the company associated with the development and distribution of Microsoft Windows.

In this context, the company that "created" Microsoft Windows would be Microsoft Corporation. Therefore, the answer to the puzzle, considering the given interpretation, would be Microsoft Corporation.
</p></blockquote>

Now, let's take a step back and overview different types of learning for NLP models:



1. **Regular ML model training.** We collect a training set, then train the model from scratch on this training set. In the context of GPT, this is the **Pre-Training** stage, where an LLM is trained on a huge text dataset.
2. **Fine-tuning of a pre-trained model.** This is a model that was already trained on a general-purpose dataset, which is then tuned on smaller, task-specific data. ChatGPT is tuned to follow instructions during the **Supervised Fine-Tuning** stage and to align with human preferences during the **RLHF** stage.  \
(The fine-tuning of the BERT model for sentiment analysis from the previous lesson falls into the same category.)
3. **In-context learning.** This refers to a situation where we give the model all the training data (usually this is very small) inside the prompt and want the model to extract patterns from the prompt. This type of learning doesn't involve any model training in the technical sense because the model does not change during this process.

At the beginning of the ML revolution, only the first type of training was used; nowadays, all three of them are well established. When going from regular training, to fine-tuning, to context learning, we make more and more shallow changes of the model behaviour, but at the same time, it requires less and less data and computing power. This is very important: training a large model on terabytes of data for half a year is something only few companies can afford!

<center>
<img src="https://drive.google.com/uc?export=view&id=18Qzaoo5MKMZX_-wWCYuE4BhyRUgwO9d-" width="600" />
</center>

# 4. ChatGPT as a General-Purpose AI



While it's easy to see that ChatGPT can give helpful answers to the most random questions, let's pose a different question: can it solve any of the  practical tasks that other ML applications can? It turns out that, because the core GPT model was trained on a very diverse dataset of texts, the ChatGPT model can indeed solve a lot of text-based tasks.

Let's return to our task related to understanding the sentiments from user reviews for various businesses from the Yelp dataset that we previously worked with. As a refresher, let's load dataset and take a look:

In [None]:
!pip install -qq datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 w

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [None]:
import pandas as pd
import textwrap

some_samples = [0, 72, 167, 213, 316]

for ind in some_samples:
    print("Review:", textwrap.fill(dataset['train'][ind]['text']))
    print("Rating:", int(dataset['train'][ind]['label']) + 1)
    print()

Review: dr. goldberg offers everything i look for in a general practitioner.
he's nice and easy to talk to without being patronizing; he's always
on time in seeing his patients; he's affiliated with a top-notch
hospital (nyu) which my parents have explained to me is very important
in case something happens and you need surgery; and you can get
referrals to see specialists without having to see him first.  really,
what more do you need?  i'm sitting here trying to think of any
complaints i have about him, but i'm really drawing a blank.
Rating: 5

Review: Yay, I'm a fan but sometimes service is a little slow, it was very
good for us this visit.  Go to Papa j's every once in a while but
mostly for the White Pizza.  It is the best white pizza I have ever
had.  Order the white pizza on our visit this  weekend... it has
garlic, spinach, feta cheese and we usually add some veggie on top.
It was delicious!  Order fried calamari and it was OK...note to self
next time try the calamari roman sty

Now, in order to solve our sentiment analysis task – we’ll just ask ChatGPT to do it!

In [None]:
# remove unnecessary whitespaces using dedent function
prompt = textwrap.dedent("""
    Here is a review of some company:

    "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

    Predict how many stars from 1 to 5 this user rated this company.
""")

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature = 0.0)

# use fill function for better readability
print(textwrap.fill(response.choices[0].message.content))

Based on the positive feedback provided in the review, it is likely
that this user rated the company with 5 stars.



And the answer is correct! Let's try another one with a rating of 1:

In [None]:
# remove unnecessary whitespaces using dedent function
prompt = textwrap.dedent("""
    Here is a review of some company:

    "The worse "Chinese food", even the white rice which came with the ogle sweet and sour chicken is really bad. Please don't eat this food."

    Predict how many stars from 1 to 5 this user rated this company.
""")

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature = 0.0)

# use fill function for better readability
print(textwrap.fill(response.choices[0].message.content))

Based on the review, it seems like the user had a very negative
experience with the company. They specifically mentioned that the food
was bad and advised others not to eat it. Therefore, it is likely that
this user rated the company with 1 star out of 5.


Also correct!

Naturally, the next step will be to run an evaluation on the entire test dataset. However, we have a problem that needs to be addressed first: there is no uniform structure for the answers, meaning that it will be hard to extract and evaluate them with metrics. So, let’s try asking ChatGPT to format this properly:

In [None]:
# remove unnecessary whitespaces using dedent function
prompt = textwrap.dedent("""
    Here is a review of some company:

    "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."

    Predict how many stars from 1 to 5 this user rated this company.
    At the end of your answer, give your final decision about the rating in the following format: "Final Prediction: X stars"
""")

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature = 0.0)

# use fill function for better readability
print(textwrap.fill(response.choices[0].message.content))

Based on the positive feedback provided in the review, it is likely
that the user rated this company 5 stars. Final Prediction: 5 stars


This seems to be working, though we cannot be sure that the model will be consistent with it.

**Exercise.** Write a function that takes the text of a review as input and returns the star rating as an integer. If the answer is not in any recognizable format, the function should raise an exception.

In [None]:
# Exception class for failed parsing
class ReviewParseError(Exception):
    pass

# passing also the OpenAI client as parameter for better usability
def predict_review_rating(text: str, client) -> int:

    # <your code here>

    pass

In [None]:
# <SOLUTION>

import re

class ReviewParseError(Exception):
    pass

def predict_review_rating(text: str, client) -> int:
    prompt = textwrap.dedent("""
            Here is a review of some company:
            "{text}"
            Predict how many stars from 1 to 5 this user rated this company.
            At the end of your answer, give your final decision about the
            rating in the following format: "Final Prediction: X stars".
        """).format(text=text)

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt,}],
        temperature=0.0)

    last_line = response.choices[0].message.content.split('\n')[-1]
    m = re.match(r"[\w\s]*[ :]+(?P<rating>[1-5])\s?stars?[\s.!,;]*", last_line)

    if m is None:
        raise ReviewParseError("Failed to parse the prediction.")
    else:
        return int(m.group('rating'))

Let's test out how it works:

In [None]:
ind = 0
print("Review:", textwrap.fill(dataset['train'][ind]['text']))
print("Rating:", int(dataset['train'][ind]['label']) + 1)
print()
ans = predict_review_rating(dataset['train'][ind]['text'], client)
print("Answer:", ans)

Review: dr. goldberg offers everything i look for in a general practitioner.
he's nice and easy to talk to without being patronizing; he's always
on time in seeing his patients; he's affiliated with a top-notch
hospital (nyu) which my parents have explained to me is very important
in case something happens and you need surgery; and you can get
referrals to see specialists without having to see him first.  really,
what more do you need?  i'm sitting here trying to think of any
complaints i have about him, but i'm really drawing a blank.
Rating: 5

Answer: 5


**Exercise.** Evaluate this prediction function on the dataset. Determine both the accuracy and the percentage of parsing errors.

_Note: This function will probably be very slow, but we don't want to bother too much with speed for now, so you can use around 100 examples for simplicity’s sake._

<details>
    <summary><font color="404090">Hint (click to expand).</font></summary>

Setting the temperature to 0 may help a lot.

</details>

In [None]:
# <your code here>

In [None]:
# <SOLUTION>

import random
from tqdm import tqdm

def evaluate_chatgpt_predictions(dataset, client, subset_size=1, seed=12345):
    indexes = list(range(len(dataset)))
    r = random.Random(seed)
    r.shuffle(indexes)
    indexes = indexes[:subset_size]

    correct = 0
    failed = 0
    for ind in tqdm(indexes):
        text = dataset[ind]['text']
        label = int(dataset[ind]['label']) + 1
        try:
            pred = predict_review_rating(text, client)
            if label == pred:
                correct += 1
        except ReviewParseError:
            failed += 1

    return correct / len(indexes), failed / len(indexes)

acc, fail = evaluate_chatgpt_predictions(dataset['test'], client, subset_size=100)
print()
print("Accuracy: ", acc)
print("Failed: ", fail)

100%|██████████| 100/100 [01:59<00:00,  1.20s/it]


Accuracy:  0.61
Failed:  0.0





In our solution, we ended up with 65% accuracy and a 0% failure rate. However, since the ChatGPT is updated regularly, your results may vary. In any case, they are probably similar, and it’s quite impressive that this model has an accuracy level higher than the linear one, considering that there was no actual training performed, just a little bit of in-context learning.

<center><img src="https://drive.google.com/uc?export=view&id=18ltp1vYx6PYcohdOOZ2zbtiplXQcxfGP" width=320 /></center>

There is one thing worth mentioning however: this experiment, making predictions with out-of-the-box ChatGPT, may show an incorrect result. That is because the dataset could  have **leaked** into the GPT training data. We can't know this for certain, but given that the model was trained on a huge dataset of texts sourced from the Internet, (and given that Yelp dataset is not very novel), there’s a good chance that it ended up in the training data. The model hasn't been trained to predict the sentiment of these texts, but it may have seen the texts and their labels somewhere in close context.

Potential test data leakage is actually a huge problem for benchmarking contemporary LLMs. Most of the state-of-the-art models are regularly updated with new data downloaded from the Internet so that they can answer prompts with more up-to-date information (and also just to get even more training data). So, if someone creates an open benchmark for testing LLMs, it will leak into the training data soon enough, and thus, invalidate all the testing. Unfortunately, as for now, there is no set protocol on how to do these things correctly.

# 5. LangChain + OpenAI API


## LangChain Basics

Now that we've tried writing applications on top of ChatGPT, let's learn how to do this using some even more powerful tools. **LangChain** is one of those tools, and it helps us build applications on top of language models. First, let's install it and set it up.

In [None]:
!pip install -qq langchain langchain-openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.6/983.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m366.3/366.3 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Using this library, you can build what are called **chains**, that is, different steps for processing LLMs. Let's build a simple chain for our Yelp review prediction. When using the LangChain library our chain will contain 3 parts:

1. A prompt template for converting input data (review) into the LLM prompt
2. An LLM interface for processing the prompt
3. An output parser for converting the LLMs output into a proper data type

As you can see, this is very similar to what we've done before, just split into more standardized blocks.

So, first, let's build a prompt template. And for better parsing quality, let's ask ChatGPT to output the results in JSON format, instead of plain text:

In [None]:
# just a general template for prompts
from langchain_core.prompts import ChatPromptTemplate

# adding all our prompt text, plus a {review} template variable
json_prompt = ChatPromptTemplate.from_template(
    textwrap.dedent("""Here is a review of some company:

            "{review}"

            Return a JSON object with a `rating` key that stores a
            prediction of how many stars from 1 to 5 this user rated this company.
        """)
)

Next, let's create an LLM, which is basically just ChatGPT wrapped in a LangChain class:

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo",
                 temperature=0.0,
                 openai_api_key=OPENAI_API_KEY)

Then, we’ll import an output parser, which is just an out-of-the box feature in LangChain for parsing JSON:

In [None]:
from langchain.output_parsers.json import SimpleJsonOutputParser

json_parser = SimpleJsonOutputParser()

Now, the last step – connecting all the elements into one chain using this neat syntax:

In [None]:
chain = json_prompt | llm | json_parser

Let's test it! To run a chain, we call the function `invoke`, and we pass it our named parameters from the prompt:

In [None]:
chain.invoke({"review": dataset['train'][0]['text']})

{'rating': 5}

**Exercise.** Evaluate this new function and find out its accuracy and failure rate. Again, to save time you can just use 100 examples).

In [None]:
# <your code here>

In [None]:
# <SOLUTION>

import random
from tqdm import tqdm

def evaluate_chatgpt_predictions_2(dataset, chain, subset_size=1, seed=12345):
    indexes = list(range(len(dataset)))
    r = random.Random(seed)
    r.shuffle(indexes)
    indexes = indexes[:subset_size]

    correct = 0
    failed = 0
    for ind in tqdm(indexes):
        text = dataset[ind]['text']
        label = int(dataset[ind]['label']) + 1
        ans = chain.invoke({"review": text})
        try:
            pred = int(ans['rating'])
            if label == pred:
                correct += 1
        except (KeyError, ValueError, TypeError):
            failed += 1

    return correct / len(indexes), failed / len(indexes)

In [None]:
acc, fail = evaluate_chatgpt_predictions_2(dataset['test'], chain, subset_size=100)
print()
print("Accuracy: ", acc)
print("Failed: ", fail)

100%|██████████| 100/100 [00:56<00:00,  1.77it/s]


Accuracy:  0.68
Failed:  0.0





The accuracy was probably very similar to those we saw before, but with much less manual parsing needed.

There is also a way to improve LLM prediction results by using a technique called **Chain of Thought(s)** (CoT). This is when we provide some additional text in the prompt that helps the model elaborate more on the answer before arriving at the final result. In this task with Yelp, it doesn't seem to help much, but we’ll demonstrate how to do it with LangChain anyway because it’s a useful tool.

We’ll use the simplest version of CoT by adding the phrase "Let's think step by step" to our prompt. This will encourage the model to elaborate more, but it will also ruin the JSON format. So, we’ll need to pass the output again to the LLM and ask it to build the answer in JSON format; this is done by creating two chains and passing one to the other:

In [None]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

# first prompt with elaboration and prediction
prompt_1 = ChatPromptTemplate.from_template(
    textwrap.dedent("""Here is a review of some company:

            "{review}"

            You need to analyse the sentiment of this text. Then make a
            prediction of how many stars from 1 to 5 this user rated this company.
            Let's think step by step.
        """)
)

# chain with simple string output
chain_1 = prompt_1 | llm | StrOutputParser()

# second prompt just for JSON extraction
prompt_2 = ChatPromptTemplate.from_template(
    textwrap.dedent("""
            Here is the analysis of the sentiment of a user review:

            "{cot}"

            Return a JSON object with a `rating` key that stores a
            prediction of how many stars from 1 to 5 this user rated this company.
        """)
)

# second chain using syntax for passing results from one chain to another
chain_2 = {"cot": chain_1} | prompt_2 | llm | json_parser

Let's test how it works:

In [None]:
chain_2.invoke({"review": dataset['train'][0]['text']})

{'rating': 5}

In [None]:
aacc, fail = evaluate_chatgpt_predictions_2(dataset['test'], chain_2, subset_size=100)
print()
print("Accuracy: ", acc)
print("Failed: ", fail)

100%|██████████| 100/100 [04:40<00:00,  2.81s/it]


Accuracy:  0.68
Failed:  0.0





**Bonus exercise.** Can you make use of this technique to actually improve the quality? Here’s [a guide on CoT](https://deepgram.com/learn/chain-of-thought-prompting-guide) that you can peruse if needed.

_Note: as the change in quality decreases, you’ll need to use more examples for testing._

<center><img src="https://drive.google.com/uc?export=view&id=1wkxdhtNHUm3S3aAbq4nK586mor8YIv5l" width=320 /></center>

## RAG with LangChain

Just to see how powerful LangChain is, let's implement another thing using it: **Retrieval Augmented Generation**, or **RAG**. The idea behind this approach is to improve the LLMs question-answering abilities by adding search results from a search engine or a database as additional information. This way, we utilize both the LLMs ability to generate intelligent answers, and we also make sure that the information it uses is as up-to-date as possible.

A simple RAG in LangChain will work almost out-of-the-box, so let's make one, and get a general overview of how it works.

We’ll need to know about two new concepts for this task: **agents** and **tools**. Essentially, an agent is a high-level system that uses an LLM together with prompt templates and various tools to solve complex tasks. Tools are things like web search, Python code, various APIs (also known as "plugins" in ChatGPT web UI).

So, in our case, we’ll just need to create an agent, and provide it with a web search tool and the appropriate prompts. We’ll make use DuckDuckGo search and a pre-made prompt from LangChain Hub designed for these kinds of agents:

In [None]:
!pip install -qq duckduckgo-search langchainhub langchain-community

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.agents import AgentExecutor, create_react_agent
from langchain import hub
from langchain.tools import DuckDuckGoSearchRun

# just loading a prompt for ReAct agents (Reasoning + Acting)
prompt = hub.pull("hwchase17/react")

# using only one tool: web search
tools = [DuckDuckGoSearchRun()]

# pass everything to an agent (llm was already defined before)
agent = create_react_agent(llm, tools, prompt)
# just another wrapper, but we are setting verbose to true to see the details
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

Now, let's run the agent with a query that involves looking up some up-to-date information:

In [None]:
agent_executor.invoke({
    "input": "What is the most interesting scientific discovery of 2024?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use the duckduckgo_search tool to find the most recent scientific discoveries.
Action: duckduckgo_search
Action Input: "most interesting scientific discovery 2024"[0m[36;1m[1;3m12 recent scientific breakthroughs. From cell reparation to reef restoration. 1. Restoring brain cells. Scientists in many fields received little recognition for the last couple of years, as the ... At CAS, we have a unique view of recent scientific breakthroughs, the historical discoveries they were built upon, and the expertise to navigate the opportunities ahead. In 2023, we identified the top scientific breakthroughs, and 2024 has even more to offer. New trends to watch include the accelerated expansion of green chemistry ... At CAS, we have a unique view of recent scientific breakthroughs, the historical discoveries they were built upon, and the expertise to navigate the opportunities ahead. In 2023, we identified the top scientific br

{'input': 'What is the most interesting scientific discovery of 2024?',
 'output': "The most interesting scientific discovery of 2024 is the groundbreaking body of work led by Monash University physicists that has opened a new pathway for understanding the universe's fundamental physics."}

From the output you can see that the agent:



1. Understood what it needs to search for, and created a search query
2. Got the search results from the search engine
3. Analyzed the results and generated the answer

So, there’s a simple way to upgrade a search engine with LLMs, but even more complex RAG's are getting lots of action in many applications right now!

# 6. Conclusion

In this lesson, we talked a lot about ChatGPT and its applications, so let's wrap up what we've learned:



* We understood how the GPT model is built using **transformer** architecture and how ChatGPT is built on top of it
* We learned about different kinds of model learning, and most interestingly for LLMs, **in-context learning**
* We practised writing applications on top of LLMs, using both a basic **API** and the **LangChain** library

Up until this point, we’ve mostly been talking about NLP tasks, but in the [**next lesson**](https://colab.research.google.com/github/Nebius-Academy/ML-Starter-Pack/blob/main/notebooks/lesson_5.ipynb), we’ll switch to a new kind of task: **image classification**.

By the way, if you’re interested, here are some additional resources on the things we talked about in this lesson:

* This [blog post from OpenAI](https://openai.com/blog/chatgpt) describes the ChatGPT model
* See a [GPT-2 implementation](https://jaykmody.com/blog/gpt-from-scratch/) from scratch in NumPy
* A [guide](https://deepgram.com/learn/chain-of-thought-prompting-guide) on the Chain-of-Thought technique
* A [simple guide](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain) on building RAGs for more specific applications than web search