# Using `gpt-3.5-turbo` to answer questions posed in natural language, using a custom dataset and Retrieval-Augmented Generation



---
**Credit**: Adapted from this [OpenAI notebook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

---

Many use cases require us to respond to user questions with relevant and accurate answers. For example, a customer support chatbot may need to provide answers to common support questions.

The GPT models have picked up a lot of general knowledge in training - remember GPT-3 was trained on 500 billion tokens! - but we often would like to have the model *use our own dataset or library* of more specific information to answer the questions (e.g., we would like our customer service chatbot to consult a library of service manuals when it answers a user question). We'd expect those tailored responses to be more helpful and accurate than generic responses uninformed by our specific data.

In this notebook we will demonstrate a method for enabling `gpt-3.5-turbo` to answer questions using a library of text as a reference. We'll be using a dataset of Wikipedia articles about the 2022 Winter Olympic Games but the same approach can be used with a library of books, articles, documentation, service manuals, or much much more.

## Setup

Let's get started by installing the openai python package and `tiktoken`, a package that can tokenize inputs using BPE, in a manner compatible with the OpenAI models


In [None]:
!pip install --upgrade openai
!pip install tiktoken

In [2]:
import pandas as pd  # for storing text and embeddings data
import dotenv
import os


We will use `gpt-3.5-turbo` and then `gpt-4o` in this example; this is the GPT variant that powered the initial release of ChatGPT and remains a potential backend for that service.




---
We will be using **pre-trained contextual embeddings** as well. For that, we will
use the `text-embedding-ada-002` model ([link](https://openai.com/blog/new-and-improved-embedding-model)).


---

Finally, let's set the OpenAI API key. You can get yours [here](https://platform.openai.com/account/api-keys), and then enter it under `OPENAI_API_KEY` in your Colab secrets. We will create an OpenAI API client using this key.





In [3]:
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

dotenv.load_dotenv()

# client for OpenAI API
from openai import OpenAI # for calling the OpenAI API
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

## Prompting without custom data

Before we try anything fancy, let's simply ask `gpt-3.5-turbo` a question on the 2020 Summer Olympics and see how it responds.

First, we prepare the prompt.

In [4]:
query = 'Which athlete won the gold medal in the high jump at the 2020 Summer Olympics?'

Next, we make the request to the model, using the openai API. [Documentation](https://platform.openai.com/docs/api-reference/completions/create?lang=python).


In [5]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Mutaz Essa Barshim of Qatar and Gianmarco Tamberi of Italy both won the gold medal in the men's high jump at the 2020 Summer Olympics. They decided to share the gold medal rather than participate in a jump-off.


We can check that this answer is in fact correct [here](https://en.wikipedia.org/wiki/Athletics_at_the_2020_Summer_Olympics_%E2%80%93_Men%27s_high_jump). Impressive. But now lets change the query around and ask something about the 2022 Winter Olmpics.

In [12]:
query = 'Which athletes won the gold medal in curling at the 2022 Winter Olympics?'

response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

The gold medal in curling at the 2022 Winter Olympics was won by the Swedish men's team and the South Korean women's team.


If we fact-check this, it turns out that ....
<br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br>













... Sweden did win the men's gold and the South Korean team did particpate, but **Great Britain won the women's gold**.


<br>

<br>



Sounds like `gpt-3.5-turbo` could use some help. 😆


### "Engineering" the prompt to reduce hallucinations



One simple thing we can try right off the bat is to tell `gpt-3.5-turbo` to say "I don't know" if it doesn't know rather than make stuff up i.e., "hallucinate".


How? By asking nicely? 😀 Well, almost.



By asking **explicitly!**

Let's modify our prompt as follows.


In [36]:
query = f"Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"

Note the explicit extra instruction in the above prompt: *as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"*

In [37]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': '''You answer questions about the 2022 Winter Olympics.Answer the question as truthfully as possible, \
and if you're unsure of the answer, say "Sorry, I don't know'''},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Sorry, I don't know.


Wow, it worked. The model is being humble and honest 👀.

It is an interesting question as to why ChatGPT knew the High jump answer but not this. Let's check the [cutoff date](https://platform.openai.com/docs/models/gpt-3-5-turbo) for the training data.

## Using custom data

To help the model answer a question, we can provide relevant custom data **in the prompt itself**. This extra information we provide in the prompt is referred to as **context**.



### Manually enriching the prompt with custom data

We will first show how to do this by ***manually*** finding and adding information (that's relevant to the question) to the prompt.

First, we will use the wikipedia article for the 2022 Winter Olympics curling event as context.

Second, we will **explicitly tell the model to make use of the provided context**.

There's a deeper lesson here: **telling LLMs explicitly what you want them to do often helps**

In [6]:
# text copied and pasted from: https://en.wikipedia.org/wiki/Curling_at_the_2022_Winter_Olympics
# Only the portion of the article up until the medalists is included.

wikipedia_article_on_curling = """Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
2022 Winter Olympics
Curling pictogram.svg
Qualification
Statistics
Tournament
Men
Women
Mixed doubles
vte
The curling competitions of the 2022 Winter Olympics were held at the \
Beijing National Aquatics Centre, one of the Olympic Green venues. Curling \
competitions were scheduled for every day of the games, from February 2 to \
February 20.[1] This was the eighth time that curling was part of the Olympic \
program.

In each of the men's, women's, and mixed doubles competitions, 10 nations \
competed. The mixed doubles competition was expanded for its second appearance \
in the Olympics.[2] A total of 120 quota spots (60 per sex) were distributed to \
the sport of curling, an increase of four from the 2018 Winter Olympics.[3] A \
total of 3 events were contested, one for men, one for women, and one mixed.[4]

Qualification
Main article: Curling at the 2022 Winter Olympics – Qualification
Qualification to the Men's and Women's curling tournaments at the Winter \
Olympics was determined through two methods (in addition to the host nation).\
 Nations qualified teams by placing in the top six at the 2021 World Curling \
 Championships. Teams could also qualify through Olympic qualification events \
 which were held in 2021. Six nations qualified via World Championship \
 qualification placement, while three nations qualified through qualification \
 events. In men's and women's play, a host will be selected for the Olympic \
 Qualification Event (OQE). They would be joined by the teams which competed \
 at the 2021 World Championships but did not qualify for the Olympics, and \
 two qualifiers from the Pre-Olympic Qualification Event (Pre-OQE). The \
 Pre-OQE was open to all member associations.[5]

For the mixed doubles competition in 2022, the tournament field was expanded \
from eight competitor nations to ten.[2] The top seven ranked teams at the \
2021 World Mixed Doubles Curling Championship qualified, along with two teams \
from the Olympic Qualification Event (OQE) – Mixed Doubles. This OQE was open \
to a nominated host and the fifteen nations with the highest qualification \
points not already qualified to the Olympics. As the host nation, China \
qualified teams automatically, thus making a total of ten teams per event \
in the curling tournaments.[6]

Summary
Nations	Men	Women	Mixed doubles	Athletes
 Australia			Yes	2
 Canada	Yes	Yes	Yes	12
 China	Yes	Yes	Yes	12
 Czech Republic			Yes	2
 Denmark	Yes	Yes		10
 Great Britain	Yes	Yes	Yes	10
 Italy	Yes		Yes	6
 Japan		Yes		5
 Norway	Yes		Yes	6
 ROC	Yes	Yes		10
 South Korea		Yes		5
 Sweden	Yes	Yes	Yes	11
 Switzerland	Yes	Yes	Yes	12
 United States	Yes	Yes	Yes	11
Total: 14 NOCs	10	10	10	114
Competition schedule

The Beijing National Aquatics Centre served as the venue of the curling \
competitions.
Curling competitions started two days before the Opening Ceremony and finished \
on the last day of the games, meaning the sport was the only one to have had a \
competition every day of the games. The following was the competition schedule \
for the curling competitions:

RR	Round robin	SF	Semifinals	B	3rd place play-off	F	Final
Date
Event
Wed 2	Thu 3	Fri 4	Sat 5	Sun 6	Mon 7	Tue 8	Wed 9	Thu 10	Fri 11	Sat 12	Sun 13	\
Mon 14	Tue 15	Wed 16	Thu 17	Fri 18	Sat 19	Sun 20
Men's tournament								RR	RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Women's tournament									RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Mixed doubles	RR	RR	RR	RR	RR	RR	SF	B	F
Medal summary
Medal table
Rank	Nation	Gold	Silver	Bronze	Total
1	 Great Britain	1	1	0	2
2	 Sweden	1	0	2	3
3	 Italy	1	0	0	1
4	 Japan	0	1	0	1
 Norway	0	1	0	1
6	 Canada	0	0	1	1
Totals (6 entries)	3	3	3	9
Medalists
Event	Gold	Silver	Bronze
Men
details	 Sweden
Niklas Edin
Oskar Eriksson
Rasmus Wranå
Christoffer Sundgren
Daniel Magnusson	 Great Britain
Bruce Mouat
Grant Hardie
Bobby Lammie
Hammy McMillan Jr.
Ross Whyte	 Canada
Brad Gushue
Mark Nichols
Brett Gallant
Geoff Walker
Marc Kennedy
Women
details	 Great Britain
Eve Muirhead
Vicky Wright
Jennifer Dodds
Hailey Duff
Mili Smith	 Japan
Satsuki Fujisawa
Chinami Yoshida
Yumi Suzuki
Yurika Yoshida
Kotomi Ishizaki	 Sweden
Anna Hasselborg
Sara McManus
Agnes Knochenhauer
Sofia Mabergs
Johanna Heldin
Mixed doubles
details	 Italy
Stefania Constantini
Amos Mosaner	 Norway
Kristin Skaslien
Magnus Nedregotten	 Sweden
Almida de Val
Oskar Eriksson
"""

In [7]:
query = f"""Use the below article on the 2022 Winter Olympics to answer the \
subsequent question.

Article:
```
{wikipedia_article_on_curling}
```

Question: Which teams won the gold medal in curling at the 2022 Winter Olympics?"""

print(query)

Use the below article on the 2022 Winter Olympics to answer the subsequent question.

Article:
```
Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
202

Take a moment to notice what the prompt has grown to.


OK, let's run it.

In [8]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': '''You answer questions about the Olympics.Answer the question as truthfully as possible, and if \
you're unsure of the answer, say "Sorry, I don't know'''},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

The teams that won the gold medal in curling at the 2022 Winter Olympics were:

- Men's Curling: Sweden
- Women's Curling: Great Britain
- Mixed Doubles Curling: Italy


Nicely done, `gpt-3.5-turbo`!

---

But maybe it wasn't super hard since the answer is literally in the context we provided.


Let's make it a bit harder.


I noticed that Oskar Eriksson actually won two medals in the event...which tempts me to ask whether any athlete won multiple medals.

Let's try it.

In [42]:
query = f"""Use the below article on the 2022 Winter Olympics to answer \
the subsequent question. Answer the question as truthfully as possible, \
and if you're unsure of the answer, say "Sorry, I don't know"

Article:
```
{wikipedia_article_on_curling}
```

Question: Did any athlete win multiple medals in curling at the 2022 Winter \
Olympics?"""

Notice that the question has changed. Everything else is unchanged.

In [43]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Yes, Oskar Eriksson from Sweden won multiple medals in curling at the 2022 Winter Olympics. He won a gold medal in the men's event and a bronze medal in the mixed doubles event.


WHOAH!!!!

👏 👍

Google cannot do this. In fact, poor Oskar doesn't show up anywhere on the results page summary.



---




### RETRIEVAL AUGMENTED GENERATION: *Automatically* enriching the prompt with custom data





**Manually** adding extra information into the prompt obviously doesn't scale. So, we will now show how to **automatically** enrich the prompt with custom relevant data.

First thing to note. We typically can't just include **all** the custom data into the prompt due to an important reason.

The prompt for every model has a limit (called the **context window**) on how many tokens you can send in and get out. For `gpt-3.5-turbo`,  the context window is 16,385 tokens ([link](https://platform.openai.com/docs/models/gpt-3-5-turbo)).

Note that the context window includes both the prompt and the response - **together**, they can't exceed 16,385 tokens. We will get deeper into this a bit later but for now, understand this is one key reason we can't include ALL data in the prompt. Another reason is expense. OpenAI charges by the token and these charges can easily add up.

(BTW, GPT-4's context window is way bigger - it ranges up to 128K tokens, depending on the particular GPT-4 model)

If we can't include all the custom data, the logical thing to do is to only include data that's **relevant** to the question.

How can we measure the relevance between a question and a piece of (our custom) data?

Using pretrained contextual embeddings!



---




This is our overall process.



**RETRIEVAL AUGMENTED GENERATION**

**One-time setup**
* Preprocess the custom dataset by splitting it into 'sections'
* We calculate an embedding vector for each section using the `text-embedding-ada-002` model and store it somewhere handy


**Each time we receive a question, we do this:**
* We calculate an embedding vector for the question (again using the same `text-embedding-ada-002` model)
* For each section in our custom dataset, we calculate the *cosine similarity* (more or less the dot-product) between that section's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar
* Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window
* Send the prompt into `gpt-3.5-turbo`.

![RAG](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4063347e-8920-40c6-86b3-c520084b303c_1272x998.jpeg)

Credit for 👆image: https://magazine.sebastianraschka.com/p/finetuning-large-language-models

#### One-time setup

We first need to break up the custom dataset into "sections".

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the `gpt-3.5-turbo` prompt.

Approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into headers, so we will use these to define our sections. This preprocessing (for a related dataset) has already been done in [this notebook](https://github.com/openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them.

### Creating embeddings from `text-embedding-3-small` model

In [23]:
from openai import OpenAI
import numpy as np

client = OpenAI()

response = client.embeddings.create(
    input="Which teams won the gold medal in curling at the 2022 Winter Olympics?",
    model="text-embedding-3-small"
)
embedding_1 = response.data[0].embedding

response = client.embeddings.create(
    input="In Winter Olympics 2022 who won curling across all formats?",
    model="text-embedding-3-small"
)
embedding_2 = response.data[0].embedding

# Converting embeddings to numpy arrays
embedding_1 = np.array(embedding_1)
embedding_2 = np.array(embedding_2)

# Testing Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(embedding_1.reshape(1,-1), embedding_2.reshape(1,-1))

array([[0.84427697]])

In [24]:
# OpenAI has hosted the processed dataset, so we can download it directly without having to recreate it.
# This dataset has already been split into 'chunks' (apparently one row for each section of the Wikipedia page)
# and a contextual embedding for each chunk has been computed.
# This file is ~200 MB, so may take a minute depending on your connection speed

embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

df = pd.read_csv(embeddings_path)

import ast  # for converting embeddings saved as strings to arrays
df['embedding'] = df['embedding'].apply(ast.literal_eval)

In [None]:
df.shape

In [44]:
df.head()

Unnamed: 0,text,embedding
0,"Lviv bid for the 2022 Winter Olympics\n\n{{Olympic bid|2022|Winter|\n| Paralympics = yes\n| logo = Lviv 2022 Winter Olympics bid.svg\n| logo-size = 220px\n| fullname = [[Lviv]], [[Ukraine]]\n| chair = \n| committee = [[National Olympic Committee of Ukraine]] (UKR)\n| history = None\n}}\n\n'''Lviv 2022''' ({{lang-uk|Львів 2022}}; {{lang-pl|Lwów 2022}}; {{lang-ru|Львов 2022}}; {{lang-de|Lemberg 2022}}) was a bid by the city of [[Lviv]] and the [[National Olympic Committee of Ukraine]] for the ...","[-0.005021067801862955, 0.00026050032465718687, -0.0046091326512396336, 0.016684994101524353, -0.029633380472660065, 0.03277317062020302, -0.016217919066548347, -0.01712612248957157, -0.0022461817134171724, -0.02706446312367916, 0.030411841347813606, 0.01922796480357647, -0.00896526500582695, -0.027972666546702385, -0.02758343517780304, -0.027479641139507294, 0.013921461068093777, 0.00766783207654953, 0.00015092799731064588, -0.01711314730346203, -0.003931223414838314, -0.006850448902696371,..."
1,"Lviv bid for the 2022 Winter Olympics\n\n==History==\n\n[[Image:Lwów - Rynek 01.JPG|thumb|right|200px|View of Rynok Square in Lviv]]\n\nOn 27 May 2010, [[President of Ukraine]] [[Viktor Yanukovych]] stated during a visit to [[Lviv]] that Ukraine ""will start working on the official nomination of our country as the holder of the Winter Olympic Games in [[Carpathian Mountains|Carpathians]]"".\n\nIn September 2012, [[government of Ukraine]] approved a document about the technical-economic substan...","[0.0033927420154213905, -0.007447326090186834, 0.008918799459934235, 0.005885893478989601, -0.03870810195803642, 0.016732387244701385, -0.019855251535773277, -0.022168485447764397, 1.1226058632018976e-05, -0.033387668430805206, 0.027887312695384026, 0.0068176123313605785, -0.008905948139727116, -0.031742699444293976, -0.025831105187535286, -0.02021508850157261, 0.010801514610648155, -0.001778298057615757, 0.0004293136007618159, -0.023839153349399567, -0.011206329800188541, -0.003370252437889..."
2,"Lviv bid for the 2022 Winter Olympics\n\n==Venues==\n\n{{Location map+\n|Ukraine\n|border =\n|caption = Venue areas\n|float = left\n|width = 350\n|places =\n{{location map~ |Ukraine |lat=49.85 |long=24.01 |label='''[[Lviv]]''' |position=top}}\n{{location map~ |Ukraine |lat=48.9828 |long=23.2774 |label='''[[Tysovets, Skole Raion|Tysovets]]''' |position=right}}\n{{location map~ |Ukraine |lat=48.711111 |long=23.188333 |label='''[[Volovets|Borzhava]]''' |position=bottom}} }}\nThe German companie...","[-0.00915789045393467, -0.008366798982024193, -0.004565536975860596, 3.061394454562105e-05, -0.028935180976986885, 0.036282945424318314, -0.015044148080050945, -0.017202889546751976, -0.00545719126239419, -0.03960821405053139, 0.024510430172085762, 0.006466167978942394, -0.00027947992202825844, -0.011149028316140175, -0.011102098971605301, -0.02180194854736328, -0.00016446156951133162, -0.0009561816696077585, -0.00996239110827446, -0.02874746359884739, -0.015553665347397327, -0.0019559403881..."
3,"Lviv bid for the 2022 Winter Olympics\n\n==Venues==\n\n===City zone===\n\nThe main Olympic Park would be centered around the [[Arena Lviv]], hosting the opening and closing ceremonies. The Olympic Park would have two ice rinks (ice hockey, short track speed skating and figure skating), a temporary speed skating oval and a temporary curling rink. The Olympic Park would also host the Olympic Village and International Broadcast Centre. A second ice rink for hockey competitions would be locat...","[0.0030951891094446182, -0.006064314860850573, 0.013017709366977215, 0.016268819570541382, -0.03091871738433838, 0.02446957677602768, -0.017648881301283836, -0.02371319569647312, -0.005088982172310352, -0.016083041206002235, 0.034766968339681625, 0.013973137363791466, 0.0015517413849011064, -0.004133553709834814, -0.025743480771780014, -0.011723900213837624, -0.006827330682426691, 0.003271014429628849, -0.008585583418607712, -0.03213954344391823, -0.02036919817328453, -0.012553264386951923, ..."
4,"Lviv bid for the 2022 Winter Olympics\n\n==Venues==\n\n===Mountain zone===\n\n====Venue cluster Tysovets-Panasivka====\n\nAn existing military ski training facility in [[Tysovets, Skole Raion|Tysovets]], 139&nbsp;km south of Lviv, along with two disused ski jumps, are proposed to be rebuilt to host all Nordic events. Additionally, a ski hill would be developed to host all of the snowboard and freestyle skiing events.<ref name=concept_study />\n*Tysovets Nordic Arena - biathlon, cross countr...","[-0.002936174161732197, -0.006185177247971296, 0.005705732852220535, 0.003036483423784375, -0.023625405505299568, 0.03688664361834526, -0.0189057644456625, -0.034465618431568146, -0.004981465172022581, -0.035145681351423264, 0.024890324100852013, 0.012860001064836979, 0.001692507998086512, -0.01909618265926838, -0.012030323036015034, -0.021558012813329697, 0.00392396654933691, -0.004321803338825703, -0.009337271563708782, -0.02959636226296425, -0.009622897952795029, -0.012757991440594196, 0...."


Let's print out 5 randomly chosen chunks.

In [22]:
pd.set_option('display.max_colwidth', 500)
df[['text']].sample(5)

Unnamed: 0,text
1903,"Ronja Savolainen\n\n==Career statistics==\n\n===International===\n\n{| border=""0"" cellpadding=""1"" cellspacing=""0"" style=""text-align:center; width:40em""\n|- ALIGN=""centre"" bgcolor=""#e0e0e0""\n! Year\n! Team\n! Event\n! Result\n! rowspan=""99"" bgcolor=""#ffffff"" | &nbsp;\n! GP\n! G\n! A\n! Pts\n! PIM\n|-\n| [[2014 IIHF World Women's U18 Championship|2014]]\n| [[Finland women's national under-18 ice hockey team|Finland U18]]\n| [[IIHF World Women's U18 Championship|WW18]]\n| 5th\n| 5\n| 0\n| 1\n| ..."
1846,"Artyom Minulin\n\n==International play==\n\n{{MedalTableTop|name = }}\n{{MedalSport|[[Ice hockey]]}}\n{{MedalCountry | {{flagIOC|ROC|2022 Winter}} }}\n{{MedalOlympic}}\n{{MedalSilver|[[Ice hockey at the 2022 Winter Olympics – Men's tournament|2022 Beijing]]|}}\n{{MedalBottom}}\nMinulin represented [[Russia men's national junior ice hockey team|Russia]] at the [[2018 World Junior Ice Hockey Championships]].\n\nOn 23 January 2022, Minulin was named to the roster to represent [[Russian Olympic ..."
2729,"Anna Hasselborg\n\n==Grand Slam record==\n\nHasselborg and her rink became the first women's team to win a career ""Grand Slam"" (winning all four 'majors') when she won the [[2022 Players' Championship]]. \n\n{{Curling GS key}}\n{{clear}}\n{| class=""wikitable"" border=""1""\n|-\n! Event\n! [[2010–11 curling season|2010–11]]\n! [[2011–12 curling season|2011–12]]\n! [[2012–13 curling season|2012–13]]\n! [[2013–14 curling season|2013–14]]\n! [[2014–15 curling season|2014–15]]\n! [[2015–16 curling s..."
6022,"Israel at the 2022 Winter Olympics\n\n==Competitors==\n\nThe Israeli delegation included 6 athletes, competing in 3 disciplines.\n\n{|class=""wikitable sortable"" style=""text-align:center;""\n|-\n! width=180|Sport\n! width=55|Men\n! width=55|Women\n! width=55|Total\n|-\n|align=left|[[Alpine skiing at the 2022 Winter Olympics|Alpine skiing]]\n|1 ||1 ||2\n|-\n|align=left|[[Figure skating at the 2022 Winter Olympics|Figure skating]]\n|2 ||1 ||3\n|-\n|align=left|[[Short track speed skating at the 2..."
1020,"Vetle Sjåstad Christiansen\n\n==Career==\n\n===Leaving the national team (2015–16 World Cup season)===\n\nDue to his illness, he left the national biathlon team ahead of the [[2015–16 Biathlon World Cup|2015–16 season]] and joined a private biathlon team called Team Norgesbakeriet. The team was composed of [[Vetle Ravnsborg Gurigard]], [[Kristoffer Langøien Skjelvik]], [[Håkon Svaland]], [[Martin Eng]], [[Tore Leren]] and Christiansen, with [[Knut Tore Berland]] as coach. He did not particip..."


Next, we define a function to calculate the embedding using the `text-embedding-ada-002` model, given a piece of text. The API call is simple (see below). [Link](https://openai.com/blog/new-and-improved-embedding-model).

In [27]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = client.embeddings.create(
      model=EMBEDDING_MODEL,  # which embedding model we want to use
      input=text,            # feed in the text for which you want to calc the embedding
    )
    return result.data[0].embedding

Let's try it on "GenAI is amazing!!" 😃

In [51]:
e = get_embedding("Narendra Modi is addressing the media")

Let's see how long the embedding vector is.

In [52]:
e

[-0.0163012333214283,
 -0.013816753402352333,
 0.00012527374201454222,
 0.011338611133396626,
 -0.006170004140585661,
 0.02956024743616581,
 -0.012511133216321468,
 -0.008670330047607422,
 -0.011731564067304134,
 -0.00823301076889038,
 0.058258529752492905,
 0.008816102519631386,
 -0.015794197097420692,
 0.0019774436950683594,
 -0.011947055347263813,
 -0.017961779609322548,
 0.025174377486109734,
 -0.011909027583897114,
 0.04436572268605232,
 -0.018963176757097244,
 0.014323790557682514,
 0.016985733062028885,
 0.014589984901249409,
 0.028064487501978874,
 -0.011864661239087582,
 -0.008530894294381142,
 0.011325934901833534,
 -0.00913300085812807,
 -0.01859557442367077,
 -0.010559041984379292,
 -0.006914714816957712,
 -0.009462574496865273,
 0.012555499561131,
 -0.005659798625856638,
 -0.008505542762577534,
 -0.016884326934814453,
 -0.01873501017689705,
 -0.00501966429874301,
 5.0357073632767424e-05,
 -0.0009158352622762322,
 0.030041931197047234,
 -0.008030195720493793,
 0.00540944887

In [53]:
f = get_embedding("The Prime Minister is speaking to the press")

In [54]:
len(f), len(e)

(1536, 1536)

Let's calculate the cosine similarity. The `scipy.spatial.distance.cosine` function is handy here.

In [29]:
from scipy import spatial  # for calculating cosine similarities for search

1-spatial.distance.cosine(e, f)

Given a dataframe like `df` with a column of text chunks, we can use the `get_embedding` function to calculate the embeddings for all the text chunks in the column.

In [None]:
def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.text) for idx, r in df.iterrows()
    }

To calculate the embeddings from scratch, uncomment the below line and run. Warning - it will take some time!

In [None]:
#document_embeddings = compute_doc_embeddings(df)

But happily for us, OpenAI has calculated the embeddings for us so we don't have to; in fact, these are already avaiable in the embedding column of the dataframe `df` we downloaded.

In [56]:
df.sample(5)

Unnamed: 0,text,embedding
154,"Johan Clarey\n\n==World Cup results==\n\n===Season standings===\n\n{|class=""wikitable"" style=""text-align:center; font-size:100%;""\n!Season !! Age !! Overall !! &nbsp;Slalom&nbsp; !! Giant<br>&nbsp;Slalom&nbsp; !! Super-G !! Downhill !!Combined\n|-\n| [[2003–04 FIS Alpine Ski World Cup|2004]] ||''23''|| 140 || — || — || — || 54 || —\n|-\n| [[2004–05 FIS Alpine Ski World Cup|2005]] ||''24''|| colspan=6 rowspan=2 |\n|-\n| [[2005–06 FIS Alpine Ski World Cup|2006]] ||''25''\n|-\n| [[2006–07 FIS A...","[-0.0020646711345762014, 0.010536017827689648, 0.03841996192932129, -0.005110605154186487, -0.031614750623703, 0.023938797414302826, -0.019317150115966797, -0.0016468807589262724, -0.023831628262996674, -0.05066398158669472, 0.011889021843671799, 0.029417794197797775, -0.02112562023103237, 0.01813829503953457, -0.013208536431193352, 0.001173831638880074, 0.014226638711988926, -0.023376161232590675, -0.0031229613814502954, -0.005381875671446323, 9.487146598985419e-05, 0.015860959887504578, -0..."
1186,Karl Geiger\n\n==World Cup==\n\n===Individual starts===\n\n| –\n| –\n|20\n|q\n|31\n|36\n| –\n| –\n| –\n| –\n!\n!\n!\n|-\n|colspan=33|\n|-align=center\n| rowspan=2 width=66px|[[2014–15 FIS Ski Jumping World Cup|2014/15]]\n! [[File:Flag of Germany.svg|15px|link=Germany|Klingenthal]]\n! [[File:Flag of Finland (bordered).svg|15px|link=Finland|Kuusamo]]\n! [[File:Flag of Finland (bordered).svg|15px|link=Finland|Kuusamo]]\n! [[File:Flag of Norway.svg|15px|link=Norway|Lillehammer]]\n! [[File:Flag o...,"[-0.0034302170388400555, 0.021217146888375282, 0.026759039610624313, -0.030226068571209908, -0.022328203544020653, 0.02525978349149227, -0.014805152080953121, -0.010655425488948822, -0.0012608139077201486, -0.05097470059990883, 0.001322725205682218, 0.022930582985281944, -0.013647244311869144, 0.00175192067399621, 0.0021401208359748125, -0.01175309531390667, 0.01780366338789463, -0.003955625928938389, 0.007449427619576454, -0.01801784336566925, -0.0006065627676434815, 0.03552700951695442, 0...."
3470,Fabien Claude\n\n==Career==\n\nHis first year in biathlon was 2006. He made his international debut in 2011. He won the Junior Men's pursuit and placed 2nd with the French relay team at the [[Biathlon Junior World Championships 2014]]. He was part of the French relay team that achieved 3rd place in the relay in [[2019-20 Biathlon World Cup - Stage 2|Hochfilzen]] during the 2019–20 season. His first individual podium position was 3rd place in the individual discipline in [[2019-20 Biathlon Wo...,"[-0.012693449854850769, 0.00015936909767333418, 0.013696258887648582, -0.01316186785697937, -0.011618069373071194, 0.046050041913986206, -0.005317526403814554, 0.0035988965537399054, -0.026152202859520912, -0.023090995848178864, 0.014738652855157852, 0.029107850044965744, -0.0069602858275175095, -0.03760533407330513, 0.020016593858599663, -0.01790541782975197, 0.020755507051944733, -0.023790322244167328, -0.010905547067523003, -0.016058137640357018, -0.022550007328391075, 0.01405252050608396..."
543,"Andorra at the 2022 Winter Olympics\n\n==Alpine skiing==\n\n{{main article|Alpine skiing at the 2022 Winter Olympics|Alpine skiing at the 2022 Winter Olympics – Qualification}}\nBy meeting the basic qualification standards, Andorra qualified one male and one female alpine skier.\n\n{|class=wikitable style=font-size:90%;text-align:center\n!rowspan=2|Athlete\n!rowspan=2|Event\n!colspan=2|Run 1\n!colspan=2|Run 2\n!colspan=2|Total\n|-style=""font-size: 95%""\n!Time\n!Rank\n!Time\n!Rank\n!Time\n!Ra...","[-0.009727440774440765, 0.006078820675611496, 0.008545540273189545, -0.0011013919720426202, -0.02865777164697647, 0.05123339965939522, -0.028471853584051132, -0.010823022574186325, -0.007582755759358406, -0.03840513154864311, 0.04310617595911026, 0.013598497025668621, 0.014780398458242416, -0.020517263561487198, -0.029507676139473915, -0.008751376532018185, 0.018299540504813194, 0.005325193051248789, -0.0031655682250857353, -0.0020019272342324257, -0.0007976169581525028, 0.03936127573251724,..."
2145,"2021 Olympic Qualification Event – Curling\n\n==Mixed doubles==\n\n===Playoffs===\n\n====Semifinal 2====\n\n''Thursday, December 9, 9:00''\n{{Curlingbox8\n| sheet = B\n| team1 = {{flagIOC|ROC|2022 Winter}} {{X}}\n|2|0|0|2|1|0|5|X| |10\n| team2 = {{flagIOC|FIN|2022 Winter}}\n|0|3|1|0|0|1|0|X| |5\n}}\n{| class=""wikitable""\n!colspan=4 width=400|Player percentages\n|-\n!colspan=2 width=200 style=""white-space:nowrap;""| {{flagIOC|ROC|2022 Winter}}\n!colspan=2 width=200 style=""white-space:nowrap;""|...","[-0.01521895918995142, -0.005543465260416269, -0.0024529662914574146, -0.016759946942329407, -0.036956433206796646, 0.039902038872241974, -0.010261887684464455, -0.003920654766261578, -0.031501609832048416, -0.02591041475534439, 0.03774738311767578, 0.014864396303892136, -0.012723377905786037, -0.01138012669980526, -0.013841615989804268, -0.0009562988416291773, 0.006300321780145168, -0.010493718087673187, 0.02015557512640953, -0.008720899932086468, -0.004401361104100943, 0.011271030642092228..."


So we have a custom data-set split into sections, and embedding vectors calculated for each. We also have a function that can calculate the embedding for any question.

Next we will use these embeddings to answer our users' questions.



#### Each time we receive a question

* We calculate an embedding vector for the question with the `get_embedding` funtion we defined above.
* For each chunk in our custom dataset, we calculate the cosine similarity between that chunk's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar

We first define a couple of helper functions.

In [25]:
from IPython import embed

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:

    """Returns a list of strings and relatednesses, sorted from most related to least."""

    query_embedding = get_embedding(query)

    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

Let's examine this function to see what it pulls up as documents most similar to the query string "curling gold medal"

In [30]:
strings, relatednesses = strings_ranked_by_relatedness("curling gold medal", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.879


'Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medal table===\n\n{{Medals table\n | caption        = \n | host           = \n | flag_template  = flagIOC\n | event          = 2022 Winter\n | team           = \n | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1\n | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0\n | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0\n | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2\n | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0\n | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0\n}}'

relatedness=0.872


"Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Women's tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n''Sunday, 20 February, 9:05''\n{{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}}\n{{Player percentages\n| team1 = {{flagIOC|JPN|2022 Winter}}\n| [[Yurika Yoshida]] | 97%\n| [[Yumi Suzuki]] | 82%\n| [[Chinami Yoshida]] | 64%\n| [[Satsuki Fujisawa]] | 69%\n| teampct1 = 78%\n| team2 = {{flagIOC|GBR|2022 Winter}}\n| [[Hailey Duff]] | 90%\n| [[Jennifer Dodds]] | 89%\n| [[Vicky Wright]] | 89%\n| [[Eve Muirhead]] | 88%\n| teampct2 = 89%\n}}"

relatedness=0.869


'Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Mixed doubles tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n\'\'Tuesday, 8 February, 20:05\'\'\n{{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}}\n{| class="wikitable"\n!colspan=4 width=400|Player percentages\n|-\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|ITA|2022 Winter}}\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|NOR|2022 Winter}}\n|-\n| [[Stefania Constantini]] || 83%\n| [[Kristin Skaslien]] || 70%\n|-\n| [[Amos Mosaner]] || 90%\n| [[Magnus Nedregotten]] || 69%\n|-\n| \'\'\'Total\'\'\' || 87%\n| \'\'\'Total\'\'\' || 69%\n|}'

relatedness=0.868


"Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medalists===\n\n{| {{MedalistTable|type=Event|columns=1}}\n|-\n|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}\n|{{flagIOC|SWE|2022 Winter}}<br>[[Niklas Edin]]<br>[[Oskar Eriksson]]<br>[[Rasmus Wranå]]<br>[[Christoffer Sundgren]]<br>[[Daniel Magnusson (curler)|Daniel Magnusson]]\n|{{flagIOC|GBR|2022 Winter}}<br>[[Bruce Mouat]]<br>[[Grant Hardie]]<br>[[Bobby Lammie]]<br>[[Hammy McMillan Jr.]]<br>[[Ross Whyte]]\n|{{flagIOC|CAN|2022 Winter}}<br>[[Brad Gushue]]<br>[[Mark Nichols (curler)|Mark Nichols]]<br>[[Brett Gallant]]<br>[[Geoff Walker (curler)|Geoff Walker]]<br>[[Marc Kennedy]]\n|-\n|Women<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}}\n|{{flagIOC|GBR|2022 Winter}}<br>[[Eve Muirhead]]<br>[[Vicky Wright]]<br>[[Jennifer Dodds]]<br>[[Hailey Duff]]<br>[[Mili Smith]]\n|{{flagIOC|JPN|2022 Winter}}<br>[[Satsuki Fujisawa]]<br>[[Chinami Yoshida]]<br>[[Yumi Suzuki]]<br>

relatedness=0.867


"Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Men's tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n''Saturday, 19 February, 14:50''\n{{#lst:Curling at the 2022 Winter Olympics – Men's tournament|GM}}\n{{Player percentages\n| team1 = {{flagIOC|GBR|2022 Winter}}\n| [[Hammy McMillan Jr.]] | 95%\n| [[Bobby Lammie]] | 80%\n| [[Grant Hardie]] | 94%\n| [[Bruce Mouat]] | 89%\n| teampct1 = 90%\n| team2 = {{flagIOC|SWE|2022 Winter}}\n| [[Christoffer Sundgren]] | 99%\n| [[Rasmus Wranå]] | 95%\n| [[Oskar Eriksson]] | 93%\n| [[Niklas Edin]] | 87%\n| teampct2 = 94%\n}}"

We can see that what was pulled up were several sections of the Wikipedia page for curling at the 2022 Winter Olympics. Cool.

#### Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window


Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. We write a fewer helper functions to do just this.

In [31]:
HEADER = """
Use the below articles to answer the subsequent question. \
Answer the question as truthfully as possible, and if you're unsure \
of the answer, say "Sorry, I don't know"
"""

Since we don't want to exceed the context window, we will need to count the tokens in our prompt. We use the `tiktoken` package for this.

In [32]:
import tiktoken  # for counting tokens

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

In [33]:
num_tokens("Curling gold medal", GPT_MODEL)

5

In [34]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df,top_n=5)
    introduction = HEADER
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
      # useful to indicate the start of each new potentially relevant
      # article here with the header 'Wikipedia article section:'

        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

Query message first begins with the `HEADER` and then pulls the related articles sorted in descending order of similarity to the query. We then add these articles to the query until the token budget is consumed. Below we pass in our query about 2022 curling with a token budget of 3700 tokens. We could go higher (how much higher?) but OpenAI charges by the token 😰



In [35]:
query = query_message("'Which athletes won the gold medal in curling at \
the 2022 Winter Olympics?", df, GPT_MODEL, 3700)

print(query)


Use the below articles to answer the subsequent question. Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"


Wikipedia article section:
"""
List of 2022 Winter Olympics medal winners

==Curling==

{{main|Curling at the 2022 Winter Olympics}}
{|{{MedalistTable|type=Event|columns=1|width=225|labelwidth=200}}
|-valign="top"
|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}
|{{flagIOC|SWE|2022 Winter}}<br/>[[Niklas Edin]]<br/>[[Oskar Eriksson]]<br/>[[Rasmus Wranå]]<br/>[[Christoffer Sundgren]]<br/>[[Daniel Magnusson (curler)|Daniel Magnusson]]
|{{flagIOC|GBR|2022 Winter}}<br/>[[Bruce Mouat]]<br/>[[Grant Hardie]]<br/>[[Bobby Lammie]]<br/>[[Hammy McMillan Jr.]]<br/>[[Ross Whyte]]
|{{flagIOC|CAN|2022 Winter}}<br/>[[Brad Gushue]]<br/>[[Mark Nichols (curler)|Mark Nichols]]<br/>[[Brett Gallant]]<br/>[[Geoff Walker (curler)|Geoff Walker]]<br/>[[Marc Kennedy]]
|-valign="top"
|Women<br/>{{DetailsLink|Curling 

We have now obtained the sections that are most relevant to the question and crafted a query. As a final step, let's put it all together to get an answer to the question.


In [36]:
def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)

    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message

#### Send the query into `gpt-3.5-turbo`!

Now that we've retrieved the relevant sections and constructed our prompt, we can finally answer the user's query.

In [40]:
print(ask('Which men athletes won the gold medal in curling at the 2022 Winter Olympics specifically for GBR ?'))

The men athletes who won the gold medal in curling at the 2022 Winter Olympics specifically for Great Britain were Hammy McMillan Jr., Bobby Lammie, Grant Hardie, Bruce Mouat, and Ross Whyte.


Nice!

Let's ask a question for an Olympics event that never happened!

In [69]:
print(ask('Which athletes won the gold medal in curling at the 2016 Winter Olympics?'))

Sorry, I don't know.


Good, it is trying to be humble and say "I don't know".

Let's change the header to "allow it to lie" if it wants 👀 and see if it takes the bait.

In [70]:
HEADER = """
Answer the question."\n\nContext:\n
"""

In [71]:
print(ask('Which athletes won the gold medal in curling at the 2016 Winter Olympics?'))

The athletes who won the gold medal in curling at the 2022 Winter Olympics were:

- Men's tournament: Team from Great Britain consisting of Hammy McMillan Jr., Bobby Lammie, Grant Hardie, and Bruce Mouat.
- Women's tournament: Team from Japan consisting of Yurika Yoshida, Yumi Suzuki, Chinami Yoshida, and Satsuki Fujisawa.


Hmm ... it is answering an irrelevant question. Removing that little extra phrase in the header - `as truthfully as possible` - changed its behavior!



## Conclusion
By combining pretrained contextual embeddings and `gpt-3.5-turbo`, we have created a question-answering model using Retrieval-Augmented Generation that can answer questions in natural language using a custom dataset. It also **tries** not to make stuff up and says "I don't know" when it doesn't know the answer! **But this is not guaranteed.**

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more.





---

How you can use this approach to "understand" a dense 56-page legal document:
A fun [example](https://www.youtube.com/watch?v=ih9PBGVVOO4)

---


