# Retrieval Enhanced Generative Question Answering with OpenAI

#### Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources. Required installs for this notebook are:

In [None]:
#!pip install -qU openai pinecone-client datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 KB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 KB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [1]:
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "*****************"

For many questions *state-of-the-art (SOTA)* LLMs are more than capable of answering correctly.

In [3]:
query = "who was the 12th person on the moon and when did they land?"

# now query text-davinci-003 WITHOUT context
res = openai.Completion.create(
    engine='text-davinci-003',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

In [6]:
res['choices'][0]['text']

'\n\nThe 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

However, that isn't always the case. Let's first rewrite the above into a simple function so we're not rewriting this every time.

In [2]:
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

Now let's ask a more specific question about training a specific type of transformer model called a *sentence-transformer*. The ideal answer we'd be looking for is _"Multiple Negatives Ranking (MNR) loss"_.

Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here.

In [None]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'

One of the common answers I get to this is:

```
The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.
```

This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but *cannot* be used to fine-tune a sentence-transformer, and has nothing to do with having _"pairs of related sentences"_.

An alternative answer I recieve is about `supervised learning approach` being the most suitable. This is completely true, but it's not specific and doesn't answer the question.

We have two options for enabling our LLM in understanding and correctly answering this question:

1. We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.

2. We use **R**etrieval **A**ugmented **G**eneration (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a *secondary* source of information.

We will demonstrate option **2**.

---

## Building a Knowledge Base

With open **2** the retrieval of relevant information requires an external _"Knowledge Base"_, a place where we can store and use to efficiently retrieve information. We can think of this as the external _long-term memory_ of our LLM.

We will need to retrieve information that is semantically related to our queries, to do this we need to use _"dense vector embeddings"_. These can be thought of as numerical representations of the *meaning* behind our sentences.

There are many options for creating these dense vectors, like open source [sentence transformers](https://pinecone.io/learn/nlp/) or OpenAI's [ada-002 model](https://youtu.be/ocxq84ocYi0). We will use OpenAI's offering in this example.

We have already authenticated our OpenAI connection, to create an embedding we just do:

In [12]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [13]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [14]:
len(res['data'])

2

In [15]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

### Data preparation

The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information.

In [48]:
import json

with open('Healthcare-ML-caption.json', 'r') as f:
    # Load the JSON data from the file
    data = json.load(f)
transcript_data = data['HC-ML']
for indice in transcript_data:
    indice['text'] = indice['text'].replace("\n", "") 
#transcript_data = [x['text'].replace("\n", "") for x in transcript_data]  

In [49]:
transcript_data[0]

{'text': "  [CLICK] DAVID SONTAG: So welcometo spring 2019 Machine Learning for Healthcare. My name is David Sontag. I'm a professor incomputer science. Also I'm in the Institutefor Medical Engineering and Science. My co-instructor todaywill be Pete Szolovits, who I'll introduce more towardsthe end of today's lecture, along with the restof the course staff. So the problem. The problem is that healthcarein the United States costs too much. Currently, we're spending$3 trillion a year, and we're not even necessarilydoing a very good job. Patients who havechronic disease often find that these chronicdiseases are diagnosed late. They're often not managed well.",
 'id': 'vof7x8r_ZUA_0',
 'title': '1. What Makes Healthcare Unique?',
 'MIT OpenCourseWare': 'MIT OpenCourseWare',
 '2020-10-22T19:38:19Z': '2020-10-22T19:38:19Z'}

In [11]:
len(data['HC-ML'])

1879

Now we need a place to store these embeddings and enable a efficient _vector search_ through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [43]:
# create pinecone vector database, specify name and metadata, vector length
import pinecone

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="2113e9a9-7c91-4c38-a86d-739c19b17c0f",
    environment="eu-west1-gcp"  # may be different, check at app.pinecone.io
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine',
        metadata_config={'indexed': ['title']}
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [51]:
from tqdm.auto import tqdm
import datetime
from time import sleep
embed_model = "text-embedding-ada-002"
batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(transcript_data), batch_size)):
    # find end of batch
    i_end = min(len(transcript_data), i+batch_size)
    meta_batch = transcript_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'title': x.get('title','unkown'),
        'text': x.get('text','NaN')
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/19 [00:00<?, ?it/s]

In [54]:
to_upsert[0][2]

{'title': '25. Interpretability',
 'text': '  PROFESSOR: OK, so thelast topic for the class is interpretability. As you know, the modernmachine learning models are justifiably reputed to bevery difficult to understand. So if I give you somethinglike the GPT2 model, which we talked about in naturallanguage processing, and I tell you that ithas 1.5 billion parameters and then you say,why is it working? Clearly the answeris not because these particular parametershave these particular values. There is no way tounderstand that. And so the topictoday is something'}

In [53]:
len(to_upsert)

79

Now we search, for this we need to create a _query vector_ `xq`:

In [55]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

In [56]:
res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=2, include_metadata=True)

In [57]:
res

{'matches': [{'id': 'yYWyLZrdRRI_31',
              'metadata': {'text': '  studied multi-tasklearning in class? So '
                                   "for those of youwho did, don't answer. For "
                                   'everyone else,what are some ways that you '
                                   'might try totie these two prediction '
                                   'problems together? Yeah. AUDIENCE: Maybe '
                                   'you could sharecertain weight parameters, '
                                   "so if you've got acommon set of "
                                   'biomarkers. DAVID SONTAG: So maybe you '
                                   'couldshare some weight parameters. Well, I '
                                   'mean, the simplest wayto tie them together '
                                   "is just to say, we're going to-- so you "
                                   "might say,let's first of all add these two "
                   

In [58]:
limit = 3750

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    res = index.query(xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

In [59]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

"Answer the question based on the context below.\n\nContext:\n  studied multi-tasklearning in class? So for those of youwho did, don't answer. For everyone else,what are some ways that you might try totie these two prediction problems together? Yeah. AUDIENCE: Maybe you could sharecertain weight parameters, so if you've got acommon set of biomarkers. DAVID SONTAG: So maybe you couldshare some weight parameters. Well, I mean, the simplest wayto tie them together is just to say, we're going to-- so you might say,let's first of all add these two objectivefunctions together. And now we'regoing to minimize-- instead of minimizing just-- now we're going to minimize overthe two weight vectors jointly. So now we have a singleoptimization problem. All I've done is I'venow-- we're optimizing.\n\n---\n\n  your model's trained on English,and you're testing it out in Chinese. That would be an example-- if you use a bag ofwords model, that would be an example whereyour model, obviously, wouldn't gen

In [60]:
# then we complete the context-infused query
complete(query_with_contexts)

'You should use a supervised learning method such as supervised machine translation or a recurrent neural network (RNN) to train your sentence transformers.'

And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).

Once we're done with the index we delete it to save resources:

In [None]:
#pinecone.delete_index(index_name)

---