# Steps 3 & 4: Querying a Completion Model with a Custom Text Prompt

Add your API key to the cell below then run it.

In [27]:
import openai
f = open("../../openai_app.key", "rt")
openai.api_key = f.read()

The code below loads in the data sorted by cosine distance that you previously created. Run it as-is.

In [28]:
import pandas as pd

df = pd.read_csv("distances.csv", index_col=0)
df

Unnamed: 0,text,embeddings,distances
20,The USGS Prompt Assessment of Global Earthquak...,[-0.00565063 -0.02241383 0.00731853 ... 0.00...,0.087576
2,There was widespread damage in an area of abou...,[-0.00294703 -0.02162242 -0.01029019 ... 0.00...,0.088664
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ...",[-0.00861731 -0.01604001 -0.01179847 ... -0.00...,0.117818
37,The Turkish Government was criticized on socia...,[-0.00018135 0.00045083 -0.00526761 ... -0.00...,0.122369
51,"Mahase, Elisabeth (7 February 2023). ""Death to...",[-0.01100119 -0.02313346 -0.00543663 ... 0.00...,0.124095
...,...,...,...
34,NATO secretary-general Jens Stoltenberg said t...,[-0.00911843 -0.03215307 0.00483754 ... -0.00...,0.216761
31,Arab League secretary-general Ahmed Aboul Ghei...,[-0.0294337 -0.00111019 0.02146475 ... -0.03...,0.221374
41,President Erdoğan declared seven days of natio...,[-0.00719139 -0.01002615 0.00472568 ... 0.01...,0.226036
56,ReliefWeb's main page for this event.,[-2.54744310e-02 -7.11684162e-03 2.90931948e-...,0.227360


## TODO 1: Build the Custom Text Prompt

Run the cell below as-is:

In [29]:
import tiktoken
# Create a tokenizer that is designed to align with our embeddings
tokenizer = tiktoken.get_encoding("cl100k_base")

token_limit = 1000
USER_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?"""

Now your task is to compose the custom text prompt.

The overall structure of the prompt should look like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

In the place marked `context`, provide as much information from `df['text']` as possible without exceeding `token_limit`. In the place marked `question`, add `USER_QUESTION`.

Your overall goal is to create a string called `prompt` that contains all of the relevant information.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
# Count the number of tokens in the prompt template and question
prompt_template = """
Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

{}

---

Question: {}
Answer:"""
token_count = len(tokenizer.encode(prompt_template)) + \
                        len(tokenizer.encode(USER_QUESTION))

# Create a list to store text for context
context_list = []

# Loop over rows of the sorted dataframe
for text in df["text"].values:
    
    # Append text to context_list if there is enough room
    token_count += len(tokenizer.encode(text))
    if token_count <= token_limit:
        context_list.append(text)
    else:
        # Break once we're over the token limit
        break

# Use string formatting to complete the prompt
prompt = prompt_template.format(
    "\n\n###\n\n".join(context_list),
    USER_QUESTION
)
print(prompt)
```

</details>

In [36]:
import numpy as np
import openai
f = open("../../openai_app.key", "rt")
openai.api_key = f.read()

from openai.embeddings_utils import get_embedding, distances_from_embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["embeddings"] = df_copy["embeddings"].apply(eval).apply(np.array)
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].to_list(),
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [45]:
# Count the number of tokens in the prompt template and question
prompt_template = """
Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

{}

---

Question: {}
Answer:"""
prompt = prompt_template.format("contex", USER_QUESTION)
encoded = tokenizer.encode(prompt)
token_count = len(tokenizer.encode(prompt_template)) + \
                        len(tokenizer.encode(USER_QUESTION))
print("token count {}".format(token_count))


# Create a list to store text for context
import pandas as pd
df = pd.read_csv("embeddings.csv", index_col=0)

# Create a list to store text for context
context_list = []

#get_embedding(USER_QUESTION, engine=EMBEDDING_MODEL_NAME)

# Loop over rows of the sorted dataframe
for text in get_rows_sorted_by_relevance(USER_QUESTION, df)["text"].values:
    # Append text to context_list if there is enough room
    token_count += len(tokenizer.encode(text))
    if token_count <= token_limit:
        context_list.append(text)
    else:
        # Break once we're over the token limit
        break

# Use string formatting to complete the prompt
prompt = prompt_template.format(
    "\n\n###\n\n".join(context_list),
    USER_QUESTION
)
print(prompt)


token count 57

Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

The USGS Prompt Assessment of Global Earthquakes for Response (PAGER) service estimated a 35 percent probability of economic losses between US$10 billion and US$100 billion. There was a 34 percent probability of economic losses exceeding US$100 billion. The service estimated a 36 percent probability of deaths between 10,000 and 100,000; 26 percent probability of deaths exceeding 100,000. For the second large earthquake, there was a 46 percent probability of deaths between 1,000 and 10,000; 30 percent probability of deaths between 100 and 1,000. The service also estimated a 35 percent percent probability of economic losses between US$1 billion and US$10 billion; 27 percent probability of economic losses between US$10 billion and US$100 billion.Risklayer estimated a death toll of between 23,284 and 105,671. According to geophysics pr

define a create_prompt based on above code

In [48]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

## TODO 2: Send Custom Text Prompt to Completion Model

Using the `prompt` string you created, query an OpenAI `Completion` model to get an answer. Specify a `max_tokens` of 150.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
COMPLETION_MODEL_NAME = "text-davinci-003"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)
```

</details>

In [50]:
import pandas as pd
df = pd.read_csv("embeddings.csv", index_col=0)

In [49]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [56]:
MY_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?""";
MY_QUESTION = """Who owns the twitter?""";
MY_QUESTION = """When did Russia invaded Ukraine?""";

answer_question(MY_QUESTION, df)

"I don't know."

In [53]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt,
    max_tokens=150
)
answer = response["choices"][0]["text"]
print(answer)


 The estimated damages were over US$100 billion in Turkey and US$5.1 billion in Syria.


## 🎉 Congratulations 🎉

You have now completed the prompt engineering process using unsupervised ML to get a custom answer from an OpenAI model!