# Custom Chatbot Project

I have chosen to use wikipedia pages related to the topic since they are human curated and are typically also spell checked.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import wikipedia
import pandas as pd
from attr.validators import max_len


# Get content from Wikipedia article
def get_wiki_content(title):
    try:
        page = wikipedia.page(title)
        return page.content
    except:
        return ""

# We chose relevant wikipedia articles about the 2024 presedential elections as well
# as one of the 2020 elections to see if this will end up confusing our chatbot
articles = [
    "2024 United States presidential election",
    "Kamala Harris",
    "Joe Biden",
    "Donald Trump",
    "2020 United States presidential election",
]

# Create dataframe with wiki content
wiki_data = []
for article in articles:
    content = get_wiki_content(article)
    wiki_data.append({
        'title': article,
        'text': content
    })

df = pd.DataFrame(wiki_data)


In [2]:
print("Viewing one text example to examine the contents")
df.iloc[0]["text"]

Viewing one text example to examine the contents




We notice some things to cleanup: The headers have the weird format '=== <header> ===' that can have different numbers of = depending on the header. Also, the References section is full of links that are not useful for answering the question. Finally, there is a lot of new lines that we can get rid off.

In [3]:
from helpers import remove_everything_after_references, convert_wiki_headers

df_cleaned=df.copy()
df_cleaned['text'] = df_cleaned['text'].apply(remove_everything_after_references)

df_cleaned['text'] = df_cleaned['text'].apply(convert_wiki_headers)

df_cleaned['text'] = df_cleaned['text'].apply(lambda x: x.replace("\n"," "))

In [4]:
print("Verifying things are fine after cleanup using a sample:")
df_cleaned.iloc[0]["text"]

Verifying things are fine after cleanup using a sample:




Create new dataframe with chunked text so that we can embed the chunks later to only take relevant parts as context for our chatbot

In [5]:
from helpers import split_into_chunks

chunked_data = []
for _, row in df_cleaned.iterrows():
    chunks = split_into_chunks(row['text'], max_chunk_size=50)
    for chunk in chunks:
        chunked_data.append({
            'title': row['title'],
            'text': chunk
        })

# Create new dataframe with chunked texts
df_chunked = pd.DataFrame(chunked_data)
df_chunked.head()
print(f"The number of rows after distributing the text into chunks is {len(df_chunked)}")

The number of rows after distributing the text into chunks is 1060


In [6]:
print(f"Let's view one sample chunk: {df_chunked.iloc[0]['text']}")

Let's view one sample chunk: Presidential elections were held in the United States on November 3, 2020. The Democratic ticket of former vice president Joe Biden and the junior U.S. senator from California Kamala Harris defeated the incumbent Republican president Donald Trump, and vice president Mike Pence.


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [7]:

prompt_template = """
Answer the question based on the context below, and if the
question can't be answered based on the context, say
"I don't know"

Context:

{}

---

Question: {}
Answer:"""



In [9]:
import openai
openai.api_key = # fill the key here


In [10]:

openai.api_base = "https://openai.vocareum.com/v1"

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
NUM_COMPLETION_TOKENS=100 # we reserve this for the answer

QUESTION_1 = "Who was chosen US president in 2024"
prompt=prompt_template.format("", QUESTION_1)
response = openai.Completion.create(model=COMPLETION_MODEL_NAME, prompt=prompt, max_tokens=NUM_COMPLETION_TOKENS)
answer = response["choices"][0]["text"].strip()
print(f"The returned answer for the question\n=={QUESTION_1}==\nfrom the completion model is:\n{answer}")

The returned answer for the question
==Who was chosen US president in 2024==
from the completion model is:
I don't know.


In [11]:
QUESTION_2 = "Was it Joe Biden or Kamala Harris that lost the presedential elections against Trump in 2024?"
prompt=prompt_template.format("", QUESTION_2)
response = openai.Completion.create(model=COMPLETION_MODEL_NAME, prompt=prompt, max_tokens=NUM_COMPLETION_TOKENS)
answer = response["choices"][0]["text"].strip()
print(f"The returned answer for the question\n=={QUESTION_2}==\nfrom the completion model is:\n{answer}")

The returned answer for the question
==Was it Joe Biden or Kamala Harris that lost the presedential elections against Trump in 2024?==
from the completion model is:
I don't know


In [12]:
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_template_tokens = tokenizer.encode(prompt_template)

# Print the number of tokens and the tokens themselves
print(f"Number of tokens: {len(prompt_template_tokens)}")
print(prompt_template_tokens)

Number of tokens: 40
[198, 16533, 279, 3488, 3196, 389, 279, 2317, 3770, 11, 323, 422, 279, 198, 7998, 649, 956, 387, 19089, 3196, 389, 279, 2317, 11, 2019, 198, 7189, 1541, 956, 1440, 1875, 2014, 1473, 32583, 45464, 14924, 25, 5731, 16533, 25]


In [13]:
MODEL_TOKEN_LIMIT=4096 # We know this is the limit for the gpt-3.5-turbo-instruct model
available_context_tokens = MODEL_TOKEN_LIMIT - len(prompt_template_tokens)
print(f"The number of tokens available for context and question is: {available_context_tokens}")

The number of tokens available for context and question is: 4056


We now add the embedding per chunk so that we can use that for retrieval of relevant chunks

In [14]:
from helpers import get_rows_sorted_by_relevance

# Code adopted from the GENAIND exercise notebooks
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 389
embeddings = []
for i in range(0, len(df_chunked), batch_size):
    # Send text data to OpenAI model to get embeddings
    print(f"Getting embeddings for rows {i}:{i+batch_size}")
    response = openai.Embedding.create(
        input=df_chunked.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    data=response["data"]
    # Add embeddings to list
    embeddings.extend([data_row["embedding"] for data_row in data])

# Add embeddings list to dataframe
df_chunked["embeddings"] = embeddings

df_chunked=get_rows_sorted_by_relevance(QUESTION_1, df_chunked, EMBEDDING_MODEL_NAME)

Getting embeddings for rows 0:389
Getting embeddings for rows 389:778
Getting embeddings for rows 778:1167


In [15]:
print("Let's verify the embedding column has been added:")
df_chunked.head()

Let's verify the embedding column has been added:


Unnamed: 0,title,text,embeddings,distances
634,Donald Trump,"On September 15, 2024, he was targeted in anot...","[-0.03755922242999077, -0.020260214805603027, ...",0.151698
733,2020 United States presidential election,This was the first of two elections won by Tru...,"[-0.025832142680883408, -0.03599441051483154, ...",0.158835
0,2024 United States presidential election,Presidential elections were held in the United...,"[-0.03762064129114151, -0.0432426854968071, 0....",0.164513
196,2024 United States presidential election,Biden became the oldest president ever elected...,"[-0.02744671143591404, -0.03909603878855705, -...",0.169185
713,2020 United States presidential election,Presidential elections were held in the United...,"[-0.028128311038017273, -0.027444636449217796,...",0.170454


Now we pack as many of the **most** relevant chunks as our available tokens allow

In [16]:
selected_texts = []
total_tokens = 0

for _, row in df_chunked.iterrows(): # remember, df_chunked must be sorted on the distances column (ascending)
    text_tokens = tokenizer.encode(row['text'])
    if total_tokens + len(text_tokens) < (available_context_tokens-NUM_COMPLETION_TOKENS-100):
        selected_texts.append(row['text'])
        total_tokens += len(text_tokens)
    else:
        break


print(f"Total tokens used: {total_tokens}")
print(f"Number of selected texts: {len(selected_texts)}")

Total tokens used: 3853
Number of selected texts: 76


In [17]:
print("Let's view the selected texts that will be used as context")
selected_texts

Let's view the selected texts that will be used as context


["On September 15, 2024, he was targeted in another assassination attempt in Florida. Trump won the election in November 2024 with 312 electoral votes to incumbent vice president Kamala Harris's 226, making him the second president in U.S.",
 'This was the first of two elections won by Trump, the second being in 2024 against Kamala Harris, following his defeat by Joe Biden in 2020.',
 'Presidential elections were held in the United States on November 3, 2020. The Democratic ticket of former vice president Joe Biden and the junior U.S. senator from California Kamala Harris defeated the incumbent Republican president Donald Trump, and vice president Mike Pence.',
 "Biden became the oldest president ever elected, besting Ronald Reagan's record in 1984, and the oldest non-incumbent ever, besting Trump in 2016; however, both records were broken by Trump in 2024.",
 'Presidential elections were held in the United States on November 8, 2016.',
 'On the same day, Politico released an article p

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [18]:
prompt=prompt_template.format(selected_texts, QUESTION_1)
response = openai.Completion.create(model=COMPLETION_MODEL_NAME, prompt=prompt,max_tokens=NUM_COMPLETION_TOKENS-1)
answer = response["choices"][0]["text"].strip()
print(f"The returned answer for the question\n=={QUESTION_1}==\nfrom the completion model is:\n{answer}")

The returned answer for the question
==Who was chosen US president in 2024==
from the completion model is:
Donald Trump


### Question 2

In [19]:
prompt=prompt_template.format(selected_texts, QUESTION_2)
response = openai.Completion.create(model=COMPLETION_MODEL_NAME, prompt=prompt,max_tokens=NUM_COMPLETION_TOKENS-1)
answer = response["choices"][0]["text"].strip()
print(f"The returned answer for the question\n=={QUESTION_2}==\nfrom the completion model is:\n{answer}")

The returned answer for the question
==Was it Joe Biden or Kamala Harris that lost the presedential elections against Trump in 2024?==
from the completion model is:
It was Kamala Harris.
