# Custom Chatbot Project

I choose dataset wikipedia "2024 United States presidential election", because the gpt model doesn't have information about things happening in 2024. By providing these information, it can help ppl better understand what had happened during 2024 presidential election.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import requests

# Get the Wikipedia page for "2024" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2024 United States presidential election",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)


In [19]:
import pandas as pd

# Load page text into a dataframe
content = resp.json()["query"]["pages"][0]["extract"]
df = pd.DataFrame()
df["text"] = content.split("\n")
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
print(df["text"])

0      Presidential elections were held in the United...
,1      The incumbent president, Joe Biden of the Demo...
,2      Trump, who lost in 2020 to Biden, ran for re-e...
,3      Trump achieved victory in the Electoral Colleg...
,4      According to polls, the most important issues ...
,                             ...                        
,528    An Extremely Detailed Map of the 2024 Election...
,529    "Misinformation Dashboard: Election 2024. A to...
,530    Dovere, Edward-Isaac (November 6, 2024). "Wher...
,531    "The Choice 2024: Harris vs. Trump". Frontline...
,532    "The VP Choice: Vance vs. Walz". Frontline. Se...
,Name: text, Length: 224, dtype: object


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [21]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "API_KEY"

In [22]:
# Verify GPT 3.5 is unware of 2024 US presidential election.

presidential_prompt = """
Question: "Who win the election of 2024 US presidential election"
Answer:
"""
initial_presidential_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=presidential_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_presidential_answer)

Unfortunately, it is impossible to accurately predict the outcome of the 2024 US presidential election at this time. The election will be determined by the American people through the voting process.


In [23]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [24]:
df.to_csv("embeddings.csv")

In [27]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [28]:
get_rows_sorted_by_relevance(presidential_prompt, df)

Unnamed: 0,text,embeddings,distances
513,2024 United States elections,"[-0.013307408429682255, -0.02391788363456726, ...",0.123527
517,Timeline of the 2024 United States presidentia...,"[-0.025916269049048424, -0.024590199813246727,...",0.126945
514,2024 United States gubernatorial elections,"[-0.014362308196723461, -0.013297007419168949,...",0.145521
516,2024 United States Senate elections,"[-0.01573183573782444, -0.024501660838723183, ...",0.149709
0,Presidential elections were held in the United...,"[-0.04213929921388626, -0.032862551510334015, ...",0.153677
...,...,...,...
359,"Clayton County, Georgia – 84.31%","[0.01413523405790329, 0.02612915262579918, -0....",0.308170
271,"""tilt"" (used by some predictors): advantage th...","[-0.0323086641728878, -0.014604916796088219, 0...",0.309112
365,"Borden County, Texas – 95.61%","[0.01437399722635746, 0.01590009219944477, -0....",0.313476
364,"Roberts County, Texas – 95.63%","[0.006969737354665995, 0.019120965152978897, -...",0.314162


In [29]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [30]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [31]:
custom_presidential_answer = answer_question(presidential_prompt, df)
print(custom_presidential_answer)

Donald Trump won the 2024 US presidential election.


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [32]:
presidential_prompt_1 = """
Question: "Does anyone interfere 2024 US presidential election"
Answer:
"""

In [33]:
initial_presidential_answer_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=presidential_prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_presidential_answer)

Unfortunately, it is impossible to accurately predict the outcome of the 2024 US presidential election at this time. The election will be determined by the American people through the voting process.


In [34]:
custom_presidential_answer_1 = answer_question("Does anyone interfere 2024 US presidential election", df)
print(custom_presidential_answer_1)

Yes, multiple countries such as China, Russia, and Iran were identified as attempting to interfere with the 2024 US Presidential Election through various means including propaganda, disinformation campaigns, and hacking attempts.


### Question 2

In [36]:
presidential_prompt_2 = """
Question: "Did Nikki Haley beat Donald Trump in the election?"
Answer:
"""

In [37]:
initial_presidential_answer_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=presidential_prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_presidential_answer)

Unfortunately, it is impossible to accurately predict the outcome of the 2024 US presidential election at this time. The election will be determined by the American people through the voting process.


In [38]:
custom_presidential_answer_2 = answer_question("Did Nikki Haley beat Donald Trump in the election?", df)
print(custom_presidential_answer_2)

No
