# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task


I will be choosing the wikipedia data source describing the results of the [2024 Pakistani general election](https://en.wikipedia.org/wiki/2024_Pakistani_general_election#Reruns) as currently, the cut off point for the gpt-3.5 is September 2021. I want to use this dataset to fine tune the model such that I can retrive the latest information of who the current prime minister of Pakistan is and the results of the election and whether there was any rigging etc.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [6]:
import openai
import requests
import pandas as pd

openai.api_key = "YOUR API KEY"

In [4]:
# Get the Wikipedia page for "2024 Pakistan General Election" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2024_Pakistani_general_election",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")

['General elections, originally scheduled to be held in 2023, were held in Pakistan on 8 February 2024 to elect the members of the 16th National Assembly. The Election Commission of Pakistan announced the detailed schedule on 15 December 2023.',
 'The elections were held following two years of political unrest after Prime Minister Imran Khan of the Pakistan Tehreek-e-Insaf (PTI) was removed from office by a no-confidence motion. Subsequently, Khan was arrested and convicted for corruption and barred from politics for five years. In the run-up to the elections, a Supreme Court ruling stripped the PTI of their electoral symbol for failing to hold intra-party elections for years.',
 "On election night, television broadcasts showed PTI-backed independent candidates leading in at least 127 national assembly seats, which hinted at a potential majority. However, the announcement of final results was abruptly halted. Subsequently, independent candidates ended up winning 103 general seats inclu

In [7]:
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")

In [9]:
df.tail()

Unnamed: 0,text
264,
265,== Notes ==
266,
267,
268,== References ==


In [13]:
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

In [15]:
df.tail(10)

Unnamed: 0,text
250,"On 27 February, Moody's retained Pakistan's ra..."
251,"On 28 February, at least 31 members of the Uni..."
252,"On 8 March, in an unusual political statement,..."
253,"On 20 March, a Congressional hearing titled ""P..."
257,List of members of the 16th National Assembly ...
258,2024 Punjab provincial election
259,2024 Sindh provincial election
260,2024 Khyber Pakhtunkhwa provincial election
261,2024 Balochistan provincial election
262,2024 Pakistani presidential election


In [17]:
df.reset_index(inplace=True, drop=True)

In [18]:
df

Unnamed: 0,text
0,"General elections, originally scheduled to be ..."
1,The elections were held following two years of...
2,"On election night, television broadcasts showe..."
3,The Military Establishment was accused of rigg...
4,PTI chair Gohar Ali Khan alleged election rigg...
...,...
125,2024 Punjab provincial election
126,2024 Sindh provincial election
127,2024 Khyber Pakhtunkhwa provincial election
128,2024 Balochistan provincial election


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [31]:
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
response = openai.Embedding.create(
        input = df["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
)

len(response["data"][2]["embedding"])

1536

In [32]:
embeddings = [data["embedding"] for data in response["data"]]

In [33]:
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"General elections, originally scheduled to be ...","[0.006877138279378414, -0.020588282495737076, ..."
1,The elections were held following two years of...,"[-0.03136055916547775, -0.027397744357585907, ..."
2,"On election night, television broadcasts showe...","[0.005314709153026342, -0.03167024254798889, 0..."
3,The Military Establishment was accused of rigg...,"[-0.01875847578048706, -0.00018845761951524764..."
4,PTI chair Gohar Ali Khan alleged election rigg...,"[-0.04294861853122711, -0.04385370761156082, 0..."
...,...,...
125,2024 Punjab provincial election,"[-0.01601020246744156, -0.026716869324445724, ..."
126,2024 Sindh provincial election,"[0.012851830571889877, -0.049909114837646484, ..."
127,2024 Khyber Pakhtunkhwa provincial election,"[-0.019449619576334953, -0.056799981743097305,..."
128,2024 Balochistan provincial election,"[-0.003315426642075181, -0.04778061807155609, ..."


In [34]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [35]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [36]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=2000, max_answer_tokens=300
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [37]:
q1 = """
Question: Who was the prime minister of Pakistan before the 2024 general elections?
Answer:
"""

In [38]:
initial_prime_minister_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=q1,
    max_tokens=200
)["choices"][0]["text"].strip()

custom_prime_minister_answer = answer_question("Who was the prime minister of Pakistan before the 2024 general elections?", df)

In [43]:
print(f"""
Who was the prime minister of Pakistan before the 2024 general elections?

Original Answer: {initial_prime_minister_answer} \n
Custom Answer:   {custom_prime_minister_answer}
"""
)


Who was the prime minister of Pakistan before the 2024 general elections?

Original Answer: The prime minister of Pakistan before the 2024 general elections would depend on when the elections are held. As of now (2021), the prime minister of Pakistan is Imran Khan. 

Custom Answer:   The prime minister of Pakistan before the 2024 general elections was Shehbaz Sharif, who was re-elected on 3 March 2024 after a vote of no confidence against the previous prime minister, Imran Khan, was successful.



### Question 2

In [51]:
q2 = """
Question: Which party won the 2024 general elections of Pakistan, by how many seats and who is the new prime minister and why?
Answer:
"""

In [52]:
q2.split('Answer')[0]

'\nQuestion: Which party won the 2024 general elections of Pakistan, by how many seats and who is the new prime minister and why?\n'

In [53]:
initial_election_result_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=q2,
    max_tokens=200
)["choices"][0]["text"].strip()

custom_election_result_answer = answer_question(q2.split('Answer')[0], df)

In [54]:
print(f"""
Which party won the 2024 general elections of Pakistan, and by how many seats and who is the new prime minister and why?

Original Answer: {initial_election_result_answer} \n
Custom Answer:   {custom_election_result_answer}
"""
)


Which party won the 2024 general elections of Pakistan, and by how many seats and who is the new prime minister and why?

Original Answer: As a language model AI, I cannot accurately answer this question as it has not happened yet. The 2024 general elections of Pakistan have yet to occur and the results are not known. It is impossible for me to predict or provide information about something that has not happened yet. 

Custom Answer:   The PTI won the 2024 general elections in Pakistan, with a plurality of seats in parliament. It won 127 seats in the National Assembly, followed by the PML-N with 75 seats and the PPP with 54 seats. The PTI's candidate, Shehbaz Sharif, was elected as the new prime minister with support from the PPP, MQM-P, PML-Q, BAP, PML-Z, IPP, NP, and other smaller parties. This was due to the absence of a clear majority for any single party in the assembly.



## Using a while loop for better user experience

In [55]:
while True:
    q2 = input("Type your question (or 'exit' to quit):\n")
    if q2.lower() == 'exit':
        break
    
    q2_full = f"""
    Question: {q2}
    Answer:
    """
    
    print(f"You asked ChatGPT: {q2}")
    
    intitial_answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=q2_full,
        max_tokens=200
    )["choices"][0]["text"].strip()
    
    print(f"AI-generated answer: {intitial_answer}")
    
    custom_answer = answer_question(q2_full.split('Answer')[0], df)
    print(f"Custom answer: {custom_answer}")

Type your question (or 'exit' to quit):
how many seats did MQM win in 2024 election
You asked ChatGPT: how many seats did MQM win in 2024 election
AI-generated answer: It is not possible to accurately predict the results of an election that has not yet occurred.
Custom answer: I don't know.
Type your question (or 'exit' to quit):
how many seats did MQM party win in 2024 pakistan election
You asked ChatGPT: how many seats did MQM party win in 2024 pakistan election
AI-generated answer: It is not possible to accurately answer this question since the 2024 Pakistan election has not happened yet.
Custom answer: The context given in the passage does not mention the results of the 2024 Pakistan election. Therefore, it is not possible to answer this question based on the given information.
Type your question (or 'exit' to quit):
Which party won the 2024 general elections of Pakistan, and by how many seats and who is the new prime minister and why?
You asked ChatGPT: Which party won the 2024 ge