# Custom Chatbot Project

Dataset: Character_description.

The dataset contains the description of various characters of plays, movies and limited series along with the settings of the same.

## Data Wrangling

In the cells below, I have loaded a chosen dataset into a `pandas` dataframe with a column named `"text"`. This column contains all of the text data, separated into at least 20 rows.

In [35]:
import pandas as pd
data=pd.read_csv('data/character_descriptions.csv')
df = pd.DataFrame()
df["text"]=data['Description']
df

Unnamed: 0,text
0,"A young woman in her early 20s, Emily is an as..."
1,"A middle-aged man in his 40s, Jack is a succes..."
2,"A woman in her late 30s, Alice is a warm and n..."
3,"A man in his 50s, Tom is a retired soldier and..."
4,"A woman in her mid-20s, Sarah is a free-spirit..."
5,"A man in his early 30s, George is a charming a..."
6,"A woman in her late 20s, Rachel is a shy and i..."
7,"A man in his 60s, John is a retired professor ..."
8,"A middle-aged Latina woman in her 40s, Maria i..."
9,"A young African American man in his early 20s,..."


In [33]:
import openai
openai.api_key = 'API_KEY'

In [3]:
EMBEDDING_MODEL_NAME="text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)
# Extracted embeddings
embeddings = [data["embedding"] for data in response["data"]]
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

In [7]:
import numpy as np
import pandas as pd

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,"A young woman in her early 20s, Emily is an as...","[-0.01690799556672573, -0.0075475238263607025,..."
1,"A middle-aged man in his 40s, Jack is a succes...","[0.006835197098553181, -0.019300920888781548, ..."
2,"A woman in her late 30s, Alice is a warm and n...","[0.004433026071637869, -0.004276566207408905, ..."
3,"A man in his 50s, Tom is a retired soldier and...","[0.01690339483320713, -0.008678719401359558, 0..."
4,"A woman in her mid-20s, Sarah is a free-spirit...","[-0.01835668459534645, -0.02156326174736023, 7..."
5,"A man in his early 30s, George is a charming a...","[-0.02232372760772705, -0.008801703341305256, ..."
6,"A woman in her late 20s, Rachel is a shy and i...","[-0.005023646168410778, -0.006127253174781799,..."
7,"A man in his 60s, John is a retired professor ...","[0.022159453481435776, -0.010995248332619667, ..."
8,"A middle-aged Latina woman in her 40s, Maria i...","[-0.007437407970428467, -0.0008923872373998165..."
9,"A young African American man in his early 20s,...","[0.0017888408619910479, -0.01861933246254921, ..."


## Custom Query Completion

In the cells below,I have composed a custom query using the chosen dataset and retrieved results from an OpenAI `Completion` model.

In [24]:
def answers(question):
    
    response = openai.Embedding.create(
    input=question,
    engine=EMBEDDING_MODEL_NAME
    )
    # Extracted the embeddings from the response
    q_embeddings = response['data'][0]["embedding"]
    
    #Finding cosine distance
    distances = distances_from_embeddings(q_embeddings,df["embeddings"].values,distance_metric="cosine")
    df["distances"] = distances
    df.sort_values(by="distances", ascending=True, inplace=True)

    #Creating the prompt
    tokenizer = tiktoken.get_encoding("cl100k_base")
    token_limit = 1000
    prompt_template = """
    Answer the question based on the context below, and if the 
    question can't be answered based on the context, say 
    "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    token_count = len(tokenizer.encode(prompt_template) + tokenizer.encode(question))
    # list to store text for context
    context_list = []

    # Loop over rows of the sorted dataframe
    for text in df["text"].values:
        if token_count<token_limit:
            token_count = token_count + len(tokenizer.encode(text))
        # Append text to context_list if there is enough room
            context_list.append(text)

    #string formatting to complete the prompt
    prompt = prompt_template.format(
        "\n\n###\n\n".join(context_list),
        question
        )
    
    #Querying the model
    response = openai.Completion.create(prompt=prompt,model=COMPLETION_MODEL_NAME, max_tokens=150)
    answer = response['choices'][0]['text']
    return answer

In [31]:
from openai.embeddings_utils import distances_from_embeddings
import tiktoken
EMBEDDING_MODEL_NAME="text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
while True:
    # Checking for user input to exit the loop
    user_input = input("Enter question or 'exit' to quit: ")
    if user_input.lower() == 'exit':
        print("Exiting the program...")
        break  # Exit the loop if user enters 'exit'
    else:
        final=answers(user_input)
        print(f"Answer:{final}\n")

Enter question or 'exit' to quit: Whom is tom is relationship with?
Answer: Tom is in a relationship with Rachel.

Enter question or 'exit' to quit: exit
Exiting the program...


## Custom Performance Demonstration

 In the cells below, demonstrated the performance of custom query using at least 2 questions. For each question, the answer from a basic `Completion` model query as well as the answer from custom query.

### Question 1

In [29]:
Question1 = """Whom is tom is relationship with?"""

initial_answer1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Question1,
    max_tokens=150
)["choices"][0]["text"].strip()

custom_answer1=answers(Question1)

In [30]:
print(f"""{Question1}

Original Answer: {initial_answer1}
Custom Answer:   {custom_answer1}

""")

Whom is tom is relationship with?

Original Answer: I am sorry, I am an AI and I do not have access to information about specific individuals or their personal relationships. Therefore, I am unable to answer your question.
Custom Answer:    Rachel




### Question 2

In [27]:
Question2 = """What is George's age?"""

initial_answer2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Question2,
    max_tokens=150
)["choices"][0]["text"].strip()

custom_answer2=answers(Question2)

In [28]:
print(f"""{Question2}

Original Answer: {initial_answer2}
Custom Answer:   {custom_answer2}

""")

What is George's age?

Original Answer: I don't have enough information to determine George's age. Please provide more context.
Custom Answer:    Early 30s.


