# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

### Why this dataset was used:

Due to the fact that the dataset is a custom, proprietary dataset it would not have been included in the training data used to train any OpenAI LLM. Therefore it would be unlikely for the LLM to accurately answer any question on the data - without RAG or in-context learning. 

Additionally, given the fact questions asked on the dataset would be broad, general and have more than one answer (such as asking 'Describe the character Emily'), in-context learning is needed to give the LLM the right context in which these questions are asked.

In [75]:
import pandas as pd
import csv
from openai import OpenAI
from typing import List, Optional
from scipy import spatial

In [120]:
# Data
filepath = 'data/'
filename = 'character_descriptions.csv'

# OpenAI

COMPLETION_MODEL_NAME = "gpt-3.5-turbo"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"

client = OpenAI(api_key = 'INSERT_OPENAI_KEY')

In [15]:
# Load the character data from file
with open(filepath+filename, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    record_strings = []

    for row in reader:
        record_string = ' '.join([f"{column}: {value}" for column, value in row.items()])
        record_strings.append(record_string)

In [16]:
data = pd.DataFrame(record_strings, columns=['text'])

In [61]:
data.head()

Unnamed: 0,text,embedding
0,Name: Emily Description: A young woman in her ...,"[0.035315584391355515, 0.0066185323521494865, ..."
1,Name: Jack Description: A middle-aged man in h...,"[0.0018264292739331722, -0.007305717095732689,..."
2,Name: Alice Description: A woman in her late 3...,"[0.03802965208888054, -0.008232181891798973, 0..."
3,"Name: Tom Description: A man in his 50s, Tom i...","[0.00259851454757154, 0.02653220109641552, -0...."
4,Name: Sarah Description: A woman in her mid-20...,"[0.041170552372932434, 0.035535845905542374, -..."


In [23]:
# Create OpenAI embeddings
data['embedding'] = None

for index, row in data.iterrows():
    
    embedding = client.embeddings.create(input=row['text'],
                                         model=EMBEDDING_MODEL_NAME)
    
    data.at[index, 'embedding'] = embedding.data[0].embedding

In [24]:
data.head()

Unnamed: 0,text,embedding
0,Name: Emily Description: A young woman in her ...,"[0.035315584391355515, 0.0066185323521494865, ..."
1,Name: Jack Description: A middle-aged man in h...,"[0.0018264292739331722, -0.007305717095732689,..."
2,Name: Alice Description: A woman in her late 3...,"[0.03802965208888054, -0.008232181891798973, 0..."
3,"Name: Tom Description: A man in his 50s, Tom i...","[0.00259851454757154, 0.02653220109641552, -0...."
4,Name: Sarah Description: A woman in her mid-20...,"[0.041170552372932434, 0.035535845905542374, -..."


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [46]:
# Modification of the Open AI 'distances_from_embeddings' function
def distances_from_embeddings(query_embedding: List[float], embeddings: List[List[float]], distance_metric="cosine",) -> List[List]:
    
    """Return the distances between a query embedding and a list of embeddings."""
    
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    
    return distances

In [92]:
prompt_template = """
Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

{}

---

Question: {}
Answer:
"""

In [93]:
question = 'Who is Emily?'

In [106]:
# Function to prepare prompt with context

def prepare_prompt(question):

    # Get embedding for the question
    q_embedding = client.embeddings.create(input=question, model=EMBEDDING_MODEL_NAME).data[0].embedding
    # Get the distances from the question for all text embeddings
    distances = distances_from_embeddings(query_embedding=q_embedding, embeddings=data['embedding'])
    # Find the smallest distance embedding in the list
    smallest_distance_embedding = distances.index(min(distances))
    # Get the text for the embedding with the smallest distance
    context = data.loc[smallest_distance_embedding]['text']
    # create prompt
    return prompt_template.format(context, question)

In [112]:
prompt = prepare_prompt(question)

print(prompt)


Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

Name: Emily Description: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. Medium: Play Setting: England

---

Question: Who is Emily?
Answer:



In [113]:
# Get the response from OpenAI for the prompt with context

response = client.chat.completions.create(model=COMPLETION_MODEL_NAME, messages=[{"role": "user", "content": prompt}])

print(response.choices[0].message.content)

Emily is a young woman in her early 20s, an aspiring actress, and Alice's daughter.


In [114]:
# Get the response from OpenAI WITHOUT context

response = client.chat.completions.create(model=COMPLETION_MODEL_NAME, messages=[{"role": "user", "content": question}])

print(response.choices[0].message.content)

It is unclear who Emily is as the name is quite common. Can you provide more context or details to help identify which Emily you are referring to?


### Question 1

### Question 2

In [115]:
question2 = 'What is Jack\'s play setting?'

In [117]:
prompt2 = prepare_prompt(question2)

print(prompt2)


Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

Name: Jack Description: A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice. Medium: Play Setting: England

---

Question: What is Jack's play setting?
Answer:



In [118]:
# Get the response from OpenAI for the prompt with context

response = client.chat.completions.create(model=COMPLETION_MODEL_NAME, messages=[{"role": "user", "content": prompt2}])

print(response.choices[0].message.content)

England


In [119]:
# Get the response from OpenAI WITHOUT context

response = client.chat.completions.create(model=COMPLETION_MODEL_NAME, messages=[{"role": "user", "content": question2}])

print(response.choices[0].message.content)

The play setting for Jack is the garden of his family home.
