# Custom Chatbot Project

#### Data Source: 
`character_descriptions.csv` - this file contains character descriptions from theater, television, and film productions. Each row contains the name, description, medium, and setting. All characters were invented by an OpenAI model.

#### Reasoning for selection:

The dataset chosen for this project, containing character descriptions from theater, television, and film productions, is particularly appropriate for several reasons. Firstly, the characters within this dataset are invented by an OpenAI model, ensuring that they are unique and unlikely to be pre-existing entities within the knowledge base of a large language model (LLM). Consequently, directly querying the LLM about these characters without additional context would be ineffective and inappropriate.

To address this, we will generate embeddings for the character descriptions, allowing us to retrieve relevant context based on the user’s query. This context will be incorporated into a custom prompt, enabling the LLM to provide more accurate and contextually relevant responses. By leveraging this method, we enhance the LLM's ability to handle inquiries about these specific, unique characters, ultimately improving the quality and relevance of the generated answers.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

## Data Wrangling

In [2]:
import pandas as pd

# Load the CSV file
file_path = 'data/character_descriptions.csv'
character_data = pd.read_csv(file_path)

# Function to combine columns into a single descriptive text
def combine_columns(row):
    # Combine the relevant columns into a single descriptive text
    text = f"{row['Name']} is a {row['Description']} This character appears in a {row['Medium']} set in {row['Setting']}."
    return text

# Apply the function to each row in the dataframe
character_data['text'] = character_data.apply(combine_columns, axis=1)

# Create a new dataframe with only the 'text' column
combined_df = character_data[['text']]

# Display the resulting dataframe
combined_df.head()

Unnamed: 0,text
0,"Emily is a A young woman in her early 20s, Emi..."
1,"Jack is a A middle-aged man in his 40s, Jack i..."
2,"Alice is a A woman in her late 30s, Alice is a..."
3,"Tom is a A man in his 50s, Tom is a retired so..."
4,"Sarah is a A woman in her mid-20s, Sarah is a ..."


# Inspecting Non-Customized Results

In [3]:
import random

# Randomly select two rows from the dataframe
sampled_rows = combined_df.sample(2).reset_index(drop=True)

# Extract the names of the characters for the questions
character_1 = sampled_rows.iloc[0]['text'].split()[0]
character_2 = sampled_rows.iloc[1]['text'].split()[0]

# Generate two questions based on the selected characters
prompt1 = f"What is {character_1}'s profession?"
prompt2 = f"In what setting does {character_2}'s story take place?"

print(f'Prompt1: {prompt1}')

answer1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer1)

print(f'Prompt2: {prompt2}')

answer2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer2)

# Prompt1: What is Viola's profession?
# Viola's profession is never explicitly stated in the play "Twelfth Night" by William Shakespeare. However, it can be inferred that she is a young noblewoman as she is described as being a member of the aristocracy and dresses in men's clothing. She also has the skills and knowledge necessary to impersonate a man, suggesting a well-rounded education and upbringing. Some interpretations of the character speculate that she may have been trained in music or dance due to her talent for playing the viola and her willingness to participate in a masquerade ball. Therefore, Viola's profession can be seen as a lady-in-waiting or a ward of a noble household.
# Prompt2: In what setting does Malvolio's story take place?
# Malvolio's story takes place in the fictional kingdom of Illyria.

# Prompt1: What is Prince's profession?
# Prince was a musician, singer, songwriter, and artist.
# Prompt2: In what setting does Sebastian's story take place?
# Sebastian's story takes place in a village in the mountains of Switzerland.

# Prompt1: What is Malvolio's profession?
# Malvolio's profession is a steward or steward-in-training for Olivia, the Lady of the estate in which he works.
# Prompt2: In what setting does Karma's story take place?
# Karma's story takes place in a small village in rural India.

Prompt1: What is Prince's profession?
Prince was a musician and singer, known for his mastery of multiple instruments and his innovative style that blended various genres such as funk, rock, and pop. He was also a songwriter, producer, and actor.
Prompt2: In what setting does Sonya's story take place?
Sonya's story takes place in a small village in Russia during the late 19th century.


#### Get the embeddings for the text data


In [4]:

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(combined_df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=combined_df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
combined_df["embeddings"] = embeddings
combined_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df["embeddings"] = embeddings


Unnamed: 0,text,embeddings
0,"Emily is a A young woman in her early 20s, Emi...","[-0.018288955092430115, -0.00899435393512249, ..."
1,"Jack is a A middle-aged man in his 40s, Jack i...","[0.004010607022792101, -0.018672117963433266, ..."
2,"Alice is a A woman in her late 30s, Alice is a...","[0.002794065745547414, -0.005867373663932085, ..."
3,"Tom is a A man in his 50s, Tom is a retired so...","[0.01342708244919777, -0.01089267898350954, -0..."
4,"Sarah is a A woman in her mid-20s, Sarah is a ...","[-0.017453497275710106, -0.02167949452996254, ..."
5,"George is a A man in his early 30s, George is ...","[-0.02309379354119301, -0.01107707992196083, -..."
6,"Rachel is a A woman in her late 20s, Rachel is...","[-0.007816864177584648, -0.007189466152340174,..."
7,"John is a A man in his 60s, John is a retired ...","[0.017793340608477592, -0.012889099307358265, ..."
8,Maria is a A middle-aged Latina woman in her 4...,"[-0.009754852391779423, -0.011455489322543144,..."
9,Caleb is a A young African American man in his...,"[0.006247839890420437, -0.024651341140270233, ..."


In [5]:
# Save the csv file with embeddings
combined_df.to_csv('character_descriptions_with_embeddings.csv')

In [6]:
len(combined_df['embeddings'][0])
!ls

README.md				    data	   requirements.txt
character_descriptions_with_embeddings.csv  project.ipynb


## Custom Query Completion

#### Get relevant text for custom query

In [7]:
import numpy as np
import pandas as pd
# import openai
# openai.api_base = "https://openai.vocareum.com/v1"
# openai.api_key = "YOUR API KEY"

# load the df with embeddings
file_path = 'character_descriptions_with_embeddings.csv'
df = pd.read_csv(file_path, index_col= 0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [8]:
df['embeddings'][0].shape

(1536,)

In [9]:
# do this for question 1
question1 = "What is Malvolio's profession?"
sorted_df1 = get_rows_sorted_by_relevance(question1, df)
sorted_df1.head()

Unnamed: 0,text,embeddings,distances
42,Malvolio is a A pompous and self-righteous ste...,"[-0.015121678821742535, -0.03440052270889282, ...",0.112659
45,Bianca is a Lady Olivia's cunning and quick-wi...,"[-0.015460110269486904, -0.025932665914297104,...",0.162326
43,Viola is a A plucky and resourceful young woma...,"[-0.01021886058151722, -0.04151612147688866, 0...",0.1696
40,Lady Olivia is a A wealthy and beautiful noble...,"[-0.01951126754283905, -0.023369355127215385, ...",0.185548
41,Sir Toby Belch is a A drunken and lecherous kn...,"[-0.0015530632808804512, -0.032748542726039886...",0.19205


In [10]:
# do this for question 2
question2 = "In what setting does Karma's story take place?"
sorted_df2 = get_rows_sorted_by_relevance(question2, df)
sorted_df2.head()

Unnamed: 0,text,embeddings,distances
24,"Karma is a A chameleon-like performer, Karma i...","[-0.006721639074385166, -0.020649712532758713,...",0.185987
11,"Sonya is a A white woman in her late 20s, Sony...","[0.002251438098028302, -0.026818307116627693, ...",0.232027
4,"Sarah is a A woman in her mid-20s, Sarah is a ...","[-0.017453497275710106, -0.02167949452996254, ...",0.235207
3,"Tom is a A man in his 50s, Tom is a retired so...","[0.01342708244919777, -0.01089267898350954, -0...",0.23838
8,Maria is a A middle-aged Latina woman in her 4...,"[-0.009754852391779423, -0.011455489322543144,...",0.240816


##### Create a Function that Composes a Text Prompt

Building on that sorted list of rows, we're going to select the create a text prompt that provides context to a Completion model in order to help it answer a question. The outline of the prompt looks like this:

```
Answer the question based on the context below, and if the
question can't be answered based on the context, say "I don't
know"

Context:

{context}

---

Question: {question}
Answer:
```

We want to fit as much of our dataset as possible into the "context" part of the prompt without exceeding the number of tokens allowed by the `Completion` model, which is currently 4,000. So we'll loop over the dataset, counting the tokens as we go, and stop when we hit the limit. Then we'll join that list of text data into a single string and add it to the prompt.

In [11]:
import tiktoken

prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""


def create_prompt(question, df, max_token_count, prompt_template=prompt_template):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [12]:
# create prompt for question 1
max_token_count = 300
print(create_prompt(question1, df, max_token_count))



Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Malvolio is a A pompous and self-righteous steward in Lady Olivia's household. Malvolio is humorless and uptight, and is often the target of Sir Toby Belch's pranks. He is secretly in love with Lady Olivia and harbors dreams of marrying her. This character appears in a Play set in Ancient Greece.

###

Bianca is a Lady Olivia's cunning and quick-witted maid. Bianca is a master of mischief and pranks, and often collaborates with Sir Toby Belch to torment Malvolio. She is also secretly in love with Sir Toby. This character appears in a Play set in Ancient Greece.

###

Viola is a A plucky and resourceful young woman who is shipwrecked on the coast of Illyria. Viola disguises herself as a man, taking on the identity of Cesario, and becomes a servant in Duke Orsino's household. She develops feelings for the Duke, but is unable to reveal her true identi

In [13]:
# create prompt for question 2
max_token_count = 300
print(create_prompt(question2, df, max_token_count))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Karma is a A chameleon-like performer, Karma is known for her ability to transform herself into any character. She's a master of illusion and is always pushing boundaries with her looks and performances, but can sometimes struggle with authenticity and staying true to herself. She's also a friend of Dolly, often offering her a listening ear when she needs it. This character appears in a Musical set in USA.

###

Sonya is a A white woman in her late 20s, Sonya is a free-spirited journalist who's always on the hunt for the next big story. She's passionate, tenacious, and unafraid to speak her mind. However, she's also struggling to balance her career with her personal life, and has a tendency to push people away. This character appears in a Movie set in Texas.

###

Sarah is a A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employe

#### Create a Function that Answers a Question

In [14]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [15]:
# answer question 1
answer1 = answer_question(question1, df)
print(answer1)

Steward


In [16]:
answer2 = answer_question(question2, df)
print(answer2)

Story set in USA.


## Custom Performance Demonstration

### Question 1

In [17]:
question1 = "What is George's profession?"

answer1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer1)

There is not enough information given to determine George's profession.


In [18]:
custom_chatbot_anwer1 = answer_question(question1, df)
print(custom_chatbot_anwer1)

Businessman.


### Question 2

In [19]:
question2 = "In what setting does Karma's story take place?"

answer2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer2)

Karma's story takes place in a mountain village in Nepal.


In [20]:
custom_chatbot_anwer2 = answer_question(question2, df)
print(custom_chatbot_anwer2)

Musical set in USA


In [21]:
# .csv data of the characters for reference.
# George,"A man in his early 30s, George is a charming and charismatic businessman who is in a relationship with Emily. He's ambitious, confident, and always looking for the next big opportunity. However, he's also prone to bending the rules to get what he wants.",Play,England
# Karma,"A chameleon-like performer, Karma is known for her ability to transform herself into any character. She's a master of illusion and is always pushing boundaries with her looks and performances, but can sometimes struggle with authenticity and staying true to herself. She's also a friend of Dolly, often offering her a listening ear when she needs it.",Musical,USA