# Custom Chatbot Project

I have selected `character_descriptions.csv` for this project.

Entry includes `Name`, `Description`, `Medium` and `Setting`. To optimize this data for a custom chatbot, I am merging these attributes into a single "textual narrative" column (e.g., "[Name] is a [Description] who lives in [Setting] and appears in [Medium]").

This chatbot is designed to assist creators, such as directors and writersâ€”in quickly sourcing characters that align with specific cultural, linguistic, or narrative requirements for their projects.

## Data Wrangling

Load chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.getenv('OPENAI_API_KEY')

In [2]:
import pandas as pd

df = pd.read_csv('./data/character_descriptions.csv')
df.describe()

Unnamed: 0,Name,Description,Medium,Setting
count,55,55,55,55
unique,55,55,7,6
top,Emily,"A young woman in her early 20s, Emily is an as...",Play,USA
freq,1,1,18,21


In [3]:
df['text'] = 'The character: ' + df['Description'] + '. ' + df['Name'] + ' lives in ' + df['Setting'] + ' and appears in ' + df['Medium'] + '.'
df['text'][0]

"The character: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. Emily lives in England and appears in Play."

In [4]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

# Attempt to do a single API call with entire list

# Convert 'text' column to a list
text_list = df['text'].tolist()

# Make a single call
response = openai.Embedding.create(
    input=text_list,
    engine=EMBEDDING_MODEL_NAME
)

# Extract embedding from response
df['embedding'] = [item.embedding for item in response.data]

In [5]:
df.describe()

import datetime
# Save to file for further usage
df[['text', 'embedding']].to_csv('./data/character_descriptions_embeddings.csv', index=False)

## Custom Query Completion

Now, we will compose a custom query using the dataset and retrieve results from an OpenAI `Completion` model.

In [6]:
import numpy as np
import pandas as pd

df = pd.read_csv('./data/character_descriptions_embeddings.csv')
df["embedding"] = df["embedding"].apply(eval).apply(np.array)
df.head()

Unnamed: 0,text,embedding
0,"The character: A young woman in her early 20s,...","[-0.01328309066593647, -0.0152988126501441, -0..."
1,"The character: A middle-aged man in his 40s, J...","[0.006481784861534834, -0.02580893039703369, 0..."
2,"The character: A woman in her late 30s, Alice ...","[0.006487519480288029, -0.010747930966317654, ..."
3,"The character: A man in his 50s, Tom is a reti...","[0.015963882207870483, -0.01928156055510044, 0..."
4,"The character: A woman in her mid-20s, Sarah i...","[-0.012216025032103062, -0.026548108085989952,..."


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [7]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken

MAX_TOKENS = 1000

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embedding"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in df["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

### Question 1

In [8]:
question1 = 'Who would you recommend for a gritty crime drama set in a dark, urban environment?'

# Obtain consine similarity for question1
df_q1 = get_rows_sorted_by_relevance(question1, df)

df_q1.head()

Unnamed: 0,text,embedding,distances
10,"The character: A white man in his mid-30s, Tyl...","[0.01253554131835699, -0.04597228392958641, -0...",0.236864
3,"The character: A man in his 50s, Tom is a reti...","[0.015963882207870483, -0.01928156055510044, 0...",0.245044
14,The character: A young Australian woman in her...,"[-0.006705560255795717, -0.01931357942521572, ...",0.245286
32,The character: A driven and ambitious attorney...,"[-0.0040601580403745174, -0.007333209272474050...",0.245836
52,The character: The charismatic and dashing cap...,"[-0.00556804146617651, -0.022552067413926125, ...",0.247529


In [9]:
prompt_q1 = create_prompt(question1, df_q1, MAX_TOKENS)

print(prompt_q1)


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

The character: A white man in his mid-30s, Tyler is a tough-as-nails sheriff who takes his job very seriously. He's stoic, no-nonsense, and has a strong sense of justice. However, he's also struggling to come to terms with a recent tragedy in his personal life.. Tyler lives in Texas and appears in Movie.

###

The character: A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.. Tom lives in England and appears in Play.

###

The character: A young Australian woman in her mid-20s, Mia is a driven and ambitious lawyer who's just landed her dream job at a top law firm in Sydney. She's the younger sister of Max, a former soldier who's struggling with PTSD, and is trying to help him navigate his chall

In [10]:
# Ask question1

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

response1 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt_q1,
    max_tokens=150
)

answer1 = response1["choices"][0]["text"].strip()

print(answer1)

I would recommend Tyler or Will for a gritty crime drama set in a dark, urban environment. Both of them have traits and experiences that would make them compelling characters in a crime-focused story.


In [11]:
basic_prompt1 = f"Question: {question1}\nAnswer:"

response1_basic = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=basic_prompt1,
    max_tokens=150
)

basic_answer1 = response1_basic["choices"][0]["text"].strip()

print(basic_answer1)

I would recommend David Simon, creator of shows like "The Wire" and "Homicide: Life on the Street." His work is known for its vivid and realistic portrayal of crime and poverty in urban cities. Another great option is Denis Lehane, who has written books like "Mystic River" and "Gone Baby Gone" that have been adapted into gritty crime films set in Boston.


### Question 2

In [12]:
question2 = 'I need two names of young female actress for historical theater settings.'

# Obtain consine similarity for question1
df_q2 = get_rows_sorted_by_relevance(question2, df)

df_q2.head()

Unnamed: 0,text,embedding,distances
0,"The character: A young woman in her early 20s,...","[-0.01328309066593647, -0.0152988126501441, -0...",0.206501
37,The character: A fiery and passionate young wo...,"[-0.015376008115708828, -0.01927143521606922, ...",0.216253
43,The character: A plucky and resourceful young ...,"[-0.014041371643543243, -0.041595470160245895,...",0.224975
11,"The character: A white woman in her late 20s, ...","[0.006553735584020615, -0.03139369934797287, -...",0.226446
4,"The character: A woman in her mid-20s, Sarah i...","[-0.012216025032103062, -0.026548108085989952,...",0.22706


In [13]:
prompt_q2 = create_prompt(question2, df_q2, MAX_TOKENS)

print(prompt_q2)


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

The character: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.. Emily lives in England and appears in Play.

###

The character: A fiery and passionate young woman who works as a blacksmith. She is strong-willed and independent, and her singing voice is bold and powerful. Francesca has caught the eye of Prince Lorenzo, but she is hesitant to give her heart to a man who comes from such a different world.. Francesca lives in Italy and appears in Opera.

###

The character: A plucky and resourceful young woman who is shipwrecked on the coast of Illyria. Viola disguises herself as a man, taking on the identity of Cesario, and becomes a servant in Duke Orsino's household. She develops fe

In [14]:
# Ask question2

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

response2 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt_q2,
    max_tokens=150
)

answer2 = response2["choices"][0]["text"].strip()
print(answer2)

Emily and Sarah


In [15]:
basic_prompt2 = f"Question: {question2}\nAnswer:"

response2_basic = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=basic_prompt2,
    max_tokens=150
)

basic_answer2 = response2_basic["choices"][0]["text"].strip()

print(basic_answer2)

1. Lily Collins - known for her roles in historical films such as "Mirror Mirror" and "The Mortal Instruments: City of Bones"
2. Saoirse Ronan - known for her roles in historical dramas such as "Brooklyn" and "Mary Queen of Scots"
