# Custom Chatbot Project

I selected the character_descriptions.csv dataset and GPT-3.5-turbo-instruct to demonstrate how a domain-specific dataset can enhance AI-generated responses in a targeted context. This combination showcases how structured data about characters enables the model to provide relevant, context-aware interactions, illustrating the effectiveness of domain-driven AI applications.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [15]:
import pandas as pd

df = pd.read_csv('data/character_descriptions.csv')

# Add concatenated text to the dataframe
df['text'] = df['Name'] + " is a character in a " + df['Medium'] + ". " + df['Description'] + " The story is set in " + df['Setting'] + "."

df.iloc[0]["text"]


"Emily is a character in a Play. A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. The story is set in England."

In [16]:
import openai

openai.api_key = "YOUR API KEY"

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [17]:
df

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,Emily is a character in a Play. A young woman ...,"[-0.016368035227060318, -0.014631644822657108,..."
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,Jack is a character in a Play. A middle-aged m...,"[0.0023951339535415173, -0.027941077947616577,..."
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,Alice is a character in a Play. A woman in her...,"[0.007741514127701521, -0.011314022354781628, ..."
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,Tom is a character in a Play. A man in his 50s...,"[0.015514831058681011, -0.020425723865628242, ..."
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,Sarah is a character in a Play. A woman in her...,"[-0.01451121736317873, -0.03126079961657524, -..."
5,George,"A man in his early 30s, George is a charming a...",Play,England,George is a character in a Play. A man in his ...,"[-0.01863767020404339, -0.015743380412459373, ..."
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England,Rachel is a character in a Play. A woman in he...,"[-0.006452818401157856, -0.014104107394814491,..."
7,John,"A man in his 60s, John is a retired professor ...",Play,England,John is a character in a Play. A man in his 60...,"[0.017432034015655518, -0.017279010266065598, ..."
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas,Maria is a character in a Movie. A middle-aged...,"[-0.01244370173662901, -0.019594186916947365, ..."
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas,Caleb is a character in a Movie. A young Afric...,"[0.006574748549610376, -0.030858173966407776, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [18]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [19]:
get_rows_sorted_by_relevance("Who is Emily?", df)

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,distances
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,Emily is a character in a Play. A young woman ...,"[-0.016368035227060318, -0.014631644822657108,...",0.11404
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,Alice is a character in a Play. A woman in her...,"[0.007741514127701521, -0.011314022354781628, ...",0.170748
5,George,"A man in his early 30s, George is a charming a...",Play,England,George is a character in a Play. A man in his ...,"[-0.01863767020404339, -0.015743380412459373, ...",0.193296
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England,Rachel is a character in a Play. A woman in he...,"[-0.006452818401157856, -0.014104107394814491,...",0.214381
26,Olivia,A confident and charismatic marketing executiv...,Reality Show,USA,Olivia is a character in a Reality Show. A con...,"[-0.00042307059629820287, -0.01519947499036789...",0.218513
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,Sarah is a character in a Play. A woman in her...,"[-0.01451121736317873, -0.03126079961657524, -...",0.220237
14,Mia,"A young Australian woman in her mid-20s, Mia i...",Limited Series,Australia,Mia is a character in a Limited Series. A youn...,"[-0.012351165525615215, -0.022184068337082863,...",0.223251
30,Sophia,"A fun-loving and adventurous travel blogger, S...",Reality Show,USA,Sophia is a character in a Reality Show. A fun...,"[0.024226823821663857, -0.015650367364287376, ...",0.224164
49,Abigail,A plucky and resourceful young woman who works...,Sitcom,USA,Abigail is a character in a Sitcom. A plucky a...,"[-0.021924767643213272, -0.019728370010852814,...",0.224644
18,Ava,"A middle-aged Australian woman in her 50s, Ava...",Limited Series,Australia,Ava is a character in a Limited Series. A midd...,"[9.845461318036541e-05, -0.013735860586166382,...",0.228283


In [20]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)


In [21]:
print(create_prompt("Who is Emily?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Emily is a character in a Play. A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. The story is set in England.

###

Alice is a character in a Play. A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack. The story is set in England.

---

Question: Who is Emily?
Answer:


In [33]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def openai_query(prompt, max_answer_tokens=150):
    """
    Given a prompt, query the openapi.Completion model.
    Return the first response choice from the model.
    """
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

def answer_question(question, df, max_prompt_tokens=1800, max_answer_tokens=150):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    return openai_query(prompt, max_answer_tokens)


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [34]:
question = "Who is Emily?"
emily_answer = openai_query(question)
print(emily_answer)

There is not enough information to determine who Emily is. It is a fairly common name and could refer to anyone.


In [35]:
emily_answer = answer_question(question, df)
print(emily_answer)

Emily is a character in a play, an aspiring actress and Alice's daughter. She is in a relationship with George and struggles with self-doubt and insecurity. The story is set in England.


### Question 2

In [36]:
question = "Who is Alice daughter?"
emily_answer = openai_query(question)
print(emily_answer)

It depends on which Alice you are referring to. There are many fictional characters named Alice, and their daughters may have different names or may not even exist in the story. If you are referring to the traditional character of Alice in "Alice's Adventures in Wonderland" by Lewis Carroll, she does not have a daughter. However, in some adaptations and sequels to the story, she may have a daughter named Lily or Dinah.


In [37]:
emily_answer = answer_question(question, df)
print(emily_answer)

Emily.
