# Custom Chatbot Project

For this project, I have chosen the `character_descriptions.csv` dataset, which contains detailed descriptions of fictional characters from theater, television, and film. This dataset is appropriate because it provides rich, structured information about a diverse set of characters, including their personalities, backgrounds, and relationships. By customizing the chatbot with this data, we can create a conversational agent that can answer questions about these characters, simulate their personalities, or help users explore stories and settings in a more interactive way. This customization is particularly useful for applications in creative writing, education, or entertainment, where users may want to interact with or learn about fictional characters in depth.

## Data Wrangling

In the cells below, we will load the `character_descriptions.csv` dataset into a pandas dataframe and create a column named `"text"` that combines relevant information for each character. This will prepare the data for use in our custom chatbot.

In [1]:
import pandas as pd

df = pd.read_csv('data/character_descriptions.csv')
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [2]:
# Combine relevant columns into a single 'text' column for each character
df['text'] = df.apply(lambda row: f"Name: {row['Name']}\nDescription: {row['Description']}\nMedium: {row['Medium']}\nSetting: {row['Setting']}", axis=1)
df[['text']].head()

Unnamed: 0,text
0,Name: Emily\nDescription: A young woman in her...
1,Name: Jack\nDescription: A middle-aged man in ...
2,Name: Alice\nDescription: A woman in her late ...
3,"Name: Tom\nDescription: A man in his 50s, Tom ..."
4,Name: Sarah\nDescription: A woman in her mid-2...


In [3]:
#get the total number of rows in the dataframe
total_rows = df.shape[0]
print(f"Total number of rows: {total_rows}")

df['text']

Total number of rows: 55


0     Name: Emily\nDescription: A young woman in her...
1     Name: Jack\nDescription: A middle-aged man in ...
2     Name: Alice\nDescription: A woman in her late ...
3     Name: Tom\nDescription: A man in his 50s, Tom ...
4     Name: Sarah\nDescription: A woman in her mid-2...
5     Name: George\nDescription: A man in his early ...
6     Name: Rachel\nDescription: A woman in her late...
7     Name: John\nDescription: A man in his 60s, Joh...
8     Name: Maria\nDescription: A middle-aged Latina...
9     Name: Caleb\nDescription: A young African Amer...
10    Name: Tyler\nDescription: A white man in his m...
11    Name: Sonya\nDescription: A white woman in her...
12    Name: Manuel\nDescription: A middle-aged Hispa...
13    Name: Will\nDescription: A white man in his ea...
14    Name: Mia\nDescription: A young Australian wom...
15    Name: Lucas\nDescription: A middle-aged Austra...
16    Name: Tahlia\nDescription: A young Indigenous ...
17    Name: Max\nDescription: A white Australian

## Custom Query Completion

In the cells below, we will compose a custom query using the character descriptions dataset and retrieve results from an OpenAI Completion model. We will compare the chatbot's performance with and without the custom prompt.

##### Step 1: Setting Up the OpenAI API Key
In the next cell, we will import the necessary libraries and configure the OpenAI API for use in this notebook. Make sure to replace `"YOUR_API_KEY"` with your actual API key.

In [None]:
import os
import numpy as np
import time
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR_API_KEY"

##### Step 2: Basic (no context) Response generator functions

In the next cell, we define helper functions for interacting with the OpenAI Completion API. These functions will allow us to generate responses using both the basic model and our custom character data.

In [5]:
# Helper function to get a basic completion (no custom data) with chat memory support
def basic_completion(prompt, model="gpt-3.5-turbo-instruct"):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=200,
        temperature=0.7
    )
    return response.choices[0].text.strip()


# ## Code used to verify the structure of the 'response' output

# response = openai.Completion.create(
#         model="gpt-3.5-turbo-instruct",
#         prompt="Who is Emily and what is her personality like?",
#         max_tokens=200,
#         temperature=0.7
#     )
# print(response)

##### Step 3: Generate Embeddings for Each Row

We will use OpenAI's Embedding API to generate an embedding for each character description. This will allow us to later compare the relevance of each row to a user's question.

In [6]:
def get_embedding(text, model="text-embedding-ada-002"):
    # Handles API rate limits gracefully
    while True:
        try:
            result = openai.Embedding.create(
                input=text,
                model=model
            )
            return result["data"][0]["embedding"]
        except openai.error.RateLimitError:
            time.sleep(1)

def compute_embeddings_for_df(df, text_col="text", embedding_col="embedding"):
    embeddings = []
    for text in df[text_col]:
        emb = get_embedding(text)
        embeddings.append(emb)
    df[embedding_col] = embeddings
    return df

# Only run this once, or cache the embeddings to avoid repeated API calls
# df = compute_embeddings_for_df(df)
# df.to_pickle('data/character_descriptions_with_embeddings.pkl')

##### Step 4: Load or Generate Embeddings

To avoid repeated API calls, we recommend saving the dataframe with embeddings. If you have already generated embeddings, load them from disk. Otherwise, generate and save them.

In [7]:
EMBEDDINGS_PATH = 'data/character_descriptions_with_embeddings.pkl'
if os.path.exists(EMBEDDINGS_PATH):
    df = pd.read_pickle(EMBEDDINGS_PATH)
else:
    df = compute_embeddings_for_df(df)
    df.to_pickle(EMBEDDINGS_PATH)

##### Step 5: Compute Relevance Using Cosine Similarity

We will now implement a function to rank the rows by their relevance to a user's question, using cosine similarity between embeddings.

In [8]:
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_rows_sorted_by_relevance(question, df, embedding_col="embedding", text_col="text", model="text-embedding-ada-002"):
    question_emb = get_embedding(question, model=model)
    similarities = df[embedding_col].apply(lambda emb: cosine_similarity(question_emb, emb))
    sorted_df = df.assign(similarity=similarities).sort_values(by="similarity", ascending=False)
    return sorted_df

##### Step 6: Token Counting and Prompt Construction

We will use tiktoken to count tokens and ensure the prompt stays within the model's token limit. The prompt will include the most relevant rows as context.

In [9]:
import tiktoken

def create_prompt(question, df, token_limit=3500, text_col="text"):
    encoding = tiktoken.get_encoding("cl100k_base")
    question_tokens = len(encoding.encode(question))
    context = []
    context_tokens = 0
    for text in df[text_col]:
        tokens = len(encoding.encode(text))
        if context_tokens + tokens + question_tokens > token_limit:
            break
        context.append(text)
        context_tokens += tokens
    context_str = "\n\n".join(context)
    prompt = f"Below is information about several characters, including their names, descriptions, and backgrounds:\n{context_str}\n\nWhen answering the user's question, always reference specific character names and details from the context above. Be concise, informative, and conversational. If the answer is not found in the context, say so honestly.\n\nUser question: {question}\nAnswer:"
    return prompt

##### Step 7: Answering Questions with Relevant Context

We will now create the custom_completion function to use the most relevant context from the dataset.

In [10]:
def custom_completion(question, df, model="gpt-3.5-turbo-instruct", token_limit=3500):
    relevant_df = get_rows_sorted_by_relevance(question, df)
    prompt = create_prompt(question, relevant_df, token_limit=token_limit)
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=200,
        temperature=0.7
    )
    return response.choices[0].text.strip()


In [11]:
# # Helper function to get a custom completion using the character descriptions and chat memory
# def custom_completion(user_question, df, model="gpt-3.5-turbo-instruct"):
#     # Create a context from the character descriptions

#     # Use the first 40 rows for context
#     context = "\n\n".join(df['text'].head(40))
    
#     # ## Other context options:
    
#     # Random sample of 40 rows
#     # context = "\n\n".join(df['text'].sample(40, random_state=42))

#     # Use all rows for context
#     # context = "\n\n".join(df['text'])
    
#     # Create a prompt for the model

#     # Detailed prompt
#     prompt = f"Below is information about several characters, including their names, descriptions,\
#           and backgrounds:\n" f"{context}\n\n\
#             When answering the user's question, always reference specific character names and details\
#                   from the context above.\
#                     Be concise, informative, and conversational. If the answer is not found in the\
#                           context, say so honestly.\
#                             \n\nUser question: {user_question}\n\
#                             Answer:"
    
#     # Compact Prompt
#     # prompt = f"You are an expert on fictional characters. Here is the information about some characters:\n{context}\n\nUser question: {user_question}\nAnswer:"
#     response = openai.Completion.create(
#         model=model,
#         prompt=prompt,
#         max_tokens=200,
#         temperature=0.7
#     )
#     return response.choices[0].text.strip()

# # ## Code used to verify the structure of the 'response' output

# # context = "\n\n".join(df['text'].head(40))
# # user_question = "Who is Emily and what is her personality like?"
# # prompt = f"Below is information about several characters, including their names, descriptions,\
# #           and backgrounds:\n" f"{context}\n\n\
# #             When answering the user's question, always reference specific character names and details\
# #                   from the context above.\
# #                     Be concise, informative, and conversational. If the answer is not found in the\
# #                           context, say so honestly.\
# #                             \n\nUser question: {user_question}\n\
# #                             Answer:"
# # response = openai.Completion.create(
# #         model="gpt-3.5-turbo-instruct",
# #         prompt=prompt,
# #         max_tokens=200,
# #         temperature=0.7
# #     )
# # print(response)

## Custom Performance Demonstration

Below, we demonstrate the performance of the chatbot with two example questions, showing both the basic and custom completions.

### Question 1

Who is Emily and what is her personality like?

In [12]:
# Basic completion
question1 = "Who is Emily and what is her personality like?"
print(f"Basic Completion:\n\n Question: {question1}\n\n Response:",basic_completion(question1))

Basic Completion:

 Question: Who is Emily and what is her personality like?

 Response: Emily is a fictional character and there is not enough information provided to accurately describe her personality. It is up to the author or creator to determine her personality traits and characteristics.


In [13]:
# Custom completion
print(f"Custom Completion:\n\n Question: {question1}\n\n Response:",custom_completion(question1, df))

Custom Completion:

 Question: Who is Emily and what is her personality like?

 Response: Emily is a young aspiring actress in her early 20s who is the daughter of Alice. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She is also in a relationship with George, a charming and ambitious businessman.


### Question 2

Which character is a retired soldier and what challenges does he face?

In [14]:
# Basic completion
question2 = "Which character is a retired soldier and what challenges does he face?"
print(f"Basic Completion:\n\n Question: {question2}\n\n Response:",basic_completion(question2))


Basic Completion:

 Question: Which character is a retired soldier and what challenges does he face?

 Response: One possible character that fits this description is John, a retired soldier who served in the military for 25 years before retiring. He faces several challenges in his transition to civilian life, including:

1. Adjusting to a new routine: After being in the military for so long, John is used to a structured and disciplined lifestyle. However, in retirement, he has to create a new routine for himself, which can be overwhelming and disorienting.

2. Dealing with physical and mental health issues: John may have injuries or disabilities from his time in the military, which can make it difficult for him to perform daily tasks or find employment. He may also struggle with mental health issues such as post-traumatic stress disorder (PTSD) from his experiences in combat.

3. Financial struggles: Depending on his rank and length of service, John may not receive a substantial pensio

In [15]:
# Custom completion
print(f"Custom Completion:\n\n Question: {question2}\n\n Response:",custom_completion(question2, df))


Custom Completion:

 Question: Which character is a retired soldier and what challenges does he face?

 Response: Tom, a retired soldier and John's son, faces challenges with PTSD and adjusting to civilian life in England.


## Interactive Chatbot Loop

You can now interact with the chatbot by typing your own questions below. Type 'exit' or 'quit' to end the conversation.

In [16]:
# Interactive chatbot loop
print("Welcome to the Custom Character Chatbot! Type your question (or 'exit' to quit).")
while True:
    user_question = input("You: ")
    if user_question.lower() in ["exit", "quit"]:
        print("Goodbye!")
        break
    response = custom_completion(user_question, df)
    print("Bot:", response)

Welcome to the Custom Character Chatbot! Type your question (or 'exit' to quit).
Bot: Captain James is a charismatic and dashing character in the sitcom medium set in the USA. He is the captain of the local militia and enjoys flirting with the women of the town. He has a friendly rivalry with Reverend Brown and is known for his charm and wit.
Bot: Can you tell me about a character named Maria?

Sure! Maria is a middle-aged Latina woman in her 40s who owns a small family-run diner in a small Texas town. She's a hard-working single mother who is fiercely protective of her teenage daughter, Sofia. Maria is always trying to balance work and family, and she's a strong and resilient woman who is admired by her community. She's a central character in the movie set in Texas, and her story showcases the challenges and triumphs of being a working single mother in a small town.
Goodbye!


<!-- ## Step 6: Interactive Chatbot Using Relevant Context

The chatbot will now use the improved context selection for each user question. -->

In [None]:
# print("Welcome to the Custom Character Chatbot! Type your question (or 'exit' to quit).")
# while True:
#     user_question = input("You: ")
#     if user_question.lower() in ["exit", "quit"]:
#         print("Goodbye!")
#         break
#     response = custom_completion(user_question, df)
#     print("Bot:", response)