# Custom Chatbot Project

# Character Description Query System

## Project Objective
Develop a semantic embedding-based query system for unique character descriptions using machine learning techniques.

## Dataset
- Source: `character_descriptions.csv`
- Content: AI-generated character descriptions from various media productions

## Key Technical Approach
- Semantic embedding generation
- Contextual relevance retrieval
- Custom prompt engineering

## Dataset Selection Rationale

The chosen dataset of AI-generated character descriptions offers a unique challenge for developing an advanced query system. By using artificially created characters not present in standard language model knowledge bases, we can:

- Test semantic embedding effectiveness
- Demonstrate context retrieval capabilities
- Showcase how custom prompting can enhance language model responses

In [1]:
import pandas as pd
import numpy  as np
import re
import ast
import openai
from   openai.embeddings_utils import get_embedding, distances_from_embeddings

In [2]:
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key  = "OpenAI API KEY"

In [3]:
pd.set_option('display.max_colwidth', None)
# ===========================================
data_path = "data/character_descriptions.csv"
df        = pd.read_csv(data_path)
df.head(5)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.",Play,England


## Data Wrangling

### Data Preparation Objectives

- Transform raw character data into embedding-ready format
- Create comprehensive text representations
- Prepare for semantic search and contextual query processing

In [4]:
df['text'] = df['Name'] + ', residing in ' + df['Setting'] + ', who performs in a ' + df['Medium'] + ', ' + df['Description']
df[['text']].sample(5)

Unnamed: 0,text
42,"Malvolio, residing in Ancient Greece, who performs in a Play, A pompous and self-righteous steward in Lady Olivia's household. Malvolio is humorless and uptight, and is often the target of Sir Toby Belch's pranks. He is secretly in love with Lady Olivia and harbors dreams of marrying her."
13,"Will, residing in Texas, who performs in a Movie, A white man in his early 40s, Will is a successful businessman who's come back to his hometown after many years away. He's confident, charming, and knows how to get what he wants. However, he's also hiding a dark secret from his past that threatens to destroy everything he's worked for."
14,"Mia, residing in Australia, who performs in a Limited Series, A young Australian woman in her mid-20s, Mia is a driven and ambitious lawyer who's just landed her dream job at a top law firm in Sydney. She's the younger sister of Max, a former soldier who's struggling with PTSD, and is trying to help him navigate his challenges while also balancing her demanding career."
45,"Bianca, residing in Ancient Greece, who performs in a Play, Lady Olivia's cunning and quick-witted maid. Bianca is a master of mischief and pranks, and often collaborates with Sir Toby Belch to torment Malvolio. She is also secretly in love with Sir Toby."
48,"Antonio, residing in Ancient Greece, who performs in a Play, A sea captain who rescues Sebastian from the shipwreck. Antonio becomes devoted to Sebastian and follows him to Illyria, despite the danger it poses to himself."


## Semantic Embedding Generation

- Utilizing OpenAI's text-embedding-ada-002 model
- Converting character descriptions into high-dimensional vector representations
- Enabling semantic similarity comparisons

In [5]:
sampled_df = df.sample(2).reset_index(drop=True)

for _, row in sampled_df.iterrows():
    character = row['Name']
    prompt1   = f"What is {character}'s character?"
    prompt2   = f"In what setting does {character}'s story take place?"

    print(f'Prompt1: {prompt1}')
    answer1   = openai.Completion.create(
                                        model      = "gpt-3.5-turbo-instruct",
                                        prompt     = f"Based on this description: {row['Description']}, {prompt1}",
                                        max_tokens = 100
                                       )["choices"][0]["text"].strip()
    print(answer1)

    
    print(f'\nPrompt2: {prompt2}')
    answer2   = openai.Completion.create(
                                         model      = "gpt-3.5-turbo-instruct",
                                         prompt     = f"Based on this description: {row['Description']}, {prompt2}",
                                         max_tokens = 100
                                        )["choices"][0]["text"].strip()
    print(answer2 + '\n')
    print("=============================================")

Prompt1: What is Donna's character?
Donna's character can be described as a larger-than-life and confident performer who exudes a diva-like personality on stage. She is also a mentor and guide to help others improve their craft.

Prompt2: In what setting does Donna's story take place?
It is likely that Donna's story takes place in the world of performance or entertainment, specifically in a theatrical or musical setting. This could include Broadway, a concert tour, or a drag show.

Prompt1: What is Thomas's character?
Thomas's character is charming and playful, with a good sense of humor and a willingness to go along with his friend's schemes. He brings a lighthearted energy to the group and is well-liked by his coworkers.

Prompt2: In what setting does Thomas's story take place?
Thomas's story likely takes place in a medieval or fantasy tavern setting.



## Semantic Embedding Generation

### Objective
- Convert text descriptions to high-dimensional vector representations
- Prepare data for semantic similarity search

In [6]:
response         = openai.Embedding.create(input  = df["text"].tolist(), 
                                           engine = "text-embedding-ada-002")

embeddings       = [data["embedding"] for data in response["data"]]
df["embeddings"] = embeddings

In [7]:
print(type(df['embeddings'].iloc[0]))
print(df['embeddings'].iloc[0][:5])  # First 5 elements


<class 'list'>
[-0.012010179460048676, -0.01708943024277687, -0.008300396613776684, -0.024174664169549942, -0.03551618382334709]


In [8]:
pd.set_option('display.max_colwidth', 300)
# ========================================
df["embeddings"] = df["embeddings"].apply(np.array)
df               = df[['text', 'embeddings']]
df[['text', 'embeddings']].head(3)

Unnamed: 0,text,embeddings
0,"Emily, residing in England, who performs in a Play, A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.","[-0.012010179460048676, -0.01708943024277687, -0.008300396613776684, -0.024174664169549942, -0.03551618382334709, 0.030269766226410866, -0.011727283708751202, 0.02576916292309761, -0.0065965973772108555, -0.012781711295247078, -0.005583961494266987, -0.007593159098178148, 0.015032012015581131, -..."
1,"Jack, residing in England, who performs in a Play, A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.","[0.013250970281660557, -0.02648891881108284, 0.002238339511677623, -0.03300045058131218, -0.02852051705121994, 0.021058298647403717, -0.009389631450176239, 0.010737518779933453, 0.005525036249309778, -0.007384079042822123, 0.0029432130977511406, 0.006928271614015102, 0.009936599992215633, -0.009..."
2,"Alice, residing in England, who performs in a Play, A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.","[0.006659980397671461, -0.013561666943132877, -0.015978727489709854, -0.03441371023654938, -0.03731418028473854, 0.035093098878860474, -0.009380806237459183, 0.026117313653230667, 0.0023027395363897085, -0.014606881886720657, -0.008368253707885742, 0.004883114714175463, 0.006153704132884741, -0...."


In [9]:
df.to_csv('DataFrame_With_Embeddings.csv')

In [10]:
# df = pd.read_csv("DataFrame_With_Embeddings.csv", index_col=0)
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
# df = df[["text", "embeddings"]]
# df.head()

## Semantic Relevance Ranking Function

### Purpose
- Find most relevant text for a given question
- Use semantic similarity via embeddings
- Rank text by cosine distance

### Key Components
- Convert question to embedding
- Calculate embedding distances
- Sort DataFrame by relevance

In [11]:
def get_rows_sorted_by_relevance(question, df):
    
    """ This function finds the most relevant text for a question 
        by comparing semantic similarity via embeddings. It returns 
        the dataframe sorted by relevance (most relevant first).    """
    
    question_embeddings = get_embedding(question, engine = "text-embedding-ada-002")
    
    # Calculate distances efficiently and sort
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
                                                    question_embeddings,
                                                    df_copy["embeddings"].values,
                                                    distance_metric = "cosine"
                                                    )
    
    return df_copy.sort_values("distances", ascending=True)
    

In [12]:
#df['embeddings'][0].shape

In [13]:
question1 = "What is Jack's profession?"
sorted_df1 = get_rows_sorted_by_relevance(question1, df)
sorted_df1.head(3)

Unnamed: 0,text,embeddings,distances
1,"Jack, residing in England, who performs in a Play, A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and family. He's married to Alice.","[0.013250970281660557, -0.02648891881108284, 0.002238339511677623, -0.03300045058131218, -0.02852051705121994, 0.021058298647403717, -0.009389631450176239, 0.010737518779933453, 0.005525036249309778, -0.007384079042822123, 0.0029432130977511406, 0.006928271614015102, 0.009936599992215633, -0.009...",0.140983
4,"Sarah, residing in England, who performs in a Play, A woman in her mid-20s, Sarah is a free-spirited artist and Jack's employee. She's creative, unconventional, and passionate about her work. However, she can also be flighty and impulsive at times.","[-0.007598050870001316, -0.029405105859041214, -0.00887738075107336, -0.01937827654182911, -0.03052208572626114, 0.010396990925073624, -0.024911217391490936, 0.013338800519704819, -0.01844313181936741, -0.011812696233391762, 0.010585319250822067, 0.007909765467047691, 0.008046140894293785, -0.01...",0.181321
7,"John, residing in England, who performs in a Play, A man in his 60s, John is a retired professor and Tom's father. He has a dry wit and a love of intellectual debate, but can also be stubborn and set in his ways.","[0.020502500236034393, -0.016809985041618347, -0.019650381058454514, -0.026880482211709023, -0.020076440647244453, 0.04183129593729973, -0.012878617271780968, 0.025331174954771996, -0.033413395285606384, -0.0002644716005306691, -0.002861376851797104, 0.016009509563446045, 0.016616320237517357, 0...",0.19527


In [14]:
question2 = "Suggest two names of older female actresses who would be perfect for an American sitcom."
sorted_df2 = get_rows_sorted_by_relevance(question2, df)
sorted_df2.head(3)

Unnamed: 0,text,embeddings,distances
0,"Emily, residing in England, who performs in a Play, A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.","[-0.012010179460048676, -0.01708943024277687, -0.008300396613776684, -0.024174664169549942, -0.03551618382334709, 0.030269766226410866, -0.011727283708751202, 0.02576916292309761, -0.0065965973772108555, -0.012781711295247078, -0.005583961494266987, -0.007593159098178148, 0.015032012015581131, -...",0.214187
53,"Mrs. Mercer, residing in USA, who performs in a Sitcom, The matriarch of the wealthiest family in Williamsburg. Mrs. Mercer is a bit of a snob and enjoys reminding everyone of her social standing. She often hires Abigail to work in her home and is very demanding.","[-0.015823304653167725, -0.014490398578345776, -0.012359069660305977, -0.03286074101924896, 0.015585756860673428, 0.031145120039582253, -0.007680702954530716, 0.001680978573858738, -0.01834394782781601, -0.02715959958732128, 0.014925902709364891, 0.005249140318483114, 0.006578746717423201, -0.00...",0.215965
8,"Maria, residing in Texas, who performs in a Movie, A middle-aged Latina woman in her 40s, Maria is a hard-working single mother who owns a small family-run diner in a small Texas town. She's fiercely protective of her teenage daughter, Sofia, and is always trying to balance work and family.","[-0.012487166561186314, -0.017634496092796326, -0.012305019423365593, -0.030033962801098824, -0.016878925263881683, 0.006715815514326096, -0.01276375912129879, 0.014463795349001884, -0.018686899915337563, -0.036780137568712234, 0.0189567469060421, 0.003501263912767172, 0.009471626952290535, 0.01...",0.216364


## Context-Aware Prompt Generation

Prompt creation is a strategic process of assembling the most relevant contextual information while respecting token limitations. The function leverages tiktoken for precise token management, systematically building a context that maximizes semantic relevance without overwhelming the language model.

By sorting text rows based on semantic similarity and incrementally adding them to the context, the function ensures that the most pertinent information is included first. The separator "###" allows clear demarcation between different contextual snippets, enabling the model to distinguish between discrete pieces of information.

The prompt template provides a structured approach to querying, explicitly instructing the model to either answer based on the given context or acknowledge insufficient information. This method transforms raw text data into a focused, context-rich prompt that enhances the likelihood of generating accurate and relevant responses.

In [15]:
import tiktoken

prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""


# =============================================================
# =============================================================

def create_prompt(question, df, max_token_count, prompt_template=prompt_template):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    
    current_token_count =   len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count     = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [26]:
# create prompt for question 1
max_token_count = 300
print(create_prompt(question1, df, max_token_count))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Captain James, residing in USA, who performs in a Sitcom, The charismatic and dashing captain of the local militia. Captain James is a ladies' man and enjoys flirting with the women of the town. He has a friendly rivalry with Reverend Brown and often teases him about his piousness.

###

James, residing in USA, who performs in a Reality Show, A handsome and athletic personal trainer, James is always up for a challenge. He's looking for someone who is as passionate about fitness as he is, and who can keep up with his intense workout regimen. He can sometimes come across as a bit too competitive, but his heart is always in the right place.

###

Jack, residing in England, who performs in a Play, A middle-aged man in his 40s, Jack is a successful businessman and Sarah's boss. He has a no-nonsense attitude, but is fiercely loyal to his friends and fami

In [17]:
max_token_count = 300
print(create_prompt(question2, df, max_token_count))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Emily, residing in England, who performs in a Play, A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.

###

Mrs. Mercer, residing in USA, who performs in a Sitcom, The matriarch of the wealthiest family in Williamsburg. Mrs. Mercer is a bit of a snob and enjoys reminding everyone of her social standing. She often hires Abigail to work in her home and is very demanding.

###

Maria, residing in Texas, who performs in a Movie, A middle-aged Latina woman in her 40s, Maria is a hard-working single mother who owns a small family-run diner in a small Texas town. She's fiercely protective of her teenage daughter, Sofia, and is always trying to balance work and family.

---

Question: Suggest

## Question Answering Function with Semantic Context

### Function Purpose
- Generate answers using OpenAI Completion model
- Leverage semantically relevant context
- Handle potential errors gracefully

### Key Parameters
- `question`: User's input query
- `df`: DataFrame with text embeddings
- `max_prompt_tokens`: Context token limit (default: 1800)
- `max_answer_tokens`: Response length limit (default: 150)

### Workflow
1. Create context-enriched prompt
2. Request model completion
3. Return stripped response
4. Handle exceptions by returning empty string

### Model Configuration
- Uses `gpt-3.5-turbo-instruct`
- Dynamically generates context-aware prompts

In [18]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(question, df, max_prompt_tokens = 1800, max_answer_tokens = 150):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
                                            model      = COMPLETION_MODEL_NAME,
                                            prompt     = prompt,
                                            max_tokens = max_answer_tokens
                                        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [19]:
answer1 = answer_question(question1, df)
print(answer1)

Successful businessman.


In [20]:
answer2 = answer_question(question2, df)
print(answer2)

Mrs. Mercer and Donna.


## Query Performance Demonstration

Comparative analysis of:
- Standard language model responses
- Custom context-enhanced responses
- Measuring improvement in answer relevance and accuracy

### Question 1

In [21]:
question1 = "What does Captain James's enjoy?"

answer1 = openai.Completion.create(
                                    model      = "gpt-3.5-turbo-instruct",
                                    prompt     = f"Based on this description: {df['text']}, {question1}",
                                    max_tokens = 150
                                )["choices"][0]["text"].strip()
print(answer1)

Flirting with women and teasing Reverend Brown about his seriousness.


### Question 2

In [22]:
question2 = "In what setting does Sir Toby Belch's story take place?"

answer2   = openai.Completion.create(
                                     model      = "gpt-3.5-turbo-instruct",
                                     prompt     = f"Based on this description: {df['text']}, {question2}",
                                     max_tokens = 150
                                    )["choices"][0]["text"].strip()
print(answer2)

The setting for Sir Toby Belch's story takes place in Illyria, an ancient Greece setting.
