# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [2]:
import pandas as pd

In [13]:
df = pd.read_csv('data/character_descriptions.csv')
df.tail()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [28]:
df['text'] = df['Name'] + \
    ': ' + df['Description'] + ' Medium: ' \
    +  df['Medium'] + \
        ' Setting: ' + df['Setting']

In [136]:
df.tail()

Unnamed: 0,Name,Description,Medium,Setting,text,embeddings,distances
48,Antonio,A sea captain who rescues Sebastian from the s...,Play,Ancient Greece,Antonio: A sea captain who rescues Sebastian f...,"[-0.0019626300781965256, -0.03169017285108566,...",0.2899
44,Sir Andrew Aguecheek,A dim-witted and gullible nobleman who is brou...,Play,Ancient Greece,Sir Andrew Aguecheek: A dim-witted and gullibl...,"[-0.024965493008494377, -0.0236018318682909, 0...",0.294769
34,Prince Lorenzo,A charming and handsome prince who has recentl...,Opera,Italy,Prince Lorenzo: A charming and handsome prince...,"[-0.00457723718136549, -0.010547829791903496, ...",0.29601
38,Don Carlo,A charming and charismatic young man who is of...,Opera,Italy,Don Carlo: A charming and charismatic young ma...,"[-0.01118087861686945, -0.010726531967520714, ...",0.300988
36,Baron Gustavo,A wealthy and arrogant nobleman who loves to f...,Opera,Italy,Baron Gustavo: A wealthy and arrogant nobleman...,"[-0.022414058446884155, -0.0189004335552454, 0...",0.31314


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [30]:
df['text'].iloc[0]

"Emily: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. Medium: Play Setting: England"

In [87]:
question = "what setting Emily is active on?"

In [31]:
import openai
openai.api_key = "api-key"

In [34]:
response = openai.Completion.create(
    model = "gpt-3.5-turbo-instruct",
    prompt=question,
    max_tokens=150,
    )

In [38]:
response['choices'][0]['text']

"\nI'm sorry, I am not sure what you are referring to. Can you please provide more context or information? Thank you."

In [41]:
EMBEDDING_MODEL_NAME = 'text-embedding-ada-002'
response = openai.Embedding.create(
        input = df['text'].tolist(),
        model=EMBEDDING_MODEL_NAME
        )

In [56]:
len(response['data'][0]['embedding'])

1536

In [52]:
embeddings = [data['embedding'] for data in response['data'] ]

In [54]:
df['embeddings'] = embeddings

In [57]:
from openai.embeddings_utils import get_embedding

In [59]:
EMBEDDING_MODEL_NAME = 'text-embedding-ada-002'
question_embedding = get_embedding(question,engine =EMBEDDING_MODEL_NAME )

In [63]:
from openai.embeddings_utils import distances_from_embeddings

In [65]:
distances  = distances_from_embeddings(question_embedding,
                        df['embeddings'].tolist(),distance_metric='cosine')

In [67]:
df['distances'] = distances

In [69]:
df  =df.sort_values(by='distances')

In [116]:
prompt_template = """
Answer the question based on the context below, and if the 
question can't be answered based on the context, say 
"I don't know"

Context: 

{}

---

Question: {}
Answer:"""



In [90]:
# prompt_template = prompt_template.format('context' , question)

In [72]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")

In [96]:
current_token_count = len(tokenizer.encode(prompt_template))
current_token_count

48

In [120]:
max_token_count = 1000
context = []
for text in df['text'].values:
    text_token_count = len(tokenizer.encode(text))
    if current_token_count <=max_token_count:
        context.append(text)
        current_token_count +=text_token_count
    else:
        break

In [118]:
prompt_template = prompt_template.format("\n\n###\n\n".join(context) , question)

In [123]:
openai.Completion.create(
    model = "gpt-3.5-turbo-instruct",
    prompt=prompt_template
    )['choices'][0]['text']

' Play Setting: England'

In [127]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [128]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [129]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [132]:
Answer = answer_question(question, df)

In [139]:
Answer

'Play Setting: England'

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [137]:
question1 = 'what medium Baron Gustavo is active on?'

In [145]:
response = openai.Completion.create(
    model = "gpt-3.5-turbo-instruct",
    prompt=question1,
    max_tokens=150,
    )['choices'][0]['text']
response

'\n\nThere is limited information available about Baron Gustavo, so it is not possible to determine what specific mediums he may be active on.'

In [142]:
custom_answer  =  answer_question(question1, df)
custom_answer

'Opera'

### Question 2

In [144]:
question2 = 'in which country or setting Prince Lorenzo is active?'

In [146]:
response = openai.Completion.create(
    model = "gpt-3.5-turbo-instruct",
    prompt=question2,
    max_tokens=150,
    )['choices'][0]['text']
response

'\n\nPrince Lorenzo does not appear to be an active member of a royal family or political figure. Therefore, he is not associated with any specific country or setting.'

In [147]:
custom_answer  =  answer_question(question2, df)
custom_answer

'Italy'