# Custom Chatbot Project

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-175774437712667742148836874e5f3a89e69.29051538"

In [3]:
import pandas as pd

df = pd.read_csv("data/character_descriptions.csv")
df

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England
5,George,"A man in his early 30s, George is a charming a...",Play,England
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England
7,John,"A man in his 60s, John is a retired professor ...",Play,England
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas


In [4]:
df["text"] = df["Name"] + ": " + df["Description"] + ", " + df["Medium"] + ", " + df["Setting"]
df = df[["text"]]
df

Unnamed: 0,text
0,"Emily: A young woman in her early 20s, Emily i..."
1,"Jack: A middle-aged man in his 40s, Jack is a ..."
2,"Alice: A woman in her late 30s, Alice is a war..."
3,"Tom: A man in his 50s, Tom is a retired soldie..."
4,"Sarah: A woman in her mid-20s, Sarah is a free..."
5,"George: A man in his early 30s, George is a ch..."
6,"Rachel: A woman in her late 20s, Rachel is a s..."
7,"John: A man in his 60s, John is a retired prof..."
8,"Maria: A middle-aged Latina woman in her 40s, ..."
9,Caleb: A young African American man in his ear...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["embeddings"] = embeddings


Unnamed: 0,text,embeddings
0,"Emily: A young woman in her early 20s, Emily i...","[-0.016337227076292038, -0.01441893633455038, ..."
1,"Jack: A middle-aged man in his 40s, Jack is a ...","[0.007218827493488789, -0.02205824851989746, 0..."
2,"Alice: A woman in her late 30s, Alice is a war...","[0.006097876466810703, -0.011728930287063122, ..."
3,"Tom: A man in his 50s, Tom is a retired soldie...","[0.01787872612476349, -0.017673371359705925, 0..."
4,"Sarah: A woman in her mid-20s, Sarah is a free...","[-0.01838371343910694, -0.023540902882814407, ..."
5,"George: A man in his early 30s, George is a ch...","[-0.021037409082055092, -0.012453104369342327,..."
6,"Rachel: A woman in her late 20s, Rachel is a s...","[-0.0037197889760136604, -0.00964366365224123,..."
7,"John: A man in his 60s, John is a retired prof...","[0.019544873386621475, -0.013885458000004292, ..."
8,"Maria: A middle-aged Latina woman in her 40s, ...","[-0.006797728594392538, -0.01301904208958149, ..."
9,Caleb: A young African American man in his ear...,"[0.00687229773029685, -0.029478205367922783, 0..."


In [6]:
df.to_csv("embeddings.csv")


In [7]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

Matplotlib is building the font cache; this may take a moment.


In [8]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [9]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [10]:
prompt_1 = """
Question: "What does Maya work as?"
Answer:
"""
initial_answer_1 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()

custom_answer_1 = answer_question(prompt_1, df)

In [11]:
print(prompt_1 + initial_answer_1)
print(prompt_1 + custom_answer_1)


Question: "What does Maya work as?"
Answer:
Maya's job is not specified in the question, so it is impossible to accurately answer this question. Maya could work in a variety of occupations depending on her interests, qualifications, and experience.

Question: "What does Maya work as?"
Answer:
Maya is a nurse.


### Question 2

In [12]:
prompt_2 = """
Question: "Who is an indigineous Australian?"
Answer:
"""
initial_answer_2 = openai.Completion.create(
    model=COMPLETION_MODEL_NAME,
    prompt=prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()

custom_answer_2 = answer_question(prompt_2, df)

In [13]:
print(prompt_2 + initial_answer_2)
print(prompt_2 + custom_answer_2)


Question: "Who is an indigineous Australian?"
Answer:
An indigenous Australian is a person who is descended from the original inhabitants of Australia, known as Aboriginal Australians and Torres Strait Islanders. These groups have lived on the land for thousands of years and have a unique connection to the country's cultural and spiritual heritage. Indigenous Australians have a diverse range of cultures, languages, and traditions, and are recognized as the traditional custodians of the land.

Question: "Who is an indigineous Australian?"
Answer:
Tahlia is an Indigenous Australian woman in her early 20s, who is a talented artist and is struggling to find her place in the world.
