# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I have chosen the dataset `nyc_food_scrap_drop_off_sites.csv` provided by the course. 

This file contains locations, hours, and other information about food scrap drop-off sites in New York City.

I believe i can create a chatbot that can help the user distill information from the dataset instead of manually looking, to help them with making decisions like:
    - Confirm that they are open and accepting food scraps. 
    - Review open hours listed.
    - See what items are accepted.

Many of the columns are text or simple date based so i think the model should be able to understand the contained information, with the help of my prompt to give hints on the structure of the data passed.

This dataset is also sufficiently difficult and will provide an opportunity for learning data pe-processing

I also presumed the model hadn't see the data, but We will see later that this wasn't the case for some data points.

### Dataset Details

```
df.index: RangeIndex(start=0, stop=576, step=1)
df.columns: Index(['Unnamed: 0', 'borough', 'ntaname', 'food_scrap_drop_off_site',
       'location', 'hosted_by', 'open_months', 'operation_day_hours',
       'website', 'borocd', 'councildist', 'latitude', 'longitude', 'precinct',
       'object_id', 'location_point', ':@computed_region_yeji_bk3q',
       ':@computed_region_92fq_4b7q', ':@computed_region_sbqj_enih',
       ':@computed_region_efsh_h5xi', ':@computed_region_f5dn_yrer', 'notes',
       'ct2010', 'bbl', 'bin'],
      dtype='object')
df.ndim: 2
df.shape: (576, 25)
orginal_row_size: 576
```


## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import pandas as pd
import openai
from pandas import DataFrame
from dotenv import load_dotenv
import os
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
openai.api_key = os.getenv("API_KEY")
load_dotenv()  # take environment variables from .env.
# load the dataset
df = pd.read_csv('data/nyc_food_scrap_drop_off_sites.csv')

In [2]:
# Drop unnecessary columns
original_row_size = df.shape[0]
columns_to_drop = [
    'borocd',
    'councildist',
    'latitude',
    'longitude', 
    'precinct', 
    'object_id', 
    'location_point',
    ':@computed_region_yeji_bk3q',
    ':@computed_region_92fq_4b7q',
    ':@computed_region_sbqj_enih',
    ':@computed_region_efsh_h5xi',
    ':@computed_region_f5dn_yrer',
    'ct2010',
    'bbl',
    'bin',
    'Unnamed: 0']
df.drop(columns_to_drop, axis=1, inplace=True)
df = df.dropna()
new_row_size = df.shape[0];
difference = original_row_size - new_row_size
print(f"old_size: {original_row_size}. new_size: {new_row_size}. removed: {difference} rows.")

old_size: 576. new_size: 292. removed: 284 rows.


In [3]:
# Create Text Column by combing the relevant columns
columns_to_combine = ['borough', 'ntaname','food_scrap_drop_off_site','location','hosted_by','open_months','operation_day_hours','website','notes']
df["text"] = df[columns_to_combine].agg('#'.join, axis=1)

In [4]:
# View Examples
print(df['text'].iloc[0])
print(df['text'].iloc[10])
print(df['text'].iloc[55])

Queens#Astoria (North)-Ditmars-Steinway#Astoria Pug: 41st Street#Ditmars Boulevard and 41st Street#Astoria Pug#Year Round#Mondays (Start Time: 8:00 AM - End Time:  2:00 PM)#https://www.instagram.com/astoriapug/?hl=en#Not accepted: meat, bones, or dairy
Queens#Astoria (Central)#SE Corner of Crescent St & 30th Dr#Crescent St & 30th Dr SE#Department of Sanitation#Year Round#24/7#www.nyc.gov/smartcomposting#Download the app to access bins. Accepts all food scraps, including meat and dairy. Do not leave food scraps outside of bin!
Brooklyn#Flatbush (West)-Ditmas Park-Parkville#Q Gardens#58 E 18th St, Brooklyn, NY 11226#Q Gardens#Year Round#Tuesdays, Fridays, Saturdays, and Sundays (Start Time: Tuesday 6pm; Friday - Sunday dawn - End Time:  Tuesday 8pm; Fridays + Saturdays all night; Sundays until 4:00PM)#https://qgardenscf.com/places-to-drop-off-your-compost/#Not accepted: meat, bones, or dairy


In [5]:
batch_size = 100 # batch size

def generate_open_ai_embedding(df: DataFrame) -> list:
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Get embeddings from OpenAI model
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=EMBEDDING_MODEL_NAME
        )
        
        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])
    return embeddings

# Add embeddings list to dataframe, rows should match
df["embeddings"] = generate_open_ai_embedding(df)
print(df.iloc[0]["embeddings"])

[-0.004058544524013996, 0.01975923776626587, -0.00805631186813116, -0.01986728422343731, -0.017611786723136902, 0.010311809368431568, 0.004473853390663862, 0.013438441790640354, -0.019421588629484177, -0.008441232144832611, -0.0002931217895820737, 0.0026809354312717915, -0.014221788384020329, 0.001304170466028154, -0.04510994628071785, -0.019083937630057335, 0.009595992974936962, -0.0002079708647215739, 0.012006809003651142, -0.00320598017424345, -0.020299475640058517, -0.017449716106057167, -0.004929679911583662, 0.005196422804147005, -0.010298303328454494, 0.01519421860575676, 0.008299419656395912, -0.005750167649239302, -0.027038956061005592, -0.0034490874968469143, 0.013924657367169857, 0.012027068063616753, -0.014572943560779095, -0.00671246787533164, 0.006141840945929289, -0.017247125506401062, 0.007495814468711615, 0.0018570711836218834, 0.0063646892085671425, -0.021433977410197258, -0.0016300020506605506, -0.02177162654697895, 0.00343389343470335, 0.0013176763895899057, -0.0215

### Save DataFrame to file

In [6]:
df.to_csv("data/embeddings_nyc_food.csv")

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Load DataFrame
We can now run rom here

In [8]:
import pandas as pd
import numpy as np
import openai
from pandas import DataFrame
import os
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

openai.api_key = os.getenv("API_KEY")

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

# read the DataFrame from the csv file
df = pd.read_csv("data/embeddings_nyc_food.csv", index_col=0)
# Convert each embedding from a string representation of a python array to an actual python array (using eval) then turn array into numpy array.
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

In [9]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_relevant_rows_to_question(question, df):   
    # Get question embeddings
    question_embedding = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    # Copy DataFrame to preserve original
    df_copy = df.copy()
    # Calculate distances from the question embeddings to the embeddings in the DataFrame
    df_copy["distances"] = distances_from_embeddings(
        question_embedding,
        df_copy["embeddings"].values,
        distance_metric="cosine" 
    )
    
    # Sort the copied DataFrame by the distance, smallest distance first
    df_copy.sort_values("distances", ascending=True, inplace=True)
    # Only return the top 5 results.
    return df_copy[:5]

In [11]:
import tiktoken

# Count the number of tokens in the prompt template and question
prompt_template = """
System: Answer the question based on the context below which is a list seperated by the character '#' in this order:
['borough', 'ntaname','food_scrap_drop_off_site','location','hosted_by','open_months','operation_day_hours','website','notes'].
If the question can't be answered based on the Context, say "I don't know".

Context: {}

User: {}

AI:"""

def create_prompt(question, df, max_token_count):
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
        
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    
    context = []
    for text in get_relevant_rows_to_question(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            print("Token limit reached: ", current_token_count, "max: ", max_token_count)
            break

    return prompt_template.format(",".join(context), question) #suggested in the live class

In [12]:
# Test prompt
print(create_prompt("What is the location of the Queens, Astoria (North)-Ditmars-Steinway drop of site?", df, 1000))


System: Answer the question based on the context below which is a list seperated by the character '#' in this order:
['borough', 'ntaname','food_scrap_drop_off_site','location','hosted_by','open_months','operation_day_hours','website','notes'].
If the question can't be answered based on the Context, say "I don't know".

Context: Queens#Astoria (North)-Ditmars-Steinway#Astoria Pug: Hoyt#Northwest corner of Hoyt Avenue North and 21st Street, in a courtyard that?s officially called SITTING AREA#Astoria Pug#Year Round#Saturdays (Start Time: 9:00 AM - End Time:  4:00 PM)#https://www.instagram.com/astoriapug/?hl=en#Not accepted: meat, bones, or dairy,Queens#Astoria (North)-Ditmars-Steinway#Astoria Pug: Ditmars#Ditmars #1 Municipal Parking Field 22-18 33rd Street#Astoria Pug#Year Round#Saturdays (Start Time: 9:00 AM - End Time:  3:45 PM)#https://www.instagram.com/astoriapug/?hl=en#Not accepted: meat, bones, or dairy,Queens#Astoria (East)-Woodside (North)#Astoria Pug: Steinway#38-12 30th Ave,

In [13]:

def custom_query(question, df, max_prompt_tokens=3000, max_answer_tokens=750):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, 
    
    Return:
        Answer to the question according to an OpenAI Completion model
        If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [14]:
question1 = "What is the website for Bronx, Mount Eden-Claremont (West) drop of site?"
print(df['text'].iloc[2])
# Prompt without text
q1_prompt = f"""
User: {question1}
AI:
"""

response_1_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct", # using laster instruct gpt model
    prompt=q1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()

print(response_1_without_context)

Bronx#Mount Eden-Claremont (West)#SE Corner of Eastburn Avenue & East 174th Street#SE Eastburn Avenue & East 174th Street#Department of Sanitation#Year Round#24/7#www.nyc.gov/smartcomposting#Download the app to access bins. Accepts all food scraps, including meat and dairy. Do not leave food scraps outside of bin!
I'm sorry, I'm not able to find the specific website for Bronx, Mount Eden-Claremont (West) drop off site. Can you provide more context or details?


In [16]:
response_1_with_context = custom_query(question1, df)
print(response_1_with_context)

www.nyc.gov/smartcomposting


### Question 2

In [19]:
question2 = "What food can i not drop at Brooklyn, Crown Heights (North)?"
print(df['text'].iloc[56])
# Prompt without text
q2_prompt = f"""
User: {question2}
AI:
"""

response_2_without_context = openai.Completion.create(
    model="gpt-3.5-turbo-instruct", # using laster instruct gpt model
    prompt=q2_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()

print(response_2_without_context)

Brooklyn#Crown Heights (North)#Crown Heights Franklin Ave Food Scrap Drop-off#Franklin Avenue & Eastern Parkway#GrowNYC#Year Round#Thursdays (Start Time: 8:30 AM - End Time:  11:30 AM)#grownyc.org/compost#Not accepted: meat, bones, or dairy
There are a lot of great food options in Brooklyn and Crown Heights (North), but one food you might want to avoid dropping is soup or any other liquid-based dish. Not only could it make a mess, but it could also potentially harm other people if it spills and they slip on it.


In [20]:
response_2_with_context = custom_query(question2, df)
print(response_2_with_context)

Not accepted: meat, bones, or dairy


### Results


In [22]:
pd.DataFrame(
    data={
        "Question": [question1, question2],
        "Initial Response": [response_1_without_context, response_2_without_context],
        "Response": [response_1_with_context, response_2_with_context],
        "Correct": ["Yes", "Yes"]
    }
)

Unnamed: 0,question,Initial Response,Response,Correct
0,"What is the website for Bronx, Mount Eden-Clar...","I'm sorry, I'm not able to find the specific w...",www.nyc.gov/smartcomposting,Yes
1,"What food can i not drop at Brooklyn, Crown He...",There are a lot of great food options in Brook...,"Not accepted: meat, bones, or dairy",Yes
