Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

Dataset Selection: NYC Food Scrap Drop-off Sites

For this custom chatbot project, I have selected the NYC Food Scrap Drop-off Sites dataset. This dataset includes comprehensive details about food scrap drop-off sites in New York City, such as locations, operating hours, and other pertinent information. With a minimum of 20 rows of text data, it is well-suited for the task at hand.

Use Case:

The custom chatbot will be developed to provide users with accurate and current information regarding food scrap drop-off sites in New York City. This will be particularly useful for individuals interested in composting and supporting a more sustainable urban environment. Leveraging this dataset, the chatbot will be able to answer questions about site locations, hours of operation, and other relevant details.

This customization will benefit NYC residents and businesses seeking to responsibly dispose of their food scraps, as well as tourists who wish to maintain eco-friendly practices during their visit. By offering precise and helpful information on food scrap drop-off sites, the chatbot can assist users in adopting sustainable habits and contribute to reducing overall waste in New York City.

Additionally, by providing this service, the chatbot promotes a culture of caring and environmental stewardship among its users. It encourages individuals to make conscious, eco-friendly decisions and fosters a community spirit centered on sustainability and responsibility. Through this initiative, the chatbot not only aids in practical waste disposal but also inspires a deeper commitment to caring for our planet.

In [1]:
import pandas as pd
import numpy as np
import openai
import tiktoken

In [None]:
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [3]:
df = pd.read_csv('/home/goutham/GENAI/customchatbot/nyc_food_scrap_drop_off_sites.csv')

In [4]:
def create_text_column(row):
    """
    Generate a comprehensive text representation of each food scrap drop-off site
    This ensures that all key information is captured in a single text field
    """
    return f"Food scrap drop-off site in {row['borough']} located at {row['location']} " \
           f"hosted by {row['hosted_by']} open during {row['open_months']} " \
           f"with operating hours {row['operation_day_hours']}. " \
           f"More details available at {row['website']} " \
           f"in neighborhood {row['ntaname']}."

df['text'] = df.apply(create_text_column, axis=1)


In [5]:
def get_embedding(text, model="text-embedding-ada-002"):
    """
    Generate embeddings for text using OpenAI's embedding model
    
    Args:
        text (str): Input text to generate embedding for
        model (str): Embedding model to use
    
    Returns:
        list: Embedding vector or None if error occurs
    """
    try:
        response = openai.Embedding.create(
            input=[text],
            model=model
        )
        return response['data'][0]['embedding']
    except Exception as e:
        print(f"Embedding error: {e}")
        return None


In [6]:
def cosine_similarity(a, b):
    """
    Calculate cosine similarity between two vectors
    
    Args:
        a (np.array): First vector
        b (np.array): Second vector
    
    Returns:
        float: Cosine similarity score
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [7]:
def retrieve_context(query, df, top_k=3):
    """
    Retrieve most relevant context based on query similarity
    
    Args:
        query (str): User's query
        df (pd.DataFrame): Dataset to search
        top_k (int): Number of top results to return
    
    Returns:
        pd.DataFrame: Top k most similar rows
    """
    # Required columns for comprehensive context
    required_cols = ['text', 'borough', 'ntaname', 'food_scrap_drop_off_site', 
                     'location', 'hosted_by', 'open_months', 'operation_day_hours', 
                     'website', 'notes']
    
    if not all(col in df.columns for col in required_cols):
        print(f"Required columns not found in the dataset")
        return pd.DataFrame()
    
    # Create embeddings for the entire dataset's text column
    df['embedding'] = df['text'].apply(get_embedding)
    
    # Get query embedding
    query_embedding = get_embedding(query)
    
    # Calculate similarities
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, query_embedding))
    
    # Return top k most similar rows
    return df.nlargest(top_k, 'similarity')[required_cols]



In [8]:
def food_scrap_chatbot(query, df, use_context=True):
    """
    Generate response for food scrap drop-off site queries
    
    Args:
        query (str): User's query
        df (pd.DataFrame): Dataset to search
        use_context (bool): Whether to use contextual retrieval
    
    Returns:
        str: Generated response
    """
    if use_context:
        # Retrieve relevant context
        context = retrieve_context(query, df)
        
        # Prepare context string
        context_str = context['text'].str.cat(sep='\n\n')
        
        # Construct prompt with retrieved context
        messages = [
            {
                "role": "system",
                "content": """You are a helpful NYC Food Scrap Drop-Off Sites assistant.
                Provide detailed and accurate information about food scrap recycling locations in New York City
                based on the given context. If specific details are not available,
                explain what information you can provide."""
            },
            {
                "role": "user",
                "content": f"Context of Food Scrap Drop-Off Sites:\n{context_str}\n\nQuery: {query}"
            }
        ]
    else:
        # Basic query without context
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant providing general information about food scrap recycling."
            },
            {
                "role": "user",
                "content": query
            }
        ]
    
    # Generate response using GPT-3.5 Turbo
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            max_tokens=500
        )
        return response['choices'][0]['message']['content']
    except Exception as e:
        print(f"Error generating response: {e}")
        return "I'm sorry, but I couldn't generate a response at this time."

In [9]:
def demonstrate_chatbot():
    """
    Demonstrate the chatbot with both contextual and non-contextual responses
    """
    # Queries with contextual retrieval
    contextual_queries = [
        "Where can I drop off food scraps in NYC?",
        "What are the hours for food scrap recycling locations?"
    ]
    
    # Queries without contextual retrieval
    basic_queries = [
        "Tell me about food scrap recycling in New York City",
        "What is food scrap recycling and why is it important?"
    ]
    
    print("--- Contextual Responses ---")
    for query in contextual_queries:
        print("\n--- Query: " + query + " ---")
        response = food_scrap_chatbot(query, df, use_context=True)
        print(response)
    
    print("\n--- Basic Responses ---")
    for query in basic_queries:
        print("\n--- Query: " + query + " ---")
        response = food_scrap_chatbot(query, df, use_context=False)
        print(response)

In [10]:
def explain_dataset_appropriateness():
    """
    Explain the key features that make this dataset appropriate for a custom chatbot
    """
    print("\n--- Dataset Appropriateness Explanation ---")
    print("A high-quality, appropriate dataset for a custom chatbot should have:")
    print("1. Domain-Specific Information: This dataset contains unique, localized information about NYC food scrap drop-off sites")
    print("2. Comprehensive Attributes: Multiple columns capture different aspects of each location")
    print("3. Unique, Current Data: Provides real, up-to-date information not likely to be in the model's training data")
    print("4. Structured Format: Clean, tabular data that can be easily processed")
    print("5. Contextual Richness: Includes details like borough, location, hours, hosting organization")
    
    print("\nBad datasets typically lack:")
    print("1. Outdated or generic information")
    print("2. Incomplete or inconsistent data")
    print("3. Lack of domain-specific context")
    print("4. Unstructured or messy data")


In [11]:
# Run demonstration and explanation
demonstrate_chatbot()
explain_dataset_appropriateness()

--- Contextual Responses ---

--- Query: Where can I drop off food scraps in NYC? ---
You can drop off food scraps at the following locations in Manhattan:
1. NW West 126th Street & Adam Clayton Powell Jr Blvd in the neighborhood of Harlem (North).
2. SW St. Nicholas Avenue & West 118 Street in the neighborhood of Harlem (South).
3. NE East 130th Street & 5th Avenue in the neighborhood of East Harlem (North).

These drop-off sites are hosted by the Department of Sanitation, open year-round, and operate 24/7. For more details, you can visit www.nyc.gov/smartcomposting.

--- Query: What are the hours for food scrap recycling locations? ---
For the food scrap drop-off site hosted by GrowNYC in East Midtown-Turtle Bay at E 47th St & 2nd Ave, the operating hours are Wednesdays from 8:00 AM to 12:30 PM.

For the food scrap drop-off site hosted by the Department of Sanitation in East Harlem (South), the operating hours are 24/7, indicating that you can drop off food scraps at any time through