**Toronto Bike Share Analysis: ChatBot**

The Purpose of this part of th eproject is to build a chatbot that can answer questions about Bike Share Toronto Ridership Data.

**Install the basic packages required for the chatbot**

In [None]:
# uninstall faiss
!pip uninstall faiss-gpu faiss faiss-cpu

Found existing installation: faiss-gpu 1.7.2
Uninstalling faiss-gpu-1.7.2:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/faiss/*
    /usr/local/lib/python3.10/dist-packages/faiss_gpu-1.7.2.dist-info/*
    /usr/local/lib/python3.10/dist-packages/faiss_gpu.libs/libgfortran-040039e1.so.5.0.0
    /usr/local/lib/python3.10/dist-packages/faiss_gpu.libs/libgomp-a34b3233.so.1.0.0
    /usr/local/lib/python3.10/dist-packages/faiss_gpu.libs/libquadmath-96973f99.so.0.0.0
    /usr/local/lib/python3.10/dist-packages/faiss_gpu.libs/libz-745e0a09.so.1.2.7
Proceed (Y/n)? y
  Successfully uninstalled faiss-gpu-1.7.2
[0m

In [None]:
!pip install langchain pandas sentence-transformers openai
!pip install -U sentence-transformers
!pip install langchain-openai

!pip install faiss-gpu

Collecting faiss-gpu
  Using cached faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Using cached faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


**Import Libraries**

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import torch
import os
import openai
from getpass import getpass
from langchain_openai import OpenAI
from langchain import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.schema import Document

  from tqdm.autonotebook import tqdm, trange


**Import Data**

In this section, I import a copy of trip_data and filter the data to include only entries from the year 2022 to avoid runnign issues with computation power. The filtered dataset is then saved to Google Drive, ensuring easy access for future use.

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#import a copy of trip_data
# csv_file_path = "/trip_data_copy.csv"
# trip_data_copy.to_csv(csv_file_path, index=False)

# Drop columns that I added during EDA
# trip_data_copy = trip_data_copy.drop(columns = ['Day of Week', 'Hour of Day'])

# # Filter the dataset to include only rows where the 'Year' column is 2022
# gpt_data = trip_data_copy[trip_data_copy['Year'] == 2022]
csv_file_path = "/content/drive/MyDrive/M.Eng Project/gpt_data.csv"
# gpt_data.to_csv(csv_file_path, index=False)
gpt_data = pd.read_csv(csv_file_path)

**Embedding Generation**
In this step, I will performs data preprocessing and embedding generation using the A100 GPU for enhanced performance. Previously, when using a High RAM instance, processing the embeddings for the 2022 dataset took over 6 hours, with only 10% of the data processed. Using the A100 GPU significantly accelerated this process.

In [None]:
# Check if GPU is available and set the device accordingly
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Check for FAISS GPU version
if not hasattr(faiss, 'StandardGpuResources'):
    print("FAISS GPU version is not available. Please install the GPU version of FAISS using: !pip install faiss-gpu")
else:
    print("FAISS GPU version is available.")

# File paths for saving/loading
embedding_file_path = '/content/drive/MyDrive/M.Eng Project/embeddings.npy'
faiss_index_file_path = '/content/drive/MyDrive/M.Eng Project/bike_share_index.faiss'
processed_data_file_path = '/content/drive/MyDrive/M.Eng Project/gpt_data_processed.csv'

# Check if the embeddings file and FAISS index already exist
if os.path.exists(embedding_file_path) and os.path.exists(faiss_index_file_path):
    print("Loading existing embeddings and FAISS index...")

    # Load the embeddings
    embeddings = np.load(embedding_file_path)

    # Load the FAISS index
    index = faiss.read_index(faiss_index_file_path)

    # Load the processed DataFrame
    data = pd.read_csv(processed_data_file_path)
else:
    print("Embeddings and FAISS index not found. Generating new embeddings...")

    # Load the CSV file into a DataFrame
    data = pd.read_csv('/content/drive/MyDrive/M.Eng Project/gpt_data.csv')

    # Combine relevant columns into a single string for embedding
    data['combined_text'] = data.apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

    # Load and move the SentenceTransformer model to the GPU
    model = SentenceTransformer('all-MiniLM-L6-v2')  # Choose a model based on your needs
    model = model.to(device)  # Move model to GPU if available

    # Compute embeddings using GPU
    embeddings = model.encode(data['combined_text'].tolist(),
                              batch_size=512,   # Adjust batch size as necessary based on memory capacity
                              convert_to_numpy=True,
                              show_progress_bar=True,
                              device=device)  # Ensure embeddings are computed on the specified device

    # Convert embeddings to float32 if they are not, as FAISS requires this
    embeddings = np.array(embeddings).astype('float32')

    # Save embeddings to file
    np.save(embedding_file_path, embeddings)
    print(f"Embeddings saved to {embedding_file_path}")

    # Get the dimension of the embeddings
    dimension = embeddings.shape[1]  # Define the dimension variable correctly

    # Store embeddings in FAISS using GPU index if GPU resources are available
    if hasattr(faiss, 'StandardGpuResources'):
        # Initialize FAISS GPU resources
        res = faiss.StandardGpuResources()  # Use a single GPU
        index_flat = faiss.IndexFlatL2(dimension)  # Create a CPU index with the correct dimension
        index = faiss.index_cpu_to_gpu(res, 0, index_flat)  # Transfer index to GPU
    else:
        print("Warning: Using FAISS on CPU. Performance may be degraded.")
        index = faiss.IndexFlatL2(dimension)  # Create a CPU index with the correct dimension

    # Add embeddings to the FAISS index
    index.add(embeddings)  # Add embeddings to index

    # Save the FAISS index and data for later use
    faiss.write_index(faiss.index_gpu_to_cpu(index), faiss_index_file_path)
    print(f"FAISS index saved to {faiss_index_file_path}")

    # Save the processed DataFrame to a CSV file
    data.to_csv(processed_data_file_path, index=False)
    print(f"Processed data saved to {processed_data_file_path}")

Using device: cuda
FAISS GPU version is available.
Loading existing embeddings and FAISS index...


In [None]:
# Select a model to generate dense vector embeddings for text data, capturing semantic meaning in a compact form
model = SentenceTransformer('all-MiniLM-L6-v2')



In [None]:
def analyze_overall_trends(data):
    # Calculate some general statistics
    total_records = len(data)

    # Calculate the number of unique bikes used (you can choose another metric)
    unique_bikes = data['Bike Id'].nunique()

    # Example: Monthly trends (if there's a 'Year' column)
    # Convert the 'End Time' to datetime if needed
    data['End Time'] = pd.to_datetime(data['End Time'])
    monthly_trends = data.resample('M', on='End Time').size()
    daily_trends = data.resample('D', on='End Time').size()

    # Calculate average trip duration
    average_trip_duration = data['Trip Duration'].mean()

    # Calculate the most popular start and end stations
    popular_start_station = data['Start Station Name'].value_counts().idxmax()
    popular_end_station = data['End Station Name'].value_counts().idxmax()

    return {
        "total_records": total_records,
        "unique_bikes": unique_bikes,
        "monthly_trends": monthly_trends,
        "daily_trends": daily_trends,
        "average_trip_duration": average_trip_duration,
        "popular_start_station": popular_start_station,
        "popular_end_station": popular_end_station
    }

In [None]:
def search_faiss_index(query, index, model, data, k=10):
    # Generate embedding for the query
    query_embedding = model.encode([query])

    # Search in the index
    distances, indices = index.search(query_embedding, k)  # k: number of results to retrieve

    # Retrieve and return relevant data
    return data.iloc[indices[0]], distances[0]

In [None]:
# Get the API Key
openai.api_key = getpass('Enter your API key: ')

Enter your API key: ··········


In [None]:
# Langchain: Create a template for the chatbot to generate responses based on the search results or analysis
prompt_template = """
You are an intelligent chatbot that has knowledge of the toronto bikeshare network and has all the relevant datasets for it. Your role is to provide short yet informed answers regarding insight from the datasets. Carefully Answer the question based on the following information:

Information: {context}

Question: {question}

Answer:
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

# Initialize the OpenAI LLM
llm = OpenAI(temperature=0.5, api_key=openai.api_key)

# Load the LangChain for generating responses
qa_chain = load_qa_chain(llm, chain_type="stuff", prompt=prompt)

# Decide which analysis to use based on the query
def chatbot_response(question):
    if 'trend' in question or 'overview' in question or 'summary' in question or 'overall' in question or 'total' in question:
        # Holistic Analysis
        trends = analyze_overall_trends(data)
        context = f"Total Records: {trends['total_records']}, Monthly Trends: {trends['monthly_trends']}"
    else:
        # Specific Retrieval
        relevant_data, distances = search_faiss_index(question, index, model, data)
        context = '\n'.join(relevant_data['combined_text'].tolist())

    # Generate response using LangChain
    response = qa_chain.run({
        "context": context,
        "question": question
    })

    return response

In [None]:
# Gpt4o: Function to query gpt-4o-2024-08-06
def query_gpt4o(prompt):

    response = openai.chat.completions.create(
        model="gpt-4o-2024-08-06",
        temperature=0.5,
        max_tokens=1000,
        messages=[
            {"role": "system", "content": "You are an expert data scientist. You have knowledge of bikeshare toronto system. You should always take a moment to think and carefully ensure accuracy before answering any questions."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content


In [None]:
# Function to generate chatbot response
keywords = ['trend', 'overview', 'summary', 'overall', 'total', 'dataset', 'all', '', ' ']
def chatbot_response(question):
    if any(keyword in question for keyword in keywords):
        # Holistic Analysis
        trends = analyze_overall_trends(data)
        context = (
            f"Total Records: {trends['total_records']}\n"
            f"Unique Bikes: {trends['unique_bikes']}\n"
            f"Monthly Trends: {trends['monthly_trends']}\n"
            f"Daily Trends: {trends['daily_trends']}\n"
            f"Average Trip Duration: {trends['average_trip_duration']} minutes\n"
            f"Most Popular Start Station: {trends['popular_start_station']}\n"
            f"Most Popular End Station: {trends['popular_end_station']}"
        )
    else:
        # Specific Retrieval
        relevant_data, distances = search_faiss_index(question, index, model, data)
        context = '\n'.join(relevant_data['combined_text'].tolist())

    ''' Langchain Approach
    # Convert context to Document object
    input_documents = [Document(page_content=context)]

    # Create the input for qa_chain
    input_data = {
        "input_documents": input_documents,  # Pass the list of Document objects
        "question": question
    }

    # Generate response using LangChain and extract only 'output_text'
    response = qa_chain(input_data)
    return response['output_text']  # Return only the output_text
    '''

    # ''' GPT-4o Approach
    # Prepare the prompt for GPT-4o
    prompt_gpt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"

    # Query GPT-4o for the response
    response = query_gpt4o(prompt_gpt)
    return response
    # '''


# Setting up the command-line chatbot
print("Welcome to the Bike Share Chatbot! Type 'exit()' to end the chat.")

while True:
    # Get user input
    user_input = input("You: ")

    # Check if user wants to exit
    if user_input.lower() == 'exit()':
        print("Bot: Goodbye!")
        break

    # Generate response
    try:
        bot_response = chatbot_response(user_input)
        print(f"Bot: {bot_response}")
    except Exception as e:
        print(f"Bot: Sorry, something went wrong. Error: {e}")

Welcome to the Bike Share Chatbot! Type 'exit()' to end the chat.
Bot: The Toronto Bike Share, also known as Bike Share Toronto, is a public bicycle sharing system in Toronto, Canada. It provides residents and visitors with an accessible and sustainable mode of transportation by offering a network of bicycles and docking stations throughout the city. Users can rent bikes for short trips, typically using a membership or pay-per-use system. The service aims to promote cycling as an efficient, healthy, and environmentally friendly way to travel around Toronto, reducing traffic congestion and encouraging active transportation.
Bot: To determine the columns in the dataset, we can infer some potential columns based on the context and typical data collected by bikeshare systems. However, without direct access to the dataset, I can only provide an educated guess. Here are some likely columns:

1. **Trip ID**: A unique identifier for each trip.
2. **Start Time**: The date and time when a trip b