<a href="https://colab.research.google.com/github/LaxmiGunda/MovieChatbot-CaseStudyII/blob/main/Laxmivasavi_Gunda_MovieChatbotML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie chatbot

- ## Laxmivasavi (Laxmi) Gunda
- ## laxmi.gunda@gmail.com
- ## Movie chatbot for interactive movie search


**Problem**: The goal was to create a chatbot that can answer questions and provide recommendations about movies based on a dataset, going beyond simple keyword matching to understand the user's intent and the movie content.

**Solution - RAG System Implementation**: The solution is built by building a Retrieval Augmented Generation (RAG) system in this notebook. The core idea of RAG is to augment a Language Model (LLM) with the ability to retrieve information from a custom knowledge base before generating a response.

Here's how it is achieved:

1. **Data Preparation**: started by loading the movie dataset, cleaning it by handling potential missing values and incorrect data types, and creating a comprehensive "description" for each movie.

2. **Information Chunking and Indexing**: split the movie descriptions into smaller text chunks and extracted relevant metadata (Title, Genre, Year, etc.). Later then generated numerical representations (vector embeddings) for these chunks and stored them in a searchable **Vector Store (FAISS)**. Also created a **BM25 index** for keyword-based retrieval.

3. **Hybrid Retrieval**: To answer user queries,  implemented a hybrid retrieval mechanism. This involves performing both a vector similarity search (finding semantically similar chunks) and a BM25 keyword search. The results from both methods are combined to get a more comprehensive set of relevant documents.

4. **Language Model (LLM) and Prompting**: used a Language Model (LLM) (OpenAI's GPT-4o) as the core of the chatbot. I defined a Prompt Template to instruct the LLM on how to use the retrieved documents as context to answer user questions.

5. **Agent and Tools**: Built a Langchain Agent powered by the LLM. The agent's role is to understand the user's natural language query and decide which tool to use. The agent is equipped with many Tools:
    - A tool for direct filtering (list_movies_by_filters) to handle simple requests for lists of movies based purely on metadata (like genre or year), bypassing the LLM for efficiency.
    - A tool for hybrid semantic and metadata search (semantic_search_movies_with_filters) to handle more complex queries requiring both understanding content and applying filters. This tool performs the combined vector and BM25 search and passes the results to the LLM for summarization.
    - A tool to fetch the plot of a specific movie (get_movie_plot) from an external API.
    - A tool to get the description of a specific movie (get_movie_description) from the dataset.

6. **Session Memory** : I also added Conversation Memory to the agent so it can maintain context and respond naturally to follow-up questions within a conversation.

7. **User Interface**: Used Gradio to create a simple web-based chat interface, allowing users to interact with the agent.


By combining these components, the chatbot can understand user queries, retrieve relevant movie information from its knowledge base using different methods (keyword, semantic, hybrid, direct lookup), and use the LLM to generate helpful and contextually relevant responses. Made the system more robust by handling potential data errors and making the filtering process more explicit for the agent.


## Usage

**Note**: Anyone else who downloads and uses your notebook would need to:

- **Have their own API keys**.
- **Add their keys to their own Colab Secrets Manager under the same names (OPENAI_API_KEY, OMDB_API_KEY**).

In [None]:
# Importing the OpenAI library to interact with OpenAI's API services.
import openai

# Import CSV module for reading and writing CSV files
import csv

# Import pandas for data manipulation and analysis
import pandas as pd

import langchain

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')

## Import OpenAI API Key

In [None]:
# Store your OpenAI API key

import os
from google.colab import userdata

# Access the secret using the name you gave it in the secrets manager
openai_api_key = userdata.get('OPENAI_API_KEY')

# You can then set it as an environment variable for the OpenAI library to use
os.environ["OPENAI_API_KEY"] = openai_api_key

## Mount Google Drive


In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
print("Google Drive mounted successfully!")

## Load the data set




In [None]:
import pandas as pd # Ensure pandas is imported

# Load the data from Google Drive first, with a fallback to /content/

# Define the primary path in your Google Drive
drive_file_path = '/content/drive/MyDrive/MovieChatbotData/IMDb_Dataset.csv'

# Define the fallback path in the Colab environment
content_file_path = '/content/IMDb_Dataset.csv'

df = None # Initialize df to None

try:
    print(f"Attempting to load dataset from Google Drive: {drive_file_path}")
    df = pd.read_csv(drive_file_path)
    print("Dataset loaded successfully from Google Drive.")

except FileNotFoundError:
    print(f"File not found in Google Drive at {drive_file_path}. Attempting to load from Colab content directory: {content_file_path}")
    try:
        df = pd.read_csv(content_file_path)
        print("Dataset loaded successfully from Colab content directory.")

    except FileNotFoundError:
        print(f"Error: File not found in Colab content directory at {content_file_path} either.")
        df = None # Ensure df is None if not found in both locations
    except pd.errors.ParserError as e:
        print(f"Error: Could not parse the CSV file from Colab content directory: {e}")
        df = None
    except Exception as e:
        print(f"An unexpected error occurred loading from Colab content directory: {e}")
        df = None

except pd.errors.ParserError as e:
    print(f"Error: Could not parse the CSV file from Google Drive: {e}")
    df = None
except Exception as e:
    print(f"An unexpected error occurred loading from Google Drive: {e}")
    df = None


# Final check and print if df was loaded
if df is not None:
   print(f"\nDataFrame shape: {df.shape}")
   # Check if the DataFrame is empty after loading
   if df.empty:
       print(f"Warning: The loaded DataFrame is empty.")
       df = None # Set df to None if it's empty even after loading
else:
   print("\nDataFrame was not loaded.")

Attempting to load dataset from Google Drive: /content/drive/MyDrive/MovieChatbotData/IMDb_Dataset.csv
Dataset loaded successfully from Google Drive.

DataFrame shape: (3173, 10)


## Data loading and preparation

In [None]:
# View & Understand the data
if df is not None:
   display(df.head())
else:
   print("DataFrame not loaded.")

In [None]:
# Display the features
if df is not None:
   display(df.info())

In [None]:
# Summarize the data frame with describe.
if df is not None:
  display(df.describe())

In [None]:
# Check the missing values across the columns
if df is not None:
  display(df.isnull().sum())

### Handle Missing Values

Add code to handle potential missing values in the dataset, focusing on columns used for creating the movie description and metadata.

In [None]:
# Handle missing values in critical columns
if df is not None:
    # Define columns critical for the movie description and metadata
    critical_columns = ['Title', 'IMDb Rating', 'Year', 'Certificates', 'Genre', 'Director', 'Star Cast', 'Duration (minutes)']

    # Check for missing values in critical columns
    print("Missing values in critical columns before handling:")
    display(df[critical_columns].isnull().sum())

    initial_rows = len(df)
    df.dropna(subset=critical_columns, inplace=True)
    rows_dropped = initial_rows - len(df)

    if rows_dropped > 0:
        print(f"\nDropped {rows_dropped} rows with missing values in critical columns.")
    else:
        print("\nNo rows with missing values in critical columns found.")

    # Verify no missing values in critical columns after dropping
    print("\nMissing values in critical columns after handling:")
    display(df[critical_columns].isnull().sum())

else:
    print("DataFrame not loaded, cannot check or handle missing values.")

In [None]:
# Get the unique directors and the display the top 10 directors

if df is not None:
  # Get the unique directors
  unique_directors = df['Director'].unique()
  print("Number of unique directors:", len(unique_directors))
  print("First 10 unique directors:", unique_directors[:10])

  # Count the number of movies per director (top 10)
  director_counts = df['Director'].value_counts()
  print("\nTop 10 directors by number of movies:")
  display(director_counts.head(10))
else:
  print("DataFrame not loaded.")

### Handle Unexpected Data Types

Add code to convert columns expected to be numeric to the correct data type, handling potential errors.

In [None]:
# Handle unexpected data types in numeric columns
if df is not None:
    numeric_columns = ['Year', 'IMDb Rating', 'MetaScore', 'Duration (minutes)']
    print("Attempting to convert numeric columns...")

    for col in numeric_columns:
        if col in df.columns:
            # Attempt to convert to numeric, coercing errors to NaN
            initial_dtype = df[col].dtype
            df[col] = pd.to_numeric(df[col], errors='coerce')
            if df[col].isnull().sum() > 0:
                print(f"Warning: Found non-numeric values in column '{col}' and converted them to NaN.")
            if df[col].dtype != initial_dtype:
                print(f"Converted column '{col}' from {initial_dtype} to {df[col].dtype}.")
            else:
                 print(f"Column '{col}' is already numeric or conversion did not change dtype ({df[col].dtype}).")
        else:
            print(f"Warning: Numeric column '{col}' not found in DataFrame.")

    print("\nChecking for new missing values after type conversion:")
    display(df.isnull().sum())
else:
    print("DataFrame not loaded, cannot handle unexpected data types.")

In [None]:
# Count the number of movies by Genre

genre_counts = df['Genre'].value_counts()
display(genre_counts)

#### Observations:
    - There are 3173 movies in this dataset.
    - There are no null features in the dataset.
    - Most movies have an IMDb rating between 6.4 and 7.5 (between the 25th and 75th percentiles) and a duration between 105 and 122 minutes.
    - 50% is the median value and looking at the median and mean, the data seems to be mostly symmetrical.
    - Most of the movies(868 out of 3173)  are belongs to "Biography" genre.

### Data visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.boxplot(x=df['Duration (minutes)'])
plt.title('Box Plot of Movie Duration (minutes)')
plt.xlabel('Duration (minutes)')
plt.show()

#### Observations:
    - Looking at the box plot for Duration or length of the movie, it looks like there are unusual short and long movies.

In [None]:
# Bar plot to visualize between Genre and No. Of movies

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values)
plt.title("Distribution of Movies Across Genres")
plt.xlabel("Genre")
plt.ylabel("Number of Movies")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#### Observations

  - The number of movies per genre was counted, showing a range from 868 for 'Biography' down to 1 for 'Reality-TV'.

## Text chunking

### Add new column: Description

In [None]:
# Create movie description for each movie from the details provided in the dataset

df['description'] = (df['Title'] + " | " + df["Genre"] + " | " +
                     df['Year'].astype(str) + " | Directed by " + df['Director'] +
                     " | Starring " + df['Star Cast'] + " | IMDb Rating: " +
                     df['IMDb Rating'].astype(str))

# Display the first few descriptions to check the result
display(df.head())

### Define Save/Load Path in Google Drive



In [None]:
import os

# Define the directory path in your Google Drive to save the vector store
# Replace 'YourProjectFolder' with the actual folder name in your Drive
drive_save_path = '/content/drive/MyDrive/MovieChatbotData/faiss_index'

# Create the directory if it doesn't exist
if not os.path.exists(drive_save_path):
    os.makedirs(drive_save_path)
    print(f"Created directory: {drive_save_path}")
else:
    print(f"Directory already exists: {drive_save_path}")

print(f"Vector store will be saved to and loaded from: {drive_save_path}")

### Perform chunking and save to JSON file

In [None]:
# Perform text chunking of the new "Description" feature and store to a JSON file if it doesn't exist.

import json
import os
from langchain.text_splitter import CharacterTextSplitter # Import CharacterTextSplitter here

file_path = '/content/drive/MyDrive/MovieChatbotData/IMDB_Data_chunks.json' # Changed to a Drive path
chunk_data = [] # Use a list to store dictionaries with chunk text and metadata
chunk_id_counter = 1 # Initialize a counter to start IDs from 1

# Ensure the directory for the JSON file exists
json_dir = os.path.dirname(file_path)
if not os.path.exists(json_dir):
    os.makedirs(json_dir)
    print(f"Created directory for JSON file: {json_dir}")

# Initialize the text splitter OUTSIDE the if/else block
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)
print("Text splitter initialized.") # Added print statement to confirm initialization


if os.path.exists(file_path):
    print(f"Loading chunks from {file_path}")
    # Load the structured JSON data
    with open(file_path, 'r') as f:
        chunk_data = json.load(f)
    # If loading existing data, find the max ID to continue numbering
    if chunk_data:
        # Assuming IDs are sequential, find the highest existing ID
        # Use max([item.get('id', 0) for item in chunk_data]) to handle potential missing IDs and ensure we start after the highest existing one
        max_id = max([item.get('id', 0) for item in chunk_data])
        chunk_id_counter = max_id + 1
    else:
        # If file exists but is empty, start counting from 1
        chunk_id_counter = 1
else:
    print("Creating chunks and saving to file...")
    # The text splitter is already initialized above
    print(text_splitter) # Keep this print if you want to see the splitter config


    # Chunk the descriptions and include metadata
    if df is not None:
        for index, row in df.iterrows():
            movie_chunks = text_splitter.split_text(row['description'])
            for i, chunk in enumerate(movie_chunks):
                # Create a dictionary for each chunk including text, metadata, and a unique ID
                chunk_info = {
                    "id": chunk_id_counter, # Add a unique ID
                    "text": chunk,
                    "metadata": {
                        "Title": row['Title'],
                        "Year": row['Year'],
                        "Certificates": row['Certificates'],
                        "Genre": row['Genre'],
                        "Director": row['Director'],
                        "Star Cast": row['Star Cast'],
                        "IMDb Rating": row['IMDb Rating'],
                        "Duration (minutes)": row['Duration (minutes)'],
                        "chunk_index": i # Keep track of chunk order within the movie
                    }
                }
                chunk_data.append(chunk_info)
                chunk_id_counter += 1 # Increment the counter for the next chunk
    else:
        print("DataFrame not loaded, cannot create chunks.")


    # Save the structured chunk data to the JSON file
    if chunk_data: # Only save if there's data to save
        with open(file_path, 'w') as f:
            json.dump(chunk_data, f, indent=4)

        print(f"Chunks with ID and metadata successfully saved to {file_path}")
    else:
        print("No chunk data generated to save.")


# Display the first few chunks to verify the structure
print("\nFirst 5 chunks with ID and metadata:")
for i, item in enumerate(chunk_data[:5]):
    print(f"Chunk {i+1}:")
    print(f"  ID: {item.get('id', 'N/A')}") # Use .get for safety in case 'id' is missing somehow
    print(f"  Text: {item['text']}")
    print(f"  Metadata: {item['metadata']}")

### Implement BM25 Retrieval

Add code to implement BM25 retrieval using the `rank_bm25` library. This is a keyword-based retrieval method that can complement vector search.

In [None]:
%pip install rank_bm25

In [None]:
%pip install nltk

### Download NLTK Data

Explicitly download necessary NLTK data resources like 'punkt' and 'stopwords'. This should be run before using NLTK resources for tokenization or preprocessing.

In [None]:
import nltk

# Download 'punkt' tokenizer data
try:
    nltk.download('punkt')
    print("NLTK 'punkt' data downloaded successfully.")
except Exception as e:
    print(f"Error downloading NLTK 'punkt' data: {e}")

# Download 'stopwords' corpus data
try:
    nltk.download('stopwords')
    print("NLTK 'stopwords' data downloaded successfully.")
except Exception as e:
    print(f"Error downloading NLTK 'stopwords' data: {e}")

try:
    nltk.download('punkt_tab')
    print("NLTK 'punkt_tab' data downloaded successfully.")
except Exception as e:
    print(f"Error downloading NLTK 'punkt_tab' data: {e}")

print("\nNLTK data download attempts complete.")

### Create BM25 Index

Create the BM25 index from the processed text chunks.

In [None]:
# Implement BM25 Retrieval
from rank_bm25 import BM25Okapi
import nltk
from nltk.corpus import stopwords
import string
import re

# --- Text Preprocessing for BM25 ---
# BM25 works best with tokenized and potentially cleaned text (e.g., removed stopwords, punctuation)
# Use the text content from your chunk_data (ensure chunk_data is available from cell 0ec59d01)

if 'chunk_data' not in globals() or not chunk_data:
    print("Chunk data not available. Please run the chunking cell (0ec59d01) first.")
else:
    print("Preparing text data for BM25...")

    stop_words = set(stopwords.words('english'))

    # Define the preprocessing function (should be accessible globally or within the tool)
    def preprocess_text(text):
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Tokenize
        text = nltk.word_tokenize(text) # Tokenize after punctuation removal
        # Remove stopwords and non-alphabetic tokens
        tokens = [word for word in text if word.isalpha() and word not in stop_words]
        return tokens

    # Preprocess and tokenize each chunk
    # Using the 'text' field from the chunk_data dictionaries
    tokenized_corpus = [preprocess_text(item['text']) for item in chunk_data]

    # --- Create BM25 Index ---
    print("Creating BM25 index...")
    # Initialize BM25 with the tokenized corpus
    bm25 = BM25Okapi(tokenized_corpus)
    print("BM25 index created.")

    print("\nBM25 setup complete.")

## Vector embeddings for chunks

In [None]:
%pip install sentence-transformers

### Create vector embeddings

In [None]:
# Create embeddings for the chunks
from sentence_transformers import SentenceTransformer
import numpy as np # Import numpy for checking shape

# Load Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract just the text content from the chunk_data dictionaries
chunk_texts = [item['text'] for item in chunk_data]

# Create vector embeddings for chunks using the extracted text
semanticEmbeddings = model.encode(chunk_texts)

# Display the shape of the embeddings
display(semanticEmbeddings.shape)

In [None]:
# Display the shape of the semantic embeddings and the first 10 embeddings
display(semanticEmbeddings.shape)
display(semanticEmbeddings[:10])

### Use Vector store to store embeddings

In [None]:
%pip install faiss-cpu

In [None]:
%pip install langchain-community

### Store embeddings to FAISS

In [None]:
# Store embeddings to FAISS
import faiss
import numpy as np
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document # Import Document
import os # Import os for path checking

embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vector_store = None # Initialize vector_store to None

# Define the path where the FAISS index is saved (same as defined earlier)
# Ensure drive_save_path is defined in a previous cell (e.g., cell 767cb6f7)
# drive_save_path = '/content/drive/MyDrive/MovieChatbotData/faiss_index' # Assuming this is defined

# Check if the FAISS index already exists in Google Drive
if drive_save_path and os.path.exists(drive_save_path) and os.path.exists(os.path.join(drive_save_path, 'index.faiss')):
    print(f"Loading existing FAISS index from {drive_save_path}")
    try:
        # Load the vector store from the specified path
        vector_store = FAISS.load_local(
            drive_save_path,
            embedding_function,
            allow_dangerous_deserialization=True # Set to True if loading a FAISS index from a trusted source
        )
        print("FAISS index successfully loaded.")
    except Exception as e:
        print(f"Error loading FAISS index: {e}")
        # If loading fails, you might want to proceed with creating a new index
        print("Proceeding with creating a new FAISS index.")

# If the vector store was not loaded (because it didn't exist or loading failed), create a new one
if vector_store is None:
    print("Creating a new FAISS index.")

    # Create a list of Document objects from the chunks, including metadata
    # Ensure 'df' and 'text_splitter' are available from previous cells (e.g., 0XiPvpp1FxKb and 0ec59d01)
    documents = []
    if df is not None and 'description' in df.columns and text_splitter is not None:
        for index, row in df.iterrows():
            movie_chunks = text_splitter.split_text(row['description'])
            for i, chunk in enumerate(movie_chunks):
                metadata = {
                    "Title": row['Title'],
                    "Year": row['Year'],
                    "Certificates": row['Certificates'],
                    "Genre": row['Genre'],
                    "Director": row['Director'],
                    "Star Cast": row['Star Cast'],
                    "IMDb Rating": row['IMDb Rating'],
                    "Duration (minutes)": row['Duration (minutes)'],
                    "chunk_index": i
                }
                documents.append(Document(page_content=chunk, metadata=metadata))
    elif df is None:
         print("DataFrame 'df' not loaded, cannot create documents for FAISS.")
    elif 'description' not in df.columns:
         print("'description' column not found in DataFrame, cannot create documents for FAISS.")
    elif text_splitter is None:
         print("'text_splitter' not initialized, cannot create documents for FAISS.")


    if documents:
        try:
            vector_store = FAISS.from_documents(
                documents=documents,
                embedding=embedding_function
            )
            print("Successfully created Langchain FAISS vector store from documents with metadata.")

            # Save the newly created vector store to Google Drive
            if drive_save_path: # Check if the save path is defined
                try:
                    vector_store.save_local(drive_save_path)
                    print(f"FAISS index successfully saved to {drive_save_path}")
                except Exception as e:
                    print(f"Error saving FAISS index to {drive_save_path}: {e}")
            else:
                print("drive_save_path is not defined. Cannot save FAISS index.")

        except Exception as e:
            print(f"Error creating FAISS index from documents: {e}")
            vector_store = None # Ensure vector_store is None if creation fails
    else:
        print("No documents available to create FAISS vector store.")
        vector_store = None # Ensure vector_store is None if no documents


# After this cell runs, 'vector_store' will either contain the loaded index or a newly created one (or None if errors occurred)
if vector_store:
    print("\nFAISS vector store is ready to use.")
else:
    print("\nFAISS vector store is NOT available.")

In [None]:
if vector_store is not None:
    # Access the underlying FAISS index and its ntotal attribute
    faiss_index = vector_store.index
    print(f"The size of the vector store (number of vectors) is: {faiss_index.ntotal}")
else:
    print("Vector store has not been loaded yet.")

## Define LLM

In [None]:
%pip install langchain_openai

### Import GPT4.0

In [None]:
# Create the llm model - Using OpenAI GPT-4

from langchain_openai import ChatOpenAI

# Initialize the GPT-4 model
llm = ChatOpenAI(model="gpt-4o", temperature=0) # temperature=0 for more deterministic responses

print("OpenAI GPT-4o model is now available")

### Add a new prompt template

In [None]:
# Create the prompt template
from langchain.prompts import PromptTemplate

movie_search_template = """You are a movie recommendation assistant.
Given the following movie information:
{context}

Answer the user's question about movies. If you don't know the answer, don't try to make up an answer.

Question: {question}
Helpful Answer:"""

movie_search_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=movie_search_template,
)

print(f"Movie search prompt: {movie_search_prompt}")

### Create retrieval chain

In [None]:
# Create the document processing chain
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate # Import PromptTemplate

# Initialize the document processing chain with the new custom movie_search_prompt

# Configure the retriever to potentially use metadata filters
# For example, to filter by genre (this is a static example,
# you'll likely make this dynamic with the agent later)
# search_kwargs = {'filter': {'Genre': 'Action'}} # Example filter
# search_kwargs = {'k': 5} # Example: retrieve top 5 documents

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(
        search_kwargs={'k': 5}
    ),
    chain_type_kwargs={"prompt": movie_search_prompt} # Passing movie_search_prompt here
)

print("Retrieval chain created with configured retriever!")

In [None]:
# Test to run a sample query and check the output

user_query = "What is a good action movie?"
# Use the SentenceTransformer model (loaded in cell hUutqmZF-jf4, variable is still 'model') to encode the query
# If you had renamed the SentenceTransformer model, you would use that variable name here.
query_embedding = model.encode(user_query)

display(query_embedding.shape)



In [None]:
# Display the user query embeddings for testing purposes
result = qa_chain.invoke(user_query)
display(result)

### Create User interface for intermediate check

In [None]:
# Optional: Test the functionality using a Gradio UI (intermediate check)
import gradio as gr

#The output of qa_chain.invoke() is a dictionary that includes the
# original query and the generated answer, often under a key like 'result'.
# So to get result, explicitly pass the 'result' key

#Note: to debug any errors with Gradio in Colab, set debug=True in launch below..

gr.ChatInterface(lambda message, history: qa_chain.invoke(message)['result']).launch(share=True)

## Define and initialize agent

In [None]:
from langchain.agents import initialize_agent, AgentType
from langchain.tools import tool

### Define agent tools

In [None]:
tools = [] # Initialize or clear the tools list here

from langchain.tools import tool
# from pydantic import BaseModel, Field # Already imported in other tool cells

@tool
def get_movie_description(movie_title: str) -> str:
  """
  Searches the dataset for a movie by its exact title and returns its full description from the dataset metadata.
  Use this tool when the user asks for general information about a specific movie.
  The input MUST be the exact movie title as a string for the 'movie_title' parameter.
  """
  if df is None:
      return "Error: DataFrame is not loaded."
  try:
    # Assuming df is available globally
    # Use case=False for case-insensitive matching in title search
    matching_movies = df[df['Title'].str.contains(movie_title, case=False, na=False)]

    if not matching_movies.empty:
        # Return the description of the first matching movie (can be refined for multiple matches)
        return matching_movies['description'].values[0]
    else:
        return f"Sorry, I could not find a movie with the title '{movie_title}' in the dataset."
  except Exception as e:
    return f"An error occurred while searching for the movie description: {e}"

tools.append(get_movie_description)

print("Added a new tool 'get_movie_description'")

### Add a new tool for Direct Filtering

This tool uses pandas to directly filter the DataFrame based on metadata, bypassing the vector store and LLM for filter-only queries.

In [None]:
from langchain.tools import tool
import pandas as pd # Ensure pandas is imported

@tool
def list_movies_by_filters(filter_criteria: dict) -> str: # Keep the type hint for clarity
    """
    Filters the movie DataFrame directly based on metadata criteria and lists matching titles.
    Use this tool ONLY for queries that can be answered by listing movies based purely on explicit metadata filters (like Genre, Year, Director, or Rating ranges), without needing semantic search or summarization.
    The input MUST be a dictionary for 'filter_criteria'.

    Example filter_criteria:
    {'Genre': 'Action', 'Year': 2022}
    {'Director': 'Christopher Nolan'}
    {'IMDb Rating': {'$gte': 8.0}} # Example of range filter for numeric columns

    Returns a comma-separated string of matching movie titles.
    """
    if df is None:
        return "Error: DataFrame is not loaded."

    filtered_df = df.copy() # Work on a copy to avoid modifying the original df

    try:
        # Iterate through the filter_criteria dictionary
        for key, value in filter_criteria.items():
            if key in filtered_df.columns:
                if isinstance(value, dict):
                    # Handle range filters (e.g., {'$gte': 2010, '$lte': 2019})
                    if '$gte' in value and '$lte' in value:
                        filtered_df = filtered_df[
                            (filtered_df[key] >= value['$gte']) &
                            (filtered_df[key] <= value['$lte'])
                        ]
                    elif '$gte' in value:
                        filtered_df = filtered_df[filtered_df[key] >= value['$gte']]
                    elif '$lte' in value:
                        filtered_df = filtered_df[filtered_df[key] <= value['$lte']]
                    # Add more range operators if needed ($gt, $lt)
                    else:
                         return f"Unsupported range filter format for key '{key}'."
                else:
                    # Handle exact match filtering (case-insensitive for strings where appropriate)
                    if filtered_df[key].dtype == 'object':
                         # Use .str.contains with case=False for string columns
                         filtered_df = filtered_df[filtered_df[key].str.contains(str(value), case=False, na=False)]
                    else:
                         # Direct equality for numeric or other types
                         filtered_df = filtered_df[filtered_df[key] == value]
            else:
                return f"Error: Filter key '{key}' not found in DataFrame columns."

        if not filtered_df.empty:
            # Return a formatted string of titles
            movie_titles = filtered_df['Title'].tolist()
            return "Movies matching your criteria: " + ", ".join(movie_titles[:20]) + (", ..." if len(movie_titles) > 20 else "")
        else:
            return "Sorry, I could not find any movies matching your criteria."

    except Exception as e:
        return f"An error occurred during direct filtering: {e}"

print("list_movies_by_filters tool defined without explicit input schema.")

# Add this new tool to your existing tools list
try:
    # Remove the old tool definition if it exists to avoid duplicates
    tools = [t for t in tools if t.name != 'list_movies_by_filters']
    tools.append(list_movies_by_filters)
    print("list_movies_by_filters tool added to the tools list.")
except NameError:
    print("Warning: 'tools' list not found. Please ensure the 'tools' list from cell 9rlYRLxl0bft is defined before adding this tool.")

### Add a new tool for Filtered Movie Search

This tool allows the agent to perform vector searches with metadata filters.

In [None]:
from langchain.tools import tool
from langchain.chains import RetrievalQA # Import RetrievalQA if you want the tool to run a QA chain internally
from pydantic import BaseModel, Field # Import BaseModel and Field
from langchain.schema import Document # Import Document object

# Assume bm25 index and preprocess_text function are available from cell f0a4c179
# Assume chunk_data is available from cell 0ec59d01


# Define a Pydantic model for the input of the semantic_search_movies_with_filters tool
class SemanticSearchInput(BaseModel):
    query: str = Field(..., description="The user's natural language query about the movie content or theme.")
    filters: dict = Field(default=None, description="Optional: A dictionary where keys are metadata fields (e.g., 'Genre', 'Year', 'Director', 'IMDb Rating') and values are the desired filter values or range queries. Use this to refine semantic search results based on metadata.")

@tool(args_schema=SemanticSearchInput) # Link the Pydantic model to the tool
def semantic_search_movies_with_filters(query: str, filters: dict = None) -> str:
    """
    Performs a hybrid semantic (vector) and keyword (BM25) search for movies,
    applies metadata filters, and uses an LLM to answer the query.
    Use this tool for queries that involve semantic understanding, keyword matching,
    and/or specific criteria (like Genre, Year, Rating).
    The input MUST include a 'query' string and can optionally include a 'filters' dictionary.

    Example filters:
    {'Genre': 'Action'}
    {'Year': 2022}
    {'IMDb Rating': {'$gte': 7.5}} # Example using MongoDB-like query for rating >= 7.5
    """
    if vector_store is None:
        return "Error: Vector store is not available."
    # Ensure bm25 index, preprocess_text function, and chunk_data are available globally
    if 'bm25' not in globals() or 'preprocess_text' not in globals():
        return "Error: BM25 index or preprocess_text function not available. Please run the BM25 setup cell (f0a4c179 or fffe56b4)."
    if 'chunk_data' not in globals() or not chunk_data:
         return "Error: Chunk data not available. Please run the chunking cell (0ec59d01)."


    try:
        # --- 1. Perform Vector Search ---
        # Configure the base retriever with filters if provided
        vector_search_k = 10 # Retrieve more documents initially before combining
        retriever_kwargs = {'k': vector_search_k}
        if filters:
            retriever_kwargs['filter'] = filters
            print(f"Applying filters to vector search: {filters}") # For debugging

        base_retriever = vector_store.as_retriever(search_kwargs=retriever_kwargs)
        vector_docs = base_retriever.get_relevant_documents(query)
        print(f"Vector search retrieved {len(vector_docs)} documents.")

        # --- 2. Perform BM25 Search ---
        # Note: BM25 filtering by metadata is not directly supported like in Vector Store/FAISS.
        # BM25 searches the entire corpus based on keywords.

        tokenized_query = preprocess_text(query)

        # Get top documents by BM25 score
        bm25_search_k = 10 # Retrieve top 10 documents from BM25

        # *** MODIFIED BM25 RETRIEVAL ***
        # Get the indices of the top N documents based on BM25 scores
        bm25_top_indices = bm25.get_top_n(tokenized_query, n=bm25_search_k) # Get indices from the BM25 object

        # Retrieve the original Document objects from chunk_data using the indices
        bm25_docs = []
        for i in bm25_top_indices:
            # Check if the index is valid for chunk_data
            if 0 <= i < len(chunk_data):
                item = chunk_data[i]
                 # Ensure the item has 'text' and 'metadata' keys as expected
                if isinstance(item, dict) and 'text' in item and 'metadata' in item:
                     bm25_docs.append(Document(page_content=item['text'], metadata=item['metadata']))
                else:
                     print(f"Warning: BM25 index {i} returned an item not in expected dictionary format.")
            else:
                 print(f"Warning: BM25 index {i} is out of bounds for chunk_data.")


        print(f"BM25 search retrieved {len(bm25_docs)} documents.")


        # --- 3. Combine Results (Simple Union) ---
        # Combine documents from both searches. Remove duplicates based on page_content and metadata.
        unique_docs = {}
        for doc in vector_docs + bm25_docs:
            # Create a unique key based on content and metadata
            # Use a tuple of sorted metadata items for consistent hashing
            metadata_key = tuple(sorted(doc.metadata.items()))
            doc_key = (doc.page_content, metadata_key)
            unique_docs[doc_key] = doc

        combined_docs = list(unique_docs.values())
        print(f"Combined retrieval resulted in {len(combined_docs)} unique documents.")

        # Optionally, apply reranking here on the combined documents if needed
        # or just pass the combined results to the LLM.

        # --- 4. Pass Combined Results to LLM via QA Chain ---
        # Ensure 'llm' and 'movie_search_prompt' are available
        if llm is None:
             return "Error: Language model (LLM) is not available."
        if movie_search_prompt is None:
             return "Error: Movie search prompt is not available."

        # Use a simple retriever to pass the combined docs to the QA chain
        # Create a retriever from the list of documents
        from langchain.schema import BaseRetriever as LangchainBaseRetriever
        from typing import List
        from langchain.callbacks.manager import CallbackManagerForRetrieverRun

        class ListBasedRetriever(LangchainBaseRetriever):
            docs: List[Document]

            def _get_relevant_documents(
                self, query: str, *, run_manager: CallbackManagerForRetrieverRun
            ) -> List[Document]:
                return self.docs


        # Create an instance of the ListBasedRetriever with the combined documents
        combined_retriever = ListBasedRetriever(docs=combined_docs)


        qa_chain_in_tool = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff", # Stuff all docs into the context
            retriever=combined_retriever, # Use the retriever with combined docs
            chain_type_kwargs={"prompt": movie_search_prompt}
        )

        # Invoke the QA chain with the user's query
        result = qa_chain_in_tool.invoke({"query": query})

        # The result from the QA chain is a dictionary, return the 'output' key
        return result.get('output', 'Could not generate a helpful answer based on the hybrid search results.')

    except Exception as e:
        return f"An error occurred during hybrid search: {e}"

print("semantic_search_movies_with_filters tool defined with hybrid search.")

# Add this tool to your existing tools list (assuming 'tools' list is defined elsewhere)
# Ensure the 'tools' list is accessible and defined in cell 9rlYRLxl0bft
try:
    # Remove the old tool definition if it exists to avoid duplicates
    tools = [t for t in tools if t.name != 'semantic_search_movies_with_filters']
    tools.append(semantic_search_movies_with_filters)
    print("semantic_search_movies_with_filters tool added to the tools list.")
except NameError:
    print("Warning: 'tools' list not found. Please ensure the 'tools' list from cell 9rlYRLxl0bft is defined before adding this tool.")

### Add a new tool to fetch Movie Plot

This tool will attempt to fetch the movie plot from an external API (like OMDb). Obtained a new key for OMDB and stored in Keys.

In [None]:
import requests
from google.colab import userdata
from langchain.tools import tool
import os
from pydantic import BaseModel, Field # Import BaseModel and Field

# Define a Pydantic model for the input of the get_movie_plot tool
class GetMoviePlotInput(BaseModel):
    movie_title: str = Field(..., description="The exact title of the movie for which to fetch the plot.")


# --- IMPORTANT ---
# You will need an API key for an external movie database like OMDb (www.omdbapi.com).
# Get your API key and add it to Colab Secrets under the name 'OMDB_API_KEY'.
OMDB_API_KEY = userdata.get('OMDB_API_KEY')

@tool(args_schema=GetMoviePlotInput) # Link the Pydantic model to the tool
def get_movie_plot(movie_title: str) -> str:
  """
  Fetches the plot summary for a specific movie given its exact title using an external API (like OMDb).
  Use this tool when the user explicitly asks for the plot of a named movie.
  Requires an API key for the external service.
  The input MUST be the exact movie title as a string for the 'movie_title' parameter.
  """
  if not OMDB_API_KEY:
      return "Error: OMDB_API_KEY not found in Colab Secrets. Please add it."

  base_url = "http://www.omdbapi.com/"
  params = {
      "t": movie_title,
      "apikey": OMDB_API_KEY,
      "plot": "full" # Request the full plot
  }

  try:
    response = requests.get(base_url, params=params)
    response.raise_for_status() # Raise an exception for bad status codes
    data = response.json()

    if data.get("Response") == "True":
      return data.get("Plot", "Plot information not available.")
    else:
      return data.get("Error", f"Could not find plot information for '{movie_title}'.")

  except requests.exceptions.RequestException as e:
    return f"Error fetching movie plot from API: {e}"
  except Exception as e:
    return f"An unexpected error occurred: {e}"

# Add this new tool to your existing tools list (assuming 'tools' list is defined elsewhere)
# Ensure the 'tools' list is accessible in this cell's scope if it's not global
try:
    tools.append(get_movie_plot)
    print("get_movie_plot tool added to the tools list.")
except NameError:
    print("Warning: 'tools' list not found. Please ensure the 'tools' list from cell 9rlYRLXl0bft is defined before adding this tool.")

### Initialize the agent (Orchestration)

In [None]:
# Initialize the agent with memory
# Ensure 'tools' list is updated with the new tools
# Ensure 'llm' is defined

from langchain.memory import ConversationBufferMemory # Import memory component
from langchain.agents import initialize_agent, AgentType # Ensure these are imported

# Add a prefix to guide the agent's behavior, encouraging it to use the appropriate tool
# This prefix is still relevant for guiding the agent's overall thought process
agent_prefix = """You are a helpful movie recommendation assistant.
You have access to the following tools:

{tools}

When a user asks a question about movies, first think step-by-step:
1. Identify the user's intent. Are they asking for a list of movies based purely on filters (like Genre, Year, Director, Rating range)? Or are they asking a more open-ended question that requires semantic search and summarization?
2. If the query is purely filter-based (e.g., "List all action movies from 2020"), use the `list_movies_by_filters` tool and provide the identified filters as a dictionary to the tool.
3. If the query requires semantic understanding or a combination of semantic search and filtering (e.g., "Tell me about a good sci-fi movie from the 90s"), use the `semantic_search_movies_with_filters` tool, providing the core query and any identified filters as a dictionary to the tool.
4. If the query is about a specific movie's plot, use the `get_movie_plot` tool.
5. If the query is about basic details of a specific movie available in the dataset description, use the `get_movie_description` tool first, and then `get_movie_plot` if more details are needed.
6. If you cannot answer the question using the available tools, state that you cannot help with that request.

Include conversational history in your responses where relevant.

Begin!

Previous conversation:
{chat_history}

Question: {input}
{agent_scratchpad}
"""

# Initialize ConversationBufferMemory
# Ensure memory_key and input_key match the agent's expected input format
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


# Initialize the agent with memory
# For OPENAI_FUNCTIONS agent type with memory, initialize_agent handles
# passing the memory and chat_history correctly if memory_key is set.
# The prefix also needs to include {chat_history}.

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.OPENAI_FUNCTIONS, # Use OPENAI_FUNCTIONS
    verbose=True,
    handle_parsing_errors=True,
    memory=memory, # Add memory to the agent
    agent_kwargs={
        'prefix': agent_prefix, # Pass the custom prefix
        'input_variables': ['input', 'chat_history', 'agent_scratchpad', 'tools'] # Explicitly define input variables
    }
)

print("Langchain agent initialized with memory and tools.")

In [None]:
# Display available tools for the agent with their provided descriptions.

if 'tools' in globals() and tools:
    print("Available tools for the agent:")
    for tool in tools:
        print(f"- **Tool Name:** {tool.name}")
        print(f"  **Description:** {tool.description}")
        print("-" * 20) # Separator for readability
elif 'tools' in globals() and not tools:
    print("The 'tools' list is defined but currently empty.")
else:
    print("The 'tools' list is not defined yet. Please run the cells that define and populate the 'tools' list.")

Available tools for the agent:
- **Tool Name:** get_movie_description
  **Description:** Searches the dataset for a movie by its exact title and returns its full description from the dataset metadata.
Use this tool when the user asks for general information about a specific movie.
The input MUST be the exact movie title as a string for the 'movie_title' parameter.
--------------------
- **Tool Name:** list_movies_by_filters
  **Description:** Filters the movie DataFrame directly based on metadata criteria and lists matching titles.
Use this tool ONLY for queries that can be answered by listing movies based purely on explicit metadata filters (like Genre, Year, Director, or Rating ranges), without needing semantic search or summarization.
The input MUST be a dictionary for 'filter_criteria'.

Example filter_criteria:
{'Genre': 'Action', 'Year': 2022}
{'Director': 'Christopher Nolan'}
{'IMDb Rating': {'$gte': 8.0}} # Example of range filter for numeric columns

Returns a comma-separated 

## Test the agent with a sample query

In [None]:
# Test the agent with a sample query
agent_response = agent.invoke("How can I find movies from the 90s")
print(agent_response)

## Gradio UI for testing agent

In [None]:
# Implement agent interface, integrate Gradio.

# The output of agent.invoke() is a dictionary that includes the
# original query and the generated answer with a key 'output'.
# So to get result, explicitly pass the 'output' key

# Note: to debug any errors with Gradio in Colab, set debug=True in launch below..

gr.ChatInterface(lambda message, history: agent.invoke(message)['output']).launch(share=True)

## Tests


In [None]:
# Refactored test code to run within the notebook

# Assume the necessary variables and functions (df, get_movie_description,
# list_movies_by_filters, get_movie_plot, OMDB_API_KEY if needed)
# are defined in previous cells and are accessible in the current scope.

def test_get_movie_description_exists():
  """Tests if get_movie_description returns a description for a known movie."""
  print("\nRunning test_get_movie_description_exists...")
  if 'df' not in globals() or df is None:
      print("Skipping test: DataFrame not loaded.")
      return # Skip test if df is not available
  known_movie_title = "End of the Spear" # Replace with a title from your dataset
  description = get_movie_description(known_movie_title)
  assert known_movie_title in description, f"Expected '{known_movie_title}' in description, but got: {description}"
  print("test_get_movie_description_exists passed.")

def test_get_movie_description_not_exists():
  """Tests if get_movie_description handles a non-existent movie."""
  print("\nRunning test_get_movie_description_not_exists...")
  if 'df' not in globals() or df is None:
      print("Skipping test: DataFrame not loaded.")
      return # Skip test if df is not available
  non_existent_movie_title = "NonExistentMovie123"
  description = get_movie_description(non_existent_movie_title)
  assert "Sorry, I could not find" in description, f"Expected 'not found' message, but got: {description}"
  print("test_get_movie_description_not_exists passed.")


# Add tests for get_movie_plot if you have OMDB_API_KEY set up
# Note: This test requires the OMDB_API_KEY to be available in the environment or mocked.
def test_get_movie_plot_exists():
    """Tests if get_movie_plot fetches a plot for a known movie (requires API key)."""
    print("\nRunning test_get_movie_plot_exists...")
    if 'get_movie_plot' not in globals():
        print("Skipping test: get_movie_plot tool not defined.")
        return # Skip if tool not defined
    try:
        from google.colab import userdata
        OMDB_API_KEY_CHECK = userdata.get('OMDB_API_KEY')
        if not OMDB_API_KEY_CHECK:
             print("Skipping test: OMDB_API_KEY not found in Colab Secrets.")
             return
    except ImportError:
         print("Skipping test: Not running in Colab environment or userdata not available.")
         return

    known_movie_title = "Inception" # Replace with a known movie
    plot = get_movie_plot(known_movie_title)
    assert "Plot information not available." not in plot and "Could not find plot information" not in plot and "Movie not found!" not in plot, f"Expected plot, but got: {plot}" # Should return a plot and find the movie
    assert len(plot) > 50, f"Expected plot to be reasonably long, but got length: {len(plot)}" # Plot should be reasonably long
    print("test_get_movie_plot_exists passed.")


def test_get_movie_plot_not_exists():
    """Tests if get_movie_plot handles a non-existent movie (requires API key)."""
    print("\nRunning test_get_movie_plot_not_exists...")
    if 'get_movie_plot' not in globals():
        print("Skipping test: get_movie_plot tool not defined.")
        return # Skip if tool not defined
    try:
        from google.colab import userdata
        OMDB_API_KEY_CHECK = userdata.get('OMDB_API_KEY')
        if not OMDB_API_KEY_CHECK:
             print("Skipping test: OMDB_API_KEY not found in Colab Secrets.")
             return
    except ImportError:
         print("Skipping test: Not running in Colab environment or userdata not available.")
         return

    non_existent_movie_title = "ThisMovieDoesNotExist12345"
    plot = get_movie_plot(non_existent_movie_title)
    assert "Movie not found!" in plot, f"Expected 'Movie not found!' message, but got: {plot}" # Should return the exact "Movie not found!" message
    print("test_get_movie_plot_not_exists passed.")


# --- Run the tests ---
print("--- Running Movie Tool Tests ---")
test_get_movie_description_exists()
test_get_movie_description_not_exists()
test_get_movie_plot_exists()
test_get_movie_plot_not_exists()
print("--- Finished Running Movie Tool Tests ---")

## Check the edge cases and handle them appropriately

- Edge Case: File not found or incorrect type of file loaded:
    - Solution: Handling of possible exceptions that can occur when loading the provided data set. This will prevent from showing the low level exceptions to the user and handle them properly with readable messages.

- Edge Case: Issue with losing all the data when the Colab runtime is disconnected.

   - Solution: Mounted Google drive to load the data from the drive than generating the data again.


- Edge Case: Data related issues like handling missing values, incorrect data types:
    - Solution: As all the features of the data frame are beng used to in 'Metadata' any missing or incorrect data type values are dropped from the data frame. But there were no rows that are effected because of this.

- Edge Case: Creating vector embeddings is taking a significant amount of time with Colab's default CPU run time:
    - Solution: Updated the runtime to T4-GPU. This made the execution much faster.

- Edge Case: A tool is called with invalid input types or formats:
   - Solution: Using Pydantic models with args_schema helps the agent understand the expected format and provides structured validation.

- Edge Case: An external API call fails:
   - Solution: Including proper checks and wrapping the execution code with try.. catch will help in catching any exceptions.

- Edge Case: The agent fails to correctly parse the user's natural language query and identify the correct tool and its arguments (as seen with the list_movies_by_filters issue).

  - Solution:initially started using free LLM 'gpt2.0' modell but as it was failing to provide the search results correctly, switched to gpt4.0 and the issue with parsing is resolved.

- Edge Case: The vector search returns no relevant documents.
   
  - Solution: To avoid this situation, movie_search_prompt asks the LLM not to make up answers if it doesn't know.

- Edge Case: Refine Agents tool Selection Prompting
   - Solution: Implemented another Lexical retrieval technique 'BM25' which will perform the exact text search as compared to vector search that focuses on semantic search.

- Edge Case: Get the movie plot information by RAG technique compared to directly storing in the data set.
   - Solution: The get_movie_plot tool is better for providing the most current plot for a specific movie on demand and keeping RAG vector store smaller. Including plot in the dataset/embeddings is better if you need to perform semantic searches based on plot content.

- Edge Case: Provide relevance with user queries during search

  - Solution: Added session history so that the LLM can remember and provide results by including the history where possible.

- Edge Case: SEcurity concerns with API keys
  
   - Solution: Instead of , manually entering the Keys, used Google Colab's Secret manager to store the keys and accessed in the code by using the Key names rather than actual Key values.

- Edge Case: Troubleshooting the issues during searching

  - Solution:
      - Set the Agent output to  Verbose.
      - Gradio launch to debug=True (correctly commented out as debug=True run the UI indefinitely until disconnected manually)