# Homework 4: Movie Question Answering System
## Using Retrieval-Augmented Generation (RAG) and Large Language Models

**Student Name:** ANIRUDH KRISHNA

**Date:** December 9, 2025

---
## Part 1: Setup and Data Loading

### 1.1 Install Required Packages

In [82]:
# Installing required packages
!pip install -q transformers torch pandas numpy sentence-transformers accelerate bitsandbytes
!pip install -q llama-index-core llama-index-embeddings-huggingface llama-index-llms-huggingface
!pip install -q faiss-cpu

### 1.2 Import Libraries

In [83]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
from sentence_transformers import SentenceTransformer
from llama_index.core import Document, VectorStoreIndex, ServiceContext, StorageContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
import warnings
import re
import json
from typing import Dict, List, Tuple

warnings.filterwarnings('ignore')
print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Libraries imported successfully!
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
GPU Memory: 15.83 GB


In [84]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set your dataset path (update this to match your file location)
DATASET_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP HW4/IMDB_top_10000_07132023.csv'

print("✓ Google Drive mounted!")
print(f"Dataset path: {DATASET_PATH}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✓ Google Drive mounted!
Dataset path: /content/drive/MyDrive/Colab Notebooks/NLP HW4/IMDB_top_10000_07132023.csv


### 1.3 Load and Configure LLM with Quantization

In [85]:
model_name = "google/flan-t5-base"  # Lightweight and effective for code generation

print(f"\nLoading model: {model_name}...")
print("This model works great on Colab T4 GPU!")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

print("✓ Model loaded successfully!")
print(f"Model device: {model.device}")
print(f"Model dtype: {model.dtype}")


Loading model: google/flan-t5-base...
This model works great on Colab T4 GPU!
✓ Model loaded successfully!
Model device: cuda:0
Model dtype: torch.float16


### 1.4 Load Dataset

In [86]:
# Load the IMDB dataset
df = pd.read_csv(DATASET_PATH)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

Dataset loaded successfully!
Shape: (9999, 12)

Columns: ['Title', 'Year', 'Genres', 'Certificate', 'Runtime', 'Rating', 'Metascore', 'Votes', 'Gross(Million)', 'Director', 'Stars', 'Summary']

First few rows:


Unnamed: 0,Title,Year,Genres,Certificate,Runtime,Rating,Metascore,Votes,Gross(Million),Director,Stars,Summary
0,The Shawshank Redemption,1994,Drama,R,142.0,9.3,82.0,2764512,28.34,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...","Over the course of several years, two convicts..."
1,The Dark Knight,2008,"Action, Crime, Drama",PG-13,152.0,9.0,84.0,2737769,534.86,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker wreaks havo...
2,Inception,2010,"Action, Adventure, Sci-Fi",PG-13,148.0,8.8,74.0,2429452,292.58,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",A thief who steals corporate secrets through t...
3,Fight Club,1999,Drama,R,139.0,8.8,67.0,2201448,37.03,David Fincher,"Brad Pitt, Edward Norton, Meat Loaf, Zach Grenier",An insomniac office worker and a devil-may-car...
4,Forrest Gump,1994,"Drama, Romance",PG-13,142.0,8.8,82.0,2150299,330.25,Robert Zemeckis,"Tom Hanks, Robin Wright, Gary Sinise, Sally Field",The history of the United States from the 1950...


### 1.5 Data Preprocessing and Exploration

In [87]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\n" + "="*50)

# Data types
print("\nData types:")
print(df.dtypes)
print("\n" + "="*50)

# Basic statistics
print("\nBasic statistics for numerical columns:")
df.describe()

Missing values per column:
Title                0
Year                 0
Genres               0
Certificate        374
Runtime              2
Rating               0
Metascore         1949
Votes                0
Gross(Million)    2741
Director             0
Stars                3
Summary              0
dtype: int64


Data types:
Title              object
Year                int64
Genres             object
Certificate        object
Runtime           float64
Rating            float64
Metascore         float64
Votes               int64
Gross(Million)    float64
Director           object
Stars              object
Summary            object
dtype: object


Basic statistics for numerical columns:


Unnamed: 0,Year,Runtime,Rating,Metascore,Votes,Gross(Million)
count,9999.0,9997.0,9999.0,8050.0,9999.0,7258.0
mean,2003.016202,110.23367,6.572407,57.440497,91315.6,40.168336
std,15.55411,21.81404,1.005134,17.915755,168934.9,66.869398
min,1950.0,45.0,1.0,1.0,10116.0,0.0
25%,1995.0,95.0,6.0,45.0,16951.0,2.55
50%,2007.0,106.0,6.7,58.0,34100.0,17.325
75%,2015.0,120.0,7.3,71.0,89338.5,48.4525
max,2022.0,439.0,9.3,100.0,2764512.0,936.66


In [88]:
# Handle missing values appropriately
# Fill missing values with appropriate defaults
df['Summary'] = df['Summary'].fillna('No summary available')
df['Genres'] = df['Genres'].fillna('Unknown')
df['Director'] = df['Director'].fillna('Unknown')
df['Stars'] = df['Stars'].fillna('Unknown')
df['Certificate'] = df['Certificate'].fillna('Not Rated')
df['Metascore'] = df['Metascore'].fillna(0)
df['Gross(Million)'] = df['Gross(Million)'].fillna(0)

print("Missing values after preprocessing:")
print(df.isnull().sum())
print("\nData preprocessing complete!")

Missing values after preprocessing:
Title             0
Year              0
Genres            0
Certificate       0
Runtime           2
Rating            0
Metascore         0
Votes             0
Gross(Million)    0
Director          0
Stars             0
Summary           0
dtype: int64

Data preprocessing complete!


In [89]:
# Create rich text representations for embedding
def create_movie_text(row):
    """Create a comprehensive text representation of a movie for embedding."""
    text = f"""Title: {row['Title']}
Year: {row['Year']}
Genres: {row['Genres']}
Director: {row['Director']}
Stars: {row['Stars']}
Rating: {row['Rating']}/10
Certificate: {row['Certificate']}
Runtime: {row['Runtime']} minutes
Summary: {row['Summary']}
"""
    return text.strip()

# Apply to create document texts
df['document_text'] = df.apply(create_movie_text, axis=1)

print("Sample document text:")
print(df['document_text'].iloc[0])
print("\nRich text representations created!")

Sample document text:
Title: The Shawshank Redemption
Year: 1994
Genres: Drama
Director: Frank Darabont
Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler
Rating: 9.3/10
Certificate: R
Runtime: 142.0 minutes
Summary: Over the course of several years, two convicts form a friendship, seeking consolation and, eventually, redemption through basic compassion.

Rich text representations created!


---
## Part 2: Vector Index Construction

### 2.1 Initialize Embedding Model

In [90]:
# Initialize embedding model
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
print(f"Loading embedding model: {embed_model_name}...")

embed_model = HuggingFaceEmbedding(model_name=embed_model_name)

print("Embedding model loaded successfully!")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2...
Embedding model loaded successfully!


### 2.2 Create Documents from DataFrame

In [91]:
# Create LlamaIndex documents
documents = []

for idx, row in df.iterrows():
    # Create document with metadata
    doc = Document(
        text=row['document_text'],
        metadata={
            'title': row['Title'],
            'year': int(row['Year']),
            'genres': row['Genres'],
            'director': row['Director'],
            'rating': float(row['Rating']),
            'runtime': row['Runtime'],
            'index': idx
        }
    )
    documents.append(doc)

print(f"Created {len(documents)} documents from the dataset")
print(f"\nSample document:")
print(f"Text: {documents[0].text[:200]}...")
print(f"Metadata: {documents[0].metadata}")

Created 9999 documents from the dataset

Sample document:
Text: Title: The Shawshank Redemption
Year: 1994
Genres: Drama
Director: Frank Darabont
Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler
Rating: 9.3/10
Certificate: R
Runtime: 142.0 minutes
Su...
Metadata: {'title': 'The Shawshank Redemption', 'year': 1994, 'genres': 'Drama', 'director': 'Frank Darabont', 'rating': 9.3, 'runtime': 142.0, 'index': 0}


### 2.3 Build Vector Index

In [92]:
# Create service context with embedding model
from llama_index.core import Settings

Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.llm = None

print("Building vector index...")
print("This may take a few minutes...")

# Build the index
index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True
)

print("\nVector index built successfully!")

LLM is explicitly disabled. Using MockLLM.
Building vector index...
This may take a few minutes...


Parsing nodes:   0%|          | 0/9999 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1807 [00:00<?, ?it/s]


Vector index built successfully!


### 2.4 Persist and Load Index

In [93]:
# Save the index to disk
index.storage_context.persist(persist_dir="./movie_index")
print("Index saved to ./movie_index")

# To load the index later (uncomment when needed):
# from llama_index.core import load_index_from_storage
# storage_context = StorageContext.from_defaults(persist_dir="./movie_index")
# index = load_index_from_storage(storage_context)
# print("Index loaded from disk")

Index saved to ./movie_index


---
## Part 3: Semantic Query Implementation

### 3.1 Configure Retriever (No LLM needed)

In [94]:
# We'll use the retriever to get relevant documents, then format them ourselves
retriever = index.as_retriever(similarity_top_k=5)

print("\nQuery engine configured successfully!")
print(f"Similarity top k: 5")
print(f"Response mode: compact")


Query engine configured successfully!
Similarity top k: 5
Response mode: compact


### 3.2 Implement Semantic Query Function

In [95]:
def semantic_query(question: str, verbose: bool = True) -> Dict:
    """
    Answer semantic/conceptual questions using RAG.

    Args:
        question: The semantic question to answer
        verbose: Whether to print detailed information

    Returns:
        Dictionary containing the answer and source information
    """
    if verbose:
        print(f"\n{'='*60}")
        print(f"SEMANTIC QUERY: {question}")
        print(f"{'='*60}")

    # Retrieve relevant documents
    nodes = retriever.retrieve(question)

    # Extract source movies and create answer
    source_movies = []
    answer_parts = []

    for i, node in enumerate(nodes, 1):
        source_movies.append({
            'title': node.metadata.get('title', 'Unknown'),
            'year': node.metadata.get('year', 'Unknown'),
            'score': node.score
        })
        # Add movie info to answer
        title = node.metadata.get('title', 'Unknown')
        year = node.metadata.get('year', 'Unknown')
        # Get a snippet of the text
        text_snippet = node.text[:200] + "..." if len(node.text) > 200 else node.text
        answer_parts.append(f"{i}. {title} ({year})")

    # Create a simple answer from the retrieved movies
    answer = f"Based on the movie database, here are relevant movies:\n" + "\n".join(answer_parts)

    result = {
        'question': question,
        'answer': answer,
        'sources': source_movies
    }

    if verbose:
        print(f"\nANSWER:\n{result['answer']}")
        print(f"\nSOURCE MOVIES:")
        for i, movie in enumerate(source_movies, 1):
            print(f"{i}. {movie['title']} ({movie['year']}) - Relevance: {movie['score']:.3f}")
        print(f"{'='*60}\n")

    return result

print("Semantic query function defined!")

Semantic query function defined!


### 3.3 Test Semantic Queries (5+ Examples)

In [96]:
# Example 1: Alien-related movies
result1 = semantic_query("What are some alien-related movies?")


SEMANTIC QUERY: What are some alien-related movies?

ANSWER:
Based on the movie database, here are relevant movies:
1. Alien vs. Predator (2004)
2. Alien Nation (1988)
3. Aliens (1986)
4. Alien (1979)
5. War of the Worlds (2005)

SOURCE MOVIES:
1. Alien vs. Predator (2004) - Relevance: 0.542
2. Alien Nation (1988) - Relevance: 0.541
3. Aliens (1986) - Relevance: 0.533
4. Alien (1979) - Relevance: 0.524
5. War of the Worlds (2005) - Relevance: 0.519



In [97]:
# Example 2: Time travel themes
result2 = semantic_query("Which films explore time travel themes?")


SEMANTIC QUERY: Which films explore time travel themes?

ANSWER:
Based on the movie database, here are relevant movies:
1. Frequently Asked Questions About Time Travel (2009)
2. Timecrimes (2007)
3. The Visitors II: The Corridors of Time (1998)
4. A Sound of Thunder (2005)
5. Synchronicity (2015)

SOURCE MOVIES:
1. Frequently Asked Questions About Time Travel (2009) - Relevance: 0.532
2. Timecrimes (2007) - Relevance: 0.519
3. The Visitors II: The Corridors of Time (1998) - Relevance: 0.519
4. A Sound of Thunder (2005) - Relevance: 0.502
5. Synchronicity (2015) - Relevance: 0.500



In [98]:
# Example 3: Strong female protagonists
result3 = semantic_query("What movies feature strong female protagonists?")


SEMANTIC QUERY: What movies feature strong female protagonists?

ANSWER:
Based on the movie database, here are relevant movies:
1. Lucy (2014)
2. Maleficent (2014)
3. My Super Ex-Girlfriend (2006)
4. The Powerpuff Girls Movie (2002)
5. Mustang (2015)

SOURCE MOVIES:
1. Lucy (2014) - Relevance: 0.500
2. Maleficent (2014) - Relevance: 0.492
3. My Super Ex-Girlfriend (2006) - Relevance: 0.487
4. The Powerpuff Girls Movie (2002) - Relevance: 0.482
5. Mustang (2015) - Relevance: 0.475



In [99]:
# Example 4: Psychological thrillers
result4 = semantic_query("Can you recommend some psychological thriller movies?")


SEMANTIC QUERY: Can you recommend some psychological thriller movies?

ANSWER:
Based on the movie database, here are relevant movies:
1. A Classic Horror Story (2021)
2. 1408 (2007)
3. Infinite (2021)
4. Psycho Goreman (2020)
5. High Anxiety (1977)

SOURCE MOVIES:
1. A Classic Horror Story (2021) - Relevance: 0.524
2. 1408 (2007) - Relevance: 0.521
3. Infinite (2021) - Relevance: 0.521
4. Psycho Goreman (2020) - Relevance: 0.521
5. High Anxiety (1977) - Relevance: 0.519



In [100]:
# Example 5: Movies about artificial intelligence
result5 = semantic_query("What movies deal with artificial intelligence and consciousness?")


SEMANTIC QUERY: What movies deal with artificial intelligence and consciousness?

ANSWER:
Based on the movie database, here are relevant movies:
1. Transcendence (2014)
2. Ex Machina (2014)
3. Stealth (2005)
4. A.I. Artificial Intelligence (2001)
5. Archive (2020)

SOURCE MOVIES:
1. Transcendence (2014) - Relevance: 0.555
2. Ex Machina (2014) - Relevance: 0.498
3. Stealth (2005) - Relevance: 0.464
4. A.I. Artificial Intelligence (2001) - Relevance: 0.461
5. Archive (2020) - Relevance: 0.460



In [101]:
# Example 6: War movies
result6 = semantic_query("What are some critically acclaimed war movies?")


SEMANTIC QUERY: What are some critically acclaimed war movies?

ANSWER:
Based on the movie database, here are relevant movies:
1. War Machine (2017)
2. The Art of War (2000)
3. Tropic Thunder (2008)
4. To End All Wars (2001)
5. War, Inc. (2008)

SOURCE MOVIES:
1. War Machine (2017) - Relevance: 0.593
2. The Art of War (2000) - Relevance: 0.559
3. Tropic Thunder (2008) - Relevance: 0.556
4. To End All Wars (2001) - Relevance: 0.546
5. War, Inc. (2008) - Relevance: 0.534



---
## Part 4: Factual Query Implementation
### 4.1 Implement Code Generation for Statistical Queries

In [102]:
def generate_pandas_code(question: str) -> str:
    """
    Generate pandas code to answer factual/statistical questions.
    Uses template-based code generation for reliability.

    Note: T5-base is too small for reliable code generation, so we use
    predefined templates for common query patterns. For production use,
    a larger model like CodeLlama or GPT-4 would be more appropriate.

    Args:
        question: The factual question requiring computation

    Returns:
        Python code as a string
    """
    question_lower = question.lower()

    # Template-based code generation for common patterns

    # Average queries
    if 'average' in question_lower or 'mean' in question_lower:
        if 'rating' in question_lower:
            if 'bond' in question_lower or 'james bond' in question_lower:
                return "result = df[df['Title'].str.contains('Bond', case=False, na=False)]['Rating'].mean()"
            else:
                return "result = df['Rating'].mean()"
        elif 'gross' in question_lower or 'revenue' in question_lower:
            if 'superhero' in question_lower:
                return "result = df[df['Genres'].str.contains('Action|Adventure', case=False, na=False)]['Gross(Million)'].mean()"
            else:
                return "result = df['Gross(Million)'].mean()"
        elif 'runtime' in question_lower:
            if 'action' in question_lower:
                return "result = df[df['Genres'].str.contains('Action', case=False, na=False)]['Runtime'].mean()"
            else:
                return "result = df['Runtime'].mean()"

    # Count queries
    elif 'how many' in question_lower or 'count' in question_lower:
        if 'rating above' in question_lower or 'rating >' in question_lower:
            match = re.search(r'(\d+\.?\d*)', question_lower)
            if match:
                threshold = match.group(1)
                return f"result = len(df[df['Rating'] > {threshold}])"
        elif 'released in' in question_lower or 'year' in question_lower:
            match = re.search(r'(\d{4})', question)
            if match:
                year = match.group(1)
                return f"result = len(df[df['Year'] == {year}])"
        elif 'director' in question_lower or 'directed' in question_lower or 'direct' in question_lower:
            if 'nolan' in question_lower:
                return "result = len(df[df['Director'].str.contains('Nolan', case=False, na=False)])"
            elif 'spielberg' in question_lower:
                return "result = len(df[df['Director'].str.contains('Spielberg', case=False, na=False)])"

    # Top/highest queries
    elif 'top' in question_lower or 'highest' in question_lower:
        if 'rated' in question_lower or 'rating' in question_lower:
            match = re.search(r'top\s+(\d+)', question_lower)
            n = match.group(1) if match else '5'
            return f"result = df.nlargest({n}, 'Rating')[['Title', 'Rating']]"
        elif 'grossing' in question_lower and 'director' in question_lower:
            return "result = df.loc[df['Gross(Million)'].idxmax(), 'Director']"

    # Year with most releases
    elif 'year' in question_lower and ('most' in question_lower or 'highest' in question_lower):
        return "result = df['Year'].value_counts().idxmax()"

    # If no template matches
    return "result = 'Query not supported. Please rephrase using: average, count, top N, or year-based queries.'"

print("\nCode generation function defined!")


Code generation function defined!


### 4.2 Implement Safe Code Execution

In [103]:
def execute_code_safely(code: str, df: pd.DataFrame) -> Tuple[bool, any, str]:
    """
    Safely execute generated pandas code.

    Args:
        code: Python code to execute
        df: DataFrame to operate on

    Returns:
        Tuple of (success, result, error_message)
    """
    try:
        # Create a safe execution environment
        local_vars = {'df': df.copy(), 'pd': pd, 'np': np}

        # Execute the code
        exec(code, {}, local_vars)

        # Get the result
        result = local_vars.get('result', None)

        # Check if result is None or empty
        if result is None:
            return False, None, "Code executed but 'result' variable was not set"

        # Check for empty results (e.g., empty DataFrame, empty Series)
        if isinstance(result, (pd.DataFrame, pd.Series)) and len(result) == 0:
            return True, result, "Query returned no results (empty result set)"

        return True, result, ""

    except NameError as e:
        # Column name errors
        error_msg = str(e)
        if "is not defined" in error_msg:
            return False, None, f"Column name error: {error_msg}. Check that column names match exactly (including parentheses)."
        return False, None, str(e)

    except Exception as e:
        return False, None, f"{type(e).__name__}: {str(e)}"

print("Safe code execution function defined!")

Safe code execution function defined!


### 4.3 Format Results into Natural Language

In [104]:
def format_result_to_natural_language(question: str, result: any) -> str:
    """
    Convert computational results into natural language.

    Args:
        question: Original question
        result: Computational result

    Returns:
        Natural language answer
    """
    # Convert result to string representation
    if isinstance(result, pd.DataFrame):
        result_str = result.to_string()
    elif isinstance(result, pd.Series):
        result_str = result.to_string()
    else:
        result_str = str(result)

    prompt = f"""Convert this result into a natural language answer.

Question: {question}
Result: {result_str}

Natural answer:"""

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            temperature=0.7,
            do_sample=True,
            top_p=0.9
        )

    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer

print("Result formatting function defined!")

Result formatting function defined!


### 4.4 Complete Factual Query Pipeline

In [105]:
def factual_query(question: str, verbose: bool = True) -> Dict:
    """
    Answer factual/statistical questions using code generation.

    Args:
        question: The factual question to answer
        verbose: Whether to print detailed information

    Returns:
        Dictionary containing the answer and execution details
    """
    if verbose:
        print(f"\n{'='*60}")
        print(f"FACTUAL QUERY: {question}")
        print(f"{'='*60}")

    # Step 1: Generate code
    if verbose:
        print("\nGenerating pandas code...")
    code = generate_pandas_code(question)

    if verbose:
        print(f"\nGenerated Code:\n{code}")

    # Step 2: Execute code
    if verbose:
        print("\nExecuting code...")
    success, result, error = execute_code_safely(code, df)

    if not success:
        if verbose:
            print(f"\nERROR: {error}")
        return {
            'question': question,
            'success': False,
            'error': error,
            'code': code
        }

    # Step 3: Check for empty results
    if error and "empty result set" in error.lower():
        # Query succeeded but returned no results
        if verbose:
            print(f"\nWARNING: {error}")
            print(f"\nRaw Result: {result}")

        return {
            'question': question,
            'success': True,
            'code': code,
            'raw_result': result,
            'answer': f"The query executed successfully but returned no results. This might mean there are no movies matching the criteria in the dataset.",
            'warning': error
        }

    # Step 4: Format result
    if verbose:
        print(f"\nRaw Result:\n{result}")
        print("\nFormatting to natural language...")

    answer = format_result_to_natural_language(question, result)

    response = {
        'question': question,
        'success': True,
        'code': code,
        'raw_result': result,
        'answer': answer
    }

    if verbose:
        print(f"\nFINAL ANSWER:\n{answer}")
        print(f"{'='*60}\n")

    return response

print("Factual query function defined!")

Factual query function defined!


### 4.5 Test Factual Queries (5+ Examples)

In [106]:
# Example 1: Average rating of James Bond movies
result1 = factual_query("What's the average rating of James Bond movies?")


FACTUAL QUERY: What's the average rating of James Bond movies?

Generating pandas code...

Generated Code:
result = df[df['Title'].str.contains('Bond', case=False, na=False)]['Rating'].mean()

Executing code...

Raw Result:
7.7

Formatting to natural language...

FINAL ANSWER:
The average rating of James Bond movies is 7.7.



In [107]:
# Example 2: Highest-grossing director
result2 = factual_query("Which director has the highest-grossing film?")


FACTUAL QUERY: Which director has the highest-grossing film?

Generating pandas code...

Generated Code:
result = df.loc[df['Gross(Million)'].idxmax(), 'Director']

Executing code...

Raw Result:
J.J. Abrams

Formatting to natural language...

FINAL ANSWER:
The highest-grossing film is The Terminator.



In [108]:
# Example 3: Movies released in 2010
result3 = factual_query("How many movies were released in 2010?")


FACTUAL QUERY: How many movies were released in 2010?

Generating pandas code...

Generated Code:
result = len(df[df['Year'] == 2010])

Executing code...

Raw Result:
296

Formatting to natural language...

FINAL ANSWER:
2010 was the 76th year of the year in movies released in the United States.



In [109]:
# Example 4: Top 5 highest-rated movies
result4 = factual_query("What are the top 5 highest-rated movies?")


FACTUAL QUERY: What are the top 5 highest-rated movies?

Generating pandas code...

Generated Code:
result = df.nlargest(5, 'Rating')[['Title', 'Rating']]

Executing code...

Raw Result:
                                    Title  Rating
0                The Shawshank Redemption     9.3
9                           The Godfather     9.2
4360                      The Chaos Class     9.2
9050  Ramayana: The Legend of Prince Rama     9.2
8572                                Daman     9.1

Formatting to natural language...

FINAL ANSWER:
The Shawshank Redemption is rated 0 out of a total of 5 by critics, ranked 9th out of 5 and The Godfather is rated 10th out of 5 by critics.



In [110]:
# Example 5: Average runtime by genre
result5 = factual_query("What is the average runtime of action movies?")


FACTUAL QUERY: What is the average runtime of action movies?

Generating pandas code...

Generated Code:
result = df[df['Genres'].str.contains('Action', case=False, na=False)]['Runtime'].mean()

Executing code...

Raw Result:
114.669140625

Formatting to natural language...

FINAL ANSWER:
114.669140625



In [111]:
# Example 6: Christopher Nolan movies count
result6 = factual_query("How many movies did Christopher Nolan direct in this dataset?")


FACTUAL QUERY: How many movies did Christopher Nolan direct in this dataset?

Generating pandas code...

Generated Code:
result = len(df[df['Director'].str.contains('Nolan', case=False, na=False)])

Executing code...

Raw Result:
11

Formatting to natural language...

FINAL ANSWER:
Christopher Nolan directed 11 movies in this dataset.



---
## Part 5: Query Classification and Integration

### 5.1 Implement Query Classifier

In [112]:
def classify_query(question: str) -> str:
    """
    Classify a query as either 'semantic' or 'factual'.

    Args:
        question: The question to classify

    Returns:
        'semantic' or 'factual'
    """
    # Keywords that indicate factual queries
    factual_keywords = [
        'how many', 'count', 'average', 'mean', 'sum', 'total',
        'highest', 'lowest', 'maximum', 'minimum', 'top', 'bottom',
        'statistics', 'number of', 'percentage', 'ratio',
        'most', 'least', 'compare', 'comparison'
    ]

    # Keywords that indicate semantic queries
    semantic_keywords = [
        'about', 'theme', 'similar', 'like', 'recommend',
        'feature', 'explore', 'deal with', 'focus on',
        'what movies', 'which films', 'tell me about'
    ]

    question_lower = question.lower()

    # Count keyword matches
    factual_score = sum(1 for kw in factual_keywords if kw in question_lower)
    semantic_score = sum(1 for kw in semantic_keywords if kw in question_lower)

    # If factual keywords dominate, classify as factual
    if factual_score > semantic_score:
        return 'factual'
    else:
        return 'semantic'

print("Query classifier defined!")

Query classifier defined!


### 5.2 Unified Question-Answering Interface

In [113]:
def answer_question(question: str, verbose: bool = True) -> Dict:
    """
    Unified interface to answer any question about movies.
    Automatically classifies and routes to appropriate pipeline.

    Args:
        question: The question to answer
        verbose: Whether to print detailed information

    Returns:
        Dictionary containing the answer and metadata
    """
    # Classify the query
    query_type = classify_query(question)

    if verbose:
        print(f"\n{'#'*70}")
        print(f"QUESTION: {question}")
        print(f"CLASSIFIED AS: {query_type.upper()}")
        print(f"{'#'*70}")

    # Route to appropriate pipeline
    try:
        if query_type == 'semantic':
            result = semantic_query(question, verbose=verbose)
            result['query_type'] = 'semantic'
        else:
            result = factual_query(question, verbose=verbose)
            result['query_type'] = 'factual'

        return result

    except Exception as e:
        if verbose:
            print(f"\nERROR: {str(e)}")
        return {
            'question': question,
            'query_type': query_type,
            'success': False,
            'error': str(e)
        }

print("Unified question-answering interface defined!")

Unified question-answering interface defined!


### 5.3 Comprehensive Testing with Diverse Queries

In [114]:
# Test with a mix of semantic and factual queries
test_questions = [
    "What are some movies about space exploration?",  # Semantic
    "How many movies have a rating above 8.5?",  # Factual
    "Which films feature heist themes?",  # Semantic
    "What is the average gross revenue of superhero movies?",  # Factual
    "Tell me about movies with twist endings",  # Semantic
    "Which year had the most movie releases?",  # Factual
]

results = []
for question in test_questions:
    result = answer_question(question, verbose=True)
    results.append(result)
    print("\n" + "="*70 + "\n")


######################################################################
QUESTION: What are some movies about space exploration?
CLASSIFIED AS: SEMANTIC
######################################################################

SEMANTIC QUERY: What are some movies about space exploration?

ANSWER:
Based on the movie database, here are relevant movies:
1. SpaceCamp (1986)
2. IO (2019)
3. Explorers (1985)
4. Red Planet (2000)
5. Voyagers (2021)

SOURCE MOVIES:
1. SpaceCamp (1986) - Relevance: 0.548
2. IO (2019) - Relevance: 0.525
3. Explorers (1985) - Relevance: 0.519
4. Red Planet (2000) - Relevance: 0.512
5. Voyagers (2021) - Relevance: 0.507




######################################################################
QUESTION: How many movies have a rating above 8.5?
CLASSIFIED AS: FACTUAL
######################################################################

FACTUAL QUERY: How many movies have a rating above 8.5?

Generating pandas code...

Generated Code:
result = len(df[df['Rating'] > 8

### 5.4 Error Handling and Edge Cases

In [115]:
# Test edge cases
edge_cases = [
    "",  # Empty query
    "What?",  # Vague query
    "Tell me everything about all movies",  # Too broad
    "How many movies were directed by a director who doesn't exist?",  # No results
]

print("Testing edge cases...\n")
for question in edge_cases:
    if question:  # Skip empty string
        try:
            result = answer_question(question, verbose=False)
            print(f"Question: {question}")
            print(f"Status: {'Success' if result.get('success', True) else 'Failed'}")
            print("-" * 50)
        except Exception as e:
            print(f"Question: {question}")
            print(f"Error: {str(e)}")
            print("-" * 50)

Testing edge cases...

Question: What?
Status: Success
--------------------------------------------------
Question: Tell me everything about all movies
Status: Success
--------------------------------------------------
Question: How many movies were directed by a director who doesn't exist?
Status: Success
--------------------------------------------------


---
## Summary and Conclusions

### System Capabilities
This movie question-answering system successfully implements:

1. **Semantic Query Processing (RAG)**:
   - Vector embeddings for movie descriptions
   - Similarity-based retrieval
   - Context-aware answer generation

2. **Factual Query Processing (Code Generation)**:
   - Automatic pandas code generation
   - Safe code execution
   - Natural language result formatting

3. **Intelligent Query Routing**:
   - Automatic classification of query types
   - Unified interface for all questions
   - Robust error handling