# Test Movie Web Search Retriever Functions

Tests the `retrieve_movie_plot`, `retrieve_movie_curiosities`, and `retrieve_movie_reviews` functions.

In [1]:
import os
import sys
import asyncio
import platform
import nest_asyncio
from dotenv import load_dotenv
from pathlib import Path

# Apply nest_asyncio to allow running async functions in Jupyter
nest_asyncio.apply()

# Set event loop policy for Windows if possible
if platform.system() == 'Windows':
    try:
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    except Exception as e:
        print(f"Note: Could not set WindowsSelectorEventLoopPolicy - {e}")

# Dynamically find the project root
project_root_path = Path(os.getcwd()).resolve().parents[2]
dotenv_path = project_root_path / ".env"  # Path to .env

# Load environment variables
load_dotenv(dotenv_path)

# Add the project root to sys.path for module imports
sys.path.append("D:\\Internship\\recsys\\back_end")

# Import the retriever functions
from app.web_search.movie_cast_and_crew_web_search_retriever import (
    retrieve_movie_plot,
    retrieve_movie_curiosities,
    retrieve_movie_reviews
)

## Define Test Parameters

In [2]:
test_movie_title = "The Matrix"
query_curiosities = "Tell me some trivia about The Matrix."
query_reviews = "What did people think of The Matrix?"

## Test `retrieve_movie_plot`

In [3]:
plot_result = await retrieve_movie_plot(test_movie_title)
print(f"Plot for '{test_movie_title}':\n{plot_result[:500]}...")
assert isinstance(plot_result, str)
assert len(plot_result) > 100

Retrieving plot for: The Matrix
Rate limit detected on attempt 1. Retrying in 5.0 seconds...
Rate limit detected on attempt 2. Retrying in 10.0 seconds...
Rate limit detected on attempt 3. Retrying in 20.0 seconds...
Rate limit hit after 4 attempts. Failing search for query: 'The Matrix site:imdb.com'
Failed to get IMDb ID for 'The Matrix' after retries: https://html.duckduckgo.com/html 202 Ratelimit
Could not find IMDb ID for 'The Matrix', falling back to general search.
Falling back to general web search for plot.
Rate limit detected on attempt 1. Retrying in 5.0 seconds...
Rate limit detected on attempt 2. Retrying in 10.0 seconds...
[INIT].... → Crawl4AI 0.6.1
[FETCH]... ↓ https://en.wikipedia.org/wiki/The_Matrix                                                             | ✓ | ⏱: 0.35s
[SCRAPE].. ◆ https://en.wikipedia.org/wiki/The_Matrix                                                             | ✓ | ⏱: 0.93s
[COMPLETE] ● https://en.wikipedia.org/wiki/The_Matrix                

## Test `retrieve_movie_curiosities`

In [4]:
curiosity_chunks = await retrieve_movie_curiosities(test_movie_title, query_curiosities, k=5)
print(f"Retrieved {len(curiosity_chunks)} curiosity chunks for '{test_movie_title}':")
for i, chunk in enumerate(curiosity_chunks):
    print(f"  Chunk {i+1} (Score: {chunk['score']:.4f}, Source: {chunk['source']}): {chunk['text'][:100]}...")
assert isinstance(curiosity_chunks, list)
if curiosity_chunks:
    assert isinstance(curiosity_chunks[0], dict)
    assert 'text' in curiosity_chunks[0]
    assert 'source' in curiosity_chunks[0]
    assert 'score' in curiosity_chunks[0]

Retrieving curiosities for: The Matrix with query: 'Tell me some trivia about The Matrix.'
Attempting to find trivia on triviaforyou.com with query: "The Matrix" trivia site:triviaforyou.com
Found potential triviaforyou.com URL: https://triviaforyou.com/matrix-trivia/
[INIT].... → Crawl4AI 0.6.1
[FETCH]... ↓ https://triviaforyou.com/matrix-trivia/                                                              | ✓ | ⏱: 0.73s
[SCRAPE].. ◆ https://triviaforyou.com/matrix-trivia/                                                              | ✓ | ⏱: 0.02s
[COMPLETE] ● https://triviaforyou.com/matrix-trivia/                                                              | ✓ | ⏱: 0.76s
Found 'entry-content' div on triviaforyou.com.
Successfully extracted trivia from triviaforyou.com.
FAISS index built successfully with 9 vectors.
Retrieved 5 curiosity chunks for The Matrix.
Retrieved 5 curiosity chunks for 'The Matrix':
  Chunk 1 (Score: 0.1828, Source: https://triviaforyou.com/matrix-trivia/): T

## Test `retrieve_movie_reviews`

In [5]:
review_chunks = await retrieve_movie_reviews(test_movie_title, query_reviews, initial_k=15)
print(f"Retrieved {len(review_chunks)} review chunks for '{test_movie_title}':")
total_chars = 0
for i, chunk in enumerate(review_chunks):
    print(f"  Chunk {i+1} (Score: {chunk['score']:.4f}, Source: {chunk['source']}): {chunk['text'][:100]}...")
    total_chars += len(chunk.get('text', ''))
print(f"Total characters in retrieved reviews: {total_chars}")
assert isinstance(review_chunks, list)
if review_chunks:
    assert isinstance(review_chunks[0], dict)
    assert 'text' in review_chunks[0]
    assert 'source' in review_chunks[0]
    assert 'score' in review_chunks[0]
assert total_chars <= 25000 # Check character limit

Retrieving reviews for: The Matrix with query: 'What did people think of The Matrix?'
Attempting to fetch reviews from IMDb: https://www.imdb.com/title/tt0133093/reviews/
[INIT].... → Crawl4AI 0.6.1
[FETCH]... ↓ https://www.imdb.com/title/tt0133093/reviews/                                                        | ✓ | ⏱: 0.54s
[SCRAPE].. ◆ https://www.imdb.com/title/tt0133093/reviews/                                                        | ✓ | ⏱: 0.10s
[COMPLETE] ● https://www.imdb.com/title/tt0133093/reviews/                                                        | ✓ | ⏱: 0.65s
Found 94 potential review containers using combined strategies on IMDb page.
Using fallback text extraction for container: 10
/
10
Just wow
When this came out, I was living with a roommate. He went out and saw it, came home...
Using fallback text extraction for container: When this came out, I was living with a roommate. He went out and saw it, came home and said, "Dude,...
Using fallback text extraction for co