# Semantic Search with ChromaDB

First, let's install all the required packages for our semantic search system.

In [16]:
# Install required packages
# %pip install chromadb pandas openai python-dotenv

print("All packages installed successfully!")

All packages installed successfully!


## Import Required Libraries

First, let's import all the necessary libraries for our semantic search system.

In [17]:
import chromadb
import pandas as pd
import openai
import os
import json

from pathlib import Path
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from dotenv import load_dotenv

## Setup Environment and API Keys

Load environment variables and set up the OpenAI API key.

**Important:** You need to create a `.env` file in the same directory as this notebook with your OpenAI API key:

```env
OPENAI_API_KEY=your_api_key_here
```

In [18]:
# Load environment variables from .env file
load_dotenv()

# Set up OpenAI API key
try:
    openai.api_key = os.environ["OPENAI_API_KEY"]
    EMBEDDING_MODEL = os.environ["OPENAI_EMBEDDINGS_MODEL"] # "text-embedding-ada-002"
    print("✅ Environment setup complete!")
    print(f"🔑 OpenAI API key loaded (ends with: ...{openai.api_key[-4:]})")
    print(f"🤖 Using embedding model: {EMBEDDING_MODEL}")
except KeyError:
    print("❌ ERROR: OPENAI_API_KEY not found in environment variables!")
    print("\n📝 To fix this:")
    print("1. Create a file named '.env' in the same directory as this notebook")
    print("2. Add this line to the .env file:")
    print("   OPENAI_API_KEY=your_actual_api_key_here")
    print("3. Replace 'your_actual_api_key_here' with your OpenAI API key")
    print("4. Restart the notebook kernel and run this cell again")
    raise

✅ Environment setup complete!
🔑 OpenAI API key loaded (ends with: ...co0A)
🤖 Using embedding model: text-embedding-3-small


## Define File Paths

Set up the paths for our data files. We'll use the `dataset.json` file in the same directory as our data source.

In [19]:
# Define paths relative to the current notebook location
current_dir = Path.cwd()
input_datapath = current_dir / "dataset.json"
db_path = current_dir / "chroma_db"

# Create database directory if it doesn't exist
if not db_path.exists():
    db_path.mkdir()

print(f"Data file: {input_datapath}")
print(f"Database path: {db_path}")
print(f"Data file exists: {input_datapath.exists()}")

Data file: /Users/adiazpace/Documents/GitHub/ai-agents-desde-cero/semantic-search/dataset.json
Database path: /Users/adiazpace/Documents/GitHub/ai-agents-desde-cero/semantic-search/chroma_db
Data file exists: True


## Load and Explore the Dataset

Load the movie data from the JSON file and convert it to a pandas DataFrame for easier manipulation.

In [20]:
# Load JSON data into a DataFrame
with open(input_datapath, 'r') as f:
    movie_data = json.load(f)

df = pd.DataFrame(movie_data)

# Add an ID column based on the index
df['id'] = df.index.astype(str)

# Display basic information about the dataset
print("Dataset Information:")
print(f"Number of movies: {len(df)}")
print(f"Columns: {list(df.columns)}")

Dataset Information:
Number of movies: 10
Columns: ['title', 'release_date', 'genres', 'original_language', 'vote_average', 'overview', 'tagline', 'combined', 'n_tokens', 'embedding', 'id']


## Preview the Data

Let's take a look at the first few rows to understand the structure of our data.

In [21]:
# Display the first few rows
print("First 3 movies in the dataset:")
display(df[['title', 'release_date', 'genres', 'vote_average', 'overview']].head(3))

# Check if embeddings are present
if 'embedding' in df.columns:
    print(f"\nEmbedding dimensions: {len(df['embedding'].iloc[0])}")
    print("Embeddings are already calculated and available!")
else:
    print("\nNo embeddings found in the dataset.")

First 3 movies in the dataset:


Unnamed: 0,title,release_date,genres,vote_average,overview
0,The Pope's Exorcist,2023-04-05,"['Horror', 'Mystery', 'Thriller']",7.4,"Father Gabriele Amorth, Chief Exorcist of the ..."
1,Ant-Man and the Wasp: Quantumania,2023-02-15,"['Action', 'Adventure', 'Science Fiction']",6.6,Super-Hero partners Scott Lang and Hope van Dy...
2,Ghosted,2023-04-18,"['Action', 'Comedy', 'Romance']",7.2,Salt-of-the-earth Cole falls head over heels f...



Embedding dimensions: 1536
Embeddings are already calculated and available!


## Initialize ChromaDB

Set up the ChromaDB client and create or get the collection for storing our movie embeddings.

In [22]:
# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path=str(db_path))

# Create embedding function
embedding_function = OpenAIEmbeddingFunction(
    api_key=openai.api_key, 
    model_name=EMBEDDING_MODEL
)

print("ChromaDB client initialized successfully!")

ChromaDB client initialized successfully!


## Create or Load Collection

Create a new collection or load an existing one. If the collection doesn't exist, we'll create it and populate it with our movie data.

In [23]:
# Try to get existing collection, create if it doesn't exist
collection_name = "movies"

try:
    movies_collection = chroma_client.get_collection(
        name=collection_name, 
        embedding_function=embedding_function
    )
    print(f"Loaded existing collection '{collection_name}'")
    print(f"Collection count: {movies_collection.count()}")
    
except (ValueError, Exception):  # Catch both ValueError and NotFoundError
    # Collection doesn't exist, create it
    movies_collection = chroma_client.create_collection(
        name=collection_name, 
        embedding_function=embedding_function
    )
    print(f"Created new collection '{collection_name}'")
    
    # Prepare metadata for each movie
    metadatas = []
    for _, row in df.iterrows():
        metadata = {
            'title': row['title'],
            'release_date': row['release_date'],
            'genres': row['genres'],
            'vote_average': row['vote_average'],
            'overview': row['overview']
        }
        metadatas.append(metadata)
    
    # Add the movie data to the collection with embeddings and metadata
    print("Adding pre-calculated embeddings and metadata to the collection...")
    movies_collection.add(
        ids=df.id.astype(str).tolist(),  # ChromaDB requires string IDs
        embeddings=df.embedding.tolist(),
        metadatas=metadatas
    )
    
    print(f"Successfully added {len(df)} movies to the collection!")
    print(f"Collection count: {movies_collection.count()}")

Loaded existing collection 'movies'
Collection count: 10


## Define Search Function

Create a function to query the collection and return relevant movies based on semantic similarity.

In [24]:
def query_collection(collection, query, max_results=10, dataframe=None):
    """
    Query the ChromaDB collection for similar movies.
    
    Args:
        collection: ChromaDB collection object
        query: Search query string
        max_results: Maximum number of results to return
        dataframe: Optional DataFrame to get additional movie details
    
    Returns:
        Search results with movie information
    """
    # Perform the search
    search_results = collection.query(
        query_texts=[query],
        n_results=max_results,
        include=['distances', 'metadatas']
    )
    
    # If dataframe is provided, get additional details
    if dataframe is not None:
        result_ids = search_results['ids'][0]
        movies_df = dataframe[dataframe['id'].isin(result_ids)]
        return movies_df
    
    return search_results

def display_search_results(results, query, max_display=5):
    """
    Display search results in a formatted way.
    """
    print(f"\n🔍 Search Results for: '{query}'")
    print("=" * 50)
    
    if isinstance(results, pd.DataFrame) and not results.empty:
        for idx, (_, movie) in enumerate(results.head(max_display).iterrows()):
            print(f"\n{idx + 1}. {movie['title']} ({movie['release_date'][:4] if 'release_date' in movie else 'N/A'})")
            print(f"   Rating: ⭐ {movie.get('vote_average', 'N/A')}/10")
            print(f"   Genres: {movie.get('genres', 'N/A')}")
            if 'overview' in movie:
                overview = movie['overview'][:150] + "..." if len(movie['overview']) > 150 else movie['overview']
                print(f"   Plot: {overview}")
    else:
        print("No results found.")

print("Search functions defined successfully!")

Search functions defined successfully!


## Test the Semantic Search

Let's test our semantic search system with some example queries.

In [25]:
# Test queries
test_queries = [
    "superhero adventure",
    "horror movie",
    "family fantasy film",
    "action thriller"
]

print("🎬 Testing Semantic Search System")
print("=" * 40)

for query in test_queries:
    results = query_collection(movies_collection, query, max_results=3, dataframe=df)
    display_search_results(results, query, max_display=3)
    print("\n" + "-" * 40)

🎬 Testing Semantic Search System

🔍 Search Results for: 'superhero adventure'

1. Avatar: The Way of Water (2022)
   Rating: ⭐ 7.7/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follo...

2. Guardians of the Galaxy Volume 3 (2023)
   Rating: ⭐ 8.3/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mi...

3. Creed III (2023)
   Rating: ⭐ 7.3/10
   Genres: ['Drama', 'Action']
   Plot: After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodig...

----------------------------------------

🔍 Search Results for: 'horror movie'

1. Ant-Man and the Wasp: Quantumania (2023)
   Ratin

## Interactive Search

Now you can perform your own searches! Try different queries to see how the semantic search works.

In [26]:
# Interactive search cell
# Modify the query below to search for different types of movies

user_query = "magical adventure with children"  # Change this to your desired search
num_results = 5  # Change this to get more or fewer results

search_results = query_collection(movies_collection, user_query, max_results=num_results, dataframe=df)
display_search_results(search_results, user_query, max_display=num_results)


🔍 Search Results for: 'magical adventure with children'

1. Ant-Man and the Wasp: Quantumania (2023)
   Rating: ⭐ 6.6/10
   Genres: ['Action', 'Adventure', 'Science Fiction']
   Plot: Super-Hero partners Scott Lang and Hope van Dyne, along with with Hope's parents Janet van Dyne and Hank Pym, and Scott's daughter Cassie Lang, find t...

2. Avatar: The Way of Water (2022)
   Rating: ⭐ 7.7/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follo...

3. Guardians of the Galaxy Volume 3 (2023)
   Rating: ⭐ 8.3/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mi...

4. Creed III (2023)
   Rating: ⭐ 7.3/10
   Genres: ['Drama', 'Action']
   Plot: After dominating the boxin

## Advanced Search Analysis

Let's analyze the search results in more detail, including similarity scores.

In [27]:
def detailed_search_analysis(collection, query, max_results=5):
    """
    Perform a detailed search analysis including similarity scores.
    """
    results = collection.query(
        query_texts=[query],
        n_results=max_results,
        include=['distances', 'metadatas', 'documents']
    )
    
    print(f"\n📊 Detailed Analysis for: '{query}'")
    print("=" * 60)
    
    for i, (movie_id, distance, metadata) in enumerate(zip(
        results['ids'][0], 
        results['distances'][0], 
        results['metadatas'][0]
    )):
        similarity_score = 1 - distance  # Convert distance to similarity
        
        # Handle case where metadata might be None
        if metadata is None:
            # Fallback to using the DataFrame
            movie_row = df[df['id'] == movie_id].iloc[0]
            title = movie_row['title']
            rating = movie_row['vote_average']
            genres = movie_row['genres']
            overview = movie_row['overview']
        else:
            title = metadata.get('title', 'Unknown')
            rating = metadata.get('vote_average', 'N/A')
            genres = metadata.get('genres', 'N/A')
            overview = metadata.get('overview', '')
        
        print(f"\n{i + 1}. {title}")
        print(f"   Similarity: {similarity_score:.3f} (Distance: {distance:.3f})")
        print(f"   Rating: ⭐ {rating}/10")
        print(f"   Genres: {genres}")
        if overview:
            overview_short = overview[:120] + "..." if len(overview) > 120 else overview
            print(f"   Plot: {overview_short}")

In [28]:
# Example detailed analysis
analysis_query = "science fiction space adventure"
detailed_search_analysis(movies_collection, analysis_query, max_results=3)


📊 Detailed Analysis for: 'science fiction space adventure'

1. Avatar: The Way of Water
   Similarity: -0.881 (Distance: 1.881)
   Rating: ⭐ 7.7/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their...

2. Scream VI
   Similarity: -0.921 (Distance: 1.921)
   Rating: ⭐ 7.3/10
   Genres: ['Horror', 'Mystery', 'Thriller']
   Plot: Following the latest Ghostface killings, the four survivors leave Woodsboro behind and start a fresh chapter.

3. Guardians of the Galaxy Volume 3
   Similarity: -0.924 (Distance: 1.924)
   Rating: ⭐ 8.3/10
   Genres: ['Science Fiction', 'Adventure', 'Action']
   Plot: Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with pro...


## Collection Statistics

Let's examine some statistics about our movie collection.

In [29]:
# Collection statistics
print("📈 Collection Statistics")
print("=" * 30)
print(f"Total movies in collection: {movies_collection.count()}")
print(f"Total movies in DataFrame: {len(df)}")

if not df.empty:
    print(f"\n🎭 Genre Analysis:")
    # Extract unique genres (this is simplified - you might want to parse the genre strings more carefully)
    all_genres = []
    for genres in df['genres']:
        if isinstance(genres, str):
            # Remove brackets and quotes, split by comma
            clean_genres = genres.strip("[]").replace("'", "").split(", ")
            all_genres.extend(clean_genres)
    
    genre_counts = pd.Series(all_genres).value_counts().head(10)
    print(genre_counts)
    
    print(f"\n⭐ Rating Statistics:")
    if 'vote_average' in df.columns:
        print(f"Average rating: {df['vote_average'].mean():.2f}")
        print(f"Highest rated: {df['vote_average'].max():.1f}")
        print(f"Lowest rated: {df['vote_average'].min():.1f}")
        
        # Show top rated movies
        print(f"\n🏆 Top 3 Highest Rated Movies:")
        top_movies = df.nlargest(3, 'vote_average')[['title', 'vote_average', 'release_date']]
        for _, movie in top_movies.iterrows():
            print(f"   {movie['title']} - ⭐ {movie['vote_average']}/10 ({movie['release_date'][:4] if movie['release_date'] else 'N/A'})")

📈 Collection Statistics
Total movies in collection: 10
Total movies in DataFrame: 10

🎭 Genre Analysis:
Action             7
Adventure          6
Science Fiction    3
Comedy             3
Fantasy            3
Horror             2
Mystery            2
Thriller           2
Romance            1
Drama              1
Name: count, dtype: int64

⭐ Rating Statistics:
Average rating: 7.20
Highest rated: 8.3
Lowest rated: 5.9

🏆 Top 3 Highest Rated Movies:
   Guardians of the Galaxy Volume 3 - ⭐ 8.3/10 (2023)
   Avatar: The Way of Water - ⭐ 7.7/10 (2022)
   Dungeons & Dragons: Honor Among Thieves - ⭐ 7.5/10 (2023)


## Cleanup and Summary

Summary of what we've accomplished in this notebook.

In [30]:
print("✅ Semantic Search System Summary")
print("=" * 40)
print("1. ✓ Loaded movie data from dataset.json")
print("2. ✓ Set up ChromaDB with persistent storage")
print("3. ✓ Created/loaded movie collection with embeddings")
print("4. ✓ Implemented semantic search functionality")
print("5. ✓ Tested search with various queries")
print("6. ✓ Analyzed search results and collection statistics")

print(f"\n📊 Final Stats:")
print(f"   Movies in collection: {movies_collection.count()}")
print(f"   Database location: {db_path}")
print(f"   Embedding model: {EMBEDDING_MODEL}")

print("\n🎯 You can now search for movies using natural language queries!")
print("   Try queries like: 'romantic comedy', 'dark thriller', 'family adventure', etc.")

✅ Semantic Search System Summary
1. ✓ Loaded movie data from dataset.json
2. ✓ Set up ChromaDB with persistent storage
3. ✓ Created/loaded movie collection with embeddings
4. ✓ Implemented semantic search functionality
5. ✓ Tested search with various queries
6. ✓ Analyzed search results and collection statistics

📊 Final Stats:
   Movies in collection: 10
   Database location: /Users/adiazpace/Documents/GitHub/ai-agents-desde-cero/semantic-search/chroma_db
   Embedding model: text-embedding-3-small

🎯 You can now search for movies using natural language queries!
   Try queries like: 'romantic comedy', 'dark thriller', 'family adventure', etc.
