# Trailer Recommendation System Pipeline

This notebook implements the full multimodal pipeline for trailer analysis using Gemini's API. It reads from `movie_trailers.csv` and the `trailers/` folder, extracts 5 types of features, and then clusters the trailers to find recommendations.

## 1. Imports and Configuration

In [10]:
%pip install -q -U google-generativeai chromadb opencv-python transformers sentence-transformers librosa scikit-learn scikit-image plotly nbformat

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import cv2
import torch
from transformers import CLIPProcessor, CLIPModel
import numpy as np
import os
import pandas as pd
import librosa
from sentence_transformers import SentenceTransformer
import warnings
import time
from skimage.metrics import structural_similarity as ssim
import google.generativeai as genai
import chromadb

# Suppress warnings
warnings.filterwarnings('ignore')

# --- Configuration ---
TRAILERS_DIR = "trailers"
CSV_FILE = "movie_trailers.csv"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

# --- Gemini Configuration ---
# Use the Secrets tab (key icon) in Colab to store GOOGLE_API_KEY
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    print("✓ Gemini API configured from Secrets.")
except Exception:
    # FALLBACK: If you haven't set secrets, paste key here 
    print("! Warning: using hardcoded API key.")
    genai.configure(api_key='YOUR_API_KEY') 

  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


In [3]:
# --- Configuration ---
TRAILERS_DIR = "trailers"
CSV_FILE = "movie_trailers.csv"
CHUNK_DURATION = 15 
chroma_client = chromadb.PersistentClient(path="./trailer_db")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# 2. INITIALIZE THREE SEPARATE COLLECTIONS (Late Fusion)
# This allows us to search "Visuals" independently from "Narrative"
vis_collection = chroma_client.get_or_create_collection(name="db_visuals")   
aud_collection = chroma_client.get_or_create_collection(name="db_audio")    
nar_collection = chroma_client.get_or_create_collection(name="db_narrative") 

print(f"--- Trailer Recommendation System ---")
print(f"Using device: {DEVICE}")
print(f"NOTE: This script expects '{CSV_FILE}' to be in the same directory.")
print(f"NOTE: It also expects a '{TRAILERS_DIR}/' folder populated with .mp4 files.")

--- Trailer Recommendation System ---
Using device: cpu
NOTE: This script expects 'movie_trailers.csv' to be in the same directory.
NOTE: It also expects a 'trailers/' folder populated with .mp4 files.


## 2. Load All Models

Load the pre-trained CLIP and SentenceTransformer models onto the GPU (if available).

In [4]:
print("Loading AI models...")
# 1a. CLIP for Visual Semantics
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 1b. SentenceTransformer for Text Semantics
text_model = SentenceTransformer('all-MiniLM-L6-v2').to(DEVICE)
print("Models loaded successfully.")

Loading AI models...


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Models loaded successfully.


## 3. Feature Extraction Functions

Define the functions to extract all 5 features using our models and gemini for a deeper analysis:
1.  **Visual (CLIP):** Semantic meaning of frames.
2.  **Visual (Color):** Color histogram for mood.
3.  **Audio:** Tempo and spectral contrast.
4.  **Pacing:** Shot cut detection using SSIM.
5.  **Text:** Semantic meaning of the title.

In [5]:
def normalize(v):
    """Helper to L2 normalize vectors."""
    norm = np.linalg.norm(v)
    if norm == 0: return v
    return v / norm

def extract_full_visuals(cap, frame_interval=24):
    """Extracts average CLIP & Color features for the ENTIRE video."""
    clip_vectors = []
    color_histograms = []
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
    
    current_frame = 0
    while True:
        ret, frame = cap.read()
        if not ret: break
        
        # Sample every Nth frame (e.g., once per second)
        if current_frame % frame_interval == 0:
            # CLIP
            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            inputs = clip_processor(images=rgb, return_tensors="pt").to(DEVICE)
            with torch.no_grad():
                embedding = clip_model.get_image_features(**inputs)
            clip_vectors.append(embedding.squeeze().cpu().numpy())
            
            # Color Hist
            hist = np.concatenate([cv2.calcHist([frame], [i], None, [32], [0, 256]) for i in range(3)])
            cv2.normalize(hist, hist)
            color_histograms.append(hist.flatten())
            
        current_frame += 1
        
    if not clip_vectors: return np.zeros(512), np.zeros(96)
    
    # Return average vector for the whole trailer
    return np.mean(np.stack(clip_vectors), axis=0), np.mean(np.stack(color_histograms), axis=0)

def extract_full_audio(video_path):
    """Extracts Librosa features for the ENTIRE audio track."""
    try:
        y, sr = librosa.load(video_path, sr=None) # Load full file
        if len(y) == 0: return np.zeros(8)
        
        onset_env = librosa.onset.onset_detect(y=y, sr=sr)
        tempo = librosa.beat.tempo(onset_envelope=onset_env, sr=sr)
        spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
        mean_spectral = np.mean(spectral_contrast, axis=1)
        
        tempo_val = tempo[0] if isinstance(tempo, np.ndarray) and tempo.size > 0 else (tempo if np.isscalar(tempo) else 120.0)
        
        feat = np.concatenate(([tempo_val], mean_spectral[:7]))
        return normalize(feat)
    except:
        return np.zeros(8)

def get_gemini_trailer_analysis(video_file_obj):
    """
    Analyzes the FULL trailer with Persistent Smart Model Switching.
    Priority: 2.5 Flash -> 2.0 Flash -> 2.0 Flash Lite.
    Behavior: If all models fail due to rate limits, it waits and retries indefinitely 
              until a valid response is recorded.
    """
    prompt = """
    ROLE: Senior Film Archivist, Computer Vision Specialist, and Audio Analyst.
    TASK: Analyze this ENTIRE movie trailer.
    
    OUTPUT: Provide a dense, keyword-rich description in exactly this format:
    
    [VISUAL_STYLE]: <Overall lighting>, <Color Palette>, <Camera work>, <VFX style>.
    [NARRATIVE_ARC]: <Premise>, <Key Action Set Pieces>, <Genre Elements>, <Plot Twist Hints>.
    [AUDIO_LANDSCAPE]: <Score genre>, <Dominant SFX>, <Dialogue tone>.
    [EMOTIONAL_VIBE]: <Key emotions>, <Target Audience Vibe>.
    
    CONSTRAINT: Be specific. Describe the defining characteristics of this film. 
    Do not mention 'the trailer starts with' or timestamps, or any affirmations, just start describing the movie itself.
    """
    
    # Corrected List (Fixed missing comma and used valid API tag for Lite)
    models_to_try = [
        'gemini-2.5-flash', 
        'gemini-2.0-flash',
        'gemini-2.0-flash-lite-preview-02-05' 
    ]
    
    # Persistent Loop: Keep trying until we get a result
    while True:
        rate_limit_hit_on_all = True # Assume we might hit limits on all
        
        for model_name in models_to_try:
            try:
                # print(f"    (Trying {model_name}...)")
                model = genai.GenerativeModel(model_name)
                response = model.generate_content(
                    [video_file_obj, prompt],
                    generation_config={"temperature": 0},
                    request_options={"timeout": 600} # 10 minute timeout for processing
                )
                return response.text # SUCCESS! Return immediately.
            
            except Exception as e:
                error_msg = str(e).lower()
                if "429" in error_msg or "resourceexhausted" in error_msg:
                    print(f"    ! {model_name} Rate Limit. Switching...")
                    continue # Try the next model in the list
                else:
                    print(f"    ! Non-Retryable Error with {model_name}: {e}")
                    # If it's a safety/policy error, waiting won't fix it. 
                    # We shouldn't retry indefinitely on these.
                    rate_limit_hit_on_all = False
                    break 

        # If we exit the for-loop and are still here, it means ALL models failed.
        if rate_limit_hit_on_all:
            wait_time = 60
            print(f"    ! All models rate limited. Sleeping {wait_time}s before retrying movie...")
            time.sleep(wait_time)
            # Loop continues back to the top to try 'gemini-2.5-flash' again
        else:
            # If we hit a fatal error (like a blocked file), return Unknown to avoid infinite loop
            return "Unknown content."

## 4. Main Processing Pipeline

This is the main loop. It will:
1.  Read `movie_trailers.csv`.
2.  Loop through each trailer.
3.  Run all extraction functions.
4.  Fuse the features into a weighted single "super-vector".
5.  Store all vectors in a final DataFrame.

In [6]:
# 4. MAIN PROCESSING LOOP
try:
    df_trailers = pd.read_csv(CSV_FILE)
    print(f"Found {len(df_trailers)} trailers.")
    
    for idx, row in df_trailers.iterrows():
        title = row['Movie Title']
        yt_link = row.get('YouTube Link', '')
        video_path = os.path.join(TRAILERS_DIR, f"{title}.mp4")
        
        if not os.path.exists(video_path): continue

        # Resume Check (Check Narrative DB)
        existing = nar_collection.get(where={"title": title})
        if existing['ids']:
            print(f"[{idx+1}] Skipping '{title}' (Already Processed)")
            continue
        
        print(f"\n[{idx+1}/{len(df_trailers)}] Processing: {title}")
        
        # A. Upload to Gemini
        print("  ...Uploading to Gemini...")
        gemini_file = genai.upload_file(path=video_path)
        while gemini_file.state.name == "PROCESSING":
            print('.', end='', flush=True)
            time.sleep(1)
            gemini_file = genai.get_file(gemini_file.name)
            
        # B. Analyze (Single Pass)
        print("  ...Analyzing Full Trailer...")
        
        # 1. Narrative (Gemini)
        gemini_text = get_gemini_trailer_analysis(gemini_file)
        nar_vec = text_model.encode(gemini_text)
        
        # 2. Visuals (CLIP + Color - Full Video)
        cap = cv2.VideoCapture(video_path)
        fps = cap.get(cv2.CAP_PROP_FPS)
        frame_int = int(fps) if fps > 0 else 24
        clip_v, color_v = extract_full_visuals(cap, frame_interval=frame_int)
        cap.release()
        vis_vec = np.concatenate([normalize(clip_v), normalize(color_v)])
        
        # 3. Audio (Full Audio)
        aud_vec = extract_full_audio(video_path) # Uses Librosa internally
        
        # C. Store Data (One Entry Per Movie)
        uid = f"{title}_full"
        meta = {
            "title": title, 
            "desc": gemini_text[:100], 
            "link": yt_link
        }
        
        # Add to separate collections
        nar_collection.add(ids=[uid], embeddings=[nar_vec.tolist()], metadatas=[meta], documents=[gemini_text])
        vis_collection.add(ids=[uid], embeddings=[vis_vec.tolist()], metadatas=[meta])
        aud_collection.add(ids=[uid], embeddings=[aud_vec.tolist()], metadatas=[meta])

        # Cleanup
        try: genai.delete_file(gemini_file.name)
        except: pass
        print(f"  ✓ Indexed {title} (Full)")
        
        # Respect Rate Limits (Small sleep is still good practice)
        time.sleep(2)

except Exception as e:
    print(f"CRITICAL ERROR: {e}")

Found 255 trailers.

[1/255] Processing: Ella McCay
  ...Uploading to Gemini...
...  ...Analyzing Full Trailer...
  ✓ Indexed Ella McCay (Full)

[2/255] Processing: Send Help
  ...Uploading to Gemini...
...  ...Analyzing Full Trailer...
  ✓ Indexed Send Help (Full)

[4/255] Processing: Psycho Killer
  ...Uploading to Gemini...
...  ...Analyzing Full Trailer...
  ✓ Indexed Psycho Killer (Full)

[6/255] Processing: The Amateur
  ...Uploading to Gemini...
....  ...Analyzing Full Trailer...
  ✓ Indexed The Amateur (Full)

[8/255] Processing: The First Omen
  ...Uploading to Gemini...
....  ...Analyzing Full Trailer...
  ✓ Indexed The First Omen (Full)

[9/255] Processing: Kingdom of the Planet of the Apes
  ...Uploading to Gemini...
...  ...Analyzing Full Trailer...
  ✓ Indexed Kingdom of the Planet of the Apes (Full)

[10/255] Processing: The Abyss
  ...Uploading to Gemini...
..  ...Analyzing Full Trailer...
  ✓ Indexed The Abyss (Full)

[11/255] Processing: Quiz Lady
  ...Uploading to Ge

## 5. 3D visualization of all scenes of the movies


In [3]:
import chromadb
import plotly.express as px
from sklearn.manifold import TSNE
import pandas as pd
import numpy as np

# --- 1. SAFE CONNECT (Restores connection if kernel restarted) ---
try:
    print("Connecting to database...")
    # Ensure this path matches where your script saved the data
    client = chromadb.PersistentClient(path="./trailer_db") 
    nar_collection = client.get_collection("db_narrative")
    count = nar_collection.count()
    print(f"✓ Connected to 'db_narrative'. Total Movies Found: {count}")
except Exception as e:
    print(f"⚠ Connection Error: {e}")
    print("Did you run the 'Main Processing' cell? The 'trailer_db' folder might be missing.")
    count = 0

# --- 2. VISUALIZATION FUNCTION (Fixed for NumPy Arrays) ---
def visualize_database(collection):
    if count == 0:
        print("\n❌ DATABASE IS EMPTY. Cannot visualize.")
        return

    print("Generating 3D DNA Map...")
    
    # Fetch data
    data = collection.get(include=['embeddings', 'metadatas'])
    
    # --- THE FIX IS HERE ---
    # We check if it is None, then check length. We rarely check "if data['embeddings']" directly.
    if data['embeddings'] is None or len(data['embeddings']) == 0:
        print("No embeddings found in this collection.")
        return

    # Convert to Numpy (Handling the case where it might already be one)
    embeddings = np.array(data['embeddings'])
    metadatas = data['metadatas']
    
    # Safety Check for t-SNE (Needs at least 2 points to calculate distance)
    n_samples = len(embeddings)
    if n_samples < 2:
        print(f"⚠ Not enough data to visualize ({n_samples} item). Need at least 2 movies.")
        return
        
    # Reduce dimensions (t-SNE)
    print(f"Running t-SNE on {n_samples} movies...")
    perp = min(30, n_samples - 1)
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
    projections = tsne.fit_transform(embeddings)
    
    # Create DataFrame
    df_viz = pd.DataFrame({
        'x': projections[:, 0],
        'y': projections[:, 1],
        'title': [m.get('title', 'Unknown') for m in metadatas],
        'desc': [m.get('desc', 'No Desc')[:100] + "..." for m in metadatas]
    })
    
    # Interactive Plot
    fig = px.scatter(
        df_viz, x='x', y='y',
        color='title', 
        hover_data=['desc'],
        title=f'Movie Map ({collection.name} - {count} items)',
        template='plotly_dark'
    )
    fig.show()

# Run it
if count > 0:
    visualize_database(nar_collection)

Connecting to database...
✓ Connected to 'db_narrative'. Total Movies Found: 206
Generating 3D DNA Map...
Running t-SNE on 206 movies...
