# Notebook 1: Data Collection

## Purpose
This notebook handles the collection of raw movie data from multiple sources including:
- **TMDB API**: Movie metadata (budget, cast, crew, genres, runtime, release dates)
- **Box Office Mojo**: Box office revenue data (opening weekend, total domestic, worldwide)
- **OMDb API**: Supplemental metadata and IMDb ratings
- **YouTube Data API**: Trailer view counts and engagement metrics

## Objectives
1. Set up API connections and test endpoints
2. Write data collection functions with error handling and rate limiting
3. Collect data for 3,000+ movies from 2010-2024
4. Merge data sources on IMDb ID
5. Save raw datasets to CSV files in `data/raw/` directory
6. Perform initial data inspection

## Outputs
- `data/raw/movies_tmdb_raw.csv`
- `data/raw/revenue_boxofficemojo_raw.csv`
- `data/raw/trailers_youtube_raw.csv`

## Notes
- This notebook may take several hours to run due to API rate limits
- Once data is collected, subsequent runs should load from saved CSV files
- API keys should be stored in a `.env` file (not committed to git)

---
## Setup and Imports

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
import os
from dotenv import load_dotenv
from datetime import datetime
import json

# Load environment variables
load_dotenv()

# API Keys
TMDB_API_KEY = os.getenv('TMDB_API_KEY')
OMDB_API_KEY = os.getenv('OMDB_API_KEY')
YOUTUBE_API_KEY = os.getenv('YOUTUBE_API_KEY')

# Verify API keys are loaded
print("API Keys loaded:")
print(f"  TMDB: {'✓' if TMDB_API_KEY else '✗'}")
print(f"  OMDb: {'✓' if OMDB_API_KEY else '✗'}")
print(f"  YouTube: {'✓' if YOUTUBE_API_KEY else '✗'}")

# Test TMDB API connection
print("\nTesting TMDB API connection...")
test_url = f"https://api.themoviedb.org/3/movie/550?api_key={TMDB_API_KEY}"
try:
    response = requests.get(test_url)
    if response.status_code == 200:
        print("✓ TMDB API connection successful!")
        print(f"  Test movie: {response.json()['title']}")
    else:
        print(f"✗ TMDB API error: {response.status_code}")
except Exception as e:
    print(f"✗ Connection error: {e}")

API Keys loaded:
  TMDB: ✓
  OMDb: ✓
  YouTube: ✓

Testing TMDB API connection...
✓ TMDB API connection successful!
  Test movie: Fight Club


---
## Data Collection Functions

In [4]:
# TMDB API Base URL
TMDB_BASE_URL = "https://api.themoviedb.org/3"

# Rate limiter class to handle TMDB's 40 requests per 10 seconds limit
class RateLimiter:
    def __init__(self, max_calls=40, time_period=10):
        self.max_calls = max_calls
        self.time_period = time_period
        self.calls = []
    
    def wait_if_needed(self):
        now = time.time()
        # Remove calls older than time_period
        self.calls = [call_time for call_time in self.calls if now - call_time < self.time_period]
        
        if len(self.calls) >= self.max_calls:
            sleep_time = self.time_period - (now - self.calls[0]) + 0.1
            print(f"  Rate limit reached, waiting {sleep_time:.1f} seconds...")
            time.sleep(sleep_time)
            self.calls = []
        
        self.calls.append(time.time())

# Initialize rate limiter
rate_limiter = RateLimiter(max_calls=35, time_period=10)  # Using 35 to be safe

def get_popular_movies_by_year(year, pages=5):
    """
    Get popular movies for a specific year using TMDB discover endpoint.
    
    Args:
        year: Release year (e.g., 2020)
        pages: Number of pages to fetch (20 movies per page)
    
    Returns:
        List of movie IDs
    """
    movie_ids = []
    
    for page in range(1, pages + 1):
        rate_limiter.wait_if_needed()
        
        url = f"{TMDB_BASE_URL}/discover/movie"
        params = {
            'api_key': TMDB_API_KEY,
            'language': 'en-US',
            'sort_by': 'popularity.desc',
            'primary_release_year': year,
            'page': page,
            'vote_count.gte': 50  # Minimum votes to ensure it's not obscure
        }
        
        try:
            response = requests.get(url, params=params, timeout=10)
            if response.status_code == 200:
                data = response.json()
                movie_ids.extend([movie['id'] for movie in data['results']])
            else:
                print(f"  Error fetching page {page} for year {year}: {response.status_code}")
        except Exception as e:
            print(f"  Exception for year {year}, page {page}: {e}")
            time.sleep(2)
    
    return movie_ids

def get_movie_details(movie_id):
    """
    Get detailed information for a specific movie.
    
    Args:
        movie_id: TMDB movie ID
    
    Returns:
        Dictionary with movie details or None if error
    """
    rate_limiter.wait_if_needed()
    
    url = f"{TMDB_BASE_URL}/movie/{movie_id}"
    params = {
        'api_key': TMDB_API_KEY,
        'append_to_response': 'credits,release_dates,videos'
    }
    
    try:
        response = requests.get(url, params=params, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print(f"  Error fetching movie {movie_id}: {response.status_code}")
            return None
    except Exception as e:
        print(f"  Exception for movie {movie_id}: {e}")
        return None

def extract_movie_data(movie_details):
    """
    Extract relevant fields from TMDB movie details.
    
    Args:
        movie_details: Raw JSON response from TMDB
    
    Returns:
        Dictionary with extracted fields
    """
    if not movie_details:
        return None
    
    # Extract release dates to find US release
    us_release_date = None
    us_certification = None
    if 'release_dates' in movie_details and 'results' in movie_details['release_dates']:
        for country_release in movie_details['release_dates']['results']:
            if country_release['iso_3166_1'] == 'US':
                for release in country_release['release_dates']:
                    if release.get('type') in [2, 3]:  # Theatrical release
                        us_release_date = release.get('release_date')
                        us_certification = release.get('certification')
                        break
                break
    
    # Extract cast (top 5 actors)
    cast = []
    if 'credits' in movie_details and 'cast' in movie_details['credits']:
        cast = [
            {
                'id': actor['id'],
                'name': actor['name'],
                'order': actor['order']
            }
            for actor in movie_details['credits']['cast'][:5]
        ]
    
    # Extract director and crew
    director = None
    if 'credits' in movie_details and 'crew' in movie_details['credits']:
        for crew_member in movie_details['credits']['crew']:
            if crew_member['job'] == 'Director':
                director = {
                    'id': crew_member['id'],
                    'name': crew_member['name']
                }
                break
    
    # Extract YouTube trailer key
    trailer_key = None
    if 'videos' in movie_details and 'results' in movie_details['videos']:
        for video in movie_details['videos']['results']:
            if video['type'] == 'Trailer' and video['site'] == 'YouTube':
                trailer_key = video['key']
                break
    
    # Extract genres
    genres = [genre['name'] for genre in movie_details.get('genres', [])]
    
    # Extract production companies
    production_companies = [company['name'] for company in movie_details.get('production_companies', [])]
    
    return {
        'tmdb_id': movie_details.get('id'),
        'imdb_id': movie_details.get('imdb_id'),
        'title': movie_details.get('title'),
        'original_title': movie_details.get('original_title'),
        'release_date': movie_details.get('release_date'),
        'us_release_date': us_release_date,
        'us_certification': us_certification,
        'budget': movie_details.get('budget'),
        'revenue': movie_details.get('revenue'),  # Note: TMDB revenue often incomplete
        'runtime': movie_details.get('runtime'),
        'genres': '|'.join(genres) if genres else None,
        'primary_genre': genres[0] if genres else None,
        'num_genres': len(genres),
        'popularity': movie_details.get('popularity'),
        'vote_average': movie_details.get('vote_average'),
        'vote_count': movie_details.get('vote_count'),
        'director_id': director['id'] if director else None,
        'director_name': director['name'] if director else None,
        'cast_ids': '|'.join([str(actor['id']) for actor in cast]),
        'cast_names': '|'.join([actor['name'] for actor in cast]),
        'production_companies': '|'.join(production_companies) if production_companies else None,
        'num_production_companies': len(production_companies),
        'original_language': movie_details.get('original_language'),
        'production_countries': '|'.join([country['iso_3166_1'] for country in movie_details.get('production_countries', [])]),
        'youtube_trailer_key': trailer_key,
        'tagline': movie_details.get('tagline'),
        'overview': movie_details.get('overview')
    }

def collect_movies_for_year_range(start_year, end_year, pages_per_year=5):
    """
    Collect movie data for a range of years.
    
    Args:
        start_year: Starting year (inclusive)
        end_year: Ending year (inclusive)
        pages_per_year: Number of pages to fetch per year
    
    Returns:
        DataFrame with collected movie data
    """
    all_movies = []
    total_movies = 0
    
    for year in range(start_year, end_year + 1):
        print(f"\n=== Collecting movies for {year} ===")
        
        # Get movie IDs for this year
        movie_ids = get_popular_movies_by_year(year, pages=pages_per_year)
        print(f"  Found {len(movie_ids)} movie IDs for {year}")
        
        # Get details for each movie
        year_movies = 0
        for i, movie_id in enumerate(movie_ids, 1):
            if i % 20 == 0:
                print(f"  Progress: {i}/{len(movie_ids)} movies processed for {year}")
            
            movie_details = get_movie_details(movie_id)
            if movie_details:
                extracted_data = extract_movie_data(movie_details)
                if extracted_data:
                    all_movies.append(extracted_data)
                    year_movies += 1
        
        print(f"  Collected {year_movies} movies for {year}")
        total_movies += year_movies
        print(f"  Total movies collected so far: {total_movies}")
    
    df = pd.DataFrame(all_movies)
    return df

print("Data collection functions loaded successfully!")

Data collection functions loaded successfully!


---
## Collect Data

In [None]:
# Collect TMDB data for movies from 2010-2024
# This will take a while due to rate limiting (approximately 2-3 hours)

# Set parameters
START_YEAR = 2010
END_YEAR = 2024
PAGES_PER_YEAR = 17  # 17 pages x 20 movies = ~340 movies per year x 15 years = ~5,100 movies

print(f"Starting data collection for {START_YEAR}-{END_YEAR}")
print(f"Fetching {PAGES_PER_YEAR} pages per year (~{PAGES_PER_YEAR * 20} movies/year)")
print(f"Estimated total movies: ~{(END_YEAR - START_YEAR + 1) * PAGES_PER_YEAR * 20}")
print(f"This will take approximately 2-3 hours due to API rate limiting.\n")

# Collect the data
start_time = time.time()
df_tmdb = collect_movies_for_year_range(START_YEAR, END_YEAR, pages_per_year=PAGES_PER_YEAR)
end_time = time.time()

print(f"\n{'='*60}")
print(f"Data collection complete!")
print(f"Total movies collected: {len(df_tmdb)}")
print(f"Time elapsed: {(end_time - start_time) / 60:.1f} minutes")
print(f"{'='*60}")

---
## Save Raw Data

In [None]:
# Create data/raw directory if it doesn't exist
os.makedirs('data/raw', exist_ok=True)

# Save to CSV
output_file = 'data/raw/movies_tmdb_raw.csv'
df_tmdb.to_csv(output_file, index=False)

print(f"Data saved to {output_file}")
print(f"File size: {os.path.getsize(output_file) / 1024:.1f} KB")
print(f"Total rows: {len(df_tmdb)}")
print(f"Total columns: {len(df_tmdb.columns)}")

---
## Initial Data Inspection

In [6]:
# Basic data inspection
print("="*60)
print("DATASET OVERVIEW")
print("="*60)

print(f"\nShape: {df_tmdb.shape}")
print(f"  Rows (movies): {df_tmdb.shape[0]}")
print(f"  Columns (features): {df_tmdb.shape[1]}")

print("\n" + "="*60)
print("COLUMN NAMES")
print("="*60)
print(df_tmdb.columns.tolist())

print("\n" + "="*60)
print("DATA TYPES")
print("="*60)
print(df_tmdb.dtypes)

print("\n" + "="*60)
print("MISSING VALUES")
print("="*60)
missing = df_tmdb.isnull().sum()
missing_pct = (missing / len(df_tmdb) * 100).round(1)
missing_df = pd.DataFrame({
    'Missing': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False))

print("\n" + "="*60)
print("FIRST 5 ROWS")
print("="*60)
print(df_tmdb.head())

print("\n" + "="*60)
print("BASIC STATISTICS (Numeric Columns)")
print("="*60)
print(df_tmdb.describe())

print("\n" + "="*60)
print("KEY METRICS")
print("="*60)
print(f"Movies with budget data: {df_tmdb['budget'].notna().sum()} ({df_tmdb['budget'].notna().sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with non-zero budget: {(df_tmdb['budget'] > 0).sum()} ({(df_tmdb['budget'] > 0).sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with revenue data: {df_tmdb['revenue'].notna().sum()} ({df_tmdb['revenue'].notna().sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with non-zero revenue: {(df_tmdb['revenue'] > 0).sum()} ({(df_tmdb['revenue'] > 0).sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with IMDb ID: {df_tmdb['imdb_id'].notna().sum()} ({df_tmdb['imdb_id'].notna().sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with director: {df_tmdb['director_name'].notna().sum()} ({df_tmdb['director_name'].notna().sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with cast data: {df_tmdb['cast_names'].notna().sum()} ({df_tmdb['cast_names'].notna().sum() / len(df_tmdb) * 100:.1f}%)")
print(f"Movies with YouTube trailer: {df_tmdb['youtube_trailer_key'].notna().sum()} ({df_tmdb['youtube_trailer_key'].notna().sum() / len(df_tmdb) * 100:.1f}%)")

print("\n" + "="*60)
print("SAMPLE MOVIES")
print("="*60)
print(df_tmdb[['title', 'release_date', 'budget', 'revenue', 'primary_genre', 'director_name']].sample(10))

DATASET OVERVIEW

Shape: (2100, 27)
  Rows (movies): 2100
  Columns (features): 27

COLUMN NAMES
['tmdb_id', 'imdb_id', 'title', 'original_title', 'release_date', 'us_release_date', 'us_certification', 'budget', 'revenue', 'runtime', 'genres', 'primary_genre', 'num_genres', 'popularity', 'vote_average', 'vote_count', 'director_id', 'director_name', 'cast_ids', 'cast_names', 'production_companies', 'num_production_companies', 'original_language', 'production_countries', 'youtube_trailer_key', 'tagline', 'overview']

DATA TYPES
tmdb_id                       int64
imdb_id                      object
title                        object
original_title               object
release_date                 object
us_release_date              object
us_certification             object
budget                        int64
revenue                       int64
runtime                       int64
genres                       object
primary_genre                object
num_genres                    int64
