TMDB Movie Data Collector for Movie Success Prediction

This notebook creates a CSV dataset from TMDB API for data provisioning analysis

In [14]:
import requests
import pandas as pd
import time
import json
from datetime import datetime
import numpy as np


API_KEY = "064388427da12053da4c13de808686df"  
BASE_URL = "https://api.themoviedb.org/3"
API_KEY_OMDB = "1f805bfb"
REQUEST_DELAY = 0.25  

api
Handler

Manages communication with multiple APIs (TMDB + OMDb) for richer dataset collection.

I added OMDb integration because TMDB lacks IMDb ratings and Rotten Tomatoes scores,

which are crucial validation metrics for movie success prediction.



In [15]:
class APIHandler:
    def __init__(self, tmdb_key, omdb_key=None):
        self.tmdb_key = tmdb_key
        self.omdb_key = omdb_key
        self.session = requests.Session()
        
    def get_tmdb_data(self, endpoint, params=None):
        if params is None:
            params = {}
        params['api_key'] = self.tmdb_key
        
        url = f"https://api.themoviedb.org/3/{endpoint}"
        
        try:
            time.sleep(0.25)  
            response = self.session.get(url, params=params)
            response.raise_for_status()
            return response.json()
        except Exception as e:
            print(f"TMDB request failed: {e}")
            return None
    
    def get_omdb_data(self, title, year):
        if not self.omdb_key:
            return {}
            
        try:
            params = {
                'apikey': self.omdb_key,
                't': title,
                'y': year
            }
            
            time.sleep(0.1)
            response = self.session.get("http://www.omdbapi.com/", params=params)
            
            if response.status_code == 200:
                data = response.json()
                if data.get('Response') == 'True':
                    return {
                        'imdb_rating': self._to_float(data.get('imdbRating')),
                        'imdb_votes': self._parse_votes(data.get('imdbVotes')),
                        'rotten_tomatoes_score': self._get_rt_score(data.get('Ratings', [])),
                        'metacritic_score': self._to_int(data.get('Metascore')),
                        'awards': data.get('Awards'),
                        'writer': data.get('Writer'),
                        'rated': data.get('Rated'),
                        'country': data.get('Country')
                    }
        except Exception as e:
            print(f"OMDb request failed for {title}: {e}")
        
        return {}
    
    def _to_float(self, value):
        if value and value != 'N/A':
            try:
                return float(value)
            except:
                pass
        return None
    
    def _to_int(self, value):
        if value and value != 'N/A':
            try:
                return int(value)
            except:
                pass
        return None
    
    def _parse_votes(self, votes_str):
        if votes_str and votes_str != 'N/A':
            try:
                return int(votes_str.replace(',', ''))
            except:
                pass
        return None
    
    def _get_rt_score(self, ratings_list):
        for rating in ratings_list:
            if rating.get('Source') == 'Rotten Tomatoes':
                score_str = rating.get('Value', '')
                if '%' in score_str:
                    try:
                        return int(score_str.replace('%', ''))
                    except:
                        pass
        return None

Movie Discovery Engine

This class discovers movies using TMDB's filtering system to find movies suitable for success prediction

I'm filtering for movies with revenue data since that's essential for creating our target variable (Hit/Break-even/Flop)

I will ffocus on modern movies (1990+) because older movies have different market dynamics

This systematic discovery ensures we get a representative sample for training our prediction model



In [16]:
class MovieDiscovery:
    def __init__(self, api_handler):
        self.api = api_handler
        
    def discover_movies(self, pages=100, start_page=1):  
        print(f"Discovering movies from pages {start_page} to {start_page + pages - 1}...")
        discovered_movies = []
        
        for page in range(start_page, start_page + pages):  
            params = {
                'page': page,
                'sort_by': 'popularity.desc',
                'include_adult': 'false',
                'with_revenue.gte': 1,
                'primary_release_date.gte': '1990-01-01',
                'primary_release_date.lte': '2024-12-31'
            }
            
            data = self.api.get_tmdb_data('discover/movie', params)
            if data and 'results' in data:
                discovered_movies.extend(data['results'])
                print(f"Page {page}: Found {len(data['results'])} movies")
            else:
                print(f"Failed to get data for page {page}")
                
        print(f"Total discovered movies: {len(discovered_movies)}")
        return discovered_movies

Movie Details Extractor

This class extracts comprehensive financial and content details for each movie

Budget and revenue are critical for creating my success classification target variable

Genre, runtime, release date affect audience appeal and box office performance

These features form the core predictive variables for my ML model

Combines TMDB base data with OMDb enrichment for comprehensive movie profiles.

I expanded this to include external data validation because relying on a single source
can lead to biased or incomplete information that affects model accuracy.

The integration provides cross-validation between data sources and captures
different perspectives on movie quality (audience vs professional critics).

In [17]:
class MovieDetailsExtractor:
    def __init__(self, api_handler):
        self.api = api_handler
        
    def get_movie_details(self, movie_id, title=None, year=None, include_omdb=True):
        """Get complete movie information"""
        details = self.api.get_tmdb_data(f'movie/{movie_id}')
        if not details:
            return None
            
        movie_data = {
            'id': details.get('id'),
            'title': details.get('title'),
            'budget': details.get('budget', 0),
            'revenue': details.get('revenue', 0),
            'runtime': details.get('runtime'),
            'release_date': details.get('release_date'),
            'vote_average': details.get('vote_average'),
            'vote_count': details.get('vote_count'),
            'popularity': details.get('popularity'),
            'overview': details.get('overview'),
            'original_language': details.get('original_language'),
            'adult': details.get('adult')
        }
        
        if details.get('genres'):
            movie_data['genres'] = [g['name'] for g in details['genres']]
            movie_data['primary_genre'] = details['genres'][0]['name']
            movie_data['genre_count'] = len(details['genres'])
        else:
            movie_data['genres'] = []
            movie_data['primary_genre'] = None
            movie_data['genre_count'] = 0
            
        if details.get('production_companies'):
            movie_data['production_companies'] = [pc['name'] for pc in details['production_companies']]
            movie_data['main_production_company'] = details['production_companies'][0]['name']
            movie_data['production_company_count'] = len(details['production_companies'])
        else:
            movie_data['production_companies'] = []
            movie_data['main_production_company'] = None
            movie_data['production_company_count'] = 0
        
        if details.get('production_countries'):
            movie_data['production_countries'] = [pc['name'] for pc in details['production_countries']]
            movie_data['main_production_country'] = details['production_countries'][0]['name']
            movie_data['is_us_movie'] = any('United States' in pc['name'] for pc in details['production_countries'])
        else:
            movie_data['production_countries'] = []
            movie_data['main_production_country'] = None
            movie_data['is_us_movie'] = False
        
        if include_omdb and title and year:
            omdb_data = self.api.get_omdb_data(title, year)
            movie_data.update(omdb_data)
        
        return movie_data

Credits and Cast Extractor

This class extracts director and cast information which are strong predictors of movie success

Director track record (previous hit rate) is one of the most reliable success predictors in the industry

Lead actor popularity and star power significantly influence opening weekend box office

Top-billed cast affects marketing appeal and audience draw

This data enables my model to factor in human talent as a success variable


In [18]:
class CreditsExtractor:
    def __init__(self, api_handler):
        self.api = api_handler
        
    def get_movie_credits(self, movie_id):
        credits = self.api.get_tmdb_data(f'movie/{movie_id}/credits')
        if not credits:
            return {}
            
        credit_data = {}
        
        if credits.get('crew'):
            directors = [person for person in credits['crew'] if person['job'] == 'Director']
            if directors:
                credit_data['director'] = directors[0]['name']
                credit_data['director_id'] = directors[0]['id']
            else:
                credit_data['director'] = None
                credit_data['director_id'] = None
        
        if credits.get('cast'):
            main_cast = credits['cast'][:5]
            credit_data['main_cast'] = [actor['name'] for actor in main_cast]
            credit_data['main_cast_ids'] = [actor['id'] for actor in main_cast]
            credit_data['lead_actor'] = main_cast[0]['name'] if main_cast else None
            credit_data['lead_actor_id'] = main_cast[0]['id'] if main_cast else None
        else:
            credit_data['main_cast'] = []
            credit_data['main_cast_ids'] = []
            credit_data['lead_actor'] = None
            credit_data['lead_actor_id'] = None
            
        return credit_data

Success Classification Creator

This class creates the target variable for my machine learning model using industry-standard profitability ratios

Movies need 2.5x revenue vs budget to be truly profitable after marketing and distribution costs

This classification system (Hit/Break-even/Flop) matches real Hollywood investment decision-making

Creating accurate target labels is essential for supervised learning and model evaluation

I expanded this to leverage external data for more sophisticated feature engineering
that captures market dynamics and audience-critic divides.

The rating comparison features help identify movies that perform differently
with critics versus general audiences, which affects long-term commercial success.

In [19]:
class SuccessClassifier:
    def classify_movie_success(self, revenue, budget):
        if budget == 0 or revenue == 0:
            return None
            
        profit_ratio = revenue / budget
        
        if profit_ratio < 1.0:
            return "Flop"
        elif profit_ratio < 2.5:
            return "Break-even" 
        else:
            return "Hit"
    
    def add_useful_features(self, movie_data):
        
        if movie_data.get('release_date'):
            try:
                date_obj = datetime.strptime(movie_data['release_date'], '%Y-%m-%d')
                movie_data['release_year'] = date_obj.year
                movie_data['release_month'] = date_obj.month
                movie_data['release_quarter'] = (date_obj.month - 1) // 3 + 1
                movie_data['is_summer_movie'] = date_obj.month in [5, 6, 7, 8]  
                movie_data['is_holiday_movie'] = date_obj.month in [11, 12]    
            except:
                movie_data['release_year'] = None
                movie_data['release_month'] = None
                movie_data['release_quarter'] = None
                movie_data['is_summer_movie'] = None
                movie_data['is_holiday_movie'] = None
        
        movie_data['success_category'] = self.classify_movie_success(
            movie_data.get('revenue', 0), 
            movie_data.get('budget', 0)
        )
        
        if movie_data.get('budget', 0) > 0:
            movie_data['profit_ratio'] = movie_data.get('revenue', 0) / movie_data['budget']
        else:
            movie_data['profit_ratio'] = None

        if movie_data.get('imdb_rating') and movie_data.get('vote_average'):
            movie_data['imdb_vs_tmdb_difference'] = movie_data['imdb_rating'] - movie_data['vote_average']
            movie_data['ratings_agree'] = abs(movie_data['imdb_vs_tmdb_difference']) < 0.5
        
        movie_data['has_awards'] = self._check_for_awards(movie_data.get('awards'))
        movie_data['has_oscar_mention'] = self._check_for_oscars(movie_data.get('awards'))
        
        movie_data['is_r_rated'] = movie_data.get('rated') == 'R'
        movie_data['is_family_friendly'] = movie_data.get('rated') in ['G', 'PG', 'PG-13']
        
        budget = movie_data.get('budget', 0)
        if budget >= 100_000_000:
            movie_data['budget_category'] = 'Blockbuster'
        elif budget >= 50_000_000:
            movie_data['budget_category'] = 'Major Studio'
        elif budget >= 15_000_000:
            movie_data['budget_category'] = 'Mid-Budget'
        elif budget > 0:
            movie_data['budget_category'] = 'Independent'
        else:
            movie_data['budget_category'] = 'Unknown'
        
        runtime = movie_data.get('runtime')
        if runtime:
            if runtime < 90:
                movie_data['runtime_category'] = 'Short'
            elif runtime <= 120:
                movie_data['runtime_category'] = 'Standard'
            else:
                movie_data['runtime_category'] = 'Long'
        else:
            movie_data['runtime_category'] = 'Unknown'
            
        return movie_data
    
    def _check_for_awards(self, awards_str):
        if not awards_str or awards_str == 'N/A':
            return False
        
        major_awards = ['Oscar', 'Golden Globe', 'BAFTA', 'Emmy', 'SAG Award']
        return any(award in awards_str for award in major_awards)
    
    def _check_for_oscars(self, awards_str):
        if not awards_str or awards_str == 'N/A':
            return False
        return 'Oscar' in awards_str or 'Academy Award' in awards_str

Data Quality Controller

This class ensures collected data meets quality standards for machine learning

Movies without budget or revenue data cannot be used for success prediction training

Filtering out incomplete records prevents model training on unreliable data

Quality control at collection stage reduces data preparation work later

I expanded this to handle validation across multiple APIs and create completeness metrics
that help identify the most reliable records for model training.

The completeness scoring helps weight predictions by data quality and identifies
records that might need additional validation or exclusion from training.

In [20]:
class DataQualityChecker:
    def is_good_movie_data(self, movie_data):
        required_fields = ['budget', 'revenue', 'title', 'release_date']
        
        for field in required_fields:
            if not movie_data.get(field):
                return False
                
        if movie_data.get('budget', 0) <= 0:
            return False
            
        if movie_data.get('revenue', 0) <= 0:
            return False
        
        runtime = movie_data.get('runtime')
        if runtime and (runtime < 60 or runtime > 300):
            print(f"Warning: Weird runtime {runtime} for {movie_data.get('title')}")
            
        tmdb_rating = movie_data.get('vote_average')
        if tmdb_rating and (tmdb_rating < 0 or tmdb_rating > 10):
            print(f"Warning: Bad TMDB rating {tmdb_rating} for {movie_data.get('title')}")
            return False
            
        imdb_rating = movie_data.get('imdb_rating')
        if imdb_rating and (imdb_rating < 0 or imdb_rating > 10):
            print(f"Warning: Bad IMDb rating {imdb_rating} for {movie_data.get('title')}")
            return False
        
        return True
    
    def calculate_data_completeness(self, movie_data):
        core_fields = ['budget', 'revenue', 'title', 'release_date', 'runtime', 'vote_average']
        extra_fields = ['imdb_rating', 'rotten_tomatoes_score', 'awards', 'director', 'lead_actor']
        
        core_complete = sum(1 for field in core_fields if movie_data.get(field) is not None)
        extra_complete = sum(1 for field in extra_fields if movie_data.get(field) is not None)
        
        core_score = core_complete / len(core_fields)
        extra_score = extra_complete / len(extra_fields)
        
        overall_score = (core_score * 0.7) + (extra_score * 0.3)
        
        return {
            'core_completeness': core_score,
            'extra_completeness': extra_score,
            'overall_completeness': overall_score
        }
    
    def clean_movie_data(self, movie_data):
        if isinstance(movie_data.get('genres'), list):
            movie_data['genres_count'] = len(movie_data['genres'])
        else:
            movie_data['genres_count'] = 0
            
        if isinstance(movie_data.get('main_cast'), list):
            movie_data['cast_count'] = len(movie_data['main_cast'])
        else:
            movie_data['cast_count'] = 0

        text_fields = ['title', 'overview']
        for field in text_fields:
            if movie_data.get(field):
                movie_data[field] = movie_data[field].strip()
        
        completeness = self.calculate_data_completeness(movie_data)
        movie_data.update(completeness)
        
        return movie_data


CSV Dataset Exporter

This class saves the collected movie data to CSV format for the data provisioning phase

Proper file naming with timestamps enables version control and reproducible research



In [21]:
class CSVExporter:
    def save_to_csv(self, movies_data, filename=None):
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"movie_dataset_{timestamp}.csv"
        
        df = pd.DataFrame(movies_data)
        
        list_columns = ['genres', 'production_companies', 'main_cast', 'main_cast_ids']
        for col in list_columns:
            if col in df.columns:
                df[col] = df[col].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else x)
        
        df.to_csv(filename, index=False)
        
        print(f"Dataset saved to: {filename}")
        print(f"Total movies: {len(df)}")
        print(f"Total columns: {len(df.columns)}")
        
        return filename, df

Data Integration Techniques (Applied from Croatia GDP exercise)
 **Why I Added Data Integration and Versioning Classes**

**Why Data Integration is Critical for Movie Success Prediction:**

Single data sources have limitations - TMDB lacks critical validation metrics like IMDb ratings

and professional critic scores that significantly impact movie success patterns.

The Croatia GDP exercise taught me that combining multiple reliable data sources creates

richer datasets that improve model accuracy and provide cross-validation.

 **Business Need:** 
Movie studios make million-dollar decisions based on these models.

Missing external validation signals leads to poor investment choices.


 **Why These Specific Integration Techniques:**
 - Union: Combine data from multiple collection runs as dataset grows
 - Inner Join: Create high-quality training sets with complete external validation
 - Left Join: Preserve all movies while adding external data where available
 - Full Outer Join: Analyze different movie eras with varying data availability
 - Exclusion: Remove anomalous periods (like pandemic) that would skew predictions

In [22]:
class MovieDataIntegrator:
    def __init__(self):
        self.integration_log = []
    
    def log_integration(self, technique, description, result_shape):
        self.integration_log.append({
            'technique': technique,
            'description': description,
            'result_shape': result_shape,
            'timestamp': datetime.now()
        })
    
    def union_integration(self, tmdb_data_batch1, tmdb_data_batch2):
        print("=== UNION INTEGRATION ===")
        
        combined = pd.concat([tmdb_data_batch1, tmdb_data_batch2], ignore_index=True)
        combined = combined.drop_duplicates(subset=['id'])
        
        print(f"Batch 1: {len(tmdb_data_batch1)} movies")
        print(f"Batch 2: {len(tmdb_data_batch2)} movies")
        print(f"Union result: {len(combined)} movies")
        
        self.log_integration("Union", "Combined TMDB collection batches", combined.shape)
        return combined
    
    def inner_join_integration(self, tmdb_data, omdb_enhanced_data):
        print("=== INNER JOIN INTEGRATION ===")
        
        omdb_cols = ['id']
        available_omdb = []
        for col in ['imdb_rating', 'rotten_tomatoes_score', 'awards']:
            if col in omdb_enhanced_data.columns:
                omdb_cols.append(col)
                available_omdb.append(col)
        
        if len(available_omdb) > 0:
            has_omdb_data = omdb_enhanced_data[available_omdb[0]].notna()
            omdb_movies = omdb_enhanced_data[has_omdb_data].copy()
            inner_joined = pd.merge(tmdb_data, omdb_movies[omdb_cols], on='id', how='inner')
        else:
            inner_joined = pd.merge(tmdb_data, omdb_enhanced_data[['id']], on='id', how='inner')
        
        print(f"TMDB data: {len(tmdb_data)} movies")
        print(f"Movies with OMDb data: {len(omdb_enhanced_data)} movies")
        print(f"Inner join result: {len(inner_joined)} movies")
        
        self.log_integration("Inner Join", "TMDB + OMDb intersection", inner_joined.shape)
        return inner_joined
    
    def left_join_integration(self, tmdb_data, external_ratings):
        print("=== LEFT OUTER JOIN INTEGRATION ===")
        
        left_joined = pd.merge(tmdb_data, external_ratings, on='id', how='left')
        
        external_cols = [col for col in external_ratings.columns if col != 'id']
        if external_cols:
            null_count = left_joined[external_cols[0]].isnull().sum()
        else:
            null_count = 0
        
        print(f"TMDB data: {len(tmdb_data)} movies")
        print(f"Left join result: {len(left_joined)} movies")
        print(f"NULL values introduced: {null_count}")
        
        self.log_integration("Left Join", "TMDB enriched with external ratings", left_joined.shape)
        return left_joined
    
    def full_outer_join_integration(self, modern_movies, classic_movies):
        print("=== FULL OUTER JOIN INTEGRATION ===")
        
        full_joined = pd.merge(modern_movies, classic_movies, 
                              on=['title'], how='outer', suffixes=('_modern', '_classic'))
        
        print(f"Modern movies: {len(modern_movies)}")
        print(f"Classic movies: {len(classic_movies)}")
        print(f"Full join result: {len(full_joined)} movies")
        
        self.log_integration("Full Outer Join", "Complete movie timeline", full_joined.shape)
        return full_joined
    
    def exclusion_integration(self, movie_data, exclude_years=None):
        print("=== EXCLUSION INTEGRATION ===")
        
        if exclude_years is None:
            exclude_years = [2020, 2021]  
        
        filtered_data = movie_data[~movie_data['release_year'].isin(exclude_years)].copy()
        excluded_count = len(movie_data) - len(filtered_data)
        
        print(f"Original: {len(movie_data)} movies")
        print(f"Excluded: {excluded_count} movies") 
        print(f"Clean dataset: {len(filtered_data)} movies")
        
        self.log_integration("Exclusion", f"Removed {exclude_years}", filtered_data.shape)
        return filtered_data

Dataset Versioning System (Applied from Croatia GDP exercise)
 
 **Why Dataset Versioning is Critical:**
 
 Movie data changes constantly - new releases, updated ratings, box office corrections.
 
 Without versioning, you cannot reproduce model results or track which data version
 
 produced which predictions. Essential for production ML systems and audit trails.

**Real-World Need:** 
When a model makes wrong predictions, you must trace back to

the exact dataset version used for training to debug and improve.

In [23]:
class DatasetVersionManager:
    def create_version_info(self, dataset, integration_log):
        timestamp = datetime.now()
        dataset_id = f"MOVIE_SUCCESS_{timestamp.strftime('%Y%m%d_%H%M')}"
        
        version_info = {
            'dataset_id': dataset_id,
            'version': '1.0.0',
            'collection_date': timestamp.isoformat(),
            'data_sources': {
                'tmdb_api': {'movies_collected': len(dataset), 'api_version': 'v3'},
                'omdb_api': {'external_data_added': dataset['imdb_rating'].notna().sum()}
            },
            'integration_techniques': [log['technique'] for log in integration_log],
            'quality_filters': ['budget > 0', 'revenue > 0', 'release >= 1990'],
            'target_variable': 'success_category (Hit/Break-even/Flop)'
        }
        
        return version_info, dataset_id
    
    def create_refresh_strategy(self):
        """Define data update strategy"""
        return {
            'tmdb_data': {'frequency': 'Weekly', 'reason': 'New releases'},
            'omdb_data': {'frequency': 'Monthly', 'reason': 'Rating updates'}
        }

Main Data Collection Collector Combined


This class coordinates all collection components to create the complete movie dataset

Combines discovery, detailed extraction, credits, and quality control in systematic workflow

Provides progress tracking and error recovery for large-scale data collection

Creates the foundational dataset needed for the data provisioning and modeling phases

I built this to systematically collect and integrate data from multiple sources while maintaining
the original TMDB workflow and adding external validation layers.

This creates production-ready datasets with comprehensive quality metrics
and progress tracking for large-scale data collection operations.

In [24]:
class MovieDataCollector:
    def __init__(self, tmdb_key, omdb_key=None):
        self.api_handler = APIHandler(tmdb_key, omdb_key)
        self.discovery = MovieDiscovery(self.api_handler)  
        self.details_extractor = MovieDetailsExtractor(self.api_handler)  
        self.credits_extractor = CreditsExtractor(self.api_handler)  
        self.success_classifier = SuccessClassifier()  
        self.data_checker = DataQualityChecker()  
        self.csv_exporter = CSVExporter()  
        
    def collect_movie_data(self, target_movies=5000, save_progress=True, include_omdb=True):
        print("Starting movie data collection...")
        
        pages_needed = max(1, target_movies // 20)
        discovered_movies = self.discovery.discover_movies(pages_needed)  
        
        complete_movies = []
        
        for i, movie in enumerate(discovered_movies):
            if i >= target_movies:
                break
                
            movie_id = movie['id']
            title = movie.get('title')
            release_date = movie.get('release_date', '')
            year = release_date[:4] if release_date and len(release_date) >= 4 else None
            
            print(f"Processing {i+1}/{min(target_movies, len(discovered_movies))}: {title}")
            
            movie_details = self.details_extractor.get_movie_details(movie_id, title, year, include_omdb=include_omdb)
            if not movie_details:
                continue

            credits = self.credits_extractor.get_movie_credits(movie_id)
            
            complete_movie = {**movie_details, **credits}
            complete_movie = self.success_classifier.add_useful_features(complete_movie)
            
            if self.data_checker.is_good_movie_data(complete_movie):
                complete_movie = self.data_checker.clean_movie_data(complete_movie)
                complete_movies.append(complete_movie)
        
        filename, df = self.csv_exporter.save_to_csv(complete_movies)
        
        return filename, df, complete_movies
    
    def collect_in_batches(self, total_target=10000, batch_size=2500, start_batch=1, include_omdb=True):
        print(f"Starting batch collection: {total_target} total movies in batches of {batch_size}")
        
        all_movies = []
        batch_files = []
        
        num_batches = (total_target + batch_size - 1) // batch_size  
        pages_per_batch = max(1, batch_size // 20)  
        
        for batch_num in range(start_batch, num_batches + 1):
            print(f"\n{'='*50}")
            print(f"BATCH {batch_num}/{num_batches}")
            print(f"{'='*50}")
            
            start_page = (batch_num - 1) * pages_per_batch + 1
            end_page = start_page + pages_per_batch - 1
            print(f"Fetching from TMDB pages {start_page} to {end_page}")
            
            discovered_movies = self.discovery.discover_movies(
                pages=pages_per_batch, 
                start_page=start_page
            )
            
            complete_movies = []
            processed_count = 0
            
            for i, movie in enumerate(discovered_movies):
                if len(complete_movies) >= batch_size:  
                    break
                    
                movie_id = movie['id']
                title = movie.get('title')
                release_date = movie.get('release_date', '')
                year = release_date[:4] if release_date and len(release_date) >= 4 else None
                
                processed_count += 1
                print(f"Processing {processed_count}/{len(discovered_movies)} (Batch {batch_num}): {title}")
                
                movie_details = self.details_extractor.get_movie_details(movie_id, title, year, include_omdb=include_omdb)
                if not movie_details:
                    continue

                credits = self.credits_extractor.get_movie_credits(movie_id)
                
                complete_movie = {**movie_details, **credits}
                complete_movie = self.success_classifier.add_useful_features(complete_movie)
                
                if self.data_checker.is_good_movie_data(complete_movie):
                    complete_movie = self.data_checker.clean_movie_data(complete_movie)
                    complete_movies.append(complete_movie)
            
            batch_df = pd.DataFrame(complete_movies)
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            batch_specific_name = f"movie_batch_{batch_num}_of_{num_batches}_{timestamp}.csv"
            
            list_columns = ['genres', 'production_companies', 'main_cast', 'main_cast_ids']
            for col in list_columns:
                if col in batch_df.columns:
                    batch_df[col] = batch_df[col].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else x)
            
            batch_df.to_csv(batch_specific_name, index=False)
            
            print(f"Batch {batch_num} saved as: {batch_specific_name}")
            print(f"Batch {batch_num} collected: {len(complete_movies)} valid movies")
            print(f"Batch {batch_num} processed: {processed_count} total movies")
            
            all_movies.extend(complete_movies)
            batch_files.append(batch_specific_name)
            
            if len(all_movies) >= total_target:
                print(f"Reached target of {total_target} movies. Stopping collection.")
                break
        
        return self._combine_batches(batch_files, all_movies)

    def collect_separated_datasets(self, total_target=5000, batch_size=2500):
        print("=== COLLECTING SEPARATED DATASETS FOR INTEGRATION ===")
        
        # Step 1: Collect TMDB-only dataset
        print("\n1. Collecting TMDB-only dataset...")
        tmdb_filename, tmdb_df, tmdb_movies = self.collect_in_batches(
            total_target=total_target, 
            batch_size=batch_size, 
            include_omdb=False
        )
        
        # Step 2: Create OMDb dataset for the same movies
        print("\n2. Collecting OMDb data for the same movies...")
        omdb_data = []
        
        for i, movie in enumerate(tmdb_movies): 
            title = movie.get('title')
            release_date = movie.get('release_date', '')
            year = release_date[:4] if release_date and len(release_date) >= 4 else None
            movie_id = movie.get('id')
            
            if title and year and movie_id:
                print(f"Getting OMDb data {i+1}/{len(tmdb_movies)}: {title}")
                external_data = self.api_handler.get_omdb_data(title, str(year))
                
                if external_data and any(external_data.values()):  # Has meaningful data
                    omdb_record = {'id': movie_id, **external_data}
                    omdb_data.append(omdb_record)
        
        omdb_df = pd.DataFrame(omdb_data)
        
        # Save OMDb dataset
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        omdb_filename = f"omdb_dataset_{len(omdb_df)}_movies_{timestamp}.csv"
        omdb_df.to_csv(omdb_filename, index=False)
        
        print(f"\nSeparated datasets created:")
        print(f"TMDB dataset: {tmdb_filename} ({len(tmdb_df)} movies)")
        print(f"OMDb dataset: {omdb_filename} ({len(omdb_df)} movies)")
        
        return tmdb_df, omdb_df

    def _combine_batches(self, batch_files, all_movies):
        print(f"\n{'='*50}")
        print("COMBINING BATCHES")
        print(f"{'='*50}")
        
        final_df = pd.DataFrame(all_movies)
        
        initial_count = len(final_df)
        final_df = final_df.drop_duplicates(subset=['id'], keep='first')
        final_count = len(final_df)
        
        print(f"Total movies collected: {initial_count}")
        print(f"After removing duplicates: {final_count}")
        print(f"Duplicates removed: {initial_count - final_count}")
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        final_filename = f"movie_dataset_COMPLETE_{final_count}_movies_{timestamp}.csv"
        final_df.to_csv(final_filename, index=False)
        
        print(f"Final dataset saved: {final_filename}")
        print(f"Batch files created: {batch_files}")
        
        return final_filename, final_df, all_movies

Execute Data Collection


In [25]:
print("MOVIE DATA COLLECTION WITH MEANINGFUL INTEGRATION")
print("=" * 60)
collector = MovieDataCollector(API_KEY, API_KEY_OMDB)
print("Collecting separated datasets for integration demonstration...")
tmdb_df, omdb_df = collector.collect_separated_datasets(
    total_target=10000, 
    batch_size=2500
)

print(f"\nDataset Collection Complete!")
print(f"TMDB dataset: {len(tmdb_df)} movies")
print(f"OMDb dataset: {len(omdb_df)} movies")

integrator = MovieDataIntegrator()

print(f"\nDemonstrating Meaningful Integration Techniques:")

# 1. Union: Combine different batches (meaningful)
mid = len(tmdb_df) // 2
batch1 = tmdb_df.iloc[:mid].copy()
batch2 = tmdb_df.iloc[mid:].copy()
union_result = integrator.union_integration(batch1, batch2)

# 2. Inner Join: Only movies with both TMDB AND OMDb data (meaningful)
inner_result = integrator.inner_join_integration(tmdb_df, omdb_df)

# 3. Left Join: All TMDB movies + OMDb data where available (meaningful)
left_result = integrator.left_join_integration(tmdb_df, omdb_df)

# 4. Exclusion: Remove pandemic years (meaningful)
clean_result = integrator.exclusion_integration(union_result)

print(f"\nIntegration Results:")
print(f"TMDB only: {len(tmdb_df)} movies")
print(f"OMDb enrichment: {len(omdb_df)} movies") 
print(f"Union (combined batches): {len(union_result)} movies")
print(f"Inner join (complete data): {len(inner_result)} movies")
print(f"Left join (TMDB + available OMDb): {len(left_result)} movies")
print(f"Clean dataset (no pandemic years): {len(clean_result)} movies")

final_analysis_dataset = left_result

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
final_filename = f"movie_dataset_INTEGRATED_{len(final_analysis_dataset)}_movies_{timestamp}.csv"
final_analysis_dataset.to_csv(final_filename, index=False)

print(f"\nIntegration Insights:")
print(f"• Union combines data collection batches while removing duplicates")
print(f"• Inner join creates high-quality subset ({len(inner_result)} movies with complete validation)")
print(f"• Left join preserves all movies while adding external validation where possible")
print(f"• Exclusion removes {len(union_result) - len(clean_result)} pandemic-affected movies")

print(f"\nFinal dataset saved: {final_filename}")
print(f"Ready for Data Provisioning analysis!")
print("DONE!")

MOVIE DATA COLLECTION WITH MEANINGFUL INTEGRATION
Collecting separated datasets for integration demonstration...
=== COLLECTING SEPARATED DATASETS FOR INTEGRATION ===

1. Collecting TMDB-only dataset...
Starting batch collection: 10000 total movies in batches of 2500

BATCH 1/4
Fetching from TMDB pages 1 to 125
Discovering movies from pages 1 to 125...
Page 1: Found 20 movies
Page 2: Found 20 movies
Page 3: Found 20 movies
Page 4: Found 20 movies
Page 5: Found 20 movies
Page 6: Found 20 movies
Page 7: Found 20 movies
Page 8: Found 20 movies
Page 9: Found 20 movies
Page 10: Found 20 movies
Page 11: Found 20 movies
Page 12: Found 20 movies
Page 13: Found 20 movies
Page 14: Found 20 movies
Page 15: Found 20 movies
Page 16: Found 20 movies
Page 17: Found 20 movies
Page 18: Found 20 movies
Page 19: Found 20 movies
Page 20: Found 20 movies
Page 21: Found 20 movies
Page 22: Found 20 movies
Page 23: Found 20 movies
Page 24: Found 20 movies
Page 25: Found 20 movies
Page 26: Found 20 movies
Page

In [26]:
import pandas as pd

# Create simple data dictionary table
data = [
    ['id', 'Integer', 'Primary key for joins', 'Unique identifier for data integration', 'TMDB API'],
    ['title', 'String', 'Not used in ML (text processing complex)', 'Movie identification and validation', 'TMDB API'],
    ['budget', 'Integer', 'Core predictor - Higher budgets often mean higher marketing spend, star power', 'Essential for ROI calculation and success threshold', 'TMDB API'],
    ['revenue', 'Integer', 'Target variable calculation - Used to create Hit/Break-even/Flop labels', 'Box office performance is ultimate success measure', 'TMDB API'],
    ['runtime', 'Integer', 'Predictor - Affects theater scheduling, audience engagement patterns', 'Longer movies = fewer showings per day = potential revenue impact', 'TMDB API'],
    ['release_year', 'Integer', 'Predictor - Market conditions, competition levels vary by year', 'Different years have different box office environments', 'TMDB API (derived)'],
    ['release_month', 'Integer', 'Strong predictor - Summer/holiday releases perform differently', 'Strategic release timing affects box office potential', 'TMDB API (derived)'],
    ['is_summer_movie', 'Boolean', 'Categorical predictor - Summer blockbuster strategy', 'Summer movies target different audiences, higher revenue potential', 'TMDB API (derived)'],
    ['is_holiday_movie', 'Boolean', 'Categorical predictor - Holiday release strategy', 'November-December releases target awards season and family audiences', 'TMDB API (derived)'],
    ['vote_average', 'Float', 'Predictor - Audience reception affects word-of-mouth marketing', 'Positive reception drives sustained box office performance', 'TMDB API'],
    ['vote_count', 'Integer', 'Confidence measure - Sample size for rating reliability', 'More votes = more reliable audience sentiment measure', 'TMDB API'],
    ['primary_genre', 'String', 'Categorical predictor - Genre popularity varies by market conditions', 'Different genres have different success patterns and audience sizes', 'TMDB API'],
    ['genre_count', 'Integer', 'Predictor - Multi-genre movies target broader audiences', 'More genres might indicate broader appeal or confused marketing', 'TMDB API (calculated)'],
    ['director', 'String', 'High-impact predictor - Director track record strongly correlates with success', 'Proven directors reduce investment risk, established fan bases', 'TMDB API'],
    ['lead_actor', 'String', 'Star power predictor - A-list actors drive opening weekend performance', 'Star recognition affects marketing effectiveness and audience draw', 'TMDB API'],
    ['main_production_company', 'String', 'Predictor - Studio resources and distribution networks vary', 'Major studios have better marketing budgets and theater access', 'TMDB API'],
    ['is_us_movie', 'Boolean', 'Market predictor - US movies have different success patterns', 'US films have domestic market advantage and global distribution', 'TMDB API (calculated)'],
    ['imdb_rating', 'Float', 'External validation - Independent rating source for model validation', 'IMDb ratings provide second opinion on movie quality', 'OMDb API'],
    ['rotten_tomatoes_score', 'Integer', 'Professional critics predictor - Critical acclaim affects awards and longevity', 'Critics scores influence awards season and long-term revenue', 'OMDb API'],
    ['metacritic_score', 'Integer', 'Aggregated critics predictor - Professional review consensus', 'Metacritic aggregates multiple professional reviews for balanced view', 'OMDb API'],
    ['has_awards', 'Boolean', 'Prestige predictor - Awards recognition drives additional revenue streams', 'Awards boost home video sales, streaming rights, international sales', 'OMDb API (calculated)'],
    ['has_oscar_mention', 'Boolean', 'Premium predictor - Oscar recognition significantly affects profitability', 'Oscar nominations/wins create long-term revenue opportunities', 'OMDb API (calculated)'],
    ['budget_category', 'String', 'Budget tier predictor - Different budget levels have different success patterns', 'Blockbusters vs indies have different risk/reward profiles', 'TMDB API (calculated)'],
    ['profit_ratio', 'Float', 'Target variable - Continuous measure of financial performance', 'Revenue/budget ratio determines investment success', 'Both APIs (calculated)'],
    ['success_category', 'String', 'ML Target - Hit/Break-even/Flop classification for supervised learning', 'Industry-standard profitability categories for investment decisions', 'Both APIs (calculated)']
]

df = pd.DataFrame(data, columns=['Field Name', 'Data Type', 'ML Purpose', 'Business Rationale', 'Data Source'])
print("Data Dictionary")
print("=" * 50)
df

Data Dictionary


Unnamed: 0,Field Name,Data Type,ML Purpose,Business Rationale,Data Source
0,id,Integer,Primary key for joins,Unique identifier for data integration,TMDB API
1,title,String,Not used in ML (text processing complex),Movie identification and validation,TMDB API
2,budget,Integer,Core predictor - Higher budgets often mean hig...,Essential for ROI calculation and success thre...,TMDB API
3,revenue,Integer,Target variable calculation - Used to create H...,Box office performance is ultimate success mea...,TMDB API
4,runtime,Integer,"Predictor - Affects theater scheduling, audien...",Longer movies = fewer showings per day = poten...,TMDB API
5,release_year,Integer,"Predictor - Market conditions, competition lev...",Different years have different box office envi...,TMDB API (derived)
6,release_month,Integer,Strong predictor - Summer/holiday releases per...,Strategic release timing affects box office po...,TMDB API (derived)
7,is_summer_movie,Boolean,Categorical predictor - Summer blockbuster str...,"Summer movies target different audiences, high...",TMDB API (derived)
8,is_holiday_movie,Boolean,Categorical predictor - Holiday release strategy,November-December releases target awards seaso...,TMDB API (derived)
9,vote_average,Float,Predictor - Audience reception affects word-of...,Positive reception drives sustained box office...,TMDB API
