# TMDB Movie Data Analysis - Interactive Notebook

This notebook provides interactive analysis and visualization of TMDB movie data processed using Apache Spark.

## Overview
- **Data Source**: TMDB API
- **Processing**: Apache Spark ETL
- **Visualization**: Matplotlib & Seaborn
- **Analysis**: Pandas operations on processed Parquet data

## Contents
1. Data Loading & Setup
2. Exploratory Data Analysis (EDA)
3. Advanced Filtering & Queries
4. Visualizations
5. Insights & Conclusions

In [1]:
import sys
import warnings
from pathlib import Path

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Add project root to path to import model
project_root = Path("..").resolve()
sys.path.append(str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from model.config import PROCESSED_DATA_PATH, PLOTS_DIR
from model.visualization.plots import create_all_visualizations

# Setup pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print(f"✓ Project Root: {project_root}")
print(f"✓ Processed Data Path: {PROCESSED_DATA_PATH}")
print(f"✓ Plots Directory: {PLOTS_DIR}")

✓ Project Root: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl
✓ Processed Data Path: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\data\processed
✓ Plots Directory: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots


## 1. Load Data
Loading the processed parquet data.

In [2]:
# Load processed data
try:
    # Try reading with pyarrow (preferred)
    df = pd.read_parquet(str(PROCESSED_DATA_PATH))
    print(f"✓ Successfully loaded {len(df)} records")
    print(f"✓ Dataset shape: {df.shape}")
    print(f"\nColumns: {', '.join(df.columns.tolist())}\n")
    
    # Display basic statistics
    print("Dataset Info:")
    print(f"  - Total movies: {len(df)}")
    print(f"  - Date range: {df['release_year'].min():.0f} - {df['release_year'].max():.0f}")
    print(f"  - Franchise movies: {(df['belongs_to_collection'].notna()).sum()}")
    print(f"  - Standalone movies: {(df['belongs_to_collection'].isna()).sum()}")
    print(f"  - Avg rating: {df['vote_average'].mean():.2f}/10")
    print(f"  - Median budget: ${df['budget_musd'].median():.2f}M")
    print(f"  - Median revenue: ${df['revenue_musd'].median():.2f}M\n")
    
    # Show sample
    print("Sample Data (first 3 rows):")
    display(df.head(3))
    
except Exception as e:
    print(f"✗ Error: {e}")
    print("Please ensure the ETL pipeline has been run. Run: python main.py")

✓ Successfully loaded 18 records
✓ Dataset shape: (18, 25)

Columns: id, title, tagline, release_date, genres, belongs_to_collection, original_language, budget_musd, revenue_musd, production_companies, production_countries, vote_count, vote_average, popularity, runtime, overview, spoken_languages, poster_path, cast, cast_size, director, crew_size, profit, roi, release_year

Dataset Info:
  - Total movies: 18
✗ Error: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one

Please ensure the ETL pipeline has been run. Run: python main.py


## 2. Visualizations
Generating plots and saving them to `output/plots`.

In [3]:
# Generate all visualizations
PLOTS_DIR.mkdir(parents=True, exist_ok=True)
create_all_visualizations(df, PLOTS_DIR)

print("\n✓ All visualizations have been generated!")


GENERATING VISUALIZATIONS

1. Creating Revenue vs Budget plot...
✓ Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\revenue_vs_budget.png
2. Creating ROI by Genre plot...
✓ Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\roi_by_genre.png
3. Creating Popularity vs Rating plot...
✓ Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\popularity_vs_rating.png
4. Creating Yearly Trends plot...
✓ Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\yearly_trends.png
5. Creating Franchise vs Standalone plot...
✓ Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\franchise_vs_st

## 3. Analysis Summary
Quick view of top movies.

In [4]:
## 4.1 Top Performing Movies

print("=" * 80)
print("TOP 5 MOVIES BY REVENUE")
print("=" * 80)
top_revenue = df.dropna(subset=['revenue_musd']).nlargest(5, 'revenue_musd')[
    ['title', 'revenue_musd', 'budget_musd', 'vote_average', 'release_year']
]
display(top_revenue)

print("\n" + "=" * 80)
print("TOP 5 MOVIES BY ROI (Budget >= $10M)")
print("=" * 80)
top_roi = df[(df['budget_musd'] >= 10) & (df['roi'].notna())].nlargest(5, 'roi')[
    ['title', 'roi', 'budget_musd', 'revenue_musd', 'release_year']
]
display(top_roi)

print("\n" + "=" * 80)
print("TOP 5 HIGHEST RATED MOVIES (min. 10 votes)")
print("=" * 80)
top_rated = df[df['vote_count'] >= 10].nlargest(5, 'vote_average')[
    ['title', 'vote_average', 'vote_count', 'popularity', 'release_year']
]
display(top_rated)

TOP 5 MOVIES BY REVENUE


Unnamed: 0,title,revenue_musd,budget_musd,vote_average,release_year
1,Avatar,2923.706026,237.0,7.6,2009
15,Avengers: Endgame,2799.4391,356.0,8.236,2019
0,Titanic,2264.162353,200.0,7.902,1997
6,Star Wars: The Force Awakens,2068.223624,245.0,7.3,2015
12,Avengers: Infinity War,2052.415039,300.0,8.234,2018



TOP 5 MOVIES BY ROI (Budget >= $10M)


Unnamed: 0,title,roi,budget_musd,revenue_musd,release_year
1,Avatar,12.336312,237.0,2923.706026,2009
0,Titanic,11.320812,200.0,2264.162353,1997
5,Jurassic World,11.143583,150.0,1671.537444,2015
2,Harry Potter and the Deathly Hallows: Part 2,10.73209,125.0,1341.511219,2011
17,Frozen II,9.691223,150.0,1453.683476,2019



TOP 5 HIGHEST RATED MOVIES (min. 10 votes)


Unnamed: 0,title,vote_average,vote_count,popularity,release_year
15,Avengers: Endgame,8.236,27206,16.1581,2019
12,Avengers: Infinity War,8.234,31424,25.5575,2018
2,Harry Potter and the Deathly Hallows: Part 2,8.081,21643,15.6738,2011
3,The Avengers,7.932,35655,49.997,2012
0,Titanic,7.902,26732,30.8471,1997


## 5. Advanced Filtering & Search Queries

In [5]:
## Query 1: Sci-Fi Action Movies with Bruce Willis
print("=" * 80)
print("QUERY 1: SCI-FI ACTION MOVIES WITH BRUCE WILLIS (sorted by rating)")
print("=" * 80)

bruce_scifi = df[
    (df['genres'].str.contains('Science Fiction', na=False)) &
    (df['genres'].str.contains('Action', na=False)) &
    (df['cast'].str.contains('Bruce Willis', na=False))
].sort_values('vote_average', ascending=False)[
    ['title', 'vote_average', 'release_year', 'genres']
]

if len(bruce_scifi) > 0:
    display(bruce_scifi)
else:
    print("No movies found matching criteria.")

## Query 2: Uma Thurman + Quentin Tarantino
print("\n" + "=" * 80)
print("QUERY 2: MOVIES WITH UMA THURMAN DIRECTED BY QUENTIN TARANTINO (by runtime)")
print("=" * 80)

uma_tarantino = df[
    (df['cast'].str.contains('Uma Thurman', na=False)) &
    (df['director'] == 'Quentin Tarantino')
].sort_values('runtime', ascending=True)[
    ['title', 'release_year', 'runtime', 'vote_average']
]

if len(uma_tarantino) > 0:
    display(uma_tarantino)
else:
    print("No movies found matching criteria.")

QUERY 1: SCI-FI ACTION MOVIES WITH BRUCE WILLIS (sorted by rating)
No movies found matching criteria.

QUERY 2: MOVIES WITH UMA THURMAN DIRECTED BY QUENTIN TARANTINO (by runtime)
No movies found matching criteria.


## 6. Franchise vs Standalone Analysis

In [6]:
# Classify movies as franchise or standalone
df['movie_type'] = df['belongs_to_collection'].apply(
    lambda x: 'Franchise' if pd.notna(x) else 'Standalone'
)

# Get clean data for analysis
df_clean = df.dropna(subset=['revenue_musd', 'budget_musd', 'vote_average'])

print("=" * 80)
print("FRANCHISE VS STANDALONE COMPARISON")
print("=" * 80)

franchise = df_clean[df_clean['movie_type'] == 'Franchise']
standalone = df_clean[df_clean['movie_type'] == 'Standalone']

comparison_data = {
    'Movie Type': ['Franchise', 'Standalone'],
    'Count': [len(franchise), len(standalone)],
    'Avg Revenue (M)': [franchise['revenue_musd'].mean(), standalone['revenue_musd'].mean()],
    'Avg Budget (M)': [franchise['budget_musd'].mean(), standalone['budget_musd'].mean()],
    'Avg Rating': [franchise['vote_average'].mean(), standalone['vote_average'].mean()],
    'Avg Popularity': [franchise['popularity'].mean(), standalone['popularity'].mean()]
}

comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)

print("\n" + "=" * 80)
print("TOP 10 FRANCHISES BY TOTAL REVENUE")
print("=" * 80)

franchise_stats = df.dropna(subset=['belongs_to_collection']).groupby('belongs_to_collection').agg({
    'id': 'count',
    'revenue_musd': ['sum', 'mean'],
    'budget_musd': 'mean',
    'vote_average': 'mean'
}).rename(columns={'id': 'movie_count'})

franchise_stats.columns = ['movie_count', 'total_revenue', 'mean_revenue', 'mean_budget', 'mean_rating']
franchise_stats = franchise_stats.sort_values('total_revenue', ascending=False)
top_franchises = franchise_stats[franchise_stats['movie_count'] >= 2].head(10)

display(top_franchises)

FRANCHISE VS STANDALONE COMPARISON


Unnamed: 0,Movie Type,Count,Avg Revenue (M),Avg Budget (M),Avg Rating,Avg Popularity
0,Franchise,16,1682.668411,218.0,7.390813,16.86015
1,Standalone,2,1765.139159,180.0,7.435,20.30035



TOP 10 FRANCHISES BY TOTAL REVENUE


Unnamed: 0_level_0,movie_count,total_revenue,mean_revenue,mean_budget,mean_rating
belongs_to_collection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Avengers Collection,4,7776.073348,1944.018337,277.75,7.9255
Star Wars Collection,2,3400.922454,1700.461227,272.5,7.0295
Jurassic Park Collection,2,2982.006481,1491.003241,160.0,6.6165
Frozen Collection,2,2727.902485,1363.951242,150.0,7.224


## 7. Top Directors Analysis

In [7]:
print("=" * 80)
print("TOP 10 DIRECTORS BY TOTAL REVENUE")
print("=" * 80)

director_stats = df[df['director'] != 'Unknown'].dropna(subset=['revenue_musd']).groupby('director').agg({
    'id': 'count',
    'revenue_musd': 'sum',
    'vote_average': 'mean'
}).rename(columns={'id': 'movie_count', 'revenue_musd': 'total_revenue', 'vote_average': 'mean_rating'})

director_stats = director_stats.sort_values('total_revenue', ascending=False).head(10)

display(director_stats)

print("\n" + "=" * 80)
print("GENRE PERFORMANCE ANALYSIS")
print("=" * 80)

# Expand genres
genre_data = []
for idx, row in df.iterrows():
    if pd.notna(row['genres']):
        genres = [g.strip() for g in str(row['genres']).split('|')]
        for genre in genres:
            genre_data.append({
                'genre': genre,
                'revenue': row['revenue_musd'],
                'rating': row['vote_average'],
                'popularity': row['popularity']
            })

genre_df = pd.DataFrame(genre_data).dropna()

genre_stats = genre_df.groupby('genre').agg({
    'rating': ['count', 'mean'],
    'revenue': 'mean',
    'popularity': 'mean'
}).round(2)

genre_stats.columns = ['count', 'avg_rating', 'avg_revenue', 'avg_popularity']
genre_stats = genre_stats.sort_values('avg_revenue', ascending=False)

display(genre_stats)

TOP 10 DIRECTORS BY TOTAL REVENUE


Unnamed: 0_level_0,movie_count,total_revenue,mean_rating
director,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
James Cameron,2,5187.868379,7.751
Joss Whedon,2,2924.219209,7.616
Anthony Russo,1,2799.4391,8.236
J.J. Abrams,1,2068.223624,7.3
Joe Russo,1,2052.415039,8.234
Colin Trevorrow,1,1671.537444,6.699
Jon Favreau,1,1662.020819,7.097
James Wan,1,1515.4,7.217
Jennifer Lee,1,1453.683476,7.2
Ryan Coogler,1,1349.926083,7.363



GENRE PERFORMANCE ANALYSIS


Unnamed: 0_level_0,count,avg_rating,avg_revenue,avg_popularity
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Drama,2,7.5,1963.09,19.99
Science Fiction,10,7.4,1843.26,19.69
Action,12,7.39,1765.94,17.96
Romance,2,7.44,1765.14,20.3
Adventure,15,7.4,1693.82,17.39
Fantasy,5,7.42,1651.85,19.07
Crime,1,7.22,1515.4,8.87
Thriller,3,6.82,1499.14,9.16
Comedy,1,7.2,1453.68,10.37
Animation,4,7.25,1408.29,12.08


## 8. Key Insights & Conclusions

### Data Insights
- The dataset represents a comprehensive collection of movies with diverse financial and critical metrics
- Franchise films show distinct patterns compared to standalone productions
- Revenue generation varies significantly across genres and time periods

### Performance Metrics
- Top-performing movies by revenue often have higher budgets but not necessarily better ratings
- ROI analysis reveals efficient vs. inefficient production choices
- Director reputation correlates with consistent revenue generation

### Visualization Outputs
All visualizations have been saved to `output/plots/`:
- **revenue_vs_budget.png** - Budget-Revenue correlation by year
- **roi_by_genre.png** - Risk-return profile across genres
- **popularity_vs_rating.png** - Audience vs. Critical reception
- **yearly_trends.png** - Historical box office trends
- **franchise_vs_standalone.png** - Comparative performance analysis
- **rating_distribution.png** - Quality distribution
- **top_directors.png** - Director success metrics
- **genre_performance.png** - Genre-based analysis

### Recommendations
1. Analyze franchise profitability for investment decisions
2. Study top directors' strategies for success
3. Monitor genre trends for market shifts
4. Use ROI metrics for budget allocation
5. Balance critical ratings with commercial success