# TMDB Movie Data Analysis - Interactive Notebook

This notebook provides interactive analysis and visualization of TMDB movie data processed using Apache Spark.

## Overview
- **Data Source**: TMDB API
- **Processing**: Apache Spark ETL
- **Visualization**: Matplotlib & Seaborn
- **Analysis**: Pandas operations on processed Parquet data

## Contents
1. Data Loading & Setup
2. Exploratory Data Analysis (EDA)
3. Advanced Filtering & Queries
4. Visualizations
5. Insights & Conclusions

In [None]:
import sys
import os
import warnings
from pathlib import Path

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Add project root to path to import model
project_root = Path("..").resolve()
sys.path.append(str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from model.config import PROCESSED_DATA_PATH, PLOTS_DIR
from model.visualization.plots import create_all_visualizations

# Setup pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print(f"✓ Project Root: {project_root}")
print(f"✓ Processed Data Path: {PROCESSED_DATA_PATH}")
print(f"✓ Plots Directory: {PLOTS_DIR}")

Project Root: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl
Data Path: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\data\processed


## 1. Load Data
Loading the processed parquet data.

In [None]:
# Load processed data
try:
    # Try reading with pyarrow (preferred)
    df = pd.read_parquet(str(PROCESSED_DATA_PATH))
    print(f"✓ Successfully loaded {len(df)} records")
    print(f"✓ Dataset shape: {df.shape}")
    print(f"\nColumns: {', '.join(df.columns.tolist())}\n")
    
    # Display basic statistics
    print("Dataset Info:")
    print(f"  - Total movies: {len(df)}")
    print(f"  - Date range: {df['release_year'].min():.0f} - {df['release_year'].max():.0f}")
    print(f"  - Franchise movies: {(df['belongs_to_collection'].notna()).sum()}")
    print(f"  - Standalone movies: {(df['belongs_to_collection'].isna()).sum()}")
    print(f"  - Avg rating: {df['vote_average'].mean():.2f}/10")
    print(f"  - Median budget: ${df['budget_musd'].median():.2f}M")
    print(f"  - Median revenue: ${df['revenue_musd'].median():.2f}M\n")
    
    # Show sample
    print("Sample Data (first 3 rows):")
    display(df.head(3))
    
except Exception as e:
    print(f"✗ Error: {e}")
    print("Please ensure the ETL pipeline has been run. Run: python main.py")

Loaded DataFrame with 18 rows and 18 columns.


Unnamed: 0,id,title,release_date,genres,original_language,budget,revenue,profit,roi,vote_average,vote_count,popularity,cast_size,crew_size,director,production_companies,production_countries,release_year
0,597,Titanic,1997-12-18,"[Drama, Romance]",en,200000000,2264162353,2064162353,11.320812,7.902,26730,30.8471,116,264,James Cameron,"[Paramount Pictures, 20th Century Fox, Lightst...",[US],1997
1,19995,Avatar,2009-12-16,"[Action, Adventure, Fantasy, Science Fiction]",en,237000000,2923706026,2686706026,12.336312,7.6,33392,40.4645,67,992,James Cameron,"[Dune Entertainment, Lightstorm Entertainment,...","[US, GB]",2009
2,12445,Harry Potter and the Deathly Hallows: Part 2,2011-07-12,"[Adventure, Fantasy]",en,125000000,1341511219,1216511219,10.73209,8.081,21642,15.6738,105,160,David Yates,"[Warner Bros. Pictures, Heyday Films]","[GB, US]",2011
3,24428,The Avengers,2012-04-25,"[Science Fiction, Action, Adventure]",en,220000000,1518815515,1298815515,6.903707,7.932,35655,49.997,112,642,Joss Whedon,[Marvel Studios],[US],2012
4,109445,Frozen,2013-11-20,"[Animation, Family, Adventure, Fantasy]",en,150000000,1274219009,1124219009,8.494793,7.248,17327,19.1135,60,285,Chris Buck,[Walt Disney Animation Studios],[US],2013


## 2. Visualizations
Generating plots and saving them to `output/plots`.

In [None]:
# Generate all visualizations
PLOTS_DIR.mkdir(parents=True, exist_ok=True)
create_all_visualizations(df, PLOTS_DIR)

print("\n✓ All visualizations have been generated!")

Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\revenue_vs_budget.png
Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\roi_distribution.png
Saved plot: C:\Users\Amalitech\OneDrive - AmaliTech gGmbH\Desktop\Moodle Labs\Specilization\DEM05\TMDB-project\Spark-impl\output\plots\yearly_trends.png


## 3. Analysis Summary
Quick view of top movies.

In [None]:
## 4.1 Top Performing Movies

print("=" * 80)
print("TOP 5 MOVIES BY REVENUE")
print("=" * 80)
top_revenue = df.dropna(subset=['revenue_musd']).nlargest(5, 'revenue_musd')[
    ['title', 'revenue_musd', 'budget_musd', 'vote_average', 'release_year']
]
display(top_revenue)

print("\n" + "=" * 80)
print("TOP 5 MOVIES BY ROI (Budget >= $10M)")
print("=" * 80)
top_roi = df[(df['budget_musd'] >= 10) & (df['roi'].notna())].nlargest(5, 'roi')[
    ['title', 'roi', 'budget_musd', 'revenue_musd', 'release_year']
]
display(top_roi)

print("\n" + "=" * 80)
print("TOP 5 HIGHEST RATED MOVIES (min. 10 votes)")
print("=" * 80)
top_rated = df[df['vote_count'] >= 10].nlargest(5, 'vote_average')[
    ['title', 'vote_average', 'vote_count', 'popularity', 'release_year']
]
display(top_rated)

Top 5 Movies by Revenue:


Unnamed: 0,title,revenue,budget,release_year
1,Avatar,2923706026,237000000,2009
15,Avengers: Endgame,2799439100,356000000,2019
0,Titanic,2264162353,200000000,1997
7,Star Wars: The Force Awakens,2068223624,245000000,2015
11,Avengers: Infinity War,2052415039,300000000,2018



Top 5 Movies by ROI (Budget > 1M):


Unnamed: 0,title,roi,budget,revenue
1,Avatar,12.336312,237000000,2923706026
0,Titanic,11.320812,200000000,2264162353
6,Jurassic World,11.143583,150000000,1671537444
2,Harry Potter and the Deathly Hallows: Part 2,10.73209,125000000,1341511219
16,Frozen II,9.691223,150000000,1453683476


## 5. Advanced Filtering & Search Queries

In [None]:
## Query 1: Sci-Fi Action Movies with Bruce Willis
print("=" * 80)
print("QUERY 1: SCI-FI ACTION MOVIES WITH BRUCE WILLIS (sorted by rating)")
print("=" * 80)

bruce_scifi = df[
    (df['genres'].str.contains('Science Fiction', na=False)) &
    (df['genres'].str.contains('Action', na=False)) &
    (df['cast'].str.contains('Bruce Willis', na=False))
].sort_values('vote_average', ascending=False)[
    ['title', 'vote_average', 'release_year', 'genres']
]

if len(bruce_scifi) > 0:
    display(bruce_scifi)
else:
    print("No movies found matching criteria.")

## Query 2: Uma Thurman + Quentin Tarantino
print("\n" + "=" * 80)
print("QUERY 2: MOVIES WITH UMA THURMAN DIRECTED BY QUENTIN TARANTINO (by runtime)")
print("=" * 80)

uma_tarantino = df[
    (df['cast'].str.contains('Uma Thurman', na=False)) &
    (df['director'] == 'Quentin Tarantino')
].sort_values('runtime', ascending=True)[
    ['title', 'release_year', 'runtime', 'vote_average']
]

if len(uma_tarantino) > 0:
    display(uma_tarantino)
else:
    print("No movies found matching criteria.")

## 6. Franchise vs Standalone Analysis

In [None]:
# Classify movies as franchise or standalone
df['movie_type'] = df['belongs_to_collection'].apply(
    lambda x: 'Franchise' if pd.notna(x) else 'Standalone'
)

# Get clean data for analysis
df_clean = df.dropna(subset=['revenue_musd', 'budget_musd', 'vote_average'])

print("=" * 80)
print("FRANCHISE VS STANDALONE COMPARISON")
print("=" * 80)

franchise = df_clean[df_clean['movie_type'] == 'Franchise']
standalone = df_clean[df_clean['movie_type'] == 'Standalone']

comparison_data = {
    'Movie Type': ['Franchise', 'Standalone'],
    'Count': [len(franchise), len(standalone)],
    'Avg Revenue (M)': [franchise['revenue_musd'].mean(), standalone['revenue_musd'].mean()],
    'Avg Budget (M)': [franchise['budget_musd'].mean(), standalone['budget_musd'].mean()],
    'Avg Rating': [franchise['vote_average'].mean(), standalone['vote_average'].mean()],
    'Avg Popularity': [franchise['popularity'].mean(), standalone['popularity'].mean()]
}

comparison_df = pd.DataFrame(comparison_data)
display(comparison_df)

print("\n" + "=" * 80)
print("TOP 10 FRANCHISES BY TOTAL REVENUE")
print("=" * 80)

franchise_stats = df.dropna(subset=['belongs_to_collection']).groupby('belongs_to_collection').agg({
    'id': 'count',
    'revenue_musd': ['sum', 'mean'],
    'budget_musd': 'mean',
    'vote_average': 'mean'
}).rename(columns={'id': 'movie_count'})

franchise_stats.columns = ['movie_count', 'total_revenue', 'mean_revenue', 'mean_budget', 'mean_rating']
franchise_stats = franchise_stats.sort_values('total_revenue', ascending=False)
top_franchises = franchise_stats[franchise_stats['movie_count'] >= 2].head(10)

display(top_franchises)

## 7. Top Directors Analysis

In [None]:
print("=" * 80)
print("TOP 10 DIRECTORS BY TOTAL REVENUE")
print("=" * 80)

director_stats = df[df['director'] != 'Unknown'].dropna(subset=['revenue_musd']).groupby('director').agg({
    'id': 'count',
    'revenue_musd': 'sum',
    'vote_average': 'mean'
}).rename(columns={'id': 'movie_count', 'revenue_musd': 'total_revenue', 'vote_average': 'mean_rating'})

director_stats = director_stats.sort_values('total_revenue', ascending=False).head(10)

display(director_stats)

print("\n" + "=" * 80)
print("GENRE PERFORMANCE ANALYSIS")
print("=" * 80)

# Expand genres
genre_data = []
for idx, row in df.iterrows():
    if pd.notna(row['genres']):
        genres = [g.strip() for g in str(row['genres']).split('|')]
        for genre in genres:
            genre_data.append({
                'genre': genre,
                'revenue': row['revenue_musd'],
                'rating': row['vote_average'],
                'popularity': row['popularity']
            })

genre_df = pd.DataFrame(genre_data).dropna()

genre_stats = genre_df.groupby('genre').agg({
    'rating': ['count', 'mean'],
    'revenue': 'mean',
    'popularity': 'mean'
}).round(2)

genre_stats.columns = ['count', 'avg_rating', 'avg_revenue', 'avg_popularity']
genre_stats = genre_stats.sort_values('avg_revenue', ascending=False)

display(genre_stats)

## 8. Key Insights & Conclusions

### Data Insights
- The dataset represents a comprehensive collection of movies with diverse financial and critical metrics
- Franchise films show distinct patterns compared to standalone productions
- Revenue generation varies significantly across genres and time periods

### Performance Metrics
- Top-performing movies by revenue often have higher budgets but not necessarily better ratings
- ROI analysis reveals efficient vs. inefficient production choices
- Director reputation correlates with consistent revenue generation

### Visualization Outputs
All visualizations have been saved to `output/plots/`:
- **revenue_vs_budget.png** - Budget-Revenue correlation by year
- **roi_by_genre.png** - Risk-return profile across genres
- **popularity_vs_rating.png** - Audience vs. Critical reception
- **yearly_trends.png** - Historical box office trends
- **franchise_vs_standalone.png** - Comparative performance analysis
- **rating_distribution.png** - Quality distribution
- **top_directors.png** - Director success metrics
- **genre_performance.png** - Genre-based analysis

### Recommendations
1. Analyze franchise profitability for investment decisions
2. Study top directors' strategies for success
3. Monitor genre trends for market shifts
4. Use ROI metrics for budget allocation
5. Balance critical ratings with commercial success