# EDA

## Questions

1. **What factors most strongly predict box office success — budget, cast popularity, or IMDb ratings?**
2. **Do movies released in summer perform better financially than those in winter?** 
3. **Is there a significant difference between critics’ and audience ratings across genres?**

In [2]:
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
import ast
%matplotlib inline

sns.set_theme(style="whitegrid", context="talk")

In [3]:
processed_data_path = Path("../data/processed.csv")
df = pd.read_csv(processed_data_path)

## Dataset Overview

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 795 entries, 0 to 794
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             795 non-null    int64  
 1   movie_id               795 non-null    int64  
 2   title                  795 non-null    object 
 3   release_date           795 non-null    object 
 4   budget                 795 non-null    float64
 5   revenue_worldwide      795 non-null    int64  
 6   runtime                795 non-null    int64  
 7   genres                 795 non-null    object 
 8   imdb_id                795 non-null    object 
 9   franchise              795 non-null    bool   
 10  cast_popularity_mean   795 non-null    float64
 11  cast_popularity_max    795 non-null    float64
 12  director_popularity    795 non-null    float64
 13  original_language      795 non-null    object 
 14  imdb_rating            795 non-null    float64
 15  imdb_v

In [5]:
overview_stats = df.describe(include='all').transpose().round(2)
overview_stats.head(20)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,795.0,,,,400.188679,231.709439,0.0,199.5,400.0,600.5,800.0
movie_id,795.0,,,,227460.943396,316234.483268,12.0,8276.5,34283.0,402381.0,1376434.0
title,795.0,794.0,Monster,2.0,,,,,,,
release_date,795.0,732.0,2000-12-22,3.0,,,,,,,
budget,795.0,,,,0.0,1.0,-0.737418,-0.73593,-0.439796,0.378665,6.55284
revenue_worldwide,795.0,,,,173332127.686792,268012579.271297,0.0,207836.5,60780981.0,210823015.0,2068223624.0
runtime,795.0,,,,109.987421,30.446507,3.0,93.0,106.0,121.0,585.0
genres,795.0,406.0,['Drama'],30.0,,,,,,,
imdb_id,795.0,795.0,tt0374900,1.0,,,,,,,
franchise,795.0,2.0,False,460.0,,,,,,,


## Q1: What factors most strongly predict box office success?
Analyzing correlations between revenue and key predictors: budget, cast popularity, and IMDb rating.

In [9]:
df[['revenue_worldwide', 'budget',
    'cast_popularity_mean', 'imdb_rating']].corr()["revenue_worldwide"]

revenue_worldwide       1.000000
budget                  0.768441
cast_popularity_mean    0.388949
imdb_rating             0.253140
Name: revenue_worldwide, dtype: float64

Budget is the strongest predictor of box office success with the highest correlation with worldwide revenue between the other factors.

This suggests that higher production budgets are far more predictive of financial success than ratings or cast popularity alone.

## Q2: Do movies released in summer perform better financially than those in winter?
Comparing average worldwide revenue between Summer and Winter releases.

In [10]:
q2_df = df[(df['season'].isin(['Summer', 'Winter'])) & (df['revenue_worldwide'] > 0)].copy()

In [11]:
seasonal_comparison = q2_df.groupby('season')['revenue_worldwide'].mean().sort_values(ascending=False)

print("Average Worldwide Revenue by Season:")
print(seasonal_comparison)

Average Worldwide Revenue by Season:
season
Summer    2.341969e+08
Winter    2.190730e+08
Name: revenue_worldwide, dtype: float64


**summer movies perform better financially than winter releases.**

Summer releases earn approximately $15 million more on average (~7% higher). This aligns with industry knowledge that summer blockbuster season (May-August) attracts larger audiences due to school vacations and favorable weather for moviegoing.

## Q3: Is there a significant difference between critics' and audience ratings across genres?
Comparing audience and critics scores by genre to identify where opinions diverge most.

In [12]:
# Parse genres from string representation to list, then explode
q3_df = df.copy()
q3_df['genres'] = q3_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
q3_df = q3_df.explode('genres').rename(columns={'genres': 'Genre'})
q3_df = q3_df.dropna(subset=['Genre'])

In [13]:
genre_comparison = q3_df.groupby('Genre')[['audience_score', 'critics_score']].mean().dropna()

In [14]:
genre_comparison['difference'] = genre_comparison['audience_score'] - genre_comparison['critics_score']

print("Ratings Comparison by Genre (Audience vs Critics):")
print(genre_comparison.sort_values(by='difference', ascending=False))

Ratings Comparison by Genre (Audience vs Critics):
                 audience_score  critics_score  difference
Genre                                                     
Horror                 0.589286       5.229630   -4.640344
Romance                0.644962       5.533193   -4.888231
Comedy                 0.631031       5.520045   -4.889015
Mystery                0.644091       5.625424   -4.981333
Thriller               0.648173       5.668956   -5.020783
Action                 0.660285       5.795370   -5.135086
Crime                  0.672920       5.828241   -5.155320
Music                  0.654706       5.950000   -5.295294
Western                0.720000       6.050000   -5.330000
War                    0.715217       6.045238   -5.330021
Science Fiction        0.661150       6.011881   -5.350731
Adventure              0.661512       6.021978   -5.360466
Fantasy                0.659416       6.050855   -5.391439
Family                 0.641570       6.086842   -5.445272
Drama

**There is a slight difference. critics consistently rate movies higher than audiences across all genres.**

Key findings:
- **Smallest gap**: Horror (-4.64) — critics and audiences agree most on horror films
- **Largest gap**: TV Movies (-7.12) and Documentaries (-6.98) — critics rate these much higher than audiences
- **Pattern**: Critics favor prestige genres (Documentary, Animation, Drama, History) while audiences show more skepticism