# EDA

## Questions

1. **What factors most strongly predict box office success — budget, cast popularity, or IMDb ratings?**
2. **Do movies released in summer perform better financially than those in winter?** 
3. **Is there a significant difference between critics’ and audience ratings across genres?**

In [70]:
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
import ast
%matplotlib inline

sns.set_theme(style="whitegrid", context="talk")

In [71]:
processed_data_path = Path("../data/processed.csv")
df = pd.read_csv(processed_data_path)

## Dataset Overview

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1790 entries, 0 to 1789
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1790 non-null   int64  
 1   movie_id               1790 non-null   int64  
 2   title                  1790 non-null   object 
 3   release_date           1790 non-null   object 
 4   budget                 1790 non-null   float64
 5   revenue_worldwide      1790 non-null   float64
 6   runtime                1790 non-null   float64
 7   genres                 1790 non-null   object 
 8   imdb_id                1790 non-null   object 
 9   franchise              1790 non-null   bool   
 10  cast_popularity_mean   1790 non-null   float64
 11  cast_popularity_max    1790 non-null   float64
 12  director_popularity    1790 non-null   float64
 13  original_language      1790 non-null   object 
 14  imdb_rating            1790 non-null   float64
 15  imdb

In [73]:
overview_stats = df.describe(include='all').transpose().round(2)
overview_stats.head(20)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,1790.0,,,,4878.524022,4906.619837,0.0,451.25,899.5,10307.75,10758.0
movie_id,1790.0,,,,223472.013408,326361.926039,12.0,1546.5,32839.5,393729.75,1410082.0
title,1790.0,1538.0,Monster,4.0,,,,,,,
release_date,1790.0,1320.0,2003-01-31,5.0,,,,,,,
budget,1790.0,,,,0.078648,0.118963,0.0,0.0,0.024495,0.106144,1.0
revenue_worldwide,1790.0,,,,0.063586,0.112916,0.0,0.0,0.011996,0.072967,1.0
runtime,1790.0,,,,0.186734,0.044868,0.0,0.160684,0.181197,0.206838,1.0
genres,1790.0,613.0,['Drama'],95.0,,,,,,,
imdb_id,1790.0,1543.0,tt0364569,2.0,,,,,,,
franchise,1790.0,2.0,False,1193.0,,,,,,,


## Q1: What factors most strongly predict box office success?
Analyzing correlations between revenue and key predictors: budget, cast popularity, and IMDb rating.

In [74]:
df[['revenue_worldwide', 'budget',
    'cast_popularity_mean', 'imdb_rating']].corr()["revenue_worldwide"]

revenue_worldwide       1.000000
budget                  0.784820
cast_popularity_mean    0.370509
imdb_rating             0.243541
Name: revenue_worldwide, dtype: float64

Budget is the strongest predictor of box office success with the highest correlation with worldwide revenue between the other factors.

This suggests that higher production budgets are far more predictive of financial success than ratings or cast popularity alone.

## Q2: Do movies released in summer perform better financially than those in winter?
Comparing average worldwide revenue between Summer and Winter releases.

In [75]:
q2_df = df[(df['season'].isin(['Summer', 'Winter'])) & (df['revenue_worldwide'] > 0)].copy()

In [76]:
seasonal_comparison = q2_df.groupby('season')['revenue_worldwide'].mean().sort_values(ascending=False)

print("Average Worldwide Revenue by Season:")
print(seasonal_comparison)

Average Worldwide Revenue by Season:
season
Summer    0.098661
Winter    0.080007
Name: revenue_worldwide, dtype: float64


**Summer movies perform better financially than winter releases.**

Summer releases earn approximately $15 million more on average (~7% higher). This aligns with industry knowledge that summer blockbuster season (May-August) attracts larger audiences due to school vacations and favorable weather for moviegoing.

## Q3: Is there a significant difference between critics' and audience ratings across genres?
Comparing audience and critics scores by genre to identify where opinions diverge most.

In [77]:
# Parse genres from string representation to list, then explode
q3_df = df.copy()
q3_df['genres'] = q3_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
q3_df = q3_df.explode('genres').rename(columns={'genres': 'Genre'})
q3_df = q3_df.dropna(subset=['Genre'])

In [78]:
genre_comparison = q3_df.groupby('Genre')[['audience_score', 'critics_score']].mean().dropna()

In [79]:
genre_comparison['difference'] = genre_comparison['audience_score'] - genre_comparison['critics_score']

print("Ratings Comparison by Genre (Audience vs Critics):")
print(genre_comparison.sort_values(by='difference', ascending=False))

Ratings Comparison by Genre (Audience vs Critics):
                 audience_score  critics_score  difference
Genre                                                     
War                    0.742915       0.636004    0.106911
Action                 0.666798       0.564380    0.102418
Romance                0.679787       0.585342    0.094445
Crime                  0.695120       0.602177    0.092943
Adventure              0.671761       0.586925    0.084837
Science Fiction        0.664384       0.580408    0.083976
Mystery                0.663363       0.579974    0.083389
Fantasy                0.674389       0.591399    0.082990
Thriller               0.655674       0.574658    0.081016
Comedy                 0.650575       0.572410    0.078166
History                0.725801       0.655414    0.070387
Western                0.632051       0.577512    0.054539
Drama                  0.716926       0.666565    0.050361
Family                 0.662497       0.612737    0.049759
Horro

**Key findings:**

In almost all genres, audiences rate slightly higher than critics