# 1. Problem Understanding

The film industry is a high-stakes domain where billions are invested annually in movie production, distribution, and marketing. However, predicting a movie's box office revenue before its release remains a challenging task due to the influence of various complex and interrelated factors such as genre, cast, production budget, and release timing.

## Objective

The main objective of this project is to develop and compare multiple machine learning regression models to accurately predict the box office revenue of a movie using metadata available prior to its release. This includes analyzing the contribution of different features (e.g., budget, cast, genre) and ensuring the model is interpretable and temporally robust.

## Motivation

By solving this problem, the project aims to provide valuable insights for stakeholders such as producers, investors, and marketers, enabling them to:
- Make informed decisions about production and marketing strategies.
- Assess the financial viability of upcoming film projects.
- Allocate resources more efficiently.

## Scope

This project is focused on:
- Using the TMDB 5000 Movie Dataset sourced from Kaggle.
- Treating the task as a regression problem.
- Implementing robust preprocessing and feature engineering techniques.
- Applying and comparing models like Linear Regression, Ridge, Lasso, Random Forest, Gradient Boosting, and XGBoost.
- Ensuring interpretability through SHAP values.
- Avoiding data leakage and considering real-world deployment using a Streamlit app.

## Key Questions

- Which metadata features contribute most significantly to movie revenue?
- Which regression model offers the best predictive performance?
- How can interpretability and model robustness be incorporated into the solution?
- Can the model be reliably used in a production environment for real-time revenue prediction?



# 2. Data Collection and Loading

## Data Source

The dataset used in this project is the **TMDB 5000 Movie Dataset**, available on [Kaggle](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata). It contains metadata for over 5,000 movies, including various attributes relevant for predicting box office revenue.

### Files Used:
- `tmdb_5000_movies.csv`: Contains information about each movie, including budget, genres, original language, popularity, production companies, release date, revenue, runtime, spoken languages, and more.
- `tmdb_5000_credits.csv`: Contains information about the cast and crew for each movie, including the director and main actors.



## Loading the Data

The datasets are loaded using `pandas` for inspection and preprocessing:

In [23]:
import pandas as pd

# load datasets
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [24]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [25]:
print("Movies dataset loaded with columns:", movies.columns)
print("Credits dataset loaded with columns:", credits.columns)

Movies dataset loaded with columns: Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')
Credits dataset loaded with columns: Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


##  Merging Movie and Credits Data

To enrich our dataset with cast and crew information, we merge the two CSV files: `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`. The common key is:
- `tmdb_5000_movies.id`
- `tmdb_5000_credits.movie_id`

First, we rename `movie_id` to `id` to enable a clean merge. The merged dataset allows us to engineer features from cast and crew data, which may have predictive value for box office revenue.


In [26]:
movies['id'].head()

0     19995
1       285
2    206647
3     49026
4     49529
Name: id, dtype: int64

In [27]:
credits['movie_id'].head()

0     19995
1       285
2    206647
3     49026
4     49529
Name: movie_id, dtype: int64

In [28]:
# rename the 'movie_id' column in credits to 'id' for merging
credits.rename(columns= {'movie_id': 'id'}, inplace= True)


In [29]:
credits['id'].head()

0     19995
1       285
2    206647
3     49026
4     49529
Name: id, dtype: int64

In [30]:
# Merging Movie and Credits Data on 'id' 
df = movies.merge(credits, on= 'id')

print("Merged DataFrame shape:", df.shape)
df.head()

Merged DataFrame shape: (4803, 23)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [31]:
print("Merged DataFrame columns:", df.columns)
df.dtypes

Merged DataFrame columns: Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'title_y', 'cast', 'crew'],
      dtype='object')


budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title_x                  object
vote_average            float64
vote_count                int64
title_y                  object
cast                     object
crew                     object
dtype: object

In [32]:

df.columns.value_counts().sum()

23

# 3. Exploratory Data Analysis (EDA)

The purpose of EDA is to explore the structure, quality, and initial patterns in the dataset. This step helps in understanding the distributions, identifying outliers, detecting data quality issues, and forming hypotheses for model building.


## Dataset Overview

In [33]:
# 1. Drop columns that are not needed
columumns_to_drop = ['title_y', 'title_x', 'homepage', 'overview', 'tagline', 'keywords',
    'spoken_languages', 'id' ]
df.drop(columns= columumns_to_drop, inplace= True)

print("Columns after dropping unnecessary ones:", df.columns)

Columns after dropping unnecessary ones: Index(['budget', 'genres', 'original_language', 'original_title', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'status', 'vote_average', 'vote_count', 'cast',
       'crew'],
      dtype='object')


After dropping irrelevant or deferred columns, the dataset now contains **15 columns**, including:

- **3 numeric predictors**:
  - `budget`, `popularity`, `runtime`

- **3 numeric targets/stats**:
  - `revenue`, `vote_average`, `vote_count`

- **9 object/categorical or nested fields**:
  - `genres`, `original_language`, `original_title`, `production_companies`, `production_countries`, `release_date`, `status`, `cast`, `crew`


In [35]:
# 2. Print basic structure
print("Shape: ", df.shape)

df.info()

Shape:  (4803, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   original_language     4803 non-null   object 
 3   original_title        4803 non-null   object 
 4   popularity            4803 non-null   float64
 5   production_companies  4803 non-null   object 
 6   production_countries  4803 non-null   object 
 7   release_date          4802 non-null   object 
 8   revenue               4803 non-null   int64  
 9   runtime               4801 non-null   float64
 10  status                4803 non-null   object 
 11  vote_average          4803 non-null   float64
 12  vote_count            4803 non-null   int64  
 13  cast                  4803 non-null   object 
 14  crew                  4803 non-null   object 
dtypes:

- The dataset contains **4,803 rows** and **15 columns**.
- Only **two columns** have missing values:
  - `release_date`: 1 missing
  - `runtime`: 2 missing

In [40]:
# 3. Summary STatistics for Numerical Columns 
print("\nSummary Statistics:")
print(df.describe().T)


Summary Statistics:
               count          mean           std  min           25%  \
budget        4803.0  2.904504e+07  4.072239e+07  0.0  790000.00000   
popularity    4803.0  2.149230e+01  3.181665e+01  0.0       4.66807   
revenue       4803.0  8.226064e+07  1.628571e+08  0.0       0.00000   
runtime       4801.0  1.068759e+02  2.261193e+01  0.0      94.00000   
vote_average  4803.0  6.092172e+00  1.194612e+00  0.0       5.60000   
vote_count    4803.0  6.902180e+02  1.234586e+03  0.0      54.00000   

                       50%           75%           max  
budget        1.500000e+07  4.000000e+07  3.800000e+08  
popularity    1.292159e+01  2.831350e+01  8.755813e+02  
revenue       1.917000e+07  9.291719e+07  2.787965e+09  
runtime       1.030000e+02  1.180000e+02  3.380000e+02  
vote_average  6.200000e+00  6.800000e+00  1.000000e+01  
vote_count    2.350000e+02  7.370000e+02  1.375200e+04  


## 🔹 Summary Statistics Insights

### `budget`
- **Mean**: $29 million
- **Max**: $380 million
- **Min**: **$0**  

---

### `revenue`
- **Mean**: $82 million
- **Max**: $2.78 billion
- **Min**: **$0**  


---

### `popularity`
- **Range**: 0 to 875 (median around 13)  

---

### `runtime`
- **Mean**: ~107 minutes
- **Max**: 338
- **Min**: **0**  


---

### `vote_average` and `vote_count`
- Some movies have **zero votes**.
- Others have **extremely high vote counts**.

In [45]:
# Count rows with zero budget, revenue, or runtime
print("Zero budget:", (df['budget'] == 0).sum())
print("Zero revenue:", (df['revenue'] == 0).sum())
print("Zero runtime:", (df['runtime'] == 0).sum())

Zero budget: 1037
Zero revenue: 1427
Zero runtime: 35


In [43]:
# 4. Function to display unique values for each object (categorical) column
def unique_values(df, max_unique= 20):
    object_col = df.select_dtypes(include= "object").columns
    for col in object_col:
        unique_vals = df[col].nunique()
        print(f"\n'{col}' - {unique_vals} unique values")
        if unique_vals <= max_unique:
            print(df[col].unique())
        else:
            print(f"Too many to display (> {max_unique})")

unique_values(df)


'genres' - 1175 unique values
Too many to display (> 20)

'original_language' - 37 unique values
Too many to display (> 20)

'original_title' - 4801 unique values
Too many to display (> 20)

'production_companies' - 3697 unique values
Too many to display (> 20)

'production_countries' - 469 unique values
Too many to display (> 20)

'release_date' - 3280 unique values
Too many to display (> 20)

'status' - 3 unique values
['Released' 'Post Production' 'Rumored']

'cast' - 4761 unique values
Too many to display (> 20)

'crew' - 4776 unique values
Too many to display (> 20)


In [44]:
# 5. Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
missing_percent = df.isnull().mean().sort_values(ascending= False) * 100
print("\nMissing Value Percentage:")
print(missing_percent[missing_percent > 0 ])


Missing values in each column:
budget                  0
genres                  0
original_language       0
original_title          0
popularity              0
production_companies    0
production_countries    0
release_date            1
revenue                 0
runtime                 2
status                  0
vote_average            0
vote_count              0
cast                    0
crew                    0
dtype: int64

Missing Value Percentage:
runtime         0.041641
release_date    0.020820
dtype: float64
