# Movie Box Office revenue predictor


## Authors: Christopher Budd, Mustafa Syed, and Jayant Varma 

### Objective: 
To predict the revenue generated by a movie given its other features such as plot keywords, cast, budget, release dates, languages, production companies, countries, TMDB vote counts and vote averages, reviews, etc.



**Dataset citation: The dataset used was https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies/data . However, this dataset is updated daily, but we used this dataset as it was available on November 5th 2023**

**You can find the exact dataset we used here: https://drive.google.com/file/d/1uPtHyqpAKkqZUpft8A0FPVXPR2iT32SN/view?usp=sharing** Kindly download the dataset on your local machine and run it accordingly (see how we ran it under 'Loading the dataset')

# Movies daily updated dataset description:

**Attributes for the dataset:**
The below attributes are copied **AS IS** from the original dataset website https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies/data 

1. id	--> TMDB id

2. title	--> Title of the movie

3. genres	--> Genres are separated by'-'

4. original_language	--> The language the movie was made in

5. overview	    --> short description of movie

6. popularity   --> TMDB metric, in depth description can be found here, https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies/discussion/400671 

7. production_companies	--> '-' separated production company

8. release_date     --> movie release date

9. budget	--> budget of the movie

10. revenue	    --> Revenue generated by the movie

11. runtime	    --> duration of the movie

12. status	--> status (Released, or planned, or other)

13. tagline	    --> tagline

14. vote_average	--> average of votes given by tmdb users

15. vote_count	       --> vote counts

16. credits	        --> '-' separated cast if movie

17. keywords	    --> '-' separated keywords that desciption of movie

18. poster_path	    --> poster image

19. backdrop_path	--> background images

20. recommendations --> '-' separated recommended movie id


**Missing values:** There exist missing values in multiple features of the above dataset as we'll soon see


**Duplicated values:** 
There exist duplicated values in multiple features of the above dataset as we'll soon see


# 1: Looking at the big picture, framing the problem, and business practicality

### Frame the problem
1. Supervised learning – training examples are labeled.
2. A regression task – predict a value (Revenue).
3. Batch learning 
    - Small data set
    - No continuous flow of data coming into the system
    - No need to adjust to changing data rapidly

### Big Picture/Business objective:
At the end of the day every business wants to know how much revenue they can generate given the all production inputs. Our project solves this problem for the movie industry. See objective below.

### Objective: 
To predict the revenue generated by a movie given its other featuers such as plot keywords, cast, budget, release dates, languages, production companies, countries, TMDB vote counts and vote averages, reviews, and recommendations.


# Initial set up

In [1]:
# Import libraries

import sklearn
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the dataset

In [2]:
url="https://raw.githubusercontent.com/Jayant1Varma/Movie-Box-Office-predictor/Model-Training/input/modified_data.csv"
movies = pd.read_csv(url, sep=',') 
# Please note: The file is massive, and not be referred online. You MUST download the file on your local machine from https://drive.google.com/file/d/1uPtHyqpAKkqZUpft8A0FPVXPR2iT32SN/view?usp=drive_link (MAKE SURE YOU ONLY USE YOUR YORK UNIVERSITY GOOGLE ACCOUNT!!!)

# 2 First impressions on the dataset, graphs of the EDA, and patterns found


Note: We use only the dataset from January 1st 2015 until Nov. 5th 2023

In [3]:
# Convert 'release_date' to datetime format
movies['release_date'] = pd.to_datetime(movies['release_date'])

# Create a mask for filtering dates
mask = (movies['release_date'] >= '2015-01-01') & (movies['release_date'] <= '2023-11-05')

# Apply the mask to filter rows
movies = movies[mask]

# Sort the DataFrame based on 'release_date'
movies = movies.sort_values(by='release_date', ascending=False)

In [4]:
movies.head()

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,runtime,status,tagline,vote_average,vote_count,credits,keywords,recommendations
14093,763144,The Last Rifleman,Drama-Thriller,en,A WWII veteran escapes his care home in Northe...,5.212,WestEnd Films-Wee Buns-Ripple World Pictures-I...,2023-11-05,0,0,0.0,Released,Meet Private Artie Crawford. He’s 92¾ and he’s...,0.0,0,Pierce Brosnan-Jürgen Prochnow-Louis Gossett J...,based on true story,
7113,652319,Chivaraku Migiledi,,te,A black and white film that takes place on a S...,10.024,,2023-11-04,0,0,63.0,Released,,0.0,0,Jagadeesh Prathap Bandari-Sai Yogi-Laxman Meesala,,
4420,956226,Courtney Gets Possessed,Horror-Comedy,en,A bumbling wedding party must battle the force...,14.01,Peach Jam Pictures,2023-11-03,0,0,0.0,Released,In Sickness and in Hell,0.0,0,Lauren Buglioli-Madison Hatfield-Jonathan Pawl...,,
4367,629925,Silence of Smoke,Drama,zh,Tells a series of misunderstandings between Li...,14.13,Magilm Pictures,2023-11-03,0,0,0.0,Released,,0.0,0,Han Geng-Zhang Guoli-Xu Qing-Xue Haojing,,
2700,940721,Godzilla Minus One,Science Fiction-Horror-Action,ja,In postwar Japan a new terror rises. Will the ...,20.46,Robot Communications-TOHO Studios-TOHO,2023-11-03,0,0,125.0,Released,Postwar Japan. From zero to minus.,0.0,0,Ryunosuke Kamiki-Minami Hamabe-Yuki Yamada-Mun...,monster-giant monster-reboot-kaiju-post war ja...,


#### Use head() to look at the first 5 rows

#### Use describe() method to see a summary of the numerical attributes.


Note: 
- Since count of id = count of budget and count of revenue, we know our data for predicting revenue is complete in a one to one mapping between id and revenue. The data rows exist, however the values may need preprocessing
- Parts of this data make little sense, for example, the maximum run time is 1.5 years if we take the unit to be in seconds. This clearly means, a lot of data cleaning, and preprocessing is required. Since 75% of movies are 90 time units long, we will arbitrarily take 150 units as the maximum length allowed, and delete all instances that run longer than this. (see preprocessing).
- Revenue could be negative, as seen from the 'min' case.

In [5]:
movies.describe()

Unnamed: 0,id,popularity,release_date,budget,revenue,runtime,vote_average,vote_count
count,231531.0,231531.0,231531,231531.0,231531.0,225962.0,231531.0,231531.0
mean,661458.03095,2.560032,2018-12-05 22:32:46.316389376,351227.0,948964.4,50.819664,2.56055,26.624137
min,10148.0,0.6,2015-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0
25%,521855.0,0.6,2017-03-04 00:00:00,0.0,0.0,7.0,0.0,0.0
50%,665913.0,0.621,2019-01-01 00:00:00,0.0,0.0,37.0,0.0,0.0
75%,806401.0,1.4,2020-09-28 00:00:00,0.0,0.0,90.0,6.0,2.0
max,968149.0,8763.998,2023-11-05 00:00:00,540000000.0,2799439000.0,43200.0,10.0,28462.0
std,176614.846096,36.128391,,6059478.0,23966680.0,107.824413,3.339721,373.046485


one of the issues that we noticed with the dataset was it had a lot of zeroes within the target values however, this makes no sense given that the revenue is not the profit it is the box office number. Therefore we will be treating 0's like missing values and having them removed

In [6]:
mask=(movies['revenue']!=0)
df1=movies[mask]

### comment this in if you want sampling if not it will only have non-zero values
mask=(movies['revenue']==0)
df2=movies[mask].sample(n=len(df1),random_state=42)

movies=pd.concat((df1,df2))


ValueError: a must be greater than 0 unless no samples are taken

#### Use info() to get a quick description of the data, the total number of rows, each attribute’s type, and the number of non-null values.

In [None]:
movies.info()

In [None]:
corr_matrix = movies.corr(numeric_only=True)
corr_matrix

 Made subplots to count the different values

In [None]:
import matplotlib.pyplot as plt

numerical_columns = movies.select_dtypes(include='number').columns

num_plots = len(numerical_columns)

### 3 rows and 3 columns of subplots
rows = 3 
cols = 3  


### make sub
fig, axes = plt.subplots(rows, cols, figsize=(24, 16))

##
for i, column in enumerate(numerical_columns):
    row = i // cols  
    col = i % cols   
    movies[column].plot(kind='hist', ax=axes[row, col])
    axes[row, col].set_title(column)
    
    # Access and modify x-axis properties
    axes[row, col].set_xlabel(numerical_columns[i])
    # You can set ticks, limits, etc., for the x-axis similarly
    
    # Access and modify y-axis properties
    axes[row, col].set_ylabel('Frequency')
    # You can set ticks, limits, etc., for the y-axis similarly
    
fig.delaxes(axes[2,1])
fig.delaxes(axes[2,2])

### set scale to logarithmic to display the data
axes[0,1].set_yscale('log')
axes[0,2].set_yscale('log')
axes[1,0].set_yscale('log')
axes[1,1].set_yscale('log')
axes[1,2].set_yscale('log')
axes[2,0].set_yscale('log')

plt.show()

### Curious note: Usually, one may presume that the greater the budget of the movie, the better revenue it may generate. So let's put this to the test:

In [None]:
# Plot budget vs. revenue generated

budgetVsRevenue = sns.lineplot(x="budget", y="revenue", data=movies, errorbar=None)
# Set labels and title
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Budget vs. Revenue (in hundred-millions)')

plt.show()

In [None]:

releaseDateVsRevenue = sns.lineplot(x="release_date", y="revenue", data=movies, errorbar=None)
plt.xlabel('Release Date')
plt.ylabel('Revenue')
plt.title('Release Date vs. Revenue (in hundred-millions)')
plt.show()

Observation: As you can see, it is not necessarily the case that higher the budget, the more revenue is generated. For better visualization, we must clean this data to scale the x-axis (i.e., budget) to better see the correlation

In [None]:
x_axis = movies["runtime"]
y_axis = movies["revenue"]
# Plot points
fig, pl = plt.subplots()
pl.scatter(x_axis, y_axis, color = 'b')
plt.xlabel("runtime")
plt.ylabel("revenue")

Observation: clearly, we need to get rid of the right most value as it is ruining the data plot. We need feature scaling after that. 
### Check the data cleaning part, we have removed the outliers by using a threshold for our movie runtime. We plot this graph again for better interpretations.

# 3 Preprocessing: Preparing data for Machine Learning tasks


## 3.1. - Data cleaning

Recall from above, that one runtime value is extraordinarily big. We will remove this outlier now.

In [None]:
# Assuming the threshold for high runtime is 150 minutes long
threshold_runtime = 150

# Create a boolean mask indicating which rows have runtime below or equal to the threshold
mask = movies['runtime'] <= threshold_runtime

# now, our dataframe contains only the rows with runtime below or equal to the threshold
movies = movies[mask]



In [None]:
# We now show how our earlier plot becomes so much better:

x_axis = movies["runtime"]
y_axis = movies["revenue"]

# Plot points
fig, pl = plt.subplots()
pl.scatter(x_axis, y_axis, color = 'b')
plt.xlabel("runtime")
plt.ylabel("revenue")


3.1.1-  Checking for duplicates, and dropping them:

In [None]:
movies.duplicated().sum()

In [None]:
movies.drop_duplicates(inplace=True)

3.1.2 - Handle the missing values:

In [None]:
# Find the number of missing values in each column as a fraction out of total instances

movies.isna().sum()/len(movies)

In [None]:
''' Genres is not put in a format such that we can analyse it since the datatype is a String
we will make the datatype into a list such that our data pipeline can separate each category
into different classes'''

### all the datatypes are split by a '-' so we can unsplit them and make them into a list
movies['Genres_list'] = movies['genres'].str.split('-')

### even while converted alot of the movies do not have genres therefore I use a lambda equation to set these to empty lists
movies['Genres_list'] = movies['Genres_list'].apply(lambda x: x if isinstance(x, list) else [])

### show the updated/new feature
movies['Genres_list']


### Removing unnecessary features ### 

We removed a lot of unimportant fields for various reasons, for the first set of features we are unable to use them to predict revenue as they are in the form of a String and we came to the conclusion that feature engineering was not possible. For generes it is necessary to remove it since we have remade it into 'Genres list' finally for id the property is entirely irrelevant. 

In [None]:
### drop the string features as they cannot be interpreted effectively
movies.drop(labels=['recommendations'], axis=1, inplace=True)
movies.drop(labels=['tagline'], axis=1, inplace=True)
movies.drop(labels=['keywords'], axis=1, inplace=True)
movies.drop(labels=['production_companies'], axis=1, inplace=True)
movies.drop(labels=['overview'], axis=1, inplace=True)
movies.drop(labels=['status'], axis=1, inplace=True)
movies.drop(labels=['credits'], axis=1, inplace=True)
movies.drop(labels=['title'], axis=1, inplace=True)

### cannot process the datetime feature in regression model, feature engineering would be necessary
movies.drop(labels=['release_date'], axis=1, inplace=True)

### drop this because genres_list has been generated 
movies.drop(labels=['genres'],axis=1,inplace=True)

### completely irrelevant to the data
movies.drop(labels=['id'], axis=1, inplace=True)

#### For the remaining missing values, we will fill them with the mean if it is a numerical value and the most frequent if it is a categorical column

#### We will do this through creating a pipeline, that will also scale the features and perform encoding in the next step.

### 3.1.3 Creating a pipeline that will:

1. Fill in the missing numerical values with the mean using a SimpleImputer

2. Scale the numerical columns using StandardScaler. Do not scale the target

3. Fill in the missing categorical values with the most_frequent value using SimpleImputer

4. Encode the categorical columns using OneHotEncoder


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import StandardScaler

In [None]:
# The MultiLabelBinarizer is used to process list based data and encode them into different categoriess
mlb = MultiLabelBinarizer()

# Transform the 'Genres_list' column into a binary representation
genres_encoded = mlb.fit_transform(movies['Genres_list'])

# Create a DataFrame with the encoded genres
genres_encoded_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)

# Mulilabel binarizer converts the columns into that of datatype object and not float (which is used by the regression models) 
genres_encoded_df=genres_encoded_df.astype(float)

# shows all the different genres that are encoded into the dataframe (less than you'd think for over 10,000 movies
genres_encoded_df.info()

In [None]:
# drop the 'Genres_list' so that it is not processed within the pipeline 
movies.drop(labels=['Genres_list'], axis=1, inplace=True)

In [None]:
# Create the cat and num columns
# Get a list of column names from the 'movies' DataFrame that are of numerical data types.
# Get a list of column names from the 'movies' DataFrame that are not of numerical data types.

num_cols = movies.select_dtypes(include='number').columns.to_list()
cat_cols = movies.select_dtypes(exclude='number').columns.to_list()


# Exclude the target from numerical columns
num_cols.remove("revenue")


# Create pipelines for numeric and categorical columns
num_pipeline = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'),  OneHotEncoder(sparse_output=False))

# Use ColumnTransformer to set the estimators and transformations

preprocessing = ColumnTransformer([('num', num_pipeline, num_cols),
                                   ('cat', cat_pipeline, cat_cols)],
                                    remainder='passthrough'
                                 )

In [None]:
num_cols

In [None]:
cat_cols

In [None]:
preprocessing

In [None]:
# Apply the preprocessing pipeline on the dataset                 
movies_prepared = preprocessing.fit_transform(movies)

# Scikit-learn strips the column headers, so just add them back on afterwards
feature_names=preprocessing.get_feature_names_out()
movies_prepared = pd.DataFrame(data=movies_prepared, columns=feature_names)

# Concatenated the two encoded datasets together that were independently processed 
movies_prepared=pd.concat([movies_prepared,genres_encoded_df],axis=1)
movies_prepared


# 4 Training and evaluation of 3 Machine Learning Algorithms, findings, and result comparison

## 4.1 Training Split

In [None]:
from sklearn.model_selection import train_test_split

# split the training and test data
x = movies_prepared.drop(["remainder__revenue"], axis=1)
y = movies_prepared['remainder__revenue']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
x

In [None]:
y

## 4.1. - Training of 3 ML algorithms
- Algorithm 1: LinearRegression no regularization
- Algorithm 2: XGBRegression
- Algorithm 3: RandomForestRegressor

### Algorithm 1: Linear Regression no regularization

In [None]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()

lr_model.fit(X_train,y_train)

In [None]:
lr_y_predict = lr_model.predict(X_test)

from sklearn.metrics import mean_absolute_error as mae
lr_mae=mae(y_test, lr_y_predict)
lr_mae

### Algorthm 2: 

In [None]:
from xgboost import XGBRegressor
xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)

In [None]:
xgb_y_predict = xgb_model.predict(X_test)
xgb_mae=mae(y_test, xgb_y_predict)
xgb_mae

### Algorthm 3 

In [None]:
from sklearn.linear_model import Ridge
RidgeRegression = Ridge(alpha=1)
ridge_model = RidgeRegression.fit(X_train, y_train)

In [None]:
ridge_y_predict = ridge_model.predict(X_test)
ridge_mae = mae(y_test, ridge_y_predict)
ridge_mae

In [None]:
plt.scatter(lr_y_predict, y_test)  # y is your actual target values
plt.xlabel("Revenue predicted")
plt.ylabel("revenue actual")
plt.xscale('log')
plt.yscale('log')
plt.title("Predicted vs. Actual Values")
plt.show()

In [None]:
plt.scatter(xgb_y_predict, y_test)  # y is your actual target values
plt.xlabel("Revenue predicted")
plt.ylabel("revenue actual")
plt.xscale('log')
plt.yscale('log')
plt.title("Predicted vs. Actual Values")
plt.show()

In [None]:
plt.scatter(ridge_y_predict, y_test)  # y is your actual target values
plt.xlabel("Revenue predicted")
plt.ylabel("revenue actual")
plt.xscale('log')
plt.yscale('log')
plt.title("Predicted vs. Actual Values")
plt.show()

## 4.2. - Analysis of findings


In [None]:
print(f'Linear Regression: {lr_mae}')
print(f'XGB Regression: {xgb_mae}')
print(f'Ridge Regression: {ridge_mae}')

XGBoost Regression has the lowest mean absolute error value therefore, it our best performing algorithm

In [None]:
feature_importance = xgb_model.feature_importances_
features=x.columns.tolist()
map={}
for y in range(len(feature_importance)):
    map[features[y]]=feature_importance[y]
    
sorted_map = sorted(map.items(), key=lambda x: x[1],reverse=True)
count=0
for key, value in sorted_map:
    print(f'{count+1}. {key:25}: {value}')
    count+=1
    if count > 15: 
        break

## Some Interesting Remarks about the weights

The most significant factor determining the success of a movie was its budget according to our model,
The genre that produces the greatest revenue simply by virtue of its genre is animation,
Our vote count was significantly more impactful than the voter average implying that loving the movie isn't what causes an increase in revenue but how many people watched it and want to talk about it (which makes sense) 
the language that has the most value for a movie to be initially released in is simplified chinese, it is fitting given that it is the most widely spoken language in the world. English is significantly behind it


# 5 - Three Graphs for the best performance algorithm

In [None]:
plt.scatter(xgb_y_predict, y_test)  # y is your actual target values
plt.xlabel("Revenue predicted")
plt.ylabel("revenue actual")
plt.title("Predicted vs. Actual Values")
plt.xscale('log')
plt.yscale('log')
plt.show()


This bar graph compares our predictions with the actual revenues side by side, the more overlap there is and values side by side to each other the closer are prediction were to the actual revenue 

In [None]:
n = len(xgb_y_predict)  # Number of data points
index = np.arange(n)  # Create an array of indices
    
bar_width = 0.35  # Width of the bars

plt.bar(index, xgb_y_predict, bar_width, label='Predicted Revenue')
plt.bar(index + bar_width, y_test, bar_width, label='Actual Revenue')

plt.xlabel('Data Point Index')
plt.ylabel('Revenue')
plt.title('Predicted vs. Actual Revenue')
plt.xticks(index + bar_width, (str(i) for i in range(n)))  # X-axis labels
plt.yscale('log')  # Setting y-axis to logarithmic scale
plt.legend()

plt.tight_layout()
plt.show()

This violin plot shows how the density of our points and their medians compared to the actual revenue. As seen  the shape is relatively similar, however, the median value of our model is higher

In [None]:
data = [xgb_y_predict, y_test]

plt.figure(figsize=(10, 6))  # You can adjust the figure size as needed
sns.violinplot(data=data)

plt.xticks([0, 1], ['Predicted Revenue', 'Actual Revenue'])
plt.ylabel('Revenue')
plt.title('Violin Plot of Predicted vs. Actual Revenue')
plt.yscale('log')

plt.show()

# 6-Limitations of the project

## Many zero's for budget and revenue

The biggest limitation with this project was the data provided, as it left a lot to be desired. We never foresaw a lack of revenue and budget data for recent movies, however this became the biggest issue. Much of our data lacked proper budget and revenue values, we remedied this by sampling our zero and non-zero data. Even with our solution, having better data from the start would result in more instances and a more precise model.

## Feature Data engineering

Many of the features that are useable require very complex data engineering for example in the case of something like production company or credits we would need to convert it from a String format to a more useable format. While the dataset has a massive amount of data, much of this data is not useable in its base form.

## Lack of focus on more complex and equally relevant factors

There are many factors that our model does not focus on that could have been very relevant such as the company that produced it, the amount of money in advertisting that was spent on it and potentially even the country it was initially released to. A lot of very complex factors have significant bearing on the amount of money that a movie earns. While our model predicts in spite of the lack of features it uses, I'd argue that with more features it would be more precise and predict more accurately.

# Appendix 1:

Empty unless we used someone else's code, then we cite it here

# Appendix 2: 

### Github repository link: https://github.com/Jayant1Varma/Movie-Box-Office-predictor.git 

**Original dataset citation: The dataset used was https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies/data . However, this dataset is updated daily, but we used this dataset as it was available on November 5th 2023**

**You can find the exact dataset we used here: https://drive.google.com/file/d/1uPtHyqpAKkqZUpft8A0FPVXPR2iT32SN/view?usp=sharing**