# Movie Gross Prediction using Decision Tree Classifier

In this notebook, we'll explore using a decision tree classifier to predict movie gross. We will explore where certain factors and features in the movie can be used to see how much a movie with manually selected features can expect to earn. This model is trained on top 1000 highest grossing movie data updated in january 2023. 

### Approach
This task involves training an AI model on the provided movie dataset to learn patterns and make predictions on unseen movies. The decision tree classifier will use and base its predictions on features such as genre, runtime, rating etc to determine the predicted gross earnings.

The model will be built to predict the page a new movie will land at. Page 1 means the movie is top 100 and Page 10 means the movie will be top 1000. Each page will have an average gross across the 100 movies and this is the amount of money we will estimate the prediction to earn

### Libraries
For this notebook we will require the following python libraries and imports

In [1]:
# Imports
import pandas as pd
import numpy as np

# Scikit-learn for machine learning methods
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.datasets import make_multilabel_classification

# Pickle to save machine learning model for later use
import pickle

# Visualization
from matplotlib import pyplot as plt

## Data

The classifier will be trained and tested on movie data from IMDbs top 1000 highest grossing movies. A dataframe of the data can be seen below.

In [2]:
df = pd.read_csv('movies_new.csv')

df

Unnamed: 0,id,title,release year,certificate,runtime,genre,rating,summary,directors,actors,lifetime gross in $,reviews,review_score,review_user,publishers,page
0,tt0499549,Avatar,2009,PG-13,162 min,"Action, Adventure, Fantasy",7.9,A paraplegic Marine dispatched to the moon Pan...,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",2922917914,['I\'m not exactly sure when it became a thing...,"['No rating', '10/10', '9/10', '8/10', '10/10'...","['gogoschka-1', 'eldreddsouza', 'Samiam3', 'Cl...","['Twentieth Century Fox', 'Dune Entertainment'...",1
1,tt4154796,Avengers: Endgame,2019,PG-13,181 min,"Action, Adventure, Drama",8.4,After the devastating events of Avengers: Infi...,"Anthony Russo, Joe Russo","Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",2797501328,"[""But its a pretty good film. A bit of a mess ...","['7/10', '8/10', '7/10', '10/10', '10/10', '10...","['MoistMovies', 'ACollegeStudent', 'nickgray-1...","['Marvel Studios', 'Walt Disney Pictures']",1
2,tt0120338,Titanic,1997,PG-13,194 min,"Drama, Romance",7.9,A seventeen-year-old aristocrat falls in love ...,James Cameron,"Leonardo DiCaprio, Kate Winslet, Billy Zane, K...",2201647264,"[""I have watched Titanic how many times I don'...","['10/10', '10/10', 'No rating', '10/10', '10/1...","['TaylorYee94', 'BJG-Reviews', 'MrHeraclius', ...","['Twentieth Century Fox', 'Paramount Pictures'...",1
3,tt2488496,Star Wars: Episode VII - The Force Awakens,2015,PG-13,138 min,"Action, Adventure, Sci-Fi",7.8,"As a new threat to the galaxy rises, Rey, a de...",J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2069521700,"[""This film really is nothing more than an Adv...","['7/10', '8/10', '8/10', 'No rating', '9/10', ...","['coasterdude44', 'Ben-Hibburd', 'Sewaat', 'El...","['Lucasfilm', 'Bad Robot', 'Walt Disney Pictur...",1
4,tt4154756,Avengers: Infinity War,2018,PG-13,149 min,"Action, Adventure, Sci-Fi",8.4,The Avengers and their allies must be willing ...,"Anthony Russo, Joe Russo","Robert Downey Jr., Chris Hemsworth, Mark Ruffa...",2048359754,['Infinity War is remembered mostly for how it...,"['10/10', 'No rating', '10/10', '10/10', '9/10...","['RJBrez', 'MrHeraclius', 'Alex_Lo', 'grztxks'...","['Marvel Studios', 'Jason Roberts Productions'...",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,tt1536044,Paranormal Activity 2,2010,R,91 min,"Horror, Mystery",5.7,After experiencing what they think are a serie...,Tod Williams,"Katie Featherston, Micah Sloat, Molly Ephraim,...",177512032,"[""More of the same though less interestingly d...","['4/10', 'No rating', 'No rating', '7/10', '7/...","['dbborroughs', 'hyprsleepy', 'Cujo108', 'Smel...","['Paramount Pictures', 'Blumhouse Productions'...",10
996,tt0251127,How to Lose a Guy in 10 Days,2003,PG-13,116 min,"Comedy, Romance",6.4,Benjamin Barry is an advertising executive and...,Donald Petrie,"Kate Hudson, Matthew McConaughey, Adam Goldber...",177502387,['So its fluffy and predictable. It\'s unreali...,"['No rating', '9/10', '10/10', '9/10', '7/10',...","['Scaramouche2004', 'Zi_Reviews_Movies', 'cris...",['Unknown Publisher'],10
997,tt0370263,Alien vs. Predator,2004,PG-13,101 min,"Action, Adventure, Horror",5.6,During an archaeological expedition on Bouvetø...,Paul W.S. Anderson,"Sanaa Lathan, Lance Henriksen, Raoul Bova, Ewe...",177427090,['Just watched it again yesterday - it\'s stri...,"['No rating', '7/10', 'No rating', '7/10', '7/...","['gogoschka-1', 'kevin_robbins', 'amesmonde', ...","['Twentieth Century Fox', 'Davis Entertainment...",10
998,tt0299977,Hero,2002,PG-13,120 min,"Action, Adventure, Drama",7.9,"A defense officer, Nameless, was summoned by t...",Yimou Zhang,"Jet Li, Tony Leung Chiu-wai, Maggie Cheung, Zi...",177395557,"[""It's a story of a man who's on a mission, wi...","['9/10', '9/10', '10/10', '10/10', '8/10', '9/...","['Xstal', 'PolishBear', 'brandon_veracka', 'rs...","['Edko Films', 'Zhang Yimou Studio', 'China Fi...",10


## Data preparation
#### To implement a classification AI model we need to convert whatever relevant data to numeric values, this includes methods such as:

1. Directly converting strings to integer or float values such as release year and movie runtime by removing non-numeric characters and typecasting.
2. Create mapping dictionairies with an integer value as key and the representing string as value.
3. Applying statistical methods such as summing, average values if desired or needed (such as lists).

In [3]:
# Convert select columns to numeric values to allow for classification

# Convert runtime to int representing the runtime in minutes
df["runtime"] = df["runtime"].apply(lambda x: int(x.split()[0]))

# Converting release year to int by removing all non-numeric characters
for i in range(len(df['release year'])):
    df.at[i, 'release year'] = ''.join(filter(str.isdigit, str(df.at[i, 'release year'])))

df['release year'] = df['release year'].astype(int)

In [4]:
# Convert columns where no apparent typecast is available

# Mapping dictionaries 
certificate_mapping = {}

# Certificate

# Get all unique certificates in the dataset
unique_certificates = df['certificate'].unique()

for i in range(len(unique_certificates)):
    certificate_mapping[unique_certificates[i]] = i + 1
    
df['certificate_numeric'] = df['certificate'].map(certificate_mapping)

print(certificate_mapping)

{'PG-13': 1, 'PG': 2, 'R': 3, 'G': 4, 'Not Rated': 5, 'TV-MA': 6, 'Passed': 7, 'TV-Y7': 8, 'TV-PG': 9, 'Approved': 10}


## Numeric data

After preparing our data, we end up with the following dataframe containing only numeric values:

In [5]:
# Reorder the page column to allow for easier splitting
df = pd.concat([df.drop('page', axis=1), df['page']], axis=1)

df = df.drop('lifetime gross in $', axis=1)

numeric_cols = df.select_dtypes(include=['number']).columns


df = df[numeric_cols]

df.dtypes

release year             int64
runtime                  int64
rating                 float64
certificate_numeric      int64
page                     int64
dtype: object

## Creating the model

Now that we have the numeric data (our features and labels) needed for our classification, we need to create the model and prepare our data for such a task.

This includes:

1. Splitting the data into training and testing sets
2. Apply the decision tree algorithm on the data
3. Test the model on the test cases.
4. Predict a new movie with specified input values

### Step 1: Training

In [6]:
# We will first convert our dataframe to a numpy ndarray
array = df.values

# Split the data into features X and labels y
X, y = array[:, :4], array[:, -1:]

In [7]:
# Split the data into a training set and a test set in proportion 80 train / 20 test randomly using seed = 1 
# For different results that may prove better, you can adjust the random generation by changing the seed value.

# 20% test data
set_prop = 0.2

seed = 1

# Split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=set_prop, random_state=seed)

### Step 2: Building the model

In [8]:
# Build Decision Trees Classifier 
params = {'max_depth': 4}
classifier = DecisionTreeClassifier(**params)
 
classifier.fit(X_train, y_train)

In [9]:
# Predict the municipalities of the test data
y_testp = classifier.predict(X_test)
print(y_testp)
# Calculated the accuracy of the model comparing the observed data and predicted data
print ("Accuracy is: ", accuracy_score(y_test,y_testp))

[ 2.  5.  5. 10.  5. 10.  2.  4.  1.  1.  5.  2.  5.  2. 10.  2.  4.  4.
  4. 10.  2. 10.  5.  5.  5.  3.  1.  5.  5.  5.  5. 10.  5.  2.  1.  2.
  1.  5.  5.  2.  5.  1.  2.  1.  5. 10.  5.  4. 10. 10.  3.  2. 10.  1.
  5.  5. 10.  5.  5.  5.  5.  3. 10.  2.  5. 10.  2.  5.  4.  1.  1.  5.
 10. 10.  5.  5.  5.  1.  8. 10.  3.  2. 10.  4.  5. 10.  4.  8.  5. 10.
  5.  5.  5. 10.  4. 10.  5. 10.  5.  5.  5.  4.  5.  4.  5.  4.  1. 10.
  5.  9.  8.  5.  5.  1. 10.  5.  5.  2.  5.  5.  5.  2.  2.  5.  4.  3.
  4.  5. 10.  5. 10.  5.  1.  1. 10. 10.  2.  4.  3.  5.  5.  4.  5.  5.
  1.  2.  4.  5.  4.  2.  2.  2.  2.  5.  2.  5.  5.  4.  2.  8.  3.  5.
  4. 10. 10.  8.  8.  3.  4.  1.  2.  2. 10.  5.  1.  2.  3.  3.  3. 10.
  5.  1.  5. 10.  5. 10.  5.  2.  2.  4. 10.  3.  5.  4.  2. 10.  5.  4.
  2.  5.]
Accuracy is:  0.135


### Step 3: Applying the model on a new movie

In [10]:
#Test the model on a new movie with set features
new_movie_X = np.array([[2018, 100, 5.0, 1]])

predicted_page = int(classifier.predict(new_movie_X))

print("Page:",predicted_page)

Page: 10


### Step 4: Results

A movie with the release year 2018, runtime of 100 minutes, rating of 5 and certificate PG-13 is classified to end on page 10 (Top 1000).

To see the amount of money we predict to movie will earn, we will compare it to other movies on the samge page.

In [11]:
# Get fresh dataframe
df = pd.read_csv('../data/movies.csv')
df = df[['title', 'page', 'lifetime gross in $']]
df = df[df['page'] == predicted_page]

max_gross = format(max(df['lifetime gross in $']), ',')
min_gross = format(min(df['lifetime gross in $']), ',')

print(f'Your movie will be expected to earn between {min_gross}$ and {max_gross}$')

Your movie will be expected to earn between 177,378,645$ and 193,967,670$


### Step 5: Save the model for future use

In [12]:
# Save the decision tree classifier model for later use
with open("../server/pre_trained_models/decision_tree_earnings.pkl", "wb") as f:
    pickle.dump(classifier, f)
