# Predicting Box Office Revenue

## Preprocessing and Training Notebook

The purpose of this notebook is twofold:

First I want to reduce the dimensionality of my data set in order to make the machine learning models I have to work with more tenable. Currently I have over 10,000 features and only 2,333 observations in the data set, causing serious computational slowdown for my models.  Further, at the time when this notebook was initially written the machine I'm working on didn't have the computational power to even create dummy variables from my 'cast' and 'crew' categories.  


I plan to 'bin' categorical dummy variables into new features to reduce dimensionality.  This is relatively easy for features that are highly skewed (spoken language) or where each variable has a low frequency in the data set compared to 'None' (collection).  However I'll need to test performance on a basic linear/polynomial regressor for other categories where the data is exponentially skewed and there isn't a clear way to bin the data.  This will allow me to compare R^2 and Mean Absolute Percent Error metrics and determine which method for dimensionality reduction allows for the more accurate model. 

Finally, I'll use Lasso and Ridge regressors to test my initial linear regressor for overfit; additionally I'll be able to use the Lasso regressor coefficients to determine what features can be dropped to further reduce dimensionality and complexity. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error, r2_score
import time

In [2]:
# Read in CSV from EDA notebook

boxoffice = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\boxoffice_EDA.csv', index_col=0, 
                        header=[0,1])

boxoffice.shape

(2333, 10456)

In [3]:
boxoffice.isnull().sum()

Genre          Action       0
               Adventure    0
               Animation    0
               Comedy       0
               Crime        0
                           ..
Release_month  8            0
               9            0
               10           0
               11           0
               12           0
Length: 10456, dtype: int64

In [4]:
boxoffice.head(3)

Unnamed: 0_level_0,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,...,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month
Unnamed: 0_level_1,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,3,4,5,6,7,8,9,10,11,12
0,0,0,0,1,0,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [5]:
# Displaying names of top level of multi-index for later reference
boxoffice.columns.get_level_values(0).unique()

Index(['Genre', 'Collection', 'Company', 'Country', 'Spoken_lang', 'Keywords',
       'Descriptive', 'Numerical', 'Release_year', 'Release_month'],
      dtype='object')

## Binning Categorical Data

To begin I want to significantly cut down on the number of features that I have in the data set.  By far the largest category that I have is Keywords, however the distribution of that category and several others presents several options for binning my data.  I can bin them into quartiles by either frequency or median/mean revenue in the data set. 

However I do have several feature categories that I won't be touching: Genre, Release_year, Country, and Release_month.  These categories all have fewer than 70 individual features.  Additionally I have a compelling reason to believe that each of these will have a large impact on revenue.  Genre, Release_year, and Release_month all were significantly more evenly distributed across films than other categories.  For these three categories median revenue was also skewed towards certain sub-categories which indicates that they have an impact revenue.  

While the Country category is **heavily** skewed towards films made in the United States, revenue is heavily skewed towards more 'exotic' countries.  This is likely a result of blockbuster films like 'Pirates of the Carribean' or 'The Avengers' being filmed on site in other locations.  I suspect that this category will be highly correlated with budget and may be dropped after I check Lasso regresor coefficients.  Regardless, this category only has 67 sub-categories and consolidating those into fewer bins will lose what appears to be useful information with minimal impact on reducing the 10,456 dimensions that the data set currently has.

This means that I need to bin the Spoken_lang, Company, Keywords, and Collection columns.  Later I'll come back through and treat the Cast and Crew categories the same as the Keyword category once I have access to a machine with enough RAM to handle those categories. 

The best place for me to start with binning is going to be the Spoken_lang category since it's highly skewed towards the 'English' sub-category.  

#### Binning Spoken_lang

Based on the work from my EDA notebook the 'English' sub-category accounts for 2,180 films in this data set.  It's clear that the simplest way to bin this category is to reduce it to a single column that indicates if a film's primary language is English or not.

In [6]:
# Create a list of column lables to be dropped
dropped = list(boxoffice['Spoken_lang'].columns)
dropped.remove('English')

In [7]:
# for loop to iterate over the dropped list and remove all languages from Spoken_lang other than english

for col in dropped:
    boxoffice.drop(col, level=1, axis=1, inplace=True)

In [8]:
# verify that English is the only remaining column
boxoffice.Spoken_lang.head(3)

Unnamed: 0,English
0,1
1,1
2,1


#### Binning Film Collection dummy variables

Ultimately this was going to be relatively complex with the multi-indexed dataframe so I went back to my EDA notebook and created a boolean column that indicates if a film belongs to a collection or not.  The final step is to transform this into the value for 1 or 0. 

In [9]:
# Casting the column in question to a numeric data type
boxoffice['Descriptive', 'Collection'] = boxoffice['Descriptive', 'Collection'].astype('int')
boxoffice['Descriptive', 'Collection'].head(3)

0    1
1    0
2    0
Name: (Descriptive, Collection), dtype: int32

In [10]:
boxoffice['Descriptive'].head(3)

Unnamed: 0,original_title,overview,tagline,title,Collection
0,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,1
1,Whiplash,"Under the direction of a ruthless instructor, ...",The road to greatness can take you to the edge.,Whiplash,0
2,Pinocchio and the Emperor of the Night,"Pinocchio and his friends, a glow worm and a m...",,Pinocchio and the Emperor of the Night,0


While I'm working on this column in Descriptive I'm also going to drop the overview, tagline, and original_title columns, and set the index to the title column

In [11]:
boxoffice.drop('overview', level=1, axis=1, inplace=True)
boxoffice.drop('tagline', level=1, axis=1, inplace=True)
boxoffice.drop('original_title', level=1, axis=1, inplace=True)
# Storing the titles in a seperate Series for later use as needed
Titles = boxoffice['Descriptive', 'title']
boxoffice.drop('title', level=1, axis=1, inplace=True)

In [12]:
#Verify that the Descriptive category has been reduced to only the one-hot encoded column for collections
boxoffice['Descriptive'].head(3)

Unnamed: 0,Collection
0,1
1,0
2,0


## Testing the accuracy of linear regression with the data set as is

At this point I have a huge number of columns, and I haven't made significant reductions in the number of features yet. 

However once I've binned the Company and Keyword columns I'll have eliminated thousands of features.  Prior to doing this I want to get a baseline for how accurate a model is with all of these columns left in the data set. 

The reason that this is important to do now, prior to reducing the dimensionality of the Company and Keywords categories is that each movie has multiple companies and keywords associated with it, while there is only a single language and film collection for each movie.  Reducing the Spoken_lang and Collection categories isn't eliminating complex information about each film like binning the Company and Keywords categories will be. 

This will also give me a baseline of accuracy prior to scaling the numeric data that I have to work with as well.

If the model takes excessively long to train I'll be forced to trim a lot of those columns from my data set, so I'll need to take the time required to run and print that as well as test the accuracy of the model. 

#### Renaming columns in Release_month

While working with dummy variables I think that it's wise to rename the columns for Release_month since it's possible that I'll be dropping the hierarchical index before creating a train/test split and training my model. 

In [13]:
months ={'1':'Jan', '2':'Feb', '3':'Mar', '4':'Apr', '5':'May', '6':'June', 
         '7':'July', '8':'Aug', '9':'Sep', '10':'Oct', '11':'Nov', '12':'Dec'}
boxoffice.rename(columns=months, level=1, inplace=True)
boxoffice['Release_month'].head(3)

Unnamed: 0,Jan,Feb,Mar,Apr,May,June,July,Aug,Sep,Oct,Nov,Dec
0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,1,0,0,0,0


#### Creating the train/test split

In [14]:
y = boxoffice['Numerical', 'revenue']
y.head(3)

0    134734481.0
1     48982041.0
2      3418605.0
Name: (Numerical, revenue), dtype: float64

In [15]:
X = boxoffice.drop('revenue', level=1, axis=1)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#### Testing the model's accuracy

In [17]:
linear_1 = LinearRegression()

In [18]:
# Training the first regressor and noting the time it takes to train
train_start = time.time()
linear_1.fit(X_train, y_train)
train_finish = time.time()
print('Time to train: ', str(round(train_finish-train_start, 3)))


Time to train:  4.097


In [19]:
# predicting new values with the first regressor
preds = linear_1.predict(X_test)

# calculating R^2 and Mean Absolute Percentage Error
MAPE = mean_absolute_percentage_error(y_test, preds)
R2 = r2_score(y_test, preds)

print('Baseline test mape scpre: ', MAPE)
print('Baseline test R^2 score: ', R2)

Baseline test mape scpre:  6.759422418739245
Baseline test R^2 score:  0.39157992817626586


Here it's clear that my initial model is not all that accurate from the R^2 score.  Additionally, my model is off on revenue prediction by 6.8% on average, which indicates to me that the baseline linear regression model here is an 'accurate guess' at revenue.  

It's not clear from this whether the low scores are a result of overfitting, or because I need to choose a different model for this data set.  Prior to moving forward I need to determine how well the model can predict using only the training data. 

In [20]:
# To get a sense for overfit I also want to see how well the model performs on the training data

train_preds = linear_1.predict(X_train)

MAPE = mean_absolute_percentage_error(y_train, train_preds)
R2 = r2_score(y_train, train_preds)

print('Baseline training mape scpre: ', MAPE)
print('Baseline training R^2 score: ', R2)

Baseline training mape scpre:  6.593331256826679e-06
Baseline training R^2 score:  0.9999999999999856


It's abundantly clear at this point that the issue is that the model is too complex and overfitted to the training data.  I have several options to remedy this but the first step I can take is to reduce the number of features that I have significantly.  

At this point I have several options.  Currently I can bin the production companies and keywords categories into fewer features, and later I'll have access to the cast & crew information as well.  From a business perspective, when a studio is presented a film as an option to invest in they control who is cast in the film, who works on it's crew, and which other studios they collaborate with.  What they don't have control over when presented with a script is what the story **is** - it's already been written.  If the core story of a film is not worth much, then it doesn't make sense to sink additional resources into a quality cast/crew, high-value filming on location in other countries, or collaboration with other studios.   

There are several features that can be used to characterize a story which are immutable about that film.  One is whether it's part of a larger story - a collection - or not. Another feature that a studio has no control over is what genre a film is, if the screenwriter has written a thriller film it's rather difficult to change that into a romantic-comedy.  Finally, there are keywords that can describe a film's nature as well.  Films about detectives and murder can't easily be changed into family-friendly animation movies.  

I've already reduced the dimensionality of the collections category to 1, and the genres category is small enough that I don't see reducing it's dimensionality as effective.  I either need to strip the keywords category down to a few features or eliminate it entirely.  

My instinct is to drop the category in it's entirety, and see how that affects the performance metrics.  The most frequently appearing keywords only appear in 188 films, and the top three most frequen keywords all indicate that a film is part of a larger film collection.  The other frequently occuring keywords indicate if there is a female director for a film, and if the film involves a murder.  All of these features of a film are represented in some way by other categories that the data set contains.  

In [21]:
# Prior to just dropping all 7,000+ features I want to double check the distribution of the Keywords

#create a new data frame of the keywords category
kwords = boxoffice['Keywords']
count = kwords.apply(pd.value_counts)
x = list(kwords.columns)
y = count.iloc[1]
kwords.shape

(2333, 7134)

In [22]:
# Showing all keywords that appear more than 5 times in the data set
y = y.sort_values(ascending=False)
z = y[y >= 5]
z.shape

(1000,)

Roughly 1 in every 7 of all keywords appear in at least **five** films - the majority of these features are not useful. 

In [23]:
# Checking how many keywords appear in at least 50 films
k = z[z >= 50]
k

duringcreditsstinger    188
based on novel          146
aftercreditsstinger     110
woman director           98
murder                   96
sequel                   90
3d                       85
dystopia                 83
violence                 82
revenge                  67
friendship               61
biography                59
No Keywords              59
superhero                54
love                     53
new york                 53
london england           52
magic                    51
family                   51
based on comic           51
musical                  50
police                   50
alien                    50
Name: 1, dtype: int64

Only 23 of the 7,134 keywords actually appear in 50 or more films.  Five of those keywords indicate that a film is part of a collection, another ten can easily be mapped to genres, and two indicate the filming location.  I'm going to drop the keywords data.  It's not worth binning the keywords with the distribution among films being so extreme, if this dramatically affects the model's accuracy I can always add these 23 columns back into the data set from the kwords data frame

In [24]:
boxoffice.drop('Keywords', level=0, axis=1, inplace=True)

#### Testing how a linear regression model works on the smaller data set

In [25]:
y = boxoffice['Numerical', 'revenue']
X = boxoffice.drop('revenue', level=1, axis=1)

In [26]:
# Creating a train/test split on the smaller data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [27]:
linear_2 = LinearRegression()

In [28]:
# 
train_start = time.time()
linear_2.fit(X_train, y_train)
train_finish =time.time()
print('Time to train: ', str(round(train_finish-train_start, 3)))

Time to train:  2.397


In [29]:
preds = linear_2.predict(X_test)
MAPE = mean_absolute_percentage_error(y_test, preds)
R2 = r2_score(y_test, preds)

print('No keywords test mape scpre: ', MAPE)
print('No keywords test R^2 score: ', R2)

No keywords test mape scpre:  1828.0863028094454
No keywords test R^2 score:  -1440938.821415582


In [30]:
train_preds = linear_2.predict(X_train)
MAPE = mean_absolute_percentage_error(y_train, train_preds)
R2 = r2_score(y_train, train_preds)

print('No keywords train mape scpre: ', MAPE)
print('No keywords train R^2 score: ', R2)

No keywords train mape scpre:  1.0640574186591432
No keywords train R^2 score:  0.9637359443804455


Clearly cutting out the keywords column out had some detrimental effect on the accuracy of the model.  However I still haven't scaled the numerical data, and suspect that may be affecting things. 

## Scaling Numerical Data

In [31]:
boxoffice['Numerical'] = boxoffice['Numcerical'].apply()

Unnamed: 0,budget,popularity,runtime,revenue,Overview_length,Tag_length
0,40000000.0,8.248895,113.0,134734481.0,393.0,60.0
1,3300000.0,64.299990,105.0,48982041.0,130.0,47.0
2,8000000.0,0.743274,83.0,3418605.0,150.0,0.0
3,14000000.0,7.286477,92.0,85446075.0,208.0,36.0
4,35000000.0,6.902423,100.0,4259710.0,397.0,27.0
...,...,...,...,...,...,...
2328,35000000.0,15.968492,93.0,95437994.0,292.0,44.0
2329,55000000.0,18.079094,122.0,71171825.0,340.0,23.0
2330,4000000.0,8.418662,144.0,3190832.0,93.0,62.0
2331,155000000.0,30.188198,126.0,440603537.0,321.0,16.0
