# Predicting Box Office Revenue

## Preprocessing and Training Notebook

The purpose of this notebook is twofold:

First I want to reduce the dimensionality of my data set in order to make the machine learning models I have to work with more tenable. Currently I have over 10,000 features and only 2,333 observations in the data set, causing serious computational slowdown for my models.  Further, at the time when this notebook was initially written the machine I'm working on didn't have the computational power to even create dummy variables from my 'cast' and 'crew' categories.  


I plan to 'bin' categorical dummy variables into new features to reduce dimensionality.  This is relatively easy for features that are highly skewed (spoken language) or where each variable has a low frequency in the data set compared to 'None' (collection).  However I'll need to test performance on a basic linear/polynomial regressor for other categories where the data is exponentially skewed and there isn't a clear way to bin the data.  This will allow me to compare R^2 and Mean Absolute Percent Error metrics and determine which method for dimensionality reduction allows for the more accurate model. 

Finally, I'll use Lasso and Ridge regressors to test my initial linear regressor for overfit; additionally I'll be able to use the Lasso regressor coefficients to determine what features can be dropped to further reduce dimensionality and complexity. 

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from datetime import datetime

In [27]:
# Read in CSV from EDA notebook

boxoffice = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\boxoffice_EDA.csv', index_col=0, 
                        header=[0,1])

boxoffice.shape

(2333, 10456)

In [28]:
boxoffice.isnull().sum()

Genre          Action       0
               Adventure    0
               Animation    0
               Comedy       0
               Crime        0
                           ..
Release_month  8            0
               9            0
               10           0
               11           0
               12           0
Length: 10456, dtype: int64

In [29]:
boxoffice.head(3)

Unnamed: 0_level_0,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,...,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month
Unnamed: 0_level_1,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,3,4,5,6,7,8,9,10,11,12
0,0,0,0,1,0,0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [30]:
# Displaying names of top level of multi-index for later reference
boxoffice.columns.get_level_values(0).unique()

Index(['Genre', 'Collection', 'Company', 'Country', 'Spoken_lang', 'Keywords',
       'Descriptive', 'Numerical', 'Release_year', 'Release_month'],
      dtype='object')

## Binning Categorical Data

To begin I want to significantly cut down on the number of features that I have in the data set.  By far the largest category that I have is Keywords, however the distribution of that category and several others presents several options for binning my data.  I can bin them into quartiles by either frequency or median/mean revenue in the data set. 

However I do have several feature categories that I won't be touching: Genre, Release_year, Country, and Release_month.  These categories all have fewer than 70 individual features.  Additionally I have a compelling reason to believe that each of these will have a large impact on revenue.  Genre, Release_year, and Release_month all were significantly more evenly distributed across films than other categories.  For these three categories median revenue was also skewed towards certain sub-categories which indicates that they have an impact revenue.  

While the Country category is **heavily** skewed towards films made in the United States, revenue is heavily skewed towards more 'exotic' countries.  This is likely a result of blockbuster films like 'Pirates of the Carribean' or 'The Avengers' being filmed on site in other locations.  I suspect that this category will be highly correlated with budget and may be dropped after I check Lasso regresor coefficients.  Regardless, this category only has 67 sub-categories and consolidating those into fewer bins will lose what appears to be useful information with minimal impact on reducing the 10,456 dimensions that the data set currently has.

This means that I need to bin the Spoken_lang, Company, Keywords, and Collection columns.  Later I'll come back through and treat the Cast and Crew categories the same as the Keyword category once I have access to a machine with enough RAM to handle those categories. 

The best place for me to start with binning is going to be the Spoken_lang category since it's highly skewed towards the 'English' sub-category.  

#### Binning Spoken_lang

Based on the work from my EDA notebook the 'English' sub-category accounts for 2,180 films in this data set.  It's clear that the simplest way to bin this category is to reduce it to a single column that indicates if a film's primary language is English or not.

In [31]:
# Create a list of column lables to be dropped
dropped = list(boxoffice['Spoken_lang'].columns)
dropped.remove('English')

In [32]:
# for loop to iterate over the dropped list and remove all languages from Spoken_lang other than english

for col in dropped:
    boxoffice.drop(col, level=1, axis=1, inplace=True)

In [33]:
# verify that English is the only remaining column
boxoffice.Spoken_lang.head(3)

Unnamed: 0,English
0,1
1,1
2,1


#### Binning Film Collection dummy variables

Ultimately this was going to be relatively complex with the multi-indexed dataframe so I went back to my EDA notebook and created a boolean column that indicates if a film belongs to a collection or not.  The final step is to transform this into the value for 1 or 0. 

In [34]:
# Casting the column in question to a numeric data type
#boxoffice['Descriptive', 'Collection'] = boxoffice['Descriptive', 'Collection'].astype('int')
#boxoffice['Descriptive', 'Collection'].head(3)

In [35]:
boxoffice['Descriptive'].head(3)

Unnamed: 0,original_title,overview,tagline,title
0,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement
1,Whiplash,"Under the direction of a ruthless instructor, ...",The road to greatness can take you to the edge.,Whiplash
2,Pinocchio and the Emperor of the Night,"Pinocchio and his friends, a glow worm and a m...",,Pinocchio and the Emperor of the Night


While I'm working on this column in Descriptive I'm also going to drop the overview, tagline, and original_title columns, and set the index to the title column

In [36]:
boxoffice.drop('overview', level=1, axis=1, inplace=True)
boxoffice.drop('tagline', level=1, axis=1, inplace=True)
boxoffice.drop('original_title', level=1, axis=1, inplace=True)
# Storing the titles in a seperate Series for later use as needed
Titles = boxoffice['Descriptive', 'title']
boxoffice.drop('title', level=1, axis=1, inplace=True)

In [37]:
#Verify that the Descriptive category has been reduced to only the one-hot encoded column for collections
boxoffice['Descriptive'].head(3)

KeyError: 'Descriptive'

## Testing the accuracy of linear regression with the data set as is

At this point I have a huge number of columns, and I haven't made significant reductions in the number of features yet. 

However once I've binned the Company and Keyword columns I'll have eliminated thousands of features.  Prior to doing this I want to get a baseline for how accurate a model is with all of these columns left in the data set. 

The reason that this is important to do now, prior to reducing the dimensionality of the Company and Keywords categories is that each movie has multiple companies and keywords associated with it, while there is only a single language and film collection for each movie.  Reducing the Spoken_lang and Collection categories isn't eliminating complex information about each film like binning the Company and Keywords categories will be. 

This will also give me a baseline of accuracy prior to scaling the numeric data that I have to work with as well.

If the model takes excessively long to train I'll be forced to trim a lot of those columns from my data set, so I'll need to take the time required to run and print that as well as test the accuracy of the model. 

#### Renaming columns in Release_month

While working with dummy variables I think that it's wise to rename the columns for Release_month since it's possible that I'll be dropping the hierarchical index before creating a train/test split and training my model. 

In [13]:
months ={'1':'Jan', '2':'Feb', '3':'Mar', '4':'Apr', '5':'May', '6':'June', 
         '7':'July', '8':'Aug', '9':'Sep', '10':'Oct', '11':'Nov', '12':'Dec'}
boxoffice.rename(columns=months, level=1, inplace=True)
boxoffice['Release_month'].head(3)

Unnamed: 0,Jan,Feb,Mar,Apr,May,June,July,Aug,Sep,Oct,Nov,Dec
0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,1,0,0,0,0


In [14]:
boxoffice.isnull()

Unnamed: 0_level_0,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,...,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month,Release_month
Unnamed: 0_level_1,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,Mar,Apr,May,June,July,Aug,Sep,Oct,Nov,Dec
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2328,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2329,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2330,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2331,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### Creating the train/test split

In [15]:
y = boxoffice['Numerical', 'revenue']
y.head(3)

0    134734481.0
1     48982041.0
2      3418605.0
Name: (Numerical, revenue), dtype: float64

In [16]:
X = boxoffice.drop('revenue', level=1, axis=1)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [18]:
linear_1 = LinearRegression()

In [19]:
train_start = datetime.now().time()
linear_1.fit(X_train, y_train)
train_finish = datetime.now().time()
Delta = train_finish-train_start
Print('Time to train: ', Delta)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').