# Wide and Deep Neural Network

### Introduction to Our Problem
There are literally tens of thousands of movies out there today. While some do great at the box office and bring in a lot of money, others flop making only a fraction compared to the top hits. What if we had a scientific way of accurately predicting how much revenue a movie would generate over its lifetime? Well, through machine learning we believe that we actually can!

The dataset we are using is found on <a href="https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset">Kaggle</a>. It consists of 5000+ movies scraped from the review site IMDB. There is quite a bit of data recorded for each movie and so we had a lot to work with to try to predict the next big hit. The data was collected from web scraping IMDB using a python library called "scrappy" to collect all of the data below. The features recorded for each movie are: 

Basic Info:
- movie title
- color (black and white or color)	
- duration of the movie
- director name
- gross (total revenue)
- genres (a lits of different genres ascribed to the movie)
- number of faces in movie poster
- language of the movie
- country the movie was produced in
- content rating (G, PG, PG-13, R, NC-17)
- budget
- year of release
- aspect ratio
- name of the 3rd actor
- name of the 2nd actor
- name of the 1st actor

Facebook Info:
- number of director facebook likes
- number of facebook likes for the whole cast
- number of the movie's facebook likes
- number of the 3rd actor's facebook likes
- number of the 2nd actor's facebook likes
- number of the 1st actor's facebook likes

IMDB Specific Info:
- number of imdb users who rated the movie
- number of critical reviews for the movie
- number of users who left a review
- imdb score
- top plot keywords


With all of this data collected on so many movies, we hope to be able to use this to build out a combined wide and deep neural network  to accurately predict the financial success (measured in categories of gross revenue: low, low-mid, high-mid, and high) of a movie. We think that this could be a useful tool to anyone in the movie industry who is concerned with making a profit on their movie. It could also help a producer understand which of these features are the most important to an accurate prediction, what content rating is most important, how budget affects outcome, etc.


We believe that the algorithm would have to predict with a relatively low cost (under ~30) to be found useful by movie directors, producers, etc. 

### Data Pre-processing:
We made a number of changes to both the original csv obtained from kaggle before we loaded it and to the data once it was loaded in.


Pre-processing of the CSV:
- We first removed the imdb link from the csv because we knew we would never need to use that (**Note: this was the only feature removed from the csv**)
- We then went through and deleted all of the movies that were made in another country (foriegn films) we did this because we wanted to just look at American films, also because the currency units for those countries (for budget and gross) were in native currency units, not USD, and with changing exchange rates, it's not very easy to compare across countries.
- We then went through and converted all 0 values for gross, movie_facebook_likes, and director_facebook_likes to a blank value in the csv (so that it is read in as NaN by pandas), this is so that we cna more easily impute values later. Note: according to the description on the kaggle entry, because of the way the data was scraped, some movies had missing data. The Python scraper just made these values into a 0 instead of NaN.
- We then removed all movies with an undefined gross. Being the feature we are trying to predict, we should not be imputing values for gross to train our model. That will basically reduce our model to an imputation algorithm...
- We then removed all movies that were made before 1935. We did this because there were only a handful of movies ranging from 1915 to 1935, the way we are classifying budget (described below) would not work with a small sample of movies from that time period. We could have cut this number at a different year (say 1960), but we didn't want to exclude such classics as "Bambi" or "Gone With the Wind"
- Lastly, we had to adjust the gross revenue and budget values for inflation, since the movies spanned many years. For adjusting for inflation we obtained a csv of consumer price index (CPI) for every month since 1947. To simplify, we just took the value for January of that year to use for the whole year. We then took the CPI and calculated the ratio per year compared to 2017 dollars. We then took the budget and gross and multiplied them out with their appropriate ratio value. We then exported this to the csv that we use for the rest of this lab. **NB:** This was done outside of this notebook because this whole process took a very long time when it was included in the notebook when done every time.

Pre-processing of the Data:
- After the above steps, we made more edits to the data using pandas. First, we removed features that we thought would be un-useful to our prediction algorithm. We removed all features concerning facebook likes. We did this because a significant portion of the movies in the training set debuted before facebook was invented and widely adopted. While some of these movies have received retroactive "likes" on facebook, only the most famous classics received a substantial amount of retraoctive "likes". Most lesser known films received very low amounts of "likes" (presumably because modern movie watchers don't really care to search for lesser known movies on facebook, or because the movie doesn't have a facebook). For this reason we decided to remove movie_facebook_likes
- Likewise, we removed the other "likes" for the same reasons as above. For example, the esteemed director George Lucas has a total of 0 "likes" between all of his films. This feature obviously would not help us predict the profitability of movies.
- We also removed irrelevant information such as aspect_ratio, language, and country. Because we deleted all foreign films the country will always be USA. A simple filter of the data reveals that there are no more than 20 movies made in the US that use a language other than English, therefore there is not enough data to use language as training feature. However, we did not delete the movies in a different language, because most of them were famous films such as *Letters from Iwo Jima* and *The Kite Runner*. We still count them as a valuable part of the dataset, just don't find the language of particular value. Lastly, we removed aspect_ratio because that seems to be unimportant for predicting the success of a movie.
- Lastly, we removed other features that would be difficult to use in our machine learning model such as actor names and plot keywords. We initially tried to include these in our model using one-hot encoding, but the resultant array was so enormous that the model would take a very, very long time to train.

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("inflation_corrected_dataset.csv")
for x in ['movie_facebook_likes', 'director_facebook_likes', 'actor_2_facebook_likes', 
          'actor_1_facebook_likes','actor_3_facebook_likes', 'cast_total_facebook_likes',
          'aspect_ratio', 'language', 'country', 'plot_keywords', 'actor_3_name', 'actor_2_name', 'movie_title', 'genres', 'color']:
    if x in df:
        del df[x]
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3222 entries, 0 to 3221
Data columns (total 12 columns):
director_name             3222 non-null object
num_critic_for_reviews    3219 non-null float64
duration                  3221 non-null float64
gross                     3222 non-null int64
actor_1_name              3220 non-null object
num_voted_users           3222 non-null int64
facenumber_in_poster      3216 non-null float64
num_user_for_reviews      3221 non-null float64
content_rating            3196 non-null object
budget                    3062 non-null float64
title_year                3222 non-null int64
imdb_score                3222 non-null float64
dtypes: float64(6), int64(3), object(3)
memory usage: 302.1+ KB
None


Below we group the columns by director_name and then impute as many values as we can, dropping the rows where we can't impute.

In [2]:
# Tamper with the groupings to improve imputations? How do we improve how many values get imputed?
df_grouped = df.groupby(by=['director_name'])
# director_name adds about 50 rows (imputes about 50 rows and then deletes about 100)

In [3]:
df_imputed = df_grouped.transform(lambda grp: grp.fillna(grp.median()))
col_deleted = list( set(df.columns) - set(df_imputed.columns)) #in case the median op deleted columns
df_imputed[col_deleted] = df[col_deleted]

# drop rows that still have missing values after imputation
df_imputed.dropna(inplace=True)
print(df_imputed.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3127 entries, 0 to 3220
Data columns (total 12 columns):
num_critic_for_reviews    3127 non-null float64
duration                  3127 non-null float64
gross                     3127 non-null int64
num_voted_users           3127 non-null int64
facenumber_in_poster      3127 non-null float64
num_user_for_reviews      3127 non-null float64
budget                    3127 non-null float64
title_year                3127 non-null int64
imdb_score                3127 non-null float64
content_rating            3127 non-null object
director_name             3127 non-null object
actor_1_name              3127 non-null object
dtypes: float64(6), int64(3), object(3)
memory usage: 317.6+ KB
None


### Scaling the Data
Below we scale the data using the methods shown so as to not adversely affect the gamma value. We scaled down the value of budget to be within -1 and 1. 

In [4]:
%%time
#scaling budgets!
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_rows', 200)

budget = df_imputed['budget'].values.reshape(-1, 1)
df_imputed.reset_index(drop=True, inplace=True)
print("df: ",df_imputed.shape)

append_list = [df_imputed]

df = pd.concat(append_list, axis=1)

print(df.shape)

df:  (3127, 12)
(3127, 12)
Wall time: 281 ms


### Cutting the gross into categories
Below we cut the adjusted, scaled, gross into 4 main categories: low, low-mid, high-mid, and high. We did this because otherwise the model would not be able to produce raw gross accurately. We also used the "qcut" function to evenly distribute the classes among the classifications, because when we did a normal cut method most of the classes would fall in the lowest category and throw off our predictions.

In [5]:
from sklearn.preprocessing import LabelEncoder

spacing = np.linspace(0, max(df['gross']), 100)
labels = []

labels = ["low", "low-mid", "high-mid", "high"]
df['gross_group'] = pd.qcut(df['gross'], 4, labels=labels)

df = df.drop('gross', axis=1)


## Evaluation
### Choosing Evaluation Metrics

For our dataset, accuracy is not the best evaluation metric, because that does not account properly for false positives. False positives for our business case are MUCH worse than a false negative. It would be very bad to predict that a movie will gross high, when in fact it grosses lowly. However, if we predict the movie will gross low, and it ends up grossing highly, that isn't as bad, because the director will either be pleasantly surprised, or he will choose to not undertake the filming in the first place. It is better to not film and miss out on the potential money, than to undertake the film thinking that it would be lucrative, when in fact it is not.

Because we are using a multi-class classification model we can not simply use precision, recall, or f1 score, but must construct a cost matrix with different weights that correspond to the different combination of predictions and results. Below we have our cost matrix defined. As you can see we weight a false positive with a 20 and a false negative with a 6. We give them this much of a cost difference because of the aforementioned reasons about false positives. Any True predictions are a negative one, and the other numbers in the matrix are scaled appropriately dependent upon how bad they would be as a result.

In [6]:
from sklearn.metrics import confusion_matrix, roc_curve, auc, accuracy_score

cost_matrix = np.array([-1,10, 14,20,2,-1,10,14,4,2,-1,10,6,4,2,-1]) #give a reason for why these numbers chosen
cost_matrix = cost_matrix.reshape(4,4)
print(cost_matrix)

[[-1 10 14 20]
 [ 2 -1 10 14]
 [ 4  2 -1 10]
 [ 6  4  2 -1]]


### Dividing up training/testing data, Scaling

For our dataset we want to use Stratified 5-fold cross validation because it is the best and works for our dataset well. Below we create the StratifiedKFold object and then use it many times later on in the lab. We selected this method instead of a simple 80/20 split because we new that we wanted to test on multiple randomized sets of data, instead of just the same one, so as to avoid data snooping and improper parameter tuning. We chose not to use the shuffle option becuase we wanted to compare our combined deep learning classifier to a standard MLP using the same indices for training and testing data. 

In [1]:
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
cv = StratifiedKFold(n_splits=5)

### Processing Different Feature Inputs
Below we set up a process input function that can handle various types of feature data types and prepares the features to be used in creating the wide and deep network.

In this first block we set the categorical and numeric headers and encode the categorical data to use integers to represent the different categories.

In [8]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
encoders = dict() 
categorical_headers = ['content_rating', 'actor_1_name', 'director_name']

for col in categorical_headers+['gross_group']:
    
    if col=="gross_group":
        tmp = LabelEncoder()
        df[col] = tmp.fit_transform(df[col])
    else:
        encoders[col] = LabelEncoder()
        df[col+'_int'] = encoders[col].fit_transform(df[col])

numeric_headers = ['num_critic_for_reviews', 'duration', 'num_voted_users', 
                   'facenumber_in_poster', 'num_user_for_reviews', 'title_year', 'imdb_score']

for col in numeric_headers:
    df[col] = df[col].astype(np.float)
    
    ss = StandardScaler()
    df[col] = ss.fit_transform(df[col].values.reshape(-1, 1))


Here we define our process input function that takes the different headers and prepares them to be used to make the model. Here we specifically convert the integer encoded header values to use one hot encoding instead. The way this one-hot works with tensor flow means that the one-hot arrays can be processed extremely quickly compared to using one-hot with normal MLP. With tensor flow we are able to use categorical features that we would not have been able to use otherwise!

In [9]:
# Now lets create a wide model 
# https://www.tensorflow.org/tutorials/wide_and_deep
def process_input(df, label_header, categ_headers, numeric_headers):
    # input: what ever you need it to be
    # output: (dict of feature columns as tensors), (labels as tensors)
    
    # ========Process Inputs=========
    # not much changes here, except we leave the numerics as tc.constants
    continuous_cols = {k: tf.constant(df[k].values) for k in numeric_headers}
    # and we shift these tensors to be sparse one-hot encoded values
    # Creates a dictionary mapping from each categorical feature column name (k)
    # to the values of that column stored in a tf.SparseTensor.
    categorical_cols = {k: tf.SparseTensor(
                              indices=[[i, 0] for i in range(df[k].size)],
                              values=df[k].values,
                              dense_shape=[df[k].size, 1])
                        for k in categ_headers}
    
    # Merges the two dictionaries into one.
    feature_cols = dict(categorical_cols)
    feature_cols.update(continuous_cols)
    
    # Convert the label column into a constant Tensor.
    label = None
    if label_header is not None:
        label = tf.constant(df[label_header].values)
        
    return feature_cols, label


# Modeling
### Create a combined wide and deep network to classify data with tensorflow
Below we import the necessary tensorflow libraries and prepare to build the wide and deep network

In [10]:
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers
from tensorflow.contrib.learn.python import SKCompat
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib
tf.logging.set_verbosity(tf.logging.ERROR) # control the verbosity of tensor flow

Here we setup the columns for one of the wide and deep neural network architectures. This first one is set up to be deeper than the second network. We will compare the performance of these later to see whether a shallow or deep network works better. We also cross columns of the main actor and the director. We believe that crossing these two columns would have the most similarities and would yield the most interesting and performant results.

In [11]:
# update the model to take input features as a dictionary
def setup_wide_deep_columns_1():
    # the prototype for this function is as follows
    # input:  (features, targets) 
    # output: (predictions, loss, train_op)
    
    wide_columns = []
    deep_columns = []
    # add in each of the categorical columns to both wide and deep features
    for col in categorical_headers:
        wide_columns.append(
            layers.sparse_column_with_keys(col, keys=encoders[col].classes_)
        )
        
        dim = round(np.log2(len(encoders[col].classes_)))
        deep_columns.append(
            layers.embedding_column(wide_columns[-1], dimension=dim)
        )
        
    # also add in some specific crossed columns
    cross_columns = [('actor_1_name', 'director_name')]
    for tup in cross_columns:
        wide_columns.append(
            layers.crossed_column(
                [layers.sparse_column_with_keys(tup[0], keys=encoders[tup[0]].classes_),
                 layers.sparse_column_with_keys(tup[1], keys=encoders[tup[1]].classes_)],
            hash_bucket_size=int(1e4))
        )
   
    # and add in the regular dense features 
    for col in numeric_headers:
        deep_columns.append(
            layers.real_valued_column(col)
        )
    return wide_columns, deep_columns

This second setup of the wide and deep network columns is for a shallower network. We believe that this one will perform better once we compare them. Again, we cross actor and director name.

In [12]:
# update the model to take input features as a dictionary
def setup_wide_deep_columns_2():
    # the prototype for this function is as follows
    # input:  (features, targets) 
    # output: (predictions, loss, train_op)
    
    wide_columns = []
    deep_columns = []
    # add in each of the categorical columns to both wide and deep features
    for col in categorical_headers:
        wide_columns.append(
            layers.sparse_column_with_keys(col, keys=encoders[col].classes_)
        )
        
        dim = round(np.log2(len(encoders[col].classes_)))
        deep_columns.append(
            layers.embedding_column(wide_columns[-1], dimension=dim)
        )
        
    # also add in some specific crossed columns
    cross_columns = [('actor_1_name', 'director_name')]
    for tup in cross_columns:
        wide_columns.append(
            layers.crossed_column(
                [layers.sparse_column_with_keys(tup[0], keys=encoders[tup[0]].classes_),
                 layers.sparse_column_with_keys(tup[1], keys=encoders[tup[1]].classes_)],
            hash_bucket_size=int(1e4))
        )

        
    # and add in the regular dense features 
    for col in numeric_headers:
        deep_columns.append(
            layers.real_valued_column(col)
        )
                    
    return wide_columns, deep_columns

Finally, we instatiate and run the deep neural network combined classifiers, one for the deeper architecture and one for the shallow. We set up all of the cross validation, receiver operator characteristic, and area under the curve. We then test them to see which has a better generalization performance.

In [13]:
X = df.drop('gross_group', axis=1)
y = df['gross_group']

In [14]:
%%time

from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
cv = StratifiedKFold(n_splits=5)
classes = [0, 1, 2, 3]
conf_list = [[],[]]
roc_list = []
auc_list = []
mlp_confusion = []
for train_idx, test_idx in cv.split(X,y):
    df_train, df_test = df.loc[train_idx], df.loc[test_idx]    

    
    input_wrapper = lambda:process_input(df_train,'gross_group',categorical_headers, numeric_headers)
    output_wrapper = lambda:process_input(df_test,None,categorical_headers, numeric_headers)
    
    wide_columns, deep_columns = setup_wide_deep_columns_1()
    clf1 = learn.DNNLinearCombinedClassifier(
                            n_classes=4,
                            linear_feature_columns=wide_columns,
                            dnn_feature_columns=deep_columns,
                            dnn_hidden_units=[80, 50, 50])

    clf1.fit(input_fn=input_wrapper, steps=300)
    
    y_test = df_test['gross_group'].values
    yhat1 = clf1.predict(input_fn=output_wrapper)
    # the output is now an iterable value, so we need to step over it
    yhat1 = [x for x in yhat1]
    yhat1_probs = [x for x in clf1.predict_proba(input_fn=output_wrapper)]

    conf_list[0].append(confusion_matrix(y_test, yhat1))

    

    clf2 = learn.DNNLinearCombinedClassifier(
                            n_classes=4,
                            linear_feature_columns=wide_columns,
                            dnn_feature_columns=deep_columns,
                            dnn_hidden_units=[80])

    clf2.fit(input_fn=input_wrapper, steps=300)
    yhat2 = clf2.predict(input_fn=output_wrapper)
    yhat2 = [x for x in yhat2]
    yhat2_probs = [x for x in clf2.predict_proba(input_fn=output_wrapper)]

    conf_list[1].append(confusion_matrix(y_test, yhat2))
    
    


Wall time: 2min 10s


As you can see, shallow is better becuase our dataset is smaller.

In [15]:
import matplotlib.pyplot as plt
%matplotlib inline

print("Custom Evaluation Costs for DNN #1")
for x in conf_list[0]:
    print(np.sum(x * cost_matrix))
    
print("Custom Evaluation Costs for DNN #2")
for x in conf_list[1]:
    print(np.sum(x * cost_matrix))


Custom Evaluation Costs for DNN #1
1340
1939
1493
1557
2589
Custom Evaluation Costs for DNN #2
1235
2027
1762
1690
2293
