# Project Luther - Predicting Movie Lifetime Gross

# Table of contents
1. [Introduction](#introduction)
2. [Webscrape Movie Reviews From IMBD](#Webscrape Movie Reviews From IMBD)
3. [Merging IMBD and BoxOfficeMojo Data](#Merging IMBD and BoxOfficeMojo Data)
4. [Predictive Models](#Predictive Models)
4. [Simple Test Case](#Simple Test Case)

## Introduction <a name="introduction"></a>

This project creates a script that scrapes movie data from www.imbd.com, and then merges it with movie data from www.boxofficemojo.com after cleaning. Then linear regression, decision tree, and random forest models are run with Scikit Learn Regression producing the highest R^2 of 82%. Overall, the models view "vote" and "opening" features as the most important features. Finally, a simple test case predicts what a fictional movie with a certain rating, opening, and vote number would make in lifetime gross at the boxoffice.

## Webscrape Movie Reviews From IMBD <a name="Webscrape Movie Reviews From IMBD"></a>

In [2]:
# Webscraping/importing/system
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os
import json
import pickle
# Graphing
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from seaborn import plt
# Modeling
import statsmodels.formula.api as smf
import patsy
from sklearn import metrics
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')
# Display
from IPython.display import display, HTML

In [3]:
def get_imdb_data():

#Sample website below:
#http://www.imdb.com/search/title?count=250&countries=us&languages=en&release_date
#=1972-01-01,2014-12-31&title_type=feature&view=simple&start=1
    
    base_url = "http://www.imdb.com/search/title"
    params = {
        "count": 250,
        "countries": "us",
        "languages": "en",
        "release_date": "1972-01-01,2014-12-31",
        "title_type": "feature",
        "view":"simple",
        "start": 1
    }
    data = []
    # Loop to navigate through pages by changing the start parameter
    while True:
        # Get content from IMDB
        content = requests.get(base_url, params=params).content
        params["start"] += 250
        
        soup = BeautifulSoup(content)
        
        # If no data is found, end the while loop
        if not soup.find("tr", class_=["even", "odd"]) or params["start"] > 10000:
            break
        
        # For each row (data), get title, ratings. Rows are tagged with even and odd class
        # tr = table row, td = table detail
        for tr in soup.find_all("tr", class_=["even", "odd"]):
            title_tag = tr.find("td", class_="title") # find td that has class = title
            title = title_tag.find("a").text # whatever is between is what I want
            
            # Get the Year. Strip the first and last value that contain the paranteses
            year = title_tag.find("span", class_="year_type").text[1:-1]
            
            # Get all td tags to get to rating and votes that don't have any class to them
            # No class names on rating
            td_tags = tr.find_all("td")
            # Get ratings.  If they don't exist, make empty string
            rating_tag = td_tags[2].find("b")
            if rating_tag:
                rating = rating_tag.text
            else:
                rating = ""
            # Get votes and remove the thousands seperator to get number
            votes = td_tags[3].text.strip().replace(",", "")
            # Append the results to data
            data.append({
                    "title": title,
                    "year": year,
                    "rating": rating,
                    "votes": votes
                })
    return data

In [4]:
data = get_imdb_data()

In [5]:
# Transform data to a dataframe, save as a pickle file

In [6]:
# pd.DataFrame(data)
# imdb = pd.DataFrame(data)
# with open('imdb1970.pkl', 'w') as picklefile:
#    pickle. dump(imdb, picklefile) 

In [7]:
imdbfile = 'imdb1970.pkl'
assert os.path.isfile(imdbfile),'Oops, move your json or change the metafile path'
imdb_df = pd.DataFrame(pd.read_pickle(imdbfile))
imdb_df = imdb_df # same as mojo_df.transpose()
imdb_df.head()

Unnamed: 0,rating,title,votes,year
0,9.2,The Godfather,1097794,1972
1,7.4,Labyrinth,90636,1986
2,8.6,Interstellar,832622,2014
3,7.8,Kingsman: The Secret Service,351702,2014
4,8.2,The Wolf of Wall Street,709910,2013


In [8]:
print imdb_df.shape
print imdb_df.columns.values

(9750, 4)
['rating' 'title' 'votes' 'year']


## Merging IMBD and BoxOfficeMojo Data <a name="Merging IMBD and BoxOfficeMojo Data"></a>

### BoxOfficeMojo Data

In [9]:
mojofile = 'mojo_movies.pkl'

In [10]:
assert os.path.isfile(mojofile), 'Oops, move your pickle or change the mojofile path'

In [16]:
mojo_df = pd.DataFrame(pd.read_pickle(mojofile))
mojo_df = mojo_df.T # same as mojo_df.transpose()
HTML(mojo_df.to_html())

Unnamed: 0,date,genres,lifetime gross,lifetime gross theaters,mojo_url,opening,opening theaters,rank,studio,title,year
/movies/?id=,2015-07-31 00:00:00,"[Biopic - Music, Foreign Language]",8808.0,5,/movies/?id=,5295.0,5,44,MBox,Paul Coelho's Best Story,2015.0
/movies/?id=10000bc.htm,2008-03-07 00:00:00,[Adventure - Period],94784201.0,3454,/movies/?id=10000bc.htm,35867488.0,3410,20,WB,"10,000 B.C.",2008.0
/movies/?id=1000ae.htm,2013-05-31 00:00:00,[Sci-Fi - Adventure],60522097.0,3401,/movies/?id=1000ae.htm,27520040.0,3401,48,Sony,After Earth,2013.0
/movies/?id=1000times.htm,2014-10-24 00:00:00,[Foreign Language],53895.0,24,/movies/?id=1000times.htm,24120.0,24,1197,FM,"1,000 Times Good Night",2014.0
/movies/?id=1001rabbittales.htm,1982-11-19 00:00:00,[Animation],78350.0,33,/movies/?id=1001rabbittales.htm,78350.0,33,322,WB,Bugs Bunny's 1001 Rabbit Tales,1982.0
/movies/?id=100bloodyacres.htm,2013-06-28 00:00:00,[Horror Comedy],6388.0,13,/movies/?id=100bloodyacres.htm,3419.0,13,106,DR,100 Bloody Acres,2013.0
/movies/?id=100foot.htm,2014-08-08 00:00:00,"[Cooking, Drama - Summer]",54240821.0,2167,/movies/?id=100foot.htm,10979290.0,2023,4,BV,The Hundred-Foot Journey,2014.0
/movies/?id=100yearoldman.htm,2015-05-01 00:00:00,[Foreign Language],944193.0,76,/movies/?id=100yearoldman.htm,,-,388,MBox,The 100-Year Old Man Who Climbed Out the Windo...,2015.0
/movies/?id=101dalmatiansliveaction.htm,1996-11-27 00:00:00,"[Dog, Family - Remake, Family - Talking Animal...",136189294.0,2901,/movies/?id=101dalmatiansliveaction.htm,33504025.0,2794,3,BV,101 Dalmatians (1996),1996.0
/movies/?id=102dalmatians.htm,2000-11-22 00:00:00,"[Dog, Family - Talking Animal (Live action), R...",66957026.0,2704,/movies/?id=102dalmatians.htm,19883351.0,2704,11,BV,102 Dalmatians,2000.0


### 2. Check for Abnormalities

In [12]:
# Incorrect Dates

In [None]:
mojo_df.year.unique()

In [None]:
# turns out crazy date movies are 100 years off & full of nan's.
crazy_dates = mojo_df[mojo_df.year > 2016]
print len(crazy_dates)
crazy_dates.head(2)

In [None]:
mjdf = mojo_df.copy()
mjdf = mjdf[mjdf.year < 2017]

In [None]:
# Get important columns and change to more user friendly column names

In [None]:
col_names = [col.replace(' ','_') for col in mjdf.columns]
wanted = [2,3,5,6,9,10]
col_names = [col for i,col in enumerate(col_names) if i in wanted]
col_names_old = [col.replace('_',' ') for col in col_names]
mjdf = pd.DataFrame(mjdf[col_names_old].values,columns=col_names)
mjdf.head()

In [None]:
# 'n/a' in columns make each column an object type: Fix

In [None]:
mjdf.dtypes # Everything is an object

In [None]:
mjdf = mjdf.replace('n/a',np.nan)
mjdf.dtypes

In [None]:
# Drop Nans in Lifetime_gross column

In [None]:
mjdf = mjdf.dropna(subset=['lifetime_gross'])

In [None]:
# Plot lifetime gross over the years

In [None]:
plt.plot(mjdf['year'],mjdf['lifetime_gross'],'mo',alpha=0.3);

In [None]:
# Plot movies per year

In [None]:
mjdf.year.hist(bins=25);

#### Clean and Merge with IMBD Data

In [None]:
merged_on_raw_title = pd.merge(mjdf,imdb_df,on='title')
merged_on_raw_title.head(2)

In [None]:
# Function to create uniform title names

In [None]:
def lightly_process_title(title):
    title = title.replace(' ','').lower()
    charlist = list(title)
    charlist = [char for char in charlist if char.isalnum()]
    return ''.join(charlist)

In [None]:
t = "102 Dalmatians: A Puppy Comes-of-Age"
lightly_process_title(t)

In [None]:
imdb_df.loc[:,'title2'] = imdb_df.loc[:,'title'].apply(lightly_process_title)
mjdf['title2'] = mjdf['title'].apply(lightly_process_title)
merged_on_lighty_processed_title = pd.merge(mjdf,imdb_df,on='title2',how='inner')

In [None]:
merged_on_lighty_processed_title.head(2)

In [None]:
# Filter for relevant columns to model
combined = merged_on_lighty_processed_title
combined.columns
imdb_num = combined[['rating','votes','lifetime_gross','year_y','lifetime_gross_theaters','opening','opening_theaters']]

In [None]:
imdb_num.head(2)

In [None]:
imdb_num.info()

In [None]:
# Get rid of rows with '-'

In [None]:
imdb_copy = imdb_num
imdb_copy[imdb_copy.lifetime_gross_theaters == '-']
imdb_copy = imdb_copy[imdb_copy.lifetime_gross_theaters != '-']
imdb_copy = imdb_copy[imdb_copy.opening_theaters != '-']
imdb_copy[imdb_copy.opening_theaters == '-']
imdb_copy.head(2)

In [None]:
# print imdb_copy.isnull() # boolean true/fales
# print imdb_copy.isnull().any(axis=1) # single col of true/falses depending on any nulls in rows
# print imdb_copy.isnull().any(axis=1).nonzero() # return rows that are non zero or 1 for true
# print imdb_copy.isnull().any(axis=1).nonzero()[0] # take first element of array
print imdb_copy.opening.shape
print imdb_copy.isnull().any(axis=1).nonzero()[0]

In [None]:
# check a row with Nan
imdb_copy.iloc[567,:]

In [None]:
# drop Nans, create final df
imdb_copy.dropna().info()
imdb_final = imdb_copy.dropna()

In [None]:
# Change dtypes
imdb_final['rating'] = imdb_final['rating'].map(lambda x: float(x))
imdb_final['votes'] = imdb_final['votes'].map(lambda x: int(x))
imdb_final['lifetime_gross'] = imdb_final['lifetime_gross'].map(lambda x: int(x))
imdb_final['year_y'] = imdb_final['year_y'].map(lambda x: int(x))
imdb_final['lifetime_gross_theaters'] = imdb_final['lifetime_gross_theaters'].map(lambda x: int(x))
#imdb_final['opening'] = imdb_final['opening'].map(lambda x: float(x))
imdb_final['opening_theaters'] = imdb_final['opening_theaters'].map(lambda x: int(x))

In [None]:
imdb_final.info()

## Predictive Models <a name="Predictive Models"></a>

### Regression - Scikit Learn - 82% R^2

In [None]:
features = ['votes', 'rating','year_y','lifetime_gross_theaters','opening','opening_theaters']
response = ['lifetime_gross']

In [None]:
# Create an empty model
lin_reg = LinearRegression()
# Choose the predictor variables, here all but the first which is the response variable
X = imdb_final[features]
# Choose the response variable(s)
y = imdb_final[response]
# Fit the model to the full dataset. Unlike statsmodel, have to give it the data
lin_reg_results = lin_reg.fit(X, y)
# Print out the R^2 for the model against the full dataset
lin_reg.score(X,y)

### Regression - Scikit Learn - Add Predictions

In [None]:
predictions = lin_reg.predict(X)
len(predictions) # 3958
imdb_final['predictions'] = predictions
imdb_final['predictions'] = imdb_final['predictions'].astype(int)

In [None]:
imdb_final.head(2)

In [None]:
# Scikit Learn Predictions/Fit Plots shows predictions being relatively accurate 

In [None]:
# Plot Predictions vs Fit
fig, ax = plt.subplots(1, 1)
ax.scatter(imdb_final['lifetime_gross'],imdb_final['predictions'])
ax.set_xlabel('lifetime_gross')
ax.set_ylabel('Predictions')
ax.axis('equal')

In [None]:
lin_reg.intercept_

In [None]:
lin_reg.coef_

In [None]:
imdb_final[['lifetime_gross','predictions']].head(2)

In [None]:
# R Squared - Same as above 
metrics.r2_score(imdb_final['lifetime_gross'], imdb_final['predictions'])

In [None]:
# MSE
np.sqrt(metrics.mean_squared_error(imdb_final['lifetime_gross'], imdb_final['predictions']))

In [None]:
# MAE - Scikit Learn
metrics.mean_absolute_error(imdb_final['lifetime_gross'], imdb_final['predictions'])

### Regression - StatsModels - Residual Plots

In [None]:
# Define the model 
model = smf.ols('lifetime_gross ~ votes + rating + year_y + lifetime_gross_theaters + opening + opening_theaters', data=imdb_final)
# Fit the model
fit = model.fit()
# Check out the results
fit.summary()

In [None]:
# Use statsmodels to plot the residuals
# Accurate because most dots around zero
fit.resid.plot(style='o', figsize=(12,8))

### Regression - StatsModels - Array Method
#### Shows importance of features based on p-values - votes most important feature

In [None]:
X2=X

In [None]:
X2["Index"] = 1
X2.head(2)

In [None]:
lm = smf.OLS(np.array(y), np.array(X2))
results = lm.fit()
results.summary()

In [None]:
results.params

In [None]:
results.rsquared

In [None]:
# confidence intervals of coefficients
results.conf_int()

In [None]:
zip(features, lin_reg.coef_)

In [None]:
# Sorted P-Values
sorted_pvalues = sorted(results.pvalues)
pvalues_results = sorted_pvalues[1:]
pvalues_results

In [None]:
#Sort by best pvalue features dataFrame
p = pd.DataFrame({'p_value':pvalues_results, 'feature':features})
p

In [None]:
relevant_features = list(p[p.p_value < .05]['feature'])
relevant_features

### Decision Trees

#### Decision Tree Model gives lower r^2 of 60.5% and depicts different feature importance set

In [None]:
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)

In [None]:
# Create a decision tree classifier instance (start out with a small tree for interpretability)
ctree = tree.DecisionTreeRegressor(random_state=1)

In [None]:
# Fit the decision tree classifier
ctree.fit(X_train, y_train)

In [None]:
# Create a feature vector
features2 = X_train.columns.tolist()

In [None]:
features2

In [None]:
# predictions_tree
predictions_tree = ctree.predict(X_test)

In [None]:
# R Squared - Scikit Learn
metrics.r2_score(y_test, predictions_tree)

In [None]:
# MAE
metrics.mean_absolute_error(y_test, predictions_tree)

In [None]:
# MSE
np.sqrt(metrics.mean_squared_error(y_test, predictions_tree))

In [None]:
# Which features are the most important?
# Clean up the output. # will add up to 1. Think %
pd.DataFrame(zip(list(X.columns), ctree.feature_importances_)).sort_values(by=1, ascending=False)

### Random Forest

#### Random Forrest Model gives slightly lower r^2 of 78.7% and depicts different feature importance set

In [None]:
rfclf = RandomForestRegressor(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfclf.fit(X_train, y_train)

In [None]:
predictions_tree = rfclf.predict(X_test)

In [None]:
# R2
metrics.r2_score(y_test, predictions_tree)

In [None]:
# MEA
metrics.mean_absolute_error(y_test, predictions_tree)

In [None]:
# MSE
np.sqrt(metrics.mean_squared_error(y_test, predictions_tree))

In [None]:
# compute the feature importances
rf_feature_imp = pd.DataFrame(zip(list(X.columns), rfclf.feature_importances_)).sort_index(by=1, ascending=False)
rf_feature_imp

## Simple Test Case <a name="Simple Test Case"></a>

#### Simple test case of a 9 star rating with a 10,000,000 and 150,000 is predicted to make about $120 million in lifetime gross

In [None]:
X_sub_features = imdb_final[['rating', 'opening', 'votes']]

X_train, X_test, y_train, y_test = train_test_split(X_sub_features,y, random_state=1)
rfclf = RandomForestRegressor(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfclf.fit(X_train, y_train)

predictions_tree = rfclf.predict(X_test)
metrics.r2_score(y_test, predictions_tree)

# To Predict a Movie Revenue
rfclf.predict([9, 10000000, 150000])[0]

In [None]:
rfclf.feature_importances_