# Modelling revenue on a narrowly-defined tier of films

In previous notebooks it was apparent that modelling the revenue of films based on smaller subsets (EG: seasonality) was far more accurate than modelling based on the entire corpus of films.  This makes logical sense - predicting revenue based off of films that are similar to that which we're predicting the revenue for should be more accurate than based off of the entire body of films. 

In this notebook I'm going to use intuition and some EDA to construct a more narrowly defined tier than simply seasonal release of films. Once a tier has been constructed I'll continue forward with modelling the revenue of that film tier to identify if I'm able to significantly improve the accuracy of modelling and predicting revenue based on a more narrowly defined tier. 

In [67]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

In [38]:
box = pd.read_csv('../Data/Reduced_dimensions2.csv', index_col=0, header = [0,1])
box.head(3)

Unnamed: 0_level_0,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,Genre,...,Release_year,Release_year,Numerical,Country,Country,Country,Country,Country,Country,Country
Unnamed: 0_level_1,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,00s,10s,Num_production_countries,Europe,Oceania,Asia,Middle_East,Latin_America,North_America,Africa
0,0,0,0,1,0,0,1,1,0,0,...,1,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0


In [39]:
# Defines a mean absolute percentage error function
def MAPE(true, pred):
    MAPE = (np.sum(np.abs((true-pred)/true))/len(true))*100
    return MAPE

### Defining an initial tier

A group of movies that immediately come to mind when I think of a group of films are 'blockbuster' action sequels from the summer release and christmas release season.

In [31]:
# Gives me 545 films
tier = box[box['Genre', 'Action'] == 1]
# Films that are action films AND sequels gives me 211 films to model
tier = tier[tier['Descriptive', 'Collection']==1]
# Limiting to action films released in June, July, or December that were also sequels leaves me with 75 films. 
tier = tier[(tier['Release_month', '7']==1) | (tier['Release_month', '12']==1) | (tier['Release_month', '6']==1)]
tier.shape

(75, 84)

In [42]:
# drop all columns that are redundant, we know these films are sequels and that they aren't released in the below months
tier_1 = tier.drop(['Collection', '1', '2', '3', '4', '5', '8', '9', '10', '11'], level=1, axis=1)
y = np.log(tier_1['Numerical', 'revenue'])
X = tier_1.drop('revenue', level=1, axis=1)
tier_1.shape

(75, 74)

In [36]:
scaler = MinMaxScaler()
scaler.fit(X['Numerical'])
X['Numerical'] = scaler.transform(X['Numerical'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
exp_y_test = np.exp(y_test)

In [64]:
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X_train, y_train)
tier_1_preds = gb.predict(X_test)
tier_1_preds = np.exp(tier_1_preds)
r2_1 = r2_score(exp_y_test, tier_1_preds)
RMSE_1 = mean_squared_error(exp_y_test, tier_1_preds, squared=False)
MAE_1 = mean_absolute_error(exp_y_test, tier_1_preds)
MAPE_1 = MAPE(exp_y_test, tier_1_preds)
scores_1 = {'R2':r2_1, 'RMSE':RMSE_1, "MAE":MAE_1, "MAPE":MAPE_1}
scores_1

{'R2': 0.3294376807570927,
 'RMSE': 59636179.77900985,
 'MAE': 47350674.83634695,
 'MAPE': 69.35751604941464}

#### We have great results!

Modelling with gradient boosting regression models has reduced the MAPE to around 60-70 (depending on the random state of the model).  Next I'm going to try to tune hyperparameters to see if I can reduce that down to below 20%.

In [74]:
params = {'n_estimators':np.arange(5,10000), 'max_depth':np.arange(1,70), 
          'min_samples_split': np.arange(2,10), 'learning_rate': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1]}
grade = GradientBoostingRegressor()
gradient = GridSearchCV(grade, param_grid=params)
gradient.fit(X_train, y_train)
grade_best_score = gradient.best_score_
grade_best_params = gradient.best_params_
print('GB best score: ', grade_best_score)
print('GB best parameters: ', grade_best_params)

MemoryError: 

In [69]:
# Testing new optimized GradientBooster with 'best' hyperparameters:
gb = GradientBoostingRegressor(n_estimators = 17, min_samples_split = 6, max_depth=11, random_state=42)
gb.fit(X_train, y_train)
tier_1_preds = gb.predict(X_test)
tier_1_preds = np.exp(tier_1_preds)
r2_2 = r2_score(exp_y_test, tier_1_preds)
RMSE_2 = mean_squared_error(exp_y_test, tier_1_preds, squared=False)
MAE_2 = mean_absolute_error(exp_y_test, tier_1_preds)
MAPE_2 = MAPE(exp_y_test, tier_1_preds)
scores_2 = {'R2':r2_2, 'RMSE':RMSE_2, "MAE":MAE_2, "MAPE":MAPE_2}
scores_2

{'R2': 0.3005550488747464,
 'RMSE': 60906972.462909244,
 'MAE': 49651270.86655247,
 'MAPE': 70.61337101467102}

{'R2': -1289.6628008202847,
 'RMSE': 2616354808.3560324,
 'MAE': 1203782767.8232892,
 'MAPE': 970.1347722377333}