# Create a predictive model that will tell us if a stand-up comedy special will receive an above or below average IMDb rating

1) Train weak learners: Random Forrest, Stochastic Gradient Descent.

2) Perform a grid search to find optimal parameters for an XGBoost classifier.

3) Put all three models into an ensemble for a final accuracy of 0.76

By combining the power of three weaker models into an ensemble, it was possible to predict what the IMDb rating of a comedy special is with decent accuracy. The models would probably be improved by using more training data. The LDA model that produced these topic vectors (in topic_modeling_LDA.ipynb) could also be improved with more training data or perhaps by using different hyperparameter settings.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

df = pd.read_pickle('stand-up-data-w-LDA.pkl')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 322 entries, 0 to 329
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   title            322 non-null    object 
 1   date_posted      322 non-null    object 
 2   link             322 non-null    object 
 3   name             318 non-null    object 
 4   year             306 non-null    float64
 5   transcript       322 non-null    object 
 6   language         322 non-null    object 
 7   runtime          272 non-null    float64
 8   rating           272 non-null    float64
 9   rating_type      322 non-null    int64  
 10  words            322 non-null    object 
 11  word_count       322 non-null    int64  
 12  f_words          322 non-null    int64  
 13  s_words          322 non-null    int64  
 14  diversity        322 non-null    int64  
 15  diversity_ratio  322 non-null    float64
 16  police_AA        322 non-null    float64
 17  clean           

In [2]:
X = np.array(df[['police_AA', 'clean', 'UK', 'relationships', 'animals', 'politics', 'big_picture']])
y = np.array(df.rating_type)
print(X.shape)
print(y.shape)

(322, 7)
(322,)


### Split data into training and testing sets and train models.

- Train Random Forrest model

- Train SGD model

- Perform grid search and train XGB model

- Create and ensemble of three classifiers

In [3]:
# Split the data training and testing 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)

In [4]:
# Random Forrest
rf = RandomForestClassifier(n_estimators=1001).fit(X_train, y_train)
print(f'RF score: {rf.score(X_test, y_test)}')

RF score: 0.7142857142857143


In [5]:
# SGD
sgd = linear_model.SGDClassifier().fit(X_train, y_train)
print(f'SGD score: {sgd.score(X_test, y_test)}')

SGD score: 0.7346938775510204


In [35]:
%%time
xgb = XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(xgb,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train, y_train)

CPU times: user 18.4 s, sys: 491 ms, total: 18.9 s
Wall time: 1min 59s


GridSearchCV(cv=3, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estim...
                                     subsample=None, tree_method=None,
                                     validate_parameters=None, verbosity=None),
             iid='deprecated', n_jobs=4,
             param_grid={'colsample_bytree': [0.3, 0.4, 0.5, 0.7],
                        

In [36]:
best_xgb = grid.best_estimator_.fit(X_train, y_train)
print(f'Best params: {grid.best_params_}')
print(f'Best XGB score: {best_xgb.score(X_test, y_test)}')

Best params: {'colsample_bytree': 0.3, 'eta': 0.05, 'gamma': 0.4, 'max_depth': 3, 'min_child_weight': 1}
Best XGB score: 0.6938775510204082


In [37]:
from sklearn.ensemble import VotingClassifier

# Ensemble
estimators = [('rf', rf), ('sgd', sgd), ('xgb', best_xgb)]

ensemble = VotingClassifier(estimators, voting='hard')
ensemble.fit(X_train, y_train)
print('Voting Classifier, Ensemble Acc: {}'.format(ensemble.score(X_test, y_test)))

Voting Classifier, Ensemble Acc: 0.7551020408163265


### Save the ensemble model for later

In [39]:
import pickle

# Save ensemble model
pickle.dump(ensemble, open('rating_prediction_ens_model.pkl', 'wb'))

# # Load ensemble model
# with open('rating_prediction_ens_model.pkl','rb') as f:
#     ensemble = pickle.load(f)