## Shortlist Promising Models

This notebook assumes the data has been split into a training and a test set. If not, run get_data.ipynb first.

1. Try these models:
  - Linear Regression
  - Random Forest Regressor
  - Dense Neural Network
  - Linear SVR
2. Measure and compare their performance on RMSE (compare means and standard deviations of RMSE for different models as well)
3. Make a quick round of feature selection and engineering:
  - Try transforming variables to normal distributions
  - Try removing unimportant features
  - Try adding polynomial features
4. Perform one or two more quick iterations of the five previous steps.
5. Shortlist the top three to five most promising models, preferring models that
make different types of errors.

In [1]:
import pandas as pd

TRAINING_FILEPATH = 'data/training_set.csv'
TEST_FILEPATH = 'data/test_set.csv'

training_set = pd.read_csv(TRAINING_FILEPATH, index_col='index')
test_set = pd.read_csv(TEST_FILEPATH, index_col='index')

In [71]:
from preprocessing_utils import separate_features_targets, FeaturePreprocessor

train_X, train_y = separate_features_targets(training_set)

# preprocess training features: power transform
feature_preprocessor = FeaturePreprocessor(add_combinations=True, powertransform_num=True, onehot_type=True)
train_X = feature_preprocessor.fit_transform(train_X)

In [72]:
# the baseline RMSE is the standard deviation of the targets
train_y.std()

1.1203423450376466

In [73]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def cross_val_rmse(model, X, y, cv=5, random_state=None, model_name=None):
    """
    Using K-fold cross validation, this function evaluates root mean squared error on training folds and validation folds
    """

    # make sure X and y are numpy arrays for slicing later
    X = np.array(X)
    y = np.array(y)

    # split data into folds
    kf = KFold(n_splits=cv, shuffle=False, random_state=random_state)
    fold_indices = kf.split(X)

    rmse_list = []
    for indices in fold_indices:
        train_indices = indices[0]
        val_indices = indices[1]
        
        # train the model on the training folds
        model.fit(X[train_indices], y[train_indices])

        # evaluate the model on all folds
        y_pred = model.predict(X)
        train_rmse = np.sqrt(mean_squared_error(y[train_indices], y_pred[train_indices]))
        val_rmse = np.sqrt(mean_squared_error(y[val_indices], y_pred[val_indices]))
        rmse_list.append([train_rmse, val_rmse])
    
    # create a data frame
    index = ['train', 'val']
    if model_name is not None:
        for i in range(len(index)):
            index[i] = f"{model_name} {index[i]}"
    df = pd.DataFrame(np.array(rmse_list).T, index=index, columns=["fold " + str(i) for i in range(cv)])
    
    # compute mean and standard deviation
    df_mean = df.mean(axis=1)
    df_std = df.std(axis=1)
    df['mean'] = df_mean
    df['std'] = df_std

    return df

In [74]:
from sklearn.dummy import DummyRegressor

baseline_model = DummyRegressor(strategy='mean')
baseline_errors = cross_val_rmse(baseline_model, train_X, train_y, cv=5, random_state=42, model_name='baseline')
display(baseline_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327


In [75]:
from sklearn.linear_model import LinearRegression

linreg_model = LinearRegression()
linreg_errors = cross_val_rmse(linreg_model, train_X, train_y, cv=5, random_state=42, model_name='linreg')
display(linreg_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802


In [76]:
from sklearn.ensemble import RandomForestRegressor

forestreg_model = RandomForestRegressor()
forestreg_errors = cross_val_rmse(forestreg_model, train_X, train_y, cv=5, random_state=42, model_name='forestreg')
display(forestreg_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
forestreg train,0.35151,0.357484,0.358066,0.358803,0.352321,0.355637,0.003441
forestreg val,0.853699,0.833393,0.84267,0.842541,0.818561,0.838173,0.013113


In [77]:
from sklearn.svm import LinearSVR

linsvr_model = LinearSVR(max_iter=100000)
linsvr_errors = cross_val_rmse(linsvr_model, train_X, train_y, cv=5, random_state=42, model_name='linsvr')
display(linsvr_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
linsvr train,0.906789,0.912054,0.903541,0.903642,0.908154,0.906836,0.003536
linsvr val,0.908406,0.893704,0.923566,0.921239,0.904131,0.910209,0.012376


In [78]:
import tensorflow.keras as keras
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

def create_neuralnet_model(input_shape):
    def create_model():
        model = Sequential([
                            Dense(32, input_shape=input_shape, activation='relu'),
                            Dense(1)
                ])
        model.compile(loss='mean_squared_error', optimizer='sgd')
        return model

    # wrap the neural network model to be used by scikit-learn
    neuralnet_model = KerasRegressor(create_model, epochs=150)
    return neuralnet_model

neuralnet_model = create_neuralnet_model(train_X.shape[1:])
neuralnet_errors = cross_val_rmse(neuralnet_model, train_X, train_y, cv=5, random_state=42, model_name='neuralnet')
display(neuralnet_errors)

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
neuralnet train,0.730743,0.734475,0.736975,0.747654,0.733245,0.736619,0.006565
neuralnet val,0.820865,0.809628,0.809639,0.820308,0.81377,0.814842,0.005513


In [79]:
power_tr_model_errors = pd.concat([baseline_errors, linreg_errors, forestreg_errors, linsvr_errors, neuralnet_errors])
power_tr_model_errors

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802
forestreg train,0.35151,0.357484,0.358066,0.358803,0.352321,0.355637,0.003441
forestreg val,0.853699,0.833393,0.84267,0.842541,0.818561,0.838173,0.013113
linsvr train,0.906789,0.912054,0.903541,0.903642,0.908154,0.906836,0.003536
linsvr val,0.908406,0.893704,0.923566,0.921239,0.904131,0.910209,0.012376
neuralnet train,0.730743,0.734475,0.736975,0.747654,0.733245,0.736619,0.006565
neuralnet val,0.820865,0.809628,0.809639,0.820308,0.81377,0.814842,0.005513


In [84]:
def evaluate_models(train_X, train_y):
    baseline_errors = cross_val_rmse(baseline_model, train_X, train_y, cv=5, model_name='baseline')
    linreg_errors = cross_val_rmse(linreg_model, train_X, train_y, cv=5, model_name='linreg')
    forestreg_errors = cross_val_rmse(forestreg_model, train_X, train_y, cv=5, model_name='forestreg')
    linsvr_errors = cross_val_rmse(linsvr_model, train_X, train_y, cv=5, model_name='linsvr')

    neuralnet_model = create_neuralnet_model(train_X.shape[1:])
    neuralnet_errors = cross_val_rmse(neuralnet_model, train_X, train_y, cv=5, model_name='neuralnet')
    
    return pd.concat([baseline_errors, linreg_errors, forestreg_errors, linsvr_errors, neuralnet_errors])

In [85]:
train_X, train_y = separate_features_targets(training_set)

# preprocess training features (standardization)
feature_preprocessor_std = FeaturePreprocessor(add_combinations=True, std_scale_num=True, onehot_type=True)
train_X_std = feature_preprocessor_std.fit_transform(train_X)

In [86]:
std_errors = evaluate_models(train_X_std, train_y)
std_errors

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.913938,0.915948,0.906405,0.909648,0.916653,0.912518,0.004373
linreg val,0.909653,0.900677,0.939456,0.92941,0.911433,0.918126,0.015838
forestreg train,0.352948,0.348856,0.35639,0.352738,0.354742,0.353135,0.002814
forestreg val,0.839054,0.823235,0.850745,0.842143,0.827498,0.836535,0.011161
linsvr train,0.931366,0.933233,0.92304,0.925438,0.93006,0.928627,0.004247
linsvr val,0.922163,0.92007,0.954817,0.948644,0.918581,0.932855,0.017415
neuralnet train,0.772485,0.787028,0.759109,0.765387,0.766973,0.770196,0.010547
neuralnet val,0.859897,0.829562,0.858548,0.844658,0.83305,0.845143,0.014024


In [88]:
train_X, train_y = separate_features_targets(training_set)

train_X = feature_preprocessor.fit_transform(train_X)
forestreg_model.fit(train_X, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [90]:
forestreg_feature_importances = pd.DataFrame({'feature': train_X.columns, 'importance': forestreg_model.feature_importances_})
forestreg_feature_importances = forestreg_feature_importances.sort_values(by='importance', ascending=False)
forestreg_feature_importances = forestreg_feature_importances.reset_index(drop=True)
forestreg_feature_importances

Unnamed: 0,feature,importance
0,year,0.143466
1,ratingCount,0.127177
2,reviewsPerRating,0.112401
3,duration,0.097468
4,type_video.movie,0.070445
5,nrOfWins,0.068954
6,nrOfUserReviews,0.042679
7,nrOfPhotos,0.039809
8,winsPerNomination,0.039808
9,nrOfNewsArticles,0.038612


In [98]:
least_important_features = list(forestreg_feature_importances.iloc[-10:]['feature'].values)
least_important_features

['Mystery',
 'Short',
 'Biography',
 'Sport',
 'War',
 'GameShow',
 'Western',
 'type_game',
 'Adult',
 'FilmNoir']

In [99]:
train_X, train_y = separate_features_targets(training_set)

# preprocess training features (power transform, remove least important features)
feature_preprocessor = FeaturePreprocessor(add_combinations=True, powertransform_num=True, onehot_type=True,
                                           drop_features=least_important_features)
train_X = feature_preprocessor.fit_transform(train_X)

In [100]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 11613 to 14254
Data columns (total 34 columns):
ratingCount           10840 non-null float64
duration              10840 non-null float64
year                  10840 non-null float64
nrOfWins              10840 non-null float64
nrOfNominations       10840 non-null float64
nrOfPhotos            10840 non-null float64
nrOfNewsArticles      10840 non-null float64
nrOfUserReviews       10840 non-null float64
nrOfGenre             10840 non-null float64
totalNominations      10840 non-null float64
winsPerNomination     10840 non-null float64
reviewsPerRating      10840 non-null float64
type_video.episode    10840 non-null float64
type_video.movie      10840 non-null float64
type_video.tv         10840 non-null float64
Action                10840 non-null int64
Adventure             10840 non-null int64
Animation             10840 non-null int64
Comedy                10840 non-null int64
Crime                 10840 non-null int

In [101]:
power_tr_drop_errors = evaluate_models(train_X, train_y)

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

In [104]:
print("Standardization Only")
display(std_errors)

print("\n\nPower Transform Only")
display(power_tr_model_errors)

print("\n\nPower Transform and Drop 10 Least Important Features")
display(power_tr_drop_errors)

Standardization Only


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.913938,0.915948,0.906405,0.909648,0.916653,0.912518,0.004373
linreg val,0.909653,0.900677,0.939456,0.92941,0.911433,0.918126,0.015838
forestreg train,0.352948,0.348856,0.35639,0.352738,0.354742,0.353135,0.002814
forestreg val,0.839054,0.823235,0.850745,0.842143,0.827498,0.836535,0.011161
linsvr train,0.931366,0.933233,0.92304,0.925438,0.93006,0.928627,0.004247
linsvr val,0.922163,0.92007,0.954817,0.948644,0.918581,0.932855,0.017415
neuralnet train,0.772485,0.787028,0.759109,0.765387,0.766973,0.770196,0.010547
neuralnet val,0.859897,0.829562,0.858548,0.844658,0.83305,0.845143,0.014024




Power Transform Only


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802
forestreg train,0.35151,0.357484,0.358066,0.358803,0.352321,0.355637,0.003441
forestreg val,0.853699,0.833393,0.84267,0.842541,0.818561,0.838173,0.013113
linsvr train,0.906789,0.912054,0.903541,0.903642,0.908154,0.906836,0.003536
linsvr val,0.908406,0.893704,0.923566,0.921239,0.904131,0.910209,0.012376
neuralnet train,0.730743,0.734475,0.736975,0.747654,0.733245,0.736619,0.006565
neuralnet val,0.820865,0.809628,0.809639,0.820308,0.81377,0.814842,0.005513




Power Transform and Drop 10 Least Important Features


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.895211,0.898788,0.892349,0.892274,0.896531,0.895031,0.002792
linreg val,0.898901,0.883642,0.910442,0.91087,0.893702,0.899511,0.011559
forestreg train,0.353823,0.353798,0.348534,0.358351,0.353111,0.353523,0.003484
forestreg val,0.829275,0.840292,0.832306,0.853556,0.814249,0.833936,0.014468
linsvr train,0.908247,0.91356,0.905552,0.90545,0.910071,0.908576,0.003395
linsvr val,0.90948,0.896384,0.923342,0.923145,0.904059,0.911282,0.01187
neuralnet train,0.743611,0.755037,0.738372,0.735314,0.749691,0.744405,0.008072
neuralnet val,0.833834,0.814097,0.828096,0.823149,0.817441,0.823323,0.007953


Notes about the models so far:
- Power transformation to numerical columns results in less error except for the random forest model where the error increases slightly
- Removing the least important features results in slightly more error except for the random forest model where error slightly decreases

Best models so far:
- Random Forest Regressor
- Dense Neural Network