## Shortlist Promising Models

This notebook assumes the data has been split into a training and a test set. If not, run get_data.ipynb first.

1. Try these models:
  - Linear Regression
  - Random Forest Regressor
  - Dense Neural Network
  - Linear SVR
2. Measure and compare their performance on RMSE (compare means and standard deviations of RMSE for different models as well)
3. Make a quick round of feature selection and engineering:
  - Try transforming variables to normal distributions
  - Try removing unimportant features
  - Try adding polynomial features
4. Perform one or two more quick iterations of the five previous steps.
5. Shortlist the top three to five most promising models, preferring models that
make different types of errors.

In [1]:
import pandas as pd

TRAINING_FILEPATH = 'data/training_set.csv'
TEST_FILEPATH = 'data/test_set.csv'

training_set = pd.read_csv(TRAINING_FILEPATH, index_col='index')
test_set = pd.read_csv(TEST_FILEPATH, index_col='index')

In [2]:
from preprocessing_utils import separate_features_targets, FeaturePreprocessor

train_X, train_y = separate_features_targets(training_set)

# preprocess training features: power transform
feature_preprocessor = FeaturePreprocessor(add_combinations=True, powertransform_num=True, onehot_type=True)
train_X = feature_preprocessor.fit_transform(train_X)

In [3]:
# the baseline RMSE is the standard deviation of the targets
train_y.std()

1.1203423450376466

In [4]:
from sklearn.dummy import DummyRegressor
from train_utils import cross_val_rmse

baseline_model = DummyRegressor(strategy='mean')
baseline_errors = cross_val_rmse(baseline_model, train_X, train_y, cv=5, random_state=42, model_name='baseline')
display(baseline_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327


In [5]:
from sklearn.linear_model import LinearRegression

linreg_model = LinearRegression()
linreg_errors = cross_val_rmse(linreg_model, train_X, train_y, cv=5, random_state=42, model_name='linreg')
display(linreg_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802


In [6]:
from sklearn.ensemble import RandomForestRegressor

forestreg_model = RandomForestRegressor()
forestreg_errors = cross_val_rmse(forestreg_model, train_X, train_y, cv=5, random_state=42, model_name='forestreg')
display(forestreg_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
forestreg train,0.348015,0.354293,0.351649,0.364987,0.350212,0.353831,0.00664
forestreg val,0.858008,0.847944,0.852273,0.844156,0.828591,0.846194,0.011109


In [7]:
from sklearn.svm import LinearSVR

linsvr_model = LinearSVR(max_iter=100000)
linsvr_errors = cross_val_rmse(linsvr_model, train_X, train_y, cv=5, random_state=42, model_name='linsvr')
display(linsvr_errors)

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
linsvr train,0.906312,0.911642,0.9035,0.902954,0.907608,0.906403,0.003508
linsvr val,0.908033,0.89322,0.923385,0.920661,0.9032,0.9097,0.01249


In [8]:
import tensorflow.keras as keras
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

def create_neuralnet_model(input_shape):
    def create_model():
        model = Sequential([
                            Dense(32, input_shape=input_shape, activation='relu'),
                            Dense(1)
                ])
        model.compile(loss='mean_squared_error', optimizer='sgd')
        return model

    # wrap the neural network model to be used by scikit-learn
    neuralnet_model = KerasRegressor(create_model, epochs=150)
    return neuralnet_model

neuralnet_model = create_neuralnet_model(train_X.shape[1:])
neuralnet_errors = cross_val_rmse(neuralnet_model, train_X, train_y, cv=5, random_state=42, model_name='neuralnet')
display(neuralnet_errors)

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
neuralnet train,0.726723,0.730062,0.737216,0.727451,0.728118,0.729914,0.004266
neuralnet val,0.819689,0.811171,0.827034,0.81255,0.81017,0.816123,0.00715


In [9]:
power_tr_model_errors = pd.concat([baseline_errors, linreg_errors, forestreg_errors, linsvr_errors, neuralnet_errors])
power_tr_model_errors

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802
forestreg train,0.348015,0.354293,0.351649,0.364987,0.350212,0.353831,0.00664
forestreg val,0.858008,0.847944,0.852273,0.844156,0.828591,0.846194,0.011109
linsvr train,0.906312,0.911642,0.9035,0.902954,0.907608,0.906403,0.003508
linsvr val,0.908033,0.89322,0.923385,0.920661,0.9032,0.9097,0.01249
neuralnet train,0.726723,0.730062,0.737216,0.727451,0.728118,0.729914,0.004266
neuralnet val,0.819689,0.811171,0.827034,0.81255,0.81017,0.816123,0.00715


In [10]:
import pandas as pd

def evaluate_models(train_X, train_y):
    baseline_errors = cross_val_rmse(baseline_model, train_X, train_y, cv=5, model_name='baseline')
    linreg_errors = cross_val_rmse(linreg_model, train_X, train_y, cv=5, model_name='linreg')
    forestreg_errors = cross_val_rmse(forestreg_model, train_X, train_y, cv=5, model_name='forestreg')
    linsvr_errors = cross_val_rmse(linsvr_model, train_X, train_y, cv=5, model_name='linsvr')

    neuralnet_model = create_neuralnet_model(train_X.shape[1:])
    neuralnet_errors = cross_val_rmse(neuralnet_model, train_X, train_y, cv=5, model_name='neuralnet')
    
    return pd.concat([baseline_errors, linreg_errors, forestreg_errors, linsvr_errors, neuralnet_errors])

In [11]:
train_X, train_y = separate_features_targets(training_set)

# preprocess training features (standardization)
feature_preprocessor_std = FeaturePreprocessor(add_combinations=True, std_scale_num=True, onehot_type=True)
train_X_std = feature_preprocessor_std.fit_transform(train_X)

In [12]:
std_errors = evaluate_models(train_X_std, train_y)
std_errors

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.913938,0.915948,0.906405,0.909648,0.916653,0.912518,0.004373
linreg val,0.909653,0.900677,0.939456,0.92941,0.911433,0.918126,0.015838
forestreg train,0.352452,0.357309,0.352901,0.35927,0.3563,0.355646,0.002918
forestreg val,0.855975,0.828011,0.852394,0.868915,0.830429,0.847145,0.0175
linsvr train,0.931163,0.933242,0.923078,0.925309,0.929819,0.928522,0.00421
linsvr val,0.922091,0.92,0.954623,0.948079,0.918281,0.932615,0.017312
neuralnet train,0.754513,0.765305,0.760587,0.761416,0.888814,0.786127,0.057534
neuralnet val,0.847024,0.815533,0.854683,0.849999,0.940134,0.861475,0.046596


In [13]:
train_X, train_y = separate_features_targets(training_set)

train_X = feature_preprocessor.fit_transform(train_X)
forestreg_model.fit(train_X, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [14]:
forestreg_feature_importances = pd.DataFrame({'feature': train_X.columns, 'importance': forestreg_model.feature_importances_})
forestreg_feature_importances = forestreg_feature_importances.sort_values(by='importance', ascending=False)
forestreg_feature_importances = forestreg_feature_importances.reset_index(drop=True)
forestreg_feature_importances

Unnamed: 0,feature,importance
0,year,0.138135
1,ratingCount,0.126447
2,reviewsPerRating,0.11728
3,duration,0.095743
4,nrOfWins,0.07846
5,type_video.movie,0.076736
6,nrOfUserReviews,0.043138
7,nrOfPhotos,0.038654
8,nrOfNewsArticles,0.038006
9,winsPerNomination,0.025696


In [15]:
least_important_features = list(forestreg_feature_importances.iloc[-10:]['feature'].values)
least_important_features

['Mystery',
 'Biography',
 'History',
 'War',
 'Western',
 'GameShow',
 'Sport',
 'Adult',
 'type_game',
 'FilmNoir']

In [16]:
train_X, train_y = separate_features_targets(training_set)

# preprocess training features (power transform, remove least important features)
feature_preprocessor = FeaturePreprocessor(add_combinations=True, powertransform_num=True, onehot_type=True,
                                           drop_features=least_important_features)
train_X = feature_preprocessor.fit_transform(train_X)

In [17]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 11613 to 14254
Data columns (total 34 columns):
ratingCount           10840 non-null float64
duration              10840 non-null float64
year                  10840 non-null float64
nrOfWins              10840 non-null float64
nrOfNominations       10840 non-null float64
nrOfPhotos            10840 non-null float64
nrOfNewsArticles      10840 non-null float64
nrOfUserReviews       10840 non-null float64
nrOfGenre             10840 non-null float64
totalNominations      10840 non-null float64
winsPerNomination     10840 non-null float64
reviewsPerRating      10840 non-null float64
type_video.episode    10840 non-null float64
type_video.movie      10840 non-null float64
type_video.tv         10840 non-null float64
Action                10840 non-null int64
Adventure             10840 non-null int64
Animation             10840 non-null int64
Comedy                10840 non-null int64
Crime                 10840 non-null int

In [18]:
power_tr_drop_errors = evaluate_models(train_X, train_y)

Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 14

In [19]:
print("Standardization Only")
display(std_errors)

print("\n\nPower Transform Only")
display(power_tr_model_errors)

print("\n\nPower Transform and Drop 10 Least Important Features")
display(power_tr_drop_errors)

Standardization Only


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.913938,0.915948,0.906405,0.909648,0.916653,0.912518,0.004373
linreg val,0.909653,0.900677,0.939456,0.92941,0.911433,0.918126,0.015838
forestreg train,0.352452,0.357309,0.352901,0.35927,0.3563,0.355646,0.002918
forestreg val,0.855975,0.828011,0.852394,0.868915,0.830429,0.847145,0.0175
linsvr train,0.931163,0.933242,0.923078,0.925309,0.929819,0.928522,0.00421
linsvr val,0.922091,0.92,0.954623,0.948079,0.918281,0.932615,0.017312
neuralnet train,0.754513,0.765305,0.760587,0.761416,0.888814,0.786127,0.057534
neuralnet val,0.847024,0.815533,0.854683,0.849999,0.940134,0.861475,0.046596




Power Transform Only


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.893267,0.896926,0.88998,0.890549,0.894051,0.892955,0.002815
linreg val,0.897161,0.881527,0.910644,0.908679,0.894478,0.898498,0.011802
forestreg train,0.348015,0.354293,0.351649,0.364987,0.350212,0.353831,0.00664
forestreg val,0.858008,0.847944,0.852273,0.844156,0.828591,0.846194,0.011109
linsvr train,0.906312,0.911642,0.9035,0.902954,0.907608,0.906403,0.003508
linsvr val,0.908033,0.89322,0.923385,0.920661,0.9032,0.9097,0.01249
neuralnet train,0.726723,0.730062,0.737216,0.727451,0.728118,0.729914,0.004266
neuralnet val,0.819689,0.811171,0.827034,0.81255,0.81017,0.816123,0.00715




Power Transform and Drop 10 Least Important Features


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
baseline train,1.123358,1.121115,1.114294,1.119928,1.122722,1.120283,0.003609
baseline val,1.107938,1.116999,1.14397,1.121741,1.110553,1.12024,0.014327
linreg train,0.894511,0.898298,0.891318,0.891449,0.895522,0.89422,0.002937
linreg val,0.897661,0.881696,0.910642,0.910051,0.893674,0.898745,0.012113
forestreg train,0.35287,0.361841,0.353304,0.364521,0.35451,0.357409,0.005387
forestreg val,0.844424,0.83691,0.854385,0.845759,0.823655,0.841027,0.011523
linsvr train,0.907547,0.913225,0.904705,0.90397,0.908863,0.907662,0.003701
linsvr val,0.908602,0.894419,0.923609,0.921272,0.904178,0.910416,0.012145
neuralnet train,0.75531,0.748043,0.740307,0.751168,0.74472,0.747909,0.005774
neuralnet val,0.824819,0.807315,0.827559,0.832562,0.826229,0.823697,0.009612


Notes about the models so far:
- Power transformation to numerical columns results in less error except for the random forest model where the error increases slightly
- Removing the least important features results in slightly more error except for the random forest model where error slightly decreases

Best models so far:
- Random Forest Regressor
- Dense Neural Network