# Ensemble Model

In this section, we train and evaluate the ensemble model.

## Training

We import the predictions on ensemble_train made by each component model. These data are used as the training input for our ensemble model.

In [None]:
brismf_train = pd.read_csv('brismf_train_predicted.csv')
iter_svd_train = pd.read_csv('iter_svd_train_predicted.csv')
svdpp_train = pd.read_csv('svdpp_train_predicted.csv')
nb_corr_train = pd.read_csv('predicted_rating_nb_corr_train.csv')
nb_ls_train = pd.read_csv('predicted_rating_nb_ls_train.csv')
funk_svd_train = pd.read_csv('predicted_rating_funk_svd_train.csv')

We combine the predictions into a large DataFrame usable by sklearn's LinearRegression.

In [None]:
# Combine the predictions into a large DataFrame
X_train = pd.concat([brismf_train, svdpp_train, nb_corr_train, nb_ls_train, funk_svd_train], axis=1)
X_train

Unnamed: 0.2,Unnamed: 0,PredictedRatingsBRISMF,Unnamed: 0.1,TrueRating_svdpp,PredictedRating_svdpp,predicted_rating_nb_corr,predicted_rating_nb_ls,predicted_rating_funk_svd
0,0,3.656787,0,5,3.712891,3.657468,3.657468,3.755237
1,1,4.437363,1,4,4.564961,4.437715,4.437715,4.700147
2,2,3.558480,2,4,3.539099,3.557692,3.557692,3.552832
3,3,4.288393,3,4,4.384660,4.287962,4.287962,4.527634
4,4,4.336045,4,5,4.507645,4.336538,4.336538,4.484777
...,...,...,...,...,...,...,...,...
79995,79995,3.892506,79995,5,4.008569,3.888232,3.888232,4.063092
79996,79996,3.150403,79996,3,2.809697,3.145349,3.145349,2.755603
79997,79997,3.799601,79997,3,3.777003,3.797619,3.797619,3.774624
79998,79998,3.379320,79998,3,3.119826,3.377966,3.377966,3.171951


In [None]:
# Drop unnecessary columns
X_train = X_train.drop(columns=X_train.columns[[0,2,3]])
X_train

Unnamed: 0,PredictedRatingsBRISMF,PredictedRating_svdpp,predicted_rating_nb_corr,predicted_rating_nb_ls,predicted_rating_funk_svd
0,3.656787,3.712891,3.657468,3.657468,3.755237
1,4.437363,4.564961,4.437715,4.437715,4.700147
2,3.558480,3.539099,3.557692,3.557692,3.552832
3,4.288393,4.384660,4.287962,4.287962,4.527634
4,4.336045,4.507645,4.336538,4.336538,4.484777
...,...,...,...,...,...
79995,3.892506,4.008569,3.888232,3.888232,4.063092
79996,3.150403,2.809697,3.145349,3.145349,2.755603
79997,3.799601,3.777003,3.797619,3.797619,3.774624
79998,3.379320,3.119826,3.377966,3.377966,3.171951


Here we prepare training ratings.

In [None]:
y_train = ensemble_train['rating']
y_train

Unnamed: 0,rating
140476,5
52693,4
99958,4
96366,4
102343,5
...,...
141665,5
142463,3
68961,3
151628,3


We now train the ensemble model using sklearn's LinearRegression. The predictions of each component model on ensemble_train are treated as the input matrix $X$, and the actual labels of ensemble_train form the output vector $y$.

In [None]:
# Import needed packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Model training
ensemble = LinearRegression()
ensemble.fit(X_train, y_train)

In [None]:
# Fitted regression coefficients
ensemble.coef_

array([ 0.12225842, -0.15651894, -0.03258624, -0.03502745,  1.08396098])

In [None]:
# Fitted intercept term
ensemble.intercept_

0.06833796090471456

## Testing

We now test the ensemble model. We first import the predictions on the test set made by each component model. These data are used as the testing input for our ensemble model.

In [None]:
brismf_test = pd.read_csv('brismf_test_predicted.csv')
iter_svd_test = pd.read_csv('iter_svd_test_predicted.csv')
svdpp_test = pd.read_csv('svdpp_test_predicted.csv')
nb_corr_test = pd.read_csv('predicted_rating_nb_corr_test.csv')
nb_ls_test = pd.read_csv('predicted_rating_nb_ls_test.csv')
funk_svd_test = pd.read_csv('predicted_rating_funk_svd_test.csv')

We combine the predictions into a large DataFrame usable by sklearn's LinearRegression.

In [None]:
# Combine the predictions into a large DataFrame
X_test = pd.concat([brismf_test, svdpp_test, nb_corr_test, nb_ls_test, funk_svd_test], axis=1)
X_test

Unnamed: 0.2,Unnamed: 0,PredictedRatingsBRISMF,Unnamed: 0.1,TrueRating_svdpp,PredictedRating_svdpp,predicted_rating_nb_corr,predicted_rating_nb_ls,predicted_rating_funk_svd
0,0,2.848916,0,2,2.479055,2.849673,2.849673,2.385604
1,1,3.371281,1,1,3.296824,3.369841,3.369841,3.234210
2,2,3.010753,2,2,3.107863,3.343739,3.343739,3.057453
3,3,3.335129,3,3,3.157093,3.333333,3.333333,3.134786
4,4,3.659837,4,2,3.544076,3.666804,3.666804,3.553986
...,...,...,...,...,...,...,...,...
19995,19995,4.035565,19995,4,4.025113,4.035714,4.035714,4.009129
19996,19996,3.933333,19996,4,3.934987,3.805029,3.805029,3.926119
19997,19997,4.183126,19997,5,3.485354,4.182927,4.182927,3.430911
19998,19998,2.994034,19998,2,2.490224,2.993351,2.993351,2.402651


In [None]:
# Drop unnecessary columns
X_test = X_test.drop(columns=X_test.columns[[0,2,3]])
X_test

Unnamed: 0,PredictedRatingsBRISMF,PredictedRating_svdpp,predicted_rating_nb_corr,predicted_rating_nb_ls,predicted_rating_funk_svd
0,2.848916,2.479055,2.849673,2.849673,2.385604
1,3.371281,3.296824,3.369841,3.369841,3.234210
2,3.010753,3.107863,3.343739,3.343739,3.057453
3,3.335129,3.157093,3.333333,3.333333,3.134786
4,3.659837,3.544076,3.666804,3.666804,3.553986
...,...,...,...,...,...
19995,4.035565,4.025113,4.035714,4.035714,4.009129
19996,3.933333,3.934987,3.805029,3.805029,3.926119
19997,4.183126,3.485354,4.182927,4.182927,3.430911
19998,2.994034,2.490224,2.993351,2.993351,2.402651


Here we prepare the actual ratings in the test set.

In [None]:
y_test = test['rating']
y_test

Unnamed: 0,rating
180000,2
180001,1
180002,2
180003,3
180004,2
...,...
199995,4
199996,4
199997,5
199998,2


We compute the test RMSE.

In [None]:
# Make predictions on the test set
ensembled_predictions = ensemble.predict(X_test)
ensembled_predictions

array([2.42184742, 3.24239962, 3.03806534, ..., 3.4703873 , 2.44660502,
       3.74009021])

In [None]:
# Compute the test RMSE
compute_rmse(ensembled_predictions, y_test)

0.9257528036019571