####    MODELING
    
Modeling is the process or means of training a machine learning algorithm to predict the label 
from the features, tuning it for the business need, and validating it on the holdout data. 
    
In the movie recommendation engine, we will be looking for a model that predicts the rating of
a movie. Because the ratings lie on a scale of 1 to 5, and can take any floating point (rouded
to 1 decimal place) this scale, this task is therefore a regression one.

In [1]:
# Import libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 

%matplotlib inline

In [2]:
# Import Dataset; this is the data that we prepared
movie_ratings = pd.read_csv("movie_ratings_features.csv")
movie_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,global_avg_rating,movie_avg_rating,user_avg_rating,sim_movie_1,sim_movie_2,sim_movie_3,sim_movie_4,sim_movie_5,sim_user_1,sim_user_2,sim_user_3,sim_user_4,sim_user_5
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,3.501557,3.92093,4.366379,4.0,4.0,3.0,4.0,4.0,5.0,3.0,4.0,3.0,3.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,3.501557,3.92093,4.366379,4.0,4.0,3.0,4.0,4.0,5.0,3.0,4.0,3.0,3.0
2,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,3.501557,3.259615,4.366379,3.0,2.0,3.0,3.0,3.0,5.0,3.0,4.0,3.0,3.0
3,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,3.501557,3.259615,4.366379,3.0,2.0,3.0,3.0,3.0,5.0,3.0,4.0,3.0,3.0
4,6,Heat (1995),Action|Crime|Thriller,1,4.0,3.501557,3.946078,4.366379,4.0,4.0,4.0,4.0,4.0,5.0,3.0,4.0,3.0,3.0


In [3]:
# Drop "title", "genres"...
# These features would need serious encoding to be used in our models...
movie_ratings.drop(labels=['title', 'genres'], axis=1, inplace=True)

In [4]:
# Round "global_avg_rating", "movie_avg_rating", and "user_avg_rating" to 1 Decimal places
movie_ratings['global_avg_rating'] = np.round(movie_ratings['global_avg_rating'], decimals=1)
movie_ratings['movie_avg_rating'] = np.round(movie_ratings['movie_avg_rating'], decimals=1)
movie_ratings['user_avg_rating'] = np.round(movie_ratings['user_avg_rating'], decimals=1)

In [5]:
# Checking for missing values
movie_ratings.isna().sum()

movieId              0
userId               0
rating               0
global_avg_rating    0
movie_avg_rating     0
user_avg_rating      0
sim_movie_1          0
sim_movie_2          0
sim_movie_3          0
sim_movie_4          0
sim_movie_5          0
sim_user_1           0
sim_user_2           0
sim_user_3           0
sim_user_4           0
sim_user_5           0
dtype: int64

In [6]:
movie_ratings.describe()

Unnamed: 0,movieId,userId,rating,global_avg_rating,movie_avg_rating,user_avg_rating,sim_movie_1,sim_movie_2,sim_movie_3,sim_movie_4,sim_movie_5,sim_user_1,sim_user_2,sim_user_3,sim_user_4,sim_user_5
count,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0,201660.0
mean,19430.333839,326.131618,3.501567,3.5,3.500952,3.500876,3.494739,3.466235,3.468214,3.45902,3.459942,3.481137,3.504939,3.511524,3.528136,3.475483
std,35523.814773,182.616124,1.042546,0.0,0.565794,0.466348,0.66264,0.693876,0.701689,0.690533,0.679804,0.667666,0.665892,0.623058,0.635271,0.584784
min,1.0,1.0,0.5,3.5,0.5,1.3,0.0,0.5,0.0,0.5,0.5,1.0,1.0,1.0,1.0,1.0
25%,1199.0,177.0,3.0,3.5,3.2,3.3,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
50%,2991.0,325.0,3.5,3.5,3.6,3.5,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0
75%,8119.0,477.0,4.0,3.5,3.9,3.8,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
max,193609.0,610.0,5.0,3.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


    'movieId' and 'userId' are on a different scale compared to other features. We can standardize all features
    to be on the same scale or we can experiment with the features as they are.

    * Model Selection
    
    In this process, we will train the different models below, and select which will one to use based
    on their different scores. In the training phase, we will use cross validation score so we are able
    to capture patterns from the entire dataset.
    
        Models:
            i)   Support Vector Machine - Regressor
            ii)  Stochastic Gradient Descent - Regressor
            iii) Nearest Neighbours Regressor
            iv)  Decision Trees Regressor
            v)   Xgboost Regressor

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

In [8]:
X = movie_ratings.drop(labels='rating', axis=1) # Features
y = movie_ratings['rating'] # Label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
from sklearn import svm # Support Vector Machines Regressor
from sklearn.linear_model import SGDRegressor# Stochastic Gradient Descent Regressor
from sklearn.neighbors import KNeighborsRegressor # Nearest Neighbours Regressor
from sklearn.tree import DecisionTreeRegressor # Decision Tree Regressor
import xgboost as xgb # Xgboost model

    In the process below, we use the training set, 80% of entire dataset, for cross validation scoring.
    The idea behind this is that we want to pick a model based on it's score, and test the model based
    on the test set to see how it generalizes, later on.

In [10]:
models = [DecisionTreeRegressor(), SGDRegressor(), KNeighborsRegressor(), xgb.XGBRegressor()]

for model in models:
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(str(model))
    print("Score: {}".format(scores.mean()))
    print("")

DecisionTreeRegressor()
Score: 0.5785704131077829

SGDRegressor()
Score: -2.7989811511858286e+34

KNeighborsRegressor()
Score: 0.13957204451885083

XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=None, importance_type='gain', interaction_constraints=None,
             learning_rate=None, max_delta_step=None, max_depth=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             random_state=None, reg_alpha=None, reg_lambda=None,
             scale_pos_weight=None, subsample=None, tree_method=None,
             validate_parameters=None, verbosity=None)
Score: 0.4618884103727007



    Score Interpretation:
    
    * The score above is based on the R^2.
    * This score typically ranges between 0 and 1, but a negative scores are possible. These indicate
      a very poor score. Hence, we pick a model with a score that is as close to 1.
      
    From our model above, we will pick DecisionTreeRegressor().

    * Hyper-parameter Tuning

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [12]:
params = {
    'max_depth': [5,10,15,20,30,40,50], 
    'max_leaf_nodes': [5,10,15,20,25]
}

mae_scorer = make_scorer(mean_absolute_error)

grid_search_cv = GridSearchCV(DecisionTreeRegressor(), param_grid=params, cv=5, scoring=mae_scorer)
grid_search_cv.fit(X_train, y_train)
print("Best score: {}".format(grid_search_cv.best_score_))
print("Best params: {}".format(grid_search_cv.best_params_))

Best score: 0.6756754358864777
Best params: {'max_depth': 5, 'max_leaf_nodes': 5}


    Score: 
        - Mean Absolute Error
        - Mean absolute error is the absolute difference between the actual and predicted values (For a given
          rating, our model will predict that given rating 0.68 above or below, on average.
        
    Best Parameters:
        - max_depth: 5
        - max_leaf_nodes: 5    

    Note:
        - Given the best parameters, we will test our model to see how it performs unseen data.
        - In testing our model, we will evaluate it's performance based on the Mean Absolute Error.

In [15]:
def model():
    tree_reg = DecisionTreeRegressor(max_depth=5, max_leaf_nodes=5, criterion='mae')
    tree_reg.fit(X_train, y_train)
    score = tree_reg.score(X_test, y_test)
    return score, tree_reg

In [16]:
score, model = model()

    Interpretation:
    
    - Our model performed better on the test set; obtained an mae of 0.28 compared to 0.67 on training.
    - Using this model, we can experiment in predicting the rating of a movie by a user.

    # Saving Model

In [17]:
import pickle

In [19]:
# Save to file in the current working directory
pkl_filename = "model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)