# Amazon Recommender System

In this notebook we prototype a set of models to be used in order to build a Recommender System for the Amazon music data.

We first establish a few simple baselines and then progress to implementing two different classes of models:
*  Collaborative filtering models based only on user and item data (no text)
*  A textual based model

We then finally combine the best model from each class into a meta model and evaluate it's performance.

We start with some necessary imports

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
plt.style.use('seaborn-darkgrid')
%matplotlib inline

random.seed(17)

Loading data from json into a pandas dataframe

In [2]:
data = None
with open(os.path.join('data', 'train.json'), 'r') as train_file:
    data = [json.loads(row) for row in train_file]

In [3]:
data_df = pd.DataFrame(data)
data_df = data_df[0:20000]
data_df

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,image
0,4.0,"08 24, 2010",u04428712,"So is Katy Perry's new album ""Teenage Dream"" c...",Amazing that I Actually Bought This...More Ama...,1282608000,Pop,$35.93,p70761125,85559980,
1,5.0,"10 31, 2009",u06946603,"I got this CD almost 10 years ago, and given t...",Excellent album,1256947200,Alternative Rock,$11.28,p85427891,41699565,
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,
4,4.0,"10 12, 2015",u07141505,"Look at all star cast. Outstanding record, pl...",Love these guys.,1444608000,Jazz,$15.24,p82618188,53377470,
...,...,...,...,...,...,...,...,...,...,...,...
19995,5.0,"03 12, 2018",u60319848,LOve you Jerry,Five Stars,1520812800,Pop,$19.99,p57635618,78272876,
19996,4.0,"04 14, 2011",u87292135,"This album is appropriately titled ""Wasting Li...",Wasting Light Wastes None of the Band's Talent,1302739200,Pop,$11.04,p39196472,64829313,
19997,3.0,"07 22, 2014",u77481859,MOTT it's NOTT! I have been a Mott and Ian Hun...,MOTT IT'S NOTT!,1405987200,Pop,$12.98,p51004404,81054257,
19998,5.0,"07 23, 2015",u98141334,"Excelent cd, great band, great sound great exp...",Five Stars,1437609600,Jazz,$22.98,p10929209,53091955,


Defining a function to handle price so that it can be converted to a float

In [4]:
def trim_price(price):
    """Trims `price` to remove the $ sign.
    
    If the price variable does not have the format $x.xx
    then the empty string is returned.
    
    Parameters
    ----------
    price: str
        A string representing a price.
    
    Returns
    -------
    str
        A string representing `price` but with the $ sign removed,
        or the empty string if `price` does not have the correct
        format.
    
    """
    if (not pd.isnull(price) and isinstance(price, str) and
        len(price) > 0 and price[0] == '$'):
        return price[1:]
    return ""

### Preprocessing

We add some additional features:
*  reviewMonth - the month in which the review was done.
*  reviewYear - the year in which the review was done.
*  reviewHour - the hour in which the review was done
*  cleanedPrice - a numeric version of the price column. We only keep this column if the price is correctly formatted.
*  fullReviewText - a column that combines the summary followed by reviewText
*  reviewWordCount - indicates whether the record has an associated review based on the fullReviewText column

We also add an indicator variable for each music category to indicate if the record is in that category

In [5]:
from datetime import datetime

data_df['reviewMonth'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[0])
data_df['reviewYear'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[2])
data_df['reviewHour'] = data_df['unixReviewTime'].apply(lambda x: datetime.fromtimestamp(x).hour)
data_df['reviewMonthYear'] = data_df['reviewYear'] + '-' + data_df['reviewMonth']

data_df['cleanedPrice'] = data_df['price'].apply(lambda x: trim_price(x))
data_df = data_df[data_df['cleanedPrice'] != ""]
data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')

data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
data_df['fixedSummary'] = np.where(pd.isnull(data_df['summary']), "", data_df['summary'])
data_df['fullReviewText'] = data_df['fixedSummary'] + " " + data_df['fixedReviewText']

data_df = data_df.drop(columns=['fixedReviewText', 'fixedSummary'])

genres = data_df['category'].unique()

for genre in genres:
    genre_col = "is" + genre.replace(" ", "").replace("&", "")
    data_df[genre_col] = data_df['category'].apply(lambda x: 1 if x == genre else 0)

data_df['reviewWordCount'] = data_df['fullReviewText'].apply(lambda x: len(x.split()))

data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedSummary'] = np.where(pd.is

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,...,reviewHour,reviewMonthYear,cleanedPrice,fullReviewText,isPop,isAlternativeRock,isJazz,isClassical,isDanceElectronic,reviewWordCount
0,4.0,"08 24, 2010",u04428712,"So is Katy Perry's new album ""Teenage Dream"" c...",Amazing that I Actually Bought This...More Ama...,1282608000,Pop,$35.93,p70761125,85559980,...,20,2010-08,35.93,Amazing that I Actually Bought This...More Ama...,1,0,0,0,0,277
1,5.0,"10 31, 2009",u06946603,"I got this CD almost 10 years ago, and given t...",Excellent album,1256947200,Alternative Rock,$11.28,p85427891,41699565,...,20,2009-10,11.28,Excellent album I got this CD almost 10 years ...,0,1,0,0,0,125
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,...,20,2015-10,89.86,"Love the Music, Hate the Light Show I REALLY e...",1,0,0,0,0,133
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,...,20,2017-06,11.89,Great Finally got it . It was everything thoug...,1,0,0,0,0,15
4,4.0,"10 12, 2015",u07141505,"Look at all star cast. Outstanding record, pl...",Love these guys.,1444608000,Jazz,$15.24,p82618188,53377470,...,20,2015-10,15.24,Love these guys. Look at all star cast. Outst...,0,0,1,0,0,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19994,5.0,"04 20, 2002",u73076812,I must have at least four different box sets o...,One of the best Trio collections out there,1019260800,Pop,$28.43,p26751948,33022251,...,20,2002-04,28.43,One of the best Trio collections out there I m...,1,0,0,0,0,226
19995,5.0,"03 12, 2018",u60319848,LOve you Jerry,Five Stars,1520812800,Pop,$19.99,p57635618,78272876,...,20,2018-03,19.99,Five Stars LOve you Jerry,1,0,0,0,0,5
19996,4.0,"04 14, 2011",u87292135,"This album is appropriately titled ""Wasting Li...",Wasting Light Wastes None of the Band's Talent,1302739200,Pop,$11.04,p39196472,64829313,...,20,2011-04,11.04,Wasting Light Wastes None of the Band's Talent...,1,0,0,0,0,165
19997,3.0,"07 22, 2014",u77481859,MOTT it's NOTT! I have been a Mott and Ian Hun...,MOTT IT'S NOTT!,1405987200,Pop,$12.98,p51004404,81054257,...,20,2014-07,12.98,MOTT IT'S NOTT! MOTT it's NOTT! I have been a ...,1,0,0,0,0,282


### Evaluation Metrics

Definining a MSE function

In [6]:
def calculate_MSE(actuals, predicteds):
    """Calculates the Mean Squared Error between `actuals` and `predicteds`.
    
    Parameters
    ----------
    actuals: np.array
        A numpy array of the actual values.
    predicteds: np.array
        A numpy array of the predicted values.
    
    Returns
    -------
    float
        A float representing the Mean Squared Error between `actuals` and
        `predicteds`.
    
    """
    return (((actuals - predicteds)**2).sum()) / (len(actuals))

Separate targets and data.

Then split into training and validation sets.

Note that we split into validation sets for each music genre and then concatenate the data frames so that the proportion of each genre in the train and validation sets is equal

In [7]:
from sklearn.model_selection import train_test_split

genres = data_df['category'].unique()
X_train_set = []
X_val_set = []
y_train_set = []
y_val_set = []

for genre in genres:
    genre_df = data_df[data_df['category'] == genre]
    targets = genre_df['overall']
    feature_data = genre_df.drop(columns=['overall'])
    X_train, X_val, y_train, y_val = train_test_split(
        feature_data, targets, shuffle=True, test_size=0.2, random_state=17)
    X_train_set.append(X_train)
    X_val_set.append(X_val)
    y_train_set.append(y_train)
    y_val_set.append(y_val)

X_train = pd.concat(X_train_set)
X_val = pd.concat(X_val_set)
y_train = pd.concat(y_train_set)
y_val = pd.concat(y_val_set)

### Model Fitting

Throughout the model fitting process we will keep 3 arrays that store the model name, training error, and validation error respectively for all models that we prototype.

##### Baselines

We will look at two simple baseline models.

The first is the same baseline model implemented in `baseline.py`. But we will evaluate its performance on the validation set in order to fit models and compare performance on data that is distinct from the test set.

In this model we simply compute the average rating and assign this as our prediction.

In [8]:
model_names = []
train_errors = []
validation_errors = []

In [9]:
def error_on_average(targets, avg):
    """Computers the error based on using average rating as the prediction.
    
    Parameters
    ----------
    targets: np.array
        The actual ratings.
    avg: float
        The predicted rating based on an average.
    
    Returns
    -------
    float
        A float representing the mean squared error from predicting
        based on `avg`.
    
    """
    return calculate_MSE(targets, avg)

In [10]:
train_avg = y_train.mean()

model_names.append("Average")
train_errors.append(error_on_average(y_train, train_avg))
validation_errors.append(error_on_average(y_val, train_avg))

print("Training error based on average prediction: %.3f" % train_errors[0])
print("Validation error based on average prediction: %.3f" % validation_errors[0])

Training error based on average prediction: 0.984
Validation error based on average prediction: 0.977


Our second baseline model is slightly more complicated. We will calculate three types of quantities:
*  The overall average
*  The difference between the average rating for each item and the overall average
*  The difference between the average rating for each user and the overall average

Our prediction for a particular user and item will then be the sum of these 3 quantities.

We will denote this model as Weighted Average

In [11]:
train_df = pd.concat([X_train, y_train], axis=1)

train_avg_total = y_train.mean()
train_user_avg = train_df.groupby(train_df['reviewerID'], as_index=False)['overall'].mean()
train_item_avg = train_df.groupby(train_df['itemID'], as_index=False)['overall'].mean()
train_user_avg.columns = ['reviewerID', 'userAverage']
train_item_avg.columns = ['itemID', 'itemAverage']

In [12]:
def threshold_rating(rating):
    """Thresholds `rating` to lie in the range [1, 5].
    
    Parameters
    ----------
    rating: float
        The rating to be thresholded.
    
    Returns
    -------
    float
        A float representing the thresholded rating.
    
    """
    if rating < 1:
        return 1
    if rating > 5:
        return 5
    return rating

def weighted_average_error(X, y, total_avg, user_avgs, item_avgs):
    """Calculates the error based on the weighted average prediction.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    y: np.array
        A numpy array containing the targets
    total_avg: float
        The average across all users/items.
    user_avgs: pd.DataFrame
        A DataFrame containing the average rating for each user.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for each item.
    
    Returns
    -------
    float
        A float representing the mean squared error of the predictions.
    
    """
    df_user = pd.merge(X, user_avgs, how='left', on=['reviewerID'])
    df_final = pd.merge(df_user, item_avgs, how='left', on=['itemID'])
    df_final = df_final[['userAverage', 'itemAverage']]
    df_final.fillna(total_avg)
    df_final['pred'] = df_final['userAverage'] + df_final['itemAverage'] - total_avg
    df_final['pred'].apply(lambda x: threshold_rating(x))
    return calculate_MSE(y, df_final['pred'])

In [13]:
train_MSE = weighted_average_error(X_train, y_train, train_avg_total, train_user_avg, train_item_avg)
val_MSE = weighted_average_error(X_val, y_val, train_avg_total, train_user_avg, train_item_avg)

model_names.append("Weighted Average")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on weighted average prediction: %.3f" % train_MSE)
print("Validation error based on weighted average prediction: %.3f" % val_MSE)

Training error based on weighted average prediction: 2.618
Validation error based on weighted average prediction: 0.093


##### Feature Models

In this section we build a set of feature based models to predict ratings on our validation set and compare their performance. These models do not follow the typical recommender system approach of collaborative filtering / matrix factorization

We start with a linear regression model. At this stage we do not have a lot of features and so it makes more sense to use the $L_{2}$-norm for regularization (a.k.a. ridge regression)

In [30]:
columns_to_keep = ['cleanedPrice', 'isPop', 'isAlternativeRock', 'isJazz', 'isClassical', 'isDanceElectronic', 'reviewWordCount']
X_train_reg = X_train[columns_to_keep]
X_val_reg = X_val[columns_to_keep]

In [31]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_train_reg['reviewWordCount'] = X_train_reg['reviewWordCount'].apply(lambda x: np.log(x))
X_val_reg['reviewWordCount'] = X_val_reg['reviewWordCount'].apply(lambda x: np.log(x))

X_train_reg = min_max_scaler.fit_transform(X_train_reg)
X_val_reg = min_max_scaler.transform(X_val_reg)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_reg['reviewWordCount'] = X_train_reg['reviewWordCount'].apply(lambda x: np.log(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val_reg['reviewWordCount'] = X_val_reg['reviewWordCount'].apply(lambda x: np.log(x))


In [32]:
from sklearn.linear_model import Ridge

vthreshold_rating = np.vectorize(threshold_rating)

alphas = [0.0, 0.01, 0.03, 0.1, 0.3]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Ridge(alpha=alpha)
    reg_model.fit(X_train_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_reg)))))
    print()

Alpha = 0.0
------------
Training Error: 0.9550244154429685
Validation Error: 0.9533155393424353

Alpha = 0.01
------------
Training Error: 0.9550248555325158
Validation Error: 0.9533145486453865

Alpha = 0.03
------------
Training Error: 0.9550257401702213
Validation Error: 0.9533125835818699

Alpha = 0.1
------------
Training Error: 0.9550288816532788
Validation Error: 0.9533058738935752

Alpha = 0.3
------------
Training Error: 0.9550382062304418
Validation Error: 0.9532880625744887



After $\alpha = 0.01$ the MSE does not marginally change and so we will use a regularization term of 0.01

In [33]:
reg_model = Ridge(alpha=0.01)
reg_model.fit(X_train_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_reg)))

model_names.append("L2-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.955
Validation error based on L2 regularized regression prediction: 0.953


We now look at some natural language processing models.

We start by processing the review column. This involves the following:
* Removing all non-alphanumeric characters
* converting to lower case
* removing a set of exclusion words (the english stopwords)
* and stemming (getting the root word)

In [40]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def process_review_text(review_text, exclude_text, ps):
    """Pre-processes the text given by `review_text`.
    
    Parameters
    ----------
    review_text: str
        The review text to be processed.
    exclude_text: collection
        A collection of words to be excluded.
    ps: PorterStemmer
        The PorterStemmer used to perform word stemming.
    
    Returns
    -------
    str
        A string representing the processed version of `review_text`.
    
    """
    review = re.sub('[^a-zA-Z0-9]', ' ', review_text).lower().split()
    review = [ps.stem(word) for word in review if not word in exclude_text]
    return ' '.join(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Matthew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
exclude_english = set(stopwords.words('english'))
ps = PorterStemmer()
X_train['processedReview'] = X_train['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_val['processedReview'] = X_val['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_train

Unnamed: 0,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,image,...,reviewMonthYear,cleanedPrice,fullReviewText,isPop,isAlternativeRock,isJazz,isClassical,isDanceElectronic,reviewWordCount,processedReview
161,"11 14, 2006",u59086436,Released in conjuction with the mock-documenta...,A True Pop Gem,1163462400,Pop,$5.84,p53306771,81549761,,...,2006-11,5.84,A True Pop Gem Released in conjuction with the...,1,0,0,0,0,471,true pop gem releas conjuct mock documentari f...
7950,"02 17, 2006",u15975953,"'The Mamas & The Papas, 16 of Their Greatest H...",Totally Evocative of our Youth.,1140134400,Pop,$7.06,p02023893,38511203,,...,2006-02,7.06,Totally Evocative of our Youth. 'The Mamas & T...,1,0,0,0,0,197,total evoc youth mama papa 16 greatest hit qui...
1409,"08 29, 2016",u36717398,This is my favorite Lady Gaga CD. I play it a...,Lady Gaga's Become a Favorite of Mine,1472428800,Pop,$26.56,p42917906,50060419,,...,2016-08,26.56,Lady Gaga's Become a Favorite of Mine This is ...,1,0,0,0,0,195,ladi gaga becom favorit mine favorit ladi gaga...
18573,"11 1, 2016",u50807686,I may be too harsh in giving this album only 3...,Good but not as special as their other albums,1477958400,Pop,$11.21,p95754874,05446185,,...,2016-11,11.21,Good but not as special as their other albums ...,1,0,0,0,0,201,good special album may harsh give album 3 star...
19049,"01 14, 2015",u67619343,Male quartets used to be so popular and this g...,Those days one could understand every word the...,1421193600,Pop,$9.95,p16122621,56569540,,...,2015-01,9.95,Those days one could understand every word the...,1,0,0,0,0,56,day one could understand everi word sang harmo...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11701,"12 23, 2010",u32058100,At first I was really let down with this album...,Took some time...,1293062400,Dance & Electronic,$2.02,p91276678,50571640,,...,2010-12,2.02,Took some time... At first I was really let do...,0,0,0,0,1,95,took time first realli let album weird deadmau...
6058,"02 17, 2015",u72292741,like,Four Stars,1424131200,Dance & Electronic,$13.96,p87964435,31658850,,...,2015-02,13.96,Four Stars like,0,0,0,0,1,3,four star like
2546,"05 3, 2000",u17343788,Moby's latest album touches on many genres of ...,Excellent,957312000,Dance & Electronic,$6.99,p33557762,03321056,,...,2000-05,6.99,Excellent Moby's latest album touches on many ...,0,0,0,0,1,80,excel mobi latest album touch mani genr music ...
19646,"11 2, 2009",u15370865,This is by far one of JR's best albums to date...,Another great album by JR,1257120000,Dance & Electronic,$9.95,p62149488,69626205,,...,2009-11,9.95,Another great album by JR This is by far one o...,0,0,0,0,1,30,anoth great album jr far one jr best album dat...


We now use a CountVectorizer to build counts of the 1500 most common words

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X_train_cv = cv.fit_transform(X_train['processedReview'])
X_val_cv = cv.transform(X_val['processedReview'])

In [43]:
import scipy.sparse as sp

X_train_reg_sp = sp.csr_matrix(X_train_reg)
X_train_cv_reg = sp.hstack((X_train_cv, X_train_reg_sp), format='csr')

X_val_reg_sp = sp.csr_matrix(X_val_reg)
X_val_cv_reg = sp.hstack((X_val_cv, X_val_reg_sp), format='csr')

Now we will fit a few sample models to this dataset.

First we will perform linear regression. In this case we use $L_{1}$ regularization as we have 1507 features.

In [46]:
from sklearn.linear_model import Lasso, LinearRegression

print("Alpha = 0")
print("------------")
reg_model = LinearRegression()
reg_model.fit(X_train_cv_reg, y_train)
print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
print()

alphas = [0.001, 0.003, 0.01, 0.03]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Lasso(alpha=alpha)
    reg_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
    print()

Alpha = 0
------------
Training Error: 0.9702267832256429
Validation Error: 0.7343669218867732

Alpha = 0.001
------------
Training Error: 1.0172938046370201
Validation Error: 0.7084168824390358

Alpha = 0.003
------------
Training Error: 1.0505511212466743
Validation Error: 0.7278164470848788

Alpha = 0.01
------------
Training Error: 0.7878701843021085
Validation Error: 0.8028965298964436

Alpha = 0.03
------------
Training Error: 0.9012085275848064
Validation Error: 0.9024814384396381



A value of $\alpha = 0.01$ seems to work the best

In [47]:
reg_model = Lasso(alpha=0.01)
reg_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))

model_names.append("CV-L1-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.788
Validation error based on L2 regularized regression prediction: 0.803


We will now try a DecisionTreeRegressor and we will try various values for `min_samples_split`. This is the minimum number of samples required to split an internal node. Intuitively, lower values will correspond to higher variance and thus overfitting, whereas higher values will correspond to higher bias and thus overfitting. Obviously, this value show be at least 2

In [54]:
from sklearn.tree import DecisionTreeRegressor
samples_split_lst = [2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]

for samples_split in samples_split_lst:
    print("Samples Split = {}".format(samples_split))
    print("-------------------")
    tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=samples_split)
    tree_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))))
    print()

Samples Split = 2
-------------------
Training Error: 0.0
Validation Error: 1.3621169916434541

Samples Split = 5
-------------------
Training Error: 0.061990582372566404
Validation Error: 1.2772255986044287

Samples Split = 10
-------------------
Training Error: 0.14434726168857676
Validation Error: 1.1985976957818383

Samples Split = 20
-------------------
Training Error: 0.21662396829778507
Validation Error: 1.139970892726337

Samples Split = 50
-------------------
Training Error: 0.3487511087905338
Validation Error: 1.0140588194819098

Samples Split = 100
-------------------
Training Error: 0.4179916778859212
Validation Error: 0.9681598567080321

Samples Split = 200
-------------------
Training Error: 0.4994314849422262
Validation Error: 0.9277689865884085

Samples Split = 500
-------------------
Training Error: 0.636519419002298
Validation Error: 0.8577306502336626

Samples Split = 1000
-------------------
Training Error: 0.7138154668215363
Validation Error: 0.819981661067408

Sam

The low values of `min_samples_split` clearly show overfitting as the training error is very low but with very high validation error. Conversely, once `min_samples_split` is past 1000 the validation error is not changing much and starts to increase. So we will stick with a value of 1000

In [55]:
tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=1000)
tree_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))

model_names.append("CV-DecTree")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.714
Validation error based on L2 regularized regression prediction: 0.814
