# Amazon Recommender System

In this notebook we prototype a set of models to be used in order to build a Recommender System for the Amazon music data.

This is done on a small subset of the data (10%) corresponding to 20% rows. In this version we use a single validation set for a wide range of models in order to restrict to a few candidate models which will be trained on the full dataset using cross validation.

We first establish a few simple baselines and then progress to implementing two different classes of models:
*  Collaborative filtering models based only on user and item data (no text)
*  A textual based model

We then finally combine the best model from each class into a meta model and evaluate it's performance.

We start with some necessary imports

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
plt.style.use('seaborn-darkgrid')
%matplotlib inline

random.seed(17)

Loading data from json into a pandas dataframe

In [2]:
data = None
with open(os.path.join('data', 'train.json'), 'r') as train_file:
    data = [json.loads(row) for row in train_file]

In [3]:
data_df = pd.DataFrame(data)
data_df = data_df[0:50000]
data_df

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,image
0,4.0,"08 24, 2010",u04428712,"So is Katy Perry's new album ""Teenage Dream"" c...",Amazing that I Actually Bought This...More Ama...,1282608000,Pop,$35.93,p70761125,85559980,
1,5.0,"10 31, 2009",u06946603,"I got this CD almost 10 years ago, and given t...",Excellent album,1256947200,Alternative Rock,$11.28,p85427891,41699565,
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,
4,4.0,"10 12, 2015",u07141505,"Look at all star cast. Outstanding record, pl...",Love these guys.,1444608000,Jazz,$15.24,p82618188,53377470,
...,...,...,...,...,...,...,...,...,...,...,...
49995,5.0,"07 1, 2005",u35885825,This DVD rates a strong five stars with me. T...,Absolutely Wonderful,1120176000,Pop,$11.91,p00193036,43343710,
49996,5.0,"04 5, 2017",u10979151,No wonder this sold 32 million copies it is an...,... wonder this sold 32 million copies it is a...,1491350400,Pop,$8.78,p21938211,03563170,
49997,5.0,"06 13, 2005",u50404725,Learning to Breathe is one of my favorite albu...,Another must have from Switchfoot,1118620800,Alternative Rock,$9.99,p78687570,61979023,
49998,5.0,"03 17, 2016",u53979205,I remember first hearing this band when they c...,giving back to the band the love and joy that ...,1458172800,Jazz,$57.37,p95260169,52021524,


Defining a function to handle price so that it can be converted to a float

In [4]:
def trim_price(price):
    """Trims `price` to remove the $ sign.
    
    If the price variable does not have the format $x.xx
    then the empty string is returned.
    
    Parameters
    ----------
    price: str
        A string representing a price.
    
    Returns
    -------
    str
        A string representing `price` but with the $ sign removed,
        or the empty string if `price` does not have the correct
        format.
    
    """
    if (not pd.isnull(price) and isinstance(price, str) and
        len(price) > 0 and price[0] == '$'):
        return price[1:]
    return ""

### Preprocessing

We add some additional features:
*  reviewMonth - the month in which the review was done.
*  reviewYear - the year in which the review was done.
*  reviewHour - the hour in which the review was done
*  cleanedPrice - a numeric version of the price column. We only keep this column if the price is correctly formatted.
*  fullReviewText - a column that combines the summary followed by reviewText
*  reviewWordCount - indicates whether the record has an associated review based on the fullReviewText column

We also add an indicator variable for each music category to indicate if the record is in that category

In [5]:
from datetime import datetime

data_df['reviewMonth'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[0])
data_df['reviewYear'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[2])
data_df['reviewHour'] = data_df['unixReviewTime'].apply(lambda x: datetime.fromtimestamp(x).hour)
data_df['reviewMonthYear'] = data_df['reviewYear'] + '-' + data_df['reviewMonth']

data_df['cleanedPrice'] = data_df['price'].apply(lambda x: trim_price(x))
data_df = data_df[data_df['cleanedPrice'] != ""]
data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')

data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
data_df['fixedSummary'] = np.where(pd.isnull(data_df['summary']), "", data_df['summary'])
data_df['fullReviewText'] = data_df['fixedSummary'] + " " + data_df['fixedReviewText']

data_df = data_df.drop(columns=['fixedReviewText', 'fixedSummary'])

genres = data_df['category'].unique()

for genre in genres:
    genre_col = "is" + genre.replace(" ", "").replace("&", "")
    data_df[genre_col] = data_df['category'].apply(lambda x: 1 if x == genre else 0)

data_df['reviewWordCount'] = data_df['fullReviewText'].apply(lambda x: len(x.split()))

data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedSummary'] = np.where(pd.is

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,...,reviewHour,reviewMonthYear,cleanedPrice,fullReviewText,isPop,isAlternativeRock,isJazz,isClassical,isDanceElectronic,reviewWordCount
0,4.0,"08 24, 2010",u04428712,"So is Katy Perry's new album ""Teenage Dream"" c...",Amazing that I Actually Bought This...More Ama...,1282608000,Pop,$35.93,p70761125,85559980,...,20,2010-08,35.93,Amazing that I Actually Bought This...More Ama...,1,0,0,0,0,277
1,5.0,"10 31, 2009",u06946603,"I got this CD almost 10 years ago, and given t...",Excellent album,1256947200,Alternative Rock,$11.28,p85427891,41699565,...,20,2009-10,11.28,Excellent album I got this CD almost 10 years ...,0,1,0,0,0,125
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,...,20,2015-10,89.86,"Love the Music, Hate the Light Show I REALLY e...",1,0,0,0,0,133
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,...,20,2017-06,11.89,Great Finally got it . It was everything thoug...,1,0,0,0,0,15
4,4.0,"10 12, 2015",u07141505,"Look at all star cast. Outstanding record, pl...",Love these guys.,1444608000,Jazz,$15.24,p82618188,53377470,...,20,2015-10,15.24,Love these guys. Look at all star cast. Outst...,0,0,1,0,0,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,5.0,"07 1, 2005",u35885825,This DVD rates a strong five stars with me. T...,Absolutely Wonderful,1120176000,Pop,$11.91,p00193036,43343710,...,20,2005-07,11.91,Absolutely Wonderful This DVD rates a strong f...,1,0,0,0,0,130
49996,5.0,"04 5, 2017",u10979151,No wonder this sold 32 million copies it is an...,... wonder this sold 32 million copies it is a...,1491350400,Pop,$8.78,p21938211,03563170,...,20,2017-04,8.78,... wonder this sold 32 million copies it is a...,1,0,0,0,0,24
49997,5.0,"06 13, 2005",u50404725,Learning to Breathe is one of my favorite albu...,Another must have from Switchfoot,1118620800,Alternative Rock,$9.99,p78687570,61979023,...,20,2005-06,9.99,Another must have from Switchfoot Learning to ...,0,1,0,0,0,720
49998,5.0,"03 17, 2016",u53979205,I remember first hearing this band when they c...,giving back to the band the love and joy that ...,1458172800,Jazz,$57.37,p95260169,52021524,...,20,2016-03,57.37,giving back to the band the love and joy that ...,0,0,1,0,0,137


### Evaluation Metrics

Definining a MSE function

In [6]:
def calculate_MSE(actuals, predicteds):
    """Calculates the Mean Squared Error between `actuals` and `predicteds`.
    
    Parameters
    ----------
    actuals: np.array
        A numpy array of the actual values.
    predicteds: np.array
        A numpy array of the predicted values.
    
    Returns
    -------
    float
        A float representing the Mean Squared Error between `actuals` and
        `predicteds`.
    
    """
    return (((actuals - predicteds)**2).sum()) / (len(actuals))

Separate targets and data.

Then split into training and validation sets.

Note that we split into validation sets for each music genre and then concatenate the data frames so that the proportion of each genre in the train and validation sets is equal

In [7]:
from sklearn.model_selection import train_test_split

genres = data_df['category'].unique()
X_train_set = []
X_val_set = []
y_train_set = []
y_val_set = []

for genre in genres:
    genre_df = data_df[data_df['category'] == genre]
    targets = genre_df['overall']
    feature_data = genre_df.drop(columns=['overall'])
    X_train, X_val, y_train, y_val = train_test_split(
        feature_data, targets, shuffle=True, test_size=0.2, random_state=17)
    X_train_set.append(X_train)
    X_val_set.append(X_val)
    y_train_set.append(y_train)
    y_val_set.append(y_val)

X_train = pd.concat(X_train_set)
X_val = pd.concat(X_val_set)
y_train = pd.concat(y_train_set)
y_val = pd.concat(y_val_set)

### Model Fitting

Throughout the model fitting process we will keep 3 arrays that store the model name, training error, and validation error respectively for all models that we prototype.

##### Baselines

We will look at two simple baseline models.

The first is the same baseline model implemented in `baseline.py`. But we will evaluate its performance on the validation set in order to fit models and compare performance on data that is distinct from the test set.

In this model we simply compute the average rating and assign this as our prediction.

In [8]:
model_names = []
train_errors = []
validation_errors = []

In [9]:
def error_on_average(targets, avg):
    """Computers the error based on using average rating as the prediction.
    
    Parameters
    ----------
    targets: np.array
        The actual ratings.
    avg: float
        The predicted rating based on an average.
    
    Returns
    -------
    float
        A float representing the mean squared error from predicting
        based on `avg`.
    
    """
    return calculate_MSE(targets, avg)

In [10]:
train_avg = y_train.mean()

model_names.append("Average")
train_errors.append(error_on_average(y_train, train_avg))
validation_errors.append(error_on_average(y_val, train_avg))

print("Training error based on average prediction: %.3f" % train_errors[0])
print("Validation error based on average prediction: %.3f" % validation_errors[0])

Training error based on average prediction: 0.994
Validation error based on average prediction: 0.950


Our second baseline model is slightly more complicated. We will calculate three types of quantities:
*  The overall average
*  The difference between the average rating for each item and the overall average
*  The difference between the average rating for each user and the overall average

Our prediction for a particular user and item will then be the sum of these 3 quantities.

We will denote this model as Weighted Average

In [11]:
train_df = pd.concat([X_train, y_train], axis=1)

train_avg_total = y_train.mean()
train_user_avg = train_df.groupby(train_df['reviewerID'], as_index=False)['overall'].mean()
train_item_avg = train_df.groupby(train_df['itemID'], as_index=False)['overall'].mean()
train_user_avg.columns = ['reviewerID', 'userAverage']
train_item_avg.columns = ['itemID', 'itemAverage']

In [12]:
def threshold_rating(rating):
    """Thresholds `rating` to lie in the range [1, 5].
    
    Parameters
    ----------
    rating: float
        The rating to be thresholded.
    
    Returns
    -------
    float
        A float representing the thresholded rating.
    
    """
    if rating < 1:
        return 1
    if rating > 5:
        return 5
    return rating

def weighted_average_error(X, y, total_avg, user_avgs, item_avgs):
    """Calculates the error based on the weighted average prediction.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    y: np.array
        A numpy array containing the targets
    total_avg: float
        The average across all users/items.
    user_avgs: pd.DataFrame
        A DataFrame containing the average rating for each user.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for each item.
    
    Returns
    -------
    float
        A float representing the mean squared error of the predictions.
    
    """
    df_user = pd.merge(X, user_avgs, how='left', on=['reviewerID'])
    df_final = pd.merge(df_user, item_avgs, how='left', on=['itemID'])
    df_final = df_final[['userAverage', 'itemAverage']]
    df_final.fillna(total_avg)
    df_final['pred'] = df_final['userAverage'] + df_final['itemAverage'] - total_avg
    df_final['pred'].apply(lambda x: threshold_rating(x))
    return calculate_MSE(y, df_final['pred'])

In [13]:
train_MSE = weighted_average_error(X_train, y_train, train_avg_total, train_user_avg, train_item_avg)
val_MSE = weighted_average_error(X_val, y_val, train_avg_total, train_user_avg, train_item_avg)

model_names.append("Weighted Average")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on weighted average prediction: %.3f" % train_MSE)
print("Validation error based on weighted average prediction: %.3f" % val_MSE)

Training error based on weighted average prediction: 2.234
Validation error based on weighted average prediction: 0.173


##### Feature Models

In this section we build a set of feature based models to predict ratings on our validation set and compare their performance. These models do not follow the typical recommender system approach of collaborative filtering / matrix factorization

We start with a linear regression model. At this stage we do not have a lot of features and so it makes more sense to use the $L_{2}$-norm for regularization (a.k.a. ridge regression)

In [14]:
columns_to_keep = ['cleanedPrice', 'isPop', 'isAlternativeRock', 'isJazz', 'isClassical', 'isDanceElectronic', 'reviewWordCount']
X_train_reg = X_train[columns_to_keep]
X_val_reg = X_val[columns_to_keep]

In [15]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_train_reg['reviewWordCount'] = X_train_reg['reviewWordCount'].apply(lambda x: np.log(x))
X_val_reg['reviewWordCount'] = X_val_reg['reviewWordCount'].apply(lambda x: np.log(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_reg['reviewWordCount'] = X_train_reg['reviewWordCount'].apply(lambda x: np.log(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val_reg['reviewWordCount'] = X_val_reg['reviewWordCount'].apply(lambda x: np.log(x))


In [16]:
def clean_dataset(df):
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

X_train_reg = clean_dataset(X_train_reg)
y_train = y_train[y_train.index.isin(X_train_reg.index)]
X_train = X_train[X_train.index.isin(X_train_reg.index)]

X_val_reg = clean_dataset(X_val_reg)
y_val = y_val[y_val.index.isin(X_val_reg.index)]
X_val = X_val[X_val.index.isin(X_val_reg.index)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [17]:
X_train_reg = min_max_scaler.fit_transform(X_train_reg)
X_val_reg = min_max_scaler.transform(X_val_reg)

In [20]:
from sklearn.linear_model import Ridge

vthreshold_rating = np.vectorize(threshold_rating)

alphas = [0.0, 0.01, 0.03, 0.1, 0.3]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Ridge(alpha=alpha)
    reg_model.fit(X_train_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_reg)))))
    print()

Alpha = 0.0
------------
Training Error: 0.9616192418446101
Validation Error: 0.9227567987732785

Alpha = 0.01
------------
Training Error: 0.9616194518528689
Validation Error: 0.9227570277991929

Alpha = 0.03
------------
Training Error: 0.9616198715719257
Validation Error: 0.9227574853932203

Alpha = 0.1
------------
Training Error: 0.9616213374549035
Validation Error: 0.9227590821637957

Alpha = 0.3
------------
Training Error: 0.9616254985913107
Validation Error: 0.9227636031245654



After $\alpha = 0.01$ the MSE does not marginally change and so we will use a regularization term of 0.01

In [21]:
from sklearn.linear_model import Ridge
reg_model = Ridge(alpha=0.01)
reg_model.fit(X_train_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_reg)))

model_names.append("L2-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.962
Validation error based on L2 regularized regression prediction: 0.923


We now look at some natural language processing models.

We start by processing the review column. This involves the following:
* Removing all non-alphanumeric characters
* converting to lower case
* removing a set of exclusion words (the english stopwords)
* and stemming (getting the root word)

In [22]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def process_review_text(review_text, exclude_text, ps):
    """Pre-processes the text given by `review_text`.
    
    Parameters
    ----------
    review_text: str
        The review text to be processed.
    exclude_text: collection
        A collection of words to be excluded.
    ps: PorterStemmer
        The PorterStemmer used to perform word stemming.
    
    Returns
    -------
    str
        A string representing the processed version of `review_text`.
    
    """
    review = re.sub('[^a-zA-Z0-9]', ' ', review_text).lower().split()
    review = [ps.stem(word) for word in review if not word in exclude_text]
    return ' '.join(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Matthew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
exclude_english = set(stopwords.words('english'))
ps = PorterStemmer()
X_train['processedReview'] = X_train['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_val['processedReview'] = X_val['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_train

Unnamed: 0,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,image,...,reviewMonthYear,cleanedPrice,fullReviewText,isPop,isAlternativeRock,isJazz,isClassical,isDanceElectronic,reviewWordCount,processedReview
32105,"01 29, 2017",u67724906,"I liked it a lot when I first heard it, and it...","Dependable, authentic, great for those long dr...",1485648000,Pop,$12.51,p07525947,94030061,,...,2017-01,12.51,"Dependable, authentic, great for those long dr...",1,0,0,0,0,31,depend authent great long drive like lot first...
14490,"05 24, 2006",u10309608,The best Chili Peppers albums were Blood Sugar...,I give it 4 stars because it's RHCP...but it's...,1148428800,Pop,$11.89,p36536384,32777266,,...,2006-05,11.89,I give it 4 stars because it's RHCP...but it's...,1,0,0,0,0,515,give 4 star rhcp noth special best chili peppe...
16921,"05 28, 2013",u04929059,When I saw the review and buy the songs. Expec...,Few songs,1369699200,Pop,$8.99,p03487927,11500344,,...,2013-05,8.99,Few songs When I saw the review and buy the so...,1,0,0,0,0,30,song saw review buy song expect bring action o...
39859,"06 5, 2017",u20690354,GREAT!!,Five Stars,1496620800,Pop,$6.96,p36784389,87729782,,...,2017-06,6.96,Five Stars GREAT!!,1,0,0,0,0,3,five star great
25563,"12 6, 2013",u63753846,"There is very little; this Lady can sing, that...",A GREAT ONE....,1386288000,Pop,$9.10,p28779552,92947167,,...,2013-12,9.10,A GREAT ONE.... There is very little; this Lad...,1,0,0,0,0,25,great one littl ladi sing like immens show man...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20877,"10 29, 2003",u83502050,"I'm undoubtedly in the minority here, but this...",For me personally.....Underworlds finest work!!!,1067385600,Dance & Electronic,$11.97,p73092793,32251137,,...,2003-10,11.97,For me personally.....Underworlds finest work!...,0,0,0,0,1,177,person underworld finest work undoubtedli mino...
6058,"02 17, 2015",u72292741,like,Four Stars,1424131200,Dance & Electronic,$13.96,p87964435,31658850,,...,2015-02,13.96,Four Stars like,0,0,0,0,1,3,four star like
22092,"08 31, 2001",u13926491,"Bjork is back with ""Vespertine"".\nA good effor...","VESPERTINE DELIVERS ON COVER ART, BUT DON'T JU...",999216000,Dance & Electronic,$18.29,p15146077,59635469,,...,2001-08,18.29,"VESPERTINE DELIVERS ON COVER ART, BUT DON'T JU...",0,0,0,0,1,233,vespertin deliv cover art judg book bjork back...
34562,"06 4, 2003",u11852145,If hip-hop is truly where the mainstream music...,Some of the best hip-hop/R&B yet to be release...,1054684800,Dance & Electronic,$0.57,p22525318,94089341,,...,2003-06,0.57,Some of the best hip-hop/R&B yet to be release...,0,0,0,0,1,239,best hip hop r b yet releas u hip hop truli ma...


We now use a CountVectorizer to build counts of the 1500 most common words

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X_train_cv = cv.fit_transform(X_train['processedReview'])
X_val_cv = cv.transform(X_val['processedReview'])

In [25]:
import scipy.sparse as sp

X_train_reg_sp = sp.csr_matrix(X_train_reg)
X_train_cv_reg = sp.hstack((X_train_cv, X_train_reg_sp), format='csr')

X_val_reg_sp = sp.csr_matrix(X_val_reg)
X_val_cv_reg = sp.hstack((X_val_cv, X_val_reg_sp), format='csr')

Now we will fit a few sample models to this dataset.

First we will perform linear regression. In this case we use $L_{1}$ regularization as we have 1507 features.

In [22]:
from sklearn.linear_model import Lasso, LinearRegression

print("Alpha = 0")
print("------------")
reg_model = LinearRegression()
reg_model.fit(X_train_cv_reg, y_train)
print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
print()

alphas = [0.001, 0.003, 0.01, 0.03]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Lasso(alpha=alpha)
    reg_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
    print()

Alpha = 0
------------
Training Error: 0.9702267832256429
Validation Error: 0.7343669218867732

Alpha = 0.001
------------
Training Error: 1.0172938046370201
Validation Error: 0.7084168824390358

Alpha = 0.003
------------
Training Error: 1.0505511212466743
Validation Error: 0.7278164470848788

Alpha = 0.01
------------
Training Error: 0.7878701843021085
Validation Error: 0.8028965298964436

Alpha = 0.03
------------
Training Error: 0.9012085275848064
Validation Error: 0.9024814384396381



A value of $\alpha = 0.01$ seems to work the best

In [26]:
from sklearn.linear_model import Lasso
reg_model = Lasso(alpha=0.01)
reg_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))

model_names.append("CV-L1-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.800
Validation error based on L2 regularized regression prediction: 0.775


We will now try a DecisionTreeRegressor and we will try various values for `min_samples_split`. This is the minimum number of samples required to split an internal node. Intuitively, lower values will correspond to higher variance and thus overfitting, whereas higher values will correspond to higher bias and thus overfitting. Obviously, this value show be at least 2

In [54]:
from sklearn.tree import DecisionTreeRegressor
samples_split_lst = [2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]

for samples_split in samples_split_lst:
    print("Samples Split = {}".format(samples_split))
    print("-------------------")
    tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=samples_split)
    tree_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))))
    print()

Samples Split = 2
-------------------
Training Error: 0.0
Validation Error: 1.3621169916434541

Samples Split = 5
-------------------
Training Error: 0.061990582372566404
Validation Error: 1.2772255986044287

Samples Split = 10
-------------------
Training Error: 0.14434726168857676
Validation Error: 1.1985976957818383

Samples Split = 20
-------------------
Training Error: 0.21662396829778507
Validation Error: 1.139970892726337

Samples Split = 50
-------------------
Training Error: 0.3487511087905338
Validation Error: 1.0140588194819098

Samples Split = 100
-------------------
Training Error: 0.4179916778859212
Validation Error: 0.9681598567080321

Samples Split = 200
-------------------
Training Error: 0.4994314849422262
Validation Error: 0.9277689865884085

Samples Split = 500
-------------------
Training Error: 0.636519419002298
Validation Error: 0.8577306502336626

Samples Split = 1000
-------------------
Training Error: 0.7138154668215363
Validation Error: 0.819981661067408

Sam

The low values of `min_samples_split` clearly show overfitting as the training error is very low but with very high validation error. Conversely, once `min_samples_split` is past 1000 the validation error is not changing much and starts to increase. So we will stick with a value of 1000

In [27]:
from sklearn.tree import DecisionTreeRegressor
tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=1000)
tree_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))

model_names.append("CV-DecTree")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on L2 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on L2 regularized regression prediction: %.3f" % val_MSE)

Training error based on L2 regularized regression prediction: 0.666
Validation error based on L2 regularized regression prediction: 0.785


Bagging is a technique used to lower variance. One decision tree could have high variance if the fit is specific to one dataset. The random forest model is a form of bagging used with decision trees with the additional technique of splitting on a random subset of features. This additional technique is done to decorrelate the predictions.

We now look at a random forest model

In [61]:
from sklearn.ensemble import RandomForestRegressor

estimator_lst = [50, 100, 200, 500, 1000]
samples_split_lst = [100, 200, 500, 1000, 2000, 5000]
for estimators in estimator_lst:
    for samples_split in samples_split_lst:
        print("Estimator: {0}, Samples Split: {1}".format(estimators, samples_split))
        print("-----------------------------------")
        forest_model = RandomForestRegressor(n_estimators=estimators, criterion='mse', min_samples_split=samples_split)
        forest_model.fit(X_train_cv_reg, y_train)
        print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_cv_reg)))))
        print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_cv_reg)))))
        print()

Estimator: 50, Samples Split: 100
-----------------------------------
Training Error: 0.4057792553637449
Validation Error: 0.7191506672122061

Estimator: 50, Samples Split: 200
-----------------------------------
Training Error: 0.49593356805206995
Validation Error: 0.7205042054861818

Estimator: 50, Samples Split: 500
-----------------------------------
Training Error: 0.6101750128150603
Validation Error: 0.7439081543549444

Estimator: 50, Samples Split: 1000
-----------------------------------
Training Error: 0.6899881867827375
Validation Error: 0.762581111737615

Estimator: 50, Samples Split: 2000
-----------------------------------
Training Error: 0.7681853361350132
Validation Error: 0.7928475095683795

Estimator: 50, Samples Split: 5000
-----------------------------------
Training Error: 0.8106714637644119
Validation Error: 0.8281190868440167

Estimator: 100, Samples Split: 100
-----------------------------------
Training Error: 0.40068404774153965
Validation Error: 0.710482975489

A value of `n_estimators = 200` and `min_samples_split = 200` seems to work quite well

In [28]:
from sklearn.ensemble import RandomForestRegressor
forest_model = RandomForestRegressor(n_estimators=200, criterion="mse", min_samples_split=200)
forest_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_cv_reg)))

model_names.append("CV-RanFor")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on Random Forest Count Vectorizer: %.3f" % train_MSE)
print("Validation error based on Random Forest Count Vectorizer: %.3f" % val_MSE)

Training error based on Random Forest Count Vectorizer: 0.455
Validation error based on Random Forest Count Vectorizer: 0.672


In [27]:
from xgboost import XGBRegressor

learning_rates = [0.01, 0.03, 0.1, 0.3, 0.5]
estimators = [10, 50, 100, 200, 500]
depths = [1, 2, 5, 10]

for learning_rate in learning_rates:
    for estimator in estimators:
        for depth in depths:
            print("Learning Rate: {0}, # Estimators: {1}, Depth: {2}".format(learning_rate, estimator, depth))
            print("--------------------------------------------------")
            xg_reg = XGBRegressor(
                learning_rate=learning_rate, max_depth=depth, n_estimators=estimator)
            xg_reg.fit(X_train_cv_reg, y_train)
            print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))))
            print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))))
            print()

Learning Rate: 0.01, # Estimators: 10, Depth: 1
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 2
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 5
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 10
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 6.567148287775521
Validation Error: 6.580003963790851

Learning Rate: 0.01, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 6.544153143964236
Validati

Training Error: 0.8055230137896795
Validation Error: 0.8218780722604903

Learning Rate: 0.1, # Estimators: 100, Depth: 2
--------------------------------------------------
Training Error: 0.7115322855995211
Validation Error: 0.7448900468672212

Learning Rate: 0.1, # Estimators: 100, Depth: 5
--------------------------------------------------
Training Error: 0.8942100595464335
Validation Error: 0.6493338358336637

Learning Rate: 0.1, # Estimators: 100, Depth: 10
--------------------------------------------------
Training Error: 0.6788926897250729
Validation Error: 0.6255026057516613

Learning Rate: 0.1, # Estimators: 200, Depth: 1
--------------------------------------------------
Training Error: 0.751404750011802
Validation Error: 0.7742689567006416

Learning Rate: 0.1, # Estimators: 200, Depth: 2
--------------------------------------------------
Training Error: 1.0214747244393767
Validation Error: 0.6964735118018198

Learning Rate: 0.1, # Estimators: 200, Depth: 5
-------------------

Training Error: 0.978398581021158
Validation Error: 0.6803343245180234

Learning Rate: 0.5, # Estimators: 500, Depth: 2
--------------------------------------------------
Training Error: 0.7940580260990752
Validation Error: 0.6603559222672145

Learning Rate: 0.5, # Estimators: 500, Depth: 5
--------------------------------------------------
Training Error: 0.5364880273660205
Validation Error: 0.699476446541734

Learning Rate: 0.5, # Estimators: 500, Depth: 10
--------------------------------------------------
Training Error: 0.00011779267994905767
Validation Error: 1.1147125854646747



The parameters `learning_rate=0.03`, `n_estimators=500` and `max_depth=10` seem to give really good performance

In [29]:
from xgboost import XGBRegressor
xg_reg = XGBRegressor(learning_rate=0.03, n_estimators=500, max_depth=10)
xg_reg.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))

model_names.append("CV-XGB")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on XGBoost CountVectorizer prediction: %.3f" % train_MSE)
print("Validation error based on XGBoost CountVectorizer prediction: %.3f" % val_MSE)

Training error based on XGBoost CountVectorizer prediction: 0.240
Validation error based on XGBoost CountVectorizer prediction: 0.578


##### Bigrams

Note that the default for CountVectorizer is `ngram_range=(1,1)` which corresponds to unigrams (single words). We are now going to look at bigrams (collections of 2 words). The reason is we could have a sentence such as "the song was well done and not generic" or a sentence like "the song was not well done and generic". These two sentences mean the same thing but the first sentence would correspond to liking the song whereas the second would correspond to disliking the song

In [30]:
cv = CountVectorizer(ngram_range=(2,2))
X_train_cv = cv.fit_transform(X_train['processedReview'])
X_val_cv = cv.transform(X_val['processedReview'])

In [31]:
X_train_reg_sp = sp.csr_matrix(X_train_reg)
X_train_cv_reg = sp.hstack((X_train_cv, X_train_reg_sp), format='csr')

X_val_reg_sp = sp.csr_matrix(X_val_reg)
X_val_cv_reg = sp.hstack((X_val_cv, X_val_reg_sp), format='csr')

Again, we will start with a linear regression model

In [32]:
print("Alpha = 0")
print("------------")
reg_model = LinearRegression()
reg_model.fit(X_train_cv_reg, y_train)
print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
print()

alphas = [0.001, 0.003, 0.01, 0.03]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Lasso(alpha=alpha)
    reg_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))))
    print()

Alpha = 0
------------


NameError: name 'LinearRegression' is not defined

A parameter of `alpha = 0.003` seems to work best

In [33]:
reg_model = Lasso(alpha=0.003)
reg_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_cv_reg)))

model_names.append("bigram-L1-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on bigram L1 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on bigram L1 regularized regression prediction: %.3f" % val_MSE)

Training error based on bigram L1 regularized regression prediction: 0.858
Validation error based on bigram L1 regularized regression prediction: 0.824


Next we look at a DecisionTree model

In [54]:
samples_split_lst = [2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000]

for samples_split in samples_split_lst:
    print("Samples Split = {}".format(samples_split))
    print("-------------------")
    tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=samples_split)
    tree_model.fit(X_train_cv_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))))
    print()

Samples Split = 2
-------------------
Training Error: 0.00034840998352970987
Validation Error: 1.5720435553304635

Samples Split = 5
-------------------
Training Error: 0.14395667046750285
Validation Error: 1.380092709828086

Samples Split = 10
-------------------
Training Error: 0.2517588922472997
Validation Error: 1.2500459664418744

Samples Split = 20
-------------------
Training Error: 0.334875317575188
Validation Error: 1.2045591036207899

Samples Split = 50
-------------------
Training Error: 0.38170810626141705
Validation Error: 1.1547422191619612

Samples Split = 100
-------------------
Training Error: 0.44518864141093517
Validation Error: 1.12558213619203

Samples Split = 200
-------------------
Training Error: 0.47686615493204587
Validation Error: 1.1039643057616346

Samples Split = 500
-------------------
Training Error: 0.5870212534753286
Validation Error: 0.9927061595309912

Samples Split = 1000
-------------------
Training Error: 0.6849479740741126
Validation Error: 0.898

The `min_samples_split=5000` seems to work best

In [34]:
tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=5000)
tree_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_cv_reg)))

model_names.append("bigram-DecTree")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on bigram decision tree prediction: %.3f" % train_MSE)
print("Validation error based on bigram decision tree prediction: %.3f" % val_MSE)

Training error based on bigram decision tree prediction: 0.651
Validation error based on bigram decision tree prediction: 0.829


Next we try RandomForestRegressor

In [56]:
estimator_lst = [50, 100, 200, 500, 1000]
samples_split_lst = [100, 200, 500, 1000, 2000, 5000]
for estimators in estimator_lst:
    for samples_split in samples_split_lst:
        print("Estimator: {0}, Samples Split: {1}".format(estimators, samples_split))
        print("-----------------------------------")
        forest_model = RandomForestRegressor(n_estimators=estimators, criterion='mse', min_samples_split=samples_split)
        forest_model.fit(X_train_cv_reg, y_train)
        print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_cv_reg)))))
        print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_cv_reg)))))
        print()

Estimator: 50, Samples Split: 100
-----------------------------------
Training Error: 0.44141249668554217
Validation Error: 0.8757471183270643

Estimator: 50, Samples Split: 200
-----------------------------------
Training Error: 0.4928900322931837
Validation Error: 0.8629796406063882

Estimator: 50, Samples Split: 500
-----------------------------------
Training Error: 0.6075063573811836
Validation Error: 0.8386345191945126

Estimator: 50, Samples Split: 1000
-----------------------------------
Training Error: 0.6761420982965163
Validation Error: 0.8286892342528405

Estimator: 50, Samples Split: 2000
-----------------------------------
Training Error: 0.7690676434275397
Validation Error: 0.8279859307599594

Estimator: 50, Samples Split: 5000
-----------------------------------
Training Error: 0.8431295781666054
Validation Error: 0.8595007183466646

Estimator: 100, Samples Split: 100
-----------------------------------
Training Error: 0.43739480862845315
Validation Error: 0.87280517646

In this case the number of estimators doesn't seem to make much of a difference but `min_samples_split=2000` seems to work best. In order to keep it simple we will use 50 estimators

In [None]:
forest_model = RandomForestRegressor(n_estimators=50, criterion="mse", min_samples_split=2000)
forest_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_cv_reg)))

model_names.append("bigram-RanFor")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on Random Forest Count Vectorizer: %.3f" % train_MSE)
print("Validation error based on Random Forest Count Vectorizer: %.3f" % val_MSE)

Lastly, we will look at XGBoost

In [58]:
from xgboost import XGBRegressor

learning_rates = [0.01, 0.03, 0.1, 0.3, 0.5]
estimators = [10, 50, 100, 200, 500]
depths = [1, 2, 5, 10]

for learning_rate in learning_rates:
    for estimator in estimators:
        for depth in depths:
            print("Learning Rate: {0}, # Estimators: {1}, Depth: {2}".format(learning_rate, estimator, depth))
            print("--------------------------------------------------")
            xg_reg = XGBRegressor(
                learning_rate=learning_rate, max_depth=depth, n_estimators=estimator)
            xg_reg.fit(X_train_cv_reg, y_train)
            print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))))
            print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))))
            print()

Learning Rate: 0.01, # Estimators: 10, Depth: 1
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 2
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 5
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 10
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 6.568249772343689
Validation Error: 6.57867399079515

Learning Rate: 0.01, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 6.553732014268306
Validatio

Training Error: 0.8562371391738242
Validation Error: 0.8646289936763091

Learning Rate: 0.1, # Estimators: 100, Depth: 2
--------------------------------------------------
Training Error: 0.8122424035463263
Validation Error: 0.8363016979402929

Learning Rate: 0.1, # Estimators: 100, Depth: 5
--------------------------------------------------
Training Error: 0.7130391586430229
Validation Error: 0.8009448747730243

Learning Rate: 0.1, # Estimators: 100, Depth: 10
--------------------------------------------------
Training Error: 0.5694270929171044
Validation Error: 0.7913095545411614

Learning Rate: 0.1, # Estimators: 200, Depth: 1
--------------------------------------------------
Training Error: 0.8286290055102565
Validation Error: 0.8435263055403349

Learning Rate: 0.1, # Estimators: 200, Depth: 2
--------------------------------------------------
Training Error: 0.7766458375238707
Validation Error: 0.8153344332339758

Learning Rate: 0.1, # Estimators: 200, Depth: 5
------------------

Training Error: 1.0441530470036742
Validation Error: 0.7848066543164833

Learning Rate: 0.5, # Estimators: 500, Depth: 2
--------------------------------------------------
Training Error: 0.9222095527682757
Validation Error: 0.7876996121689137

Learning Rate: 0.5, # Estimators: 500, Depth: 5
--------------------------------------------------
Training Error: 0.6551374635753199
Validation Error: 0.8663779028219787

Learning Rate: 0.5, # Estimators: 500, Depth: 10
--------------------------------------------------
Training Error: 0.513429621183327
Validation Error: 1.2770321600405166



Parameters `learning_rate=0.1`, `n_estimators=200` and `max_depth=5` seem to work best

In [35]:
xg_reg = XGBRegressor(learning_rate=0.1, n_estimators=200, max_depth=5)
xg_reg.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))

model_names.append("bigram-XGB")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on XGBoost CountVectorizer prediction: %.3f" % train_MSE)
print("Validation error based on XGBoost CountVectorizer prediction: %.3f" % val_MSE)

Training error based on XGBoost CountVectorizer prediction: 0.682
Validation error based on XGBoost CountVectorizer prediction: 0.726


##### TF-IDF

We now look at some Term Frequency, Inverse Document Frequency models

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train['processedReview'])
X_val_tfidf = tfidf.transform(X_val['processedReview'])

In [None]:
X_train_reg_sp = sp.csr_matrix(X_train_reg)
X_train_tfidf_reg = sp.hstack((X_train_tfidf, X_train_reg_sp), format='csr')

X_val_reg_sp = sp.csr_matrix(X_val_reg)
X_val_tfidf_reg = sp.hstack((X_val_tfidf, X_val_reg_sp), format='csr')

In [42]:
print("Alpha = 0")
print("------------")
reg_model = LinearRegression()
reg_model.fit(X_train_tfidf_reg, y_train)
print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_tfidf_reg)))))
print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_tfidf_reg)))))
print()

alphas = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Lasso(alpha=alpha)
    reg_model.fit(X_train_tfidf_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_tfidf_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_tfidf_reg)))))
    print()

Alpha = 0
------------
Training Error: 0.48764728240212846
Validation Error: 4.35882501899215

Alpha = 0.0001
------------
Training Error: 0.4971762425824536
Validation Error: 0.5943364603053036

Alpha = 0.0003
------------
Training Error: 0.6132796048231367
Validation Error: 0.6367738258983342

Alpha = 0.001
------------
Training Error: 0.7396541286275682
Validation Error: 0.7446951232339728

Alpha = 0.003
------------
Training Error: 0.8926590888416531
Validation Error: 0.8860759005064034

Alpha = 0.01
------------
Training Error: 0.9449249189212761
Validation Error: 0.9401356110678446

Alpha = 0.03
------------
Training Error: 0.9829243379548028
Validation Error: 0.9757225376331816



Very low values of the regularization parameter seems to correspond to overfitting. So we will use $\alpha = 0.0003$

In [None]:
reg_model = Lasso(alpha=0.0003)
reg_model.fit(X_train_tfidf_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(reg_model.predict(X_train_tfidf_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(reg_model.predict(X_val_tfidf_reg)))

model_names.append("TFIDF-L1-Reg")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on TF-IDF L1 regularized regression prediction: %.3f" % train_MSE)
print("Validation error based on TF-IDF L1 regularized regression prediction: %.3f" % val_MSE)

Next we look at some decision trees. As very low values of `min_samples_split` corresponded to overfitting in simpler models, low values will also correspond to overfitting in this model. So we will start with higher values. This makes sense as our prediction is in the range [1, 5] and so there shouldn't be a lot of leaf nodes

In [45]:
samples_split_lst = [10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000]

for samples_split in samples_split_lst:
    print("Samples Split = {}".format(samples_split))
    print("-------------------")
    tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=samples_split)
    tree_model.fit(X_train_tfidf_reg, y_train)
    print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_tfidf_reg)))))
    print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_tfidf_reg)))))
    print()

Samples Split = 10
-------------------
Training Error: 0.09078405317953019
Validation Error: 1.1542923660840494

Samples Split = 20
-------------------
Training Error: 0.1481975958452695
Validation Error: 1.1519107693805148

Samples Split = 50
-------------------
Training Error: 0.2172065901834366
Validation Error: 1.0530955269682027

Samples Split = 100
-------------------
Training Error: 0.2646238230552557
Validation Error: 1.0479410075498214

Samples Split = 200
-------------------
Training Error: 0.3291466366217442
Validation Error: 1.0350827073757387

Samples Split = 500
-------------------
Training Error: 0.42699768973783386
Validation Error: 0.9471850793626092

Samples Split = 1000
-------------------
Training Error: 0.5308267222973465
Validation Error: 0.8943621767262115

Samples Split = 2000
-------------------
Training Error: 0.6517068542407606
Validation Error: 0.863809577630791

Samples Split = 5000
-------------------
Training Error: 0.7570083021109453
Validation Error: 0.

It seems a value of `min_samples_split=5000` works the best

In [None]:
tree_model = DecisionTreeRegressor(criterion="mse", min_samples_split=5000)
tree_model.fit(X_train_tfidf_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(tree_model.predict(X_train_tfidf_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(tree_model.predict(X_val_tfidf_reg)))

model_names.append("TFIDF-DecTree")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on TF-IDF decision tree prediction: %.3f" % train_MSE)
print("Validation error based on TF-IDF decision tree prediction: %.3f" % val_MSE)

Now we look at some random forest models

In [39]:
estimator_lst = [50, 100, 200, 500, 1000]
samples_split_lst = [100, 200, 500, 1000, 2000, 5000]
for estimators in estimator_lst:
    for samples_split in samples_split_lst:
        print("Estimator: {0}, Samples Split: {1}".format(estimators, samples_split))
        print("-----------------------------------")
        forest_model = RandomForestRegressor(n_estimators=estimators, criterion='mse', min_samples_split=samples_split)
        forest_model.fit(X_train_tfidf_reg, y_train)
        print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_tfidf_reg)))))
        print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_tfidf_reg)))))
        print()

Estimator: 50, Samples Split: 100
-----------------------------------
Training Error: 0.27764912169968925
Validation Error: 0.682414274687323

Estimator: 50, Samples Split: 200
-----------------------------------
Training Error: 0.3321947147412355
Validation Error: 0.6896463357974781

Estimator: 50, Samples Split: 500
-----------------------------------
Training Error: 0.40365306798451744
Validation Error: 0.6999955085365609

Estimator: 50, Samples Split: 1000
-----------------------------------
Training Error: 0.5145458898616719
Validation Error: 0.7331253142760055

Estimator: 50, Samples Split: 2000
-----------------------------------
Training Error: 0.5958559191822098
Validation Error: 0.7449028000432648

Estimator: 50, Samples Split: 5000
-----------------------------------
Training Error: 0.7522311519697088
Validation Error: 0.797617220088879

Estimator: 100, Samples Split: 100
-----------------------------------
Training Error: 0.2786782324210541
Validation Error: 0.6791026227302

KeyboardInterrupt: 

In [None]:
forest_model = RandomForestRegressor(n_estimators=300, criterion="mse", min_samples_split=2000)
forest_model.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(forest_model.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(forest_model.predict(X_val_cv_reg)))

model_names.append("TFIDF-RanFor")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on Random Forest Count Vectorizer: %.3f" % train_MSE)
print("Validation error based on Random Forest Count Vectorizer: %.3f" % val_MSE)

In [36]:
learning_rates = [0.01, 0.03, 0.1, 0.3, 0.5]
estimators = [10, 50, 100, 200, 500]
depths = [1, 2, 5, 10]

for learning_rate in learning_rates:
    for estimator in estimators:
        for depth in depths:
            print("Learning Rate: {0}, # Estimators: {1}, Depth: {2}".format(learning_rate, estimator, depth))
            print("--------------------------------------------------")
            xg_reg = XGBRegressor(
                learning_rate=learning_rate, max_depth=depth, n_estimators=estimator)
            xg_reg.fit(X_train_cv_reg, y_train)
            print("Training Error: {}".format(calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))))
            print("Validation Error: {}".format(calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))))
            print()

Learning Rate: 0.01, # Estimators: 10, Depth: 1
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 2
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 5
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 10, Depth: 10
--------------------------------------------------
Training Error: 12.647472443937666
Validation Error: 12.663965560901493

Learning Rate: 0.01, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 6.568249772343689
Validation Error: 6.57867399079515

Learning Rate: 0.01, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 6.553732014268306
Validatio

Training Error: 0.8562371391738242
Validation Error: 0.8646289936763091

Learning Rate: 0.1, # Estimators: 100, Depth: 2
--------------------------------------------------
Training Error: 0.8122424035463263
Validation Error: 0.8363016979402929

Learning Rate: 0.1, # Estimators: 100, Depth: 5
--------------------------------------------------
Training Error: 0.7130391586430229
Validation Error: 0.8009448747730243

Learning Rate: 0.1, # Estimators: 100, Depth: 10
--------------------------------------------------
Training Error: 0.5694270929171044
Validation Error: 0.7913095545411614

Learning Rate: 0.1, # Estimators: 200, Depth: 1
--------------------------------------------------
Training Error: 0.8286290055102565
Validation Error: 0.8435263055403349

Learning Rate: 0.1, # Estimators: 200, Depth: 2
--------------------------------------------------
Training Error: 0.7766458375238707
Validation Error: 0.8153344332339758

Learning Rate: 0.1, # Estimators: 200, Depth: 5
------------------

Training Error: 1.0441530470036742
Validation Error: 0.7848066543164833

Learning Rate: 0.5, # Estimators: 500, Depth: 2
--------------------------------------------------
Training Error: 0.9222095527682757
Validation Error: 0.7876996121689137

Learning Rate: 0.5, # Estimators: 500, Depth: 5
--------------------------------------------------
Training Error: 0.6551374635753199
Validation Error: 0.8663779028219787

Learning Rate: 0.5, # Estimators: 500, Depth: 10
--------------------------------------------------
Training Error: 0.513429621183327
Validation Error: 1.2770321600405166



In [None]:
from xgboost import XGBRegressor
xg_reg = XGBRegressor(learning_rate=0.03, n_estimators=500, max_depth=10)
xg_reg.fit(X_train_cv_reg, y_train)

train_MSE = calculate_MSE(y_train, vthreshold_rating(xg_reg.predict(X_train_cv_reg)))
val_MSE = calculate_MSE(y_val, vthreshold_rating(xg_reg.predict(X_val_cv_reg)))

model_names.append("CV-XGB")
train_errors.append(train_MSE)
validation_errors.append(val_MSE)

print("Training error based on XGBoost CountVectorizer prediction: %.3f" % train_MSE)
print("Validation error based on XGBoost CountVectorizer prediction: %.3f" % val_MSE)

In [None]:
def user_item_matrix(df, ratings, user_col, item_col):
    return sp.csr_matrix(ratings, (df[user_col], df[item_col]))

In [None]:
X_train_sparse = user_item_matrix(X_train, y_train, 'reviewerID', 'itemID')
X_val_sparse = user_item_matrix(X_val, y_val, 'reviewerID', 'itemID')

In [None]:
average_rating_total = X_train_sparse.sum() / X_train_sparse.count_nonzero()

print("Average Rating: {}".format(average_rating_total))