# Final Models

In this notebook we look at some final candidate models and assess their performance on a validation set

In [2]:
import json
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
plt.style.use('seaborn-darkgrid')
%matplotlib inline

random.seed(17)

### Loading data

We will subsample 50,000 of the 200,000 records for training by sampling 25% of the data from each of the five music categories

In [3]:
data = None
with open(os.path.join('data', 'train.json'), 'r') as train_file:
    data = [json.loads(row) for row in train_file]

In [4]:
data_df = pd.DataFrame(data).drop(columns=['image'])
del data

In [5]:
categories = data_df['category'].unique()
dfs = []
for category in categories:
    dfs.append(data_df[data_df['category'] == category].sample(frac=0.25))
data_df = pd.concat(dfs, axis=0)
data_df = data_df.sort_index()
data_df

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631
10,3.0,"04 3, 2002",u25030850,Me personally I am not a big fan of Pearl Jam ...,Hmmmmm...........,1017792000,Pop,$6.81,p99659606,41124728
11,5.0,"11 16, 2006",u40719083,Katchen's performace throughout this collectio...,Superb interpretations by Katchen,1163635200,Classical,$31.04,p63362921,40704096
16,5.0,"09 19, 2016",u55172429,"Wow, I had this in the late 60s, I saw Deaf Ge...",Hello Darkness my old friend......,1474243200,Pop,$5.89,p72166335,75947625
...,...,...,...,...,...,...,...,...,...,...
199983,5.0,"02 9, 2017",u57558427,Those old crooners like Hartman really had a w...,Hartman and Wine Get Better With Age,1486598400,Pop,$1.98,p33434439,24285962
199986,5.0,"11 8, 2015",u85136324,Love it,Five Stars,1446940800,Pop,$11.88,p02978017,27368058
199987,4.0,"03 23, 2014",u32710934,"Love priest, halford one of the best live show...",Remaster ?,1395532800,Alternative Rock,$11.89,p75893713,97935046
199988,5.0,"10 20, 2014",u37750462,Great,Five Stars,1413763200,Alternative Rock,$18.98,p47415569,75099110


### Pre-Processing

We apply feature cleaning as prototyped before and then split into a training and validation set, ensuring that the proportion of data points in training vs validation is consistent for each music category

In [6]:
def trim_price(price):
    """Trims `price` to remove the $ sign.
    
    If the price variable does not have the format $x.xx
    then the empty string is returned.
    
    Parameters
    ----------
    price: str
        A string representing a price.
    
    Returns
    -------
    str
        A string representing `price` but with the $ sign removed,
        or the empty string if `price` does not have the correct
        format.
    
    """
    if (not pd.isnull(price) and isinstance(price, str) and
        len(price) > 0 and price[0] == '$'):
        return price[1:]
    return ""

In [7]:
from datetime import datetime

data_df['reviewMonth'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[0])
data_df['reviewYear'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[2])
data_df['reviewHour'] = data_df['unixReviewTime'].apply(lambda x: datetime.fromtimestamp(x).hour)
data_df['reviewMonthYear'] = data_df['reviewYear'] + '-' + data_df['reviewMonth']

data_df['cleanedPrice'] = data_df['price'].apply(lambda x: trim_price(x))
data_df = data_df[data_df['cleanedPrice'] != ""]
data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')

data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
data_df['fixedSummary'] = np.where(pd.isnull(data_df['summary']), "", data_df['summary'])
data_df['fullReviewText'] = data_df['fixedSummary'] + " " + data_df['fixedReviewText']

data_df = data_df.drop(columns=['fixedReviewText', 'fixedSummary'])

genres = data_df['category'].unique()

for genre in genres:
    genre_col = "is" + genre.replace(" ", "").replace("&", "")
    data_df[genre_col] = data_df['category'].apply(lambda x: 1 if x == genre else 0)

data_df['reviewWordCount'] = data_df['fullReviewText'].apply(lambda x: len(x.split()))

data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedSummary'] = np.where(pd.is

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,...,reviewHour,reviewMonthYear,cleanedPrice,fullReviewText,isPop,isClassical,isAlternativeRock,isJazz,isDanceElectronic,reviewWordCount
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,...,20,2015-10,89.86,"Love the Music, Hate the Light Show I REALLY e...",1,0,0,0,0,133
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,...,20,2017-06,11.89,Great Finally got it . It was everything thoug...,1,0,0,0,0,15
10,3.0,"04 3, 2002",u25030850,Me personally I am not a big fan of Pearl Jam ...,Hmmmmm...........,1017792000,Pop,$6.81,p99659606,41124728,...,19,2002-04,6.81,Hmmmmm........... Me personally I am not a big...,1,0,0,0,0,77
11,5.0,"11 16, 2006",u40719083,Katchen's performace throughout this collectio...,Superb interpretations by Katchen,1163635200,Classical,$31.04,p63362921,40704096,...,19,2006-11,31.04,Superb interpretations by Katchen Katchen's pe...,0,1,0,0,0,226
16,5.0,"09 19, 2016",u55172429,"Wow, I had this in the late 60s, I saw Deaf Ge...",Hello Darkness my old friend......,1474243200,Pop,$5.89,p72166335,75947625,...,20,2016-09,5.89,"Hello Darkness my old friend...... Wow, I had ...",1,0,0,0,0,84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199983,5.0,"02 9, 2017",u57558427,Those old crooners like Hartman really had a w...,Hartman and Wine Get Better With Age,1486598400,Pop,$1.98,p33434439,24285962,...,19,2017-02,1.98,Hartman and Wine Get Better With Age Those old...,1,0,0,0,0,25
199986,5.0,"11 8, 2015",u85136324,Love it,Five Stars,1446940800,Pop,$11.88,p02978017,27368058,...,19,2015-11,11.88,Five Stars Love it,1,0,0,0,0,4
199987,4.0,"03 23, 2014",u32710934,"Love priest, halford one of the best live show...",Remaster ?,1395532800,Alternative Rock,$11.89,p75893713,97935046,...,20,2014-03,11.89,"Remaster ? Love priest, halford one of the bes...",0,0,1,0,0,22
199988,5.0,"10 20, 2014",u37750462,Great,Five Stars,1413763200,Alternative Rock,$18.98,p47415569,75099110,...,20,2014-10,18.98,Five Stars Great,0,0,1,0,0,3


In [8]:
def calculate_MSE(actuals, predicteds):
    """Calculates the Mean Squared Error between `actuals` and `predicteds`.
    
    Parameters
    ----------
    actuals: np.array
        A numpy array of the actual values.
    predicteds: np.array
        A numpy array of the predicted values.
    
    Returns
    -------
    float
        A float representing the Mean Squared Error between `actuals` and
        `predicteds`.
    
    """
    return (((actuals - predicteds)**2).sum()) / (len(actuals))

In [9]:
from sklearn.model_selection import train_test_split

genres = data_df['category'].unique()
X_train_set = []
X_val_set = []
y_train_set = []
y_val_set = []

for genre in genres:
    genre_df = data_df[data_df['category'] == genre]
    targets = genre_df['overall']
    feature_data = genre_df.drop(columns=['overall'])
    X_train, X_val, y_train, y_val = train_test_split(
        feature_data, targets, shuffle=True, test_size=0.2, random_state=17)
    X_train_set.append(X_train)
    X_val_set.append(X_val)
    y_train_set.append(y_train)
    y_val_set.append(y_val)

X_train = pd.concat(X_train_set)
X_val = pd.concat(X_val_set)
y_train = pd.concat(y_train_set)
y_val = pd.concat(y_val_set)

### Collaborative Filtering

In this model we only need a users ID, the items ID, and 

In [10]:
def user_item_matrix(df, rating_col, user_col, item_col):
    return sp.csr_matrix(df[rating_col], (df[user_col], df[item_col]))

In [11]:
train_data = pd.concat([X_train, pd.DataFrame(y_train, columns=['overall'])], axis=1)
val_data = pd.concat([X_val, pd.DataFrame(y_val, columns=['overall'])], axis=1)

In [12]:
import scipy.sparse as sp

item_matrix = train_data.pivot(index='itemID', columns='reviewerID', values='overall')
item_matrix = item_matrix.fillna(0)
user_item_train_matrix = sp.csr_matrix(item_matrix.values)

In [13]:
global_average = train_data['overall'].mean()

In [14]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
model_knn.fit(user_item_train_matrix)
item_neighbors = np.asarray(model_knn.kneighbors(user_item_train_matrix, return_distance=False))

In [15]:
user_matrix = train_data.pivot(index='reviewerID', columns='itemID', values='overall')
user_matrix = user_matrix.fillna(0)
user_item_train_matrix = sp.csr_matrix(user_matrix.values)

In [16]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5)
model_knn.fit(user_item_train_matrix)
user_neighbors = np.asarray(model_knn.kneighbors(user_item_train_matrix, return_distance=False))

In [17]:
train_user_avg = train_data.groupby(train_data['reviewerID'], as_index=False)['overall'].mean()
train_item_avg = train_data.groupby(train_data['itemID'], as_index=False)['overall'].mean()
train_user_avg.columns = ['reviewerID', 'userAverage']
train_item_avg.columns = ['itemID', 'itemAverage']
train_user_avg = train_user_avg.set_index('reviewerID')
train_item_avg = train_item_avg.set_index('itemID')

In [18]:
item_avgs = []
for i in range(len(item_neighbors)):
    item_avgs.append(train_item_avg['itemAverage'][item_matrix.index[item_neighbors[i]]].mean())

item_avgs = pd.concat([pd.DataFrame(item_matrix.index, columns=['itemID']), pd.DataFrame(item_avgs, columns=['itemRating'])], axis=1)

In [19]:
user_avgs = []
for i in range(len(user_neighbors)):
    user_avgs.append(train_user_avg['userAverage'][user_matrix.index[user_neighbors[i]]].mean())

user_avgs = pd.concat([pd.DataFrame(user_matrix.index, columns=['reviewerID']), pd.DataFrame(user_avgs, columns=['userRating'])], axis=1)

In [20]:
def weighted_average_data(X, total_avg, user_avgs, item_avgs):
    """Calculates the error based on the weighted average prediction.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    y: np.array
        A numpy array containing the targets
    total_avg: float
        The average across all users/items.
    user_avgs: pd.DataFrame
        A DataFrame containing the average rating for each user.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for each item.
    
    Returns
    -------
    float
        A float representing the mean squared error of the predictions.
    
    """
    df_user = pd.merge(X, user_avgs, how='left', on=['reviewerID'])
    df_final = pd.merge(df_user, item_avgs, how='left', on=['itemID'])
    df_final = df_final[['userRating', 'itemRating']]
    df_final = df_final.fillna(total_avg)
    df_final.index = X.index
    return df_final

In [21]:
X_train_aug = weighted_average_data(X_train, global_average, user_avgs, item_avgs)
X_val_aug = weighted_average_data(X_val, global_average, user_avgs, item_avgs)

In [22]:
X_train_mod = pd.concat([X_train, X_train_aug], axis=1)
X_val_mod = pd.concat([X_val, X_val_aug], axis=1)

In [23]:
def threshold_rating(rating):
    """Thresholds `rating` to lie in the range [1, 5].
    
    Parameters
    ----------
    rating: float
        The rating to be thresholded.
    
    Returns
    -------
    float
        A float representing the thresholded rating.
    
    """
    if rating < 1:
        return 1
    if rating > 5:
        return 5
    return rating

In [24]:
X_train_mod['pred'] = (0.5 * X_train_mod['userRating']) + (0.5 * X_train_mod['itemRating'])
X_train_mod['pred'].apply(lambda x: threshold_rating(x))
print("Training MSE: {}".format(calculate_MSE(y_train, X_train_mod['pred'])))

X_val_mod['pred'] = (0.5 * X_val_mod['userRating']) + (0.5 * X_val_mod['itemRating'])
X_val_mod['pred'].apply(lambda x: threshold_rating(x))
print("Validation MSE: {}".format(calculate_MSE(y_val, X_val_mod['pred'])))

Training MSE: 0.8036093923598844
Validation MSE: 1.011061051627551


### Language Models

In [25]:
columns_to_keep = ['cleanedPrice', 'isPop', 'isAlternativeRock', 'isJazz', 'isClassical', 'isDanceElectronic', 'reviewWordCount']
X_train_reg1 = X_train[columns_to_keep]
X_val_reg1 = X_val[columns_to_keep]

In [26]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_train_reg1['reviewWordCount'] = X_train_reg1['reviewWordCount'].apply(lambda x: np.log(x))
X_val_reg1['reviewWordCount'] = X_val_reg1['reviewWordCount'].apply(lambda x: np.log(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_reg1['reviewWordCount'] = X_train_reg1['reviewWordCount'].apply(lambda x: np.log(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val_reg1['reviewWordCount'] = X_val_reg1['reviewWordCount'].apply(lambda x: np.log(x))


In [27]:
def clean_dataset(df):
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

X_train_reg1 = clean_dataset(X_train_reg1)
y_train1 = y_train[y_train.index.isin(X_train_reg1.index)]
X_train1 = X_train[X_train.index.isin(X_train_reg1.index)]

X_val_reg1 = clean_dataset(X_val_reg1)
y_val1 = y_val[y_val.index.isin(X_val_reg1.index)]
X_val1 = X_val[X_val.index.isin(X_val_reg1.index)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [28]:
X_train_mod = X_train_mod[X_train_mod.index.isin(X_train_reg1.index)]

In [30]:
X_train_reg1 = min_max_scaler.fit_transform(X_train_reg1)
X_val_reg1 = min_max_scaler.transform(X_val_reg1)

In [31]:
from sklearn.linear_model import Ridge

vthreshold_rating = np.vectorize(threshold_rating)

alphas = [0.0, 0.01, 0.03, 0.1, 0.3]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Ridge(alpha=alpha)
    reg_model.fit(X_train_reg1, y_train1)
    print("Training Error: {}".format(calculate_MSE(y_train1, vthreshold_rating(reg_model.predict(X_train_reg1)))))
    print("Validation Error: {}".format(calculate_MSE(y_val1, vthreshold_rating(reg_model.predict(X_val_reg1)))))
    print()

Alpha = 0.0
------------
Training Error: 0.9528565786877277
Validation Error: 0.9782989806075709

Alpha = 0.01
------------
Training Error: 0.9527874581451989
Validation Error: 0.9780706174350472

Alpha = 0.03
------------
Training Error: 0.9527879364316555
Validation Error: 0.9780715543169359

Alpha = 0.1
------------
Training Error: 0.9527896104574072
Validation Error: 0.9780748237435576

Alpha = 0.3
------------
Training Error: 0.9527943917923807
Validation Error: 0.9780840821773407



In [32]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def process_review_text(review_text, exclude_text, ps):
    """Pre-processes the text given by `review_text`.
    
    Parameters
    ----------
    review_text: str
        The review text to be processed.
    exclude_text: collection
        A collection of words to be excluded.
    ps: PorterStemmer
        The PorterStemmer used to perform word stemming.
    
    Returns
    -------
    str
        A string representing the processed version of `review_text`.
    
    """
    review = re.sub('[^a-zA-Z0-9]', ' ', review_text).lower().split()
    review = [ps.stem(word) for word in review if not word in exclude_text]
    return ' '.join(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Matthew/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
exclude_english = set(stopwords.words('english'))
ps = PorterStemmer()
X_train1['processedReview'] = X_train1['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_val1['processedReview'] = X_val1['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train1['processedReview'] = X_train1['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))


In [34]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X_train_cv1 = cv.fit_transform(X_train1['processedReview'])
X_val_cv1 = cv.transform(X_val1['processedReview'])

In [35]:
import scipy.sparse as sp

X_train_reg1_sp = sp.csr_matrix(X_train_reg1)
X_train_cv_reg1 = sp.hstack((X_train_cv1, X_train_reg1_sp), format='csr')

X_val_reg1_sp = sp.csr_matrix(X_val_reg1)
X_val_cv_reg1 = sp.hstack((X_val_cv1, X_val_reg1_sp), format='csr')

In [38]:
from xgboost import XGBRegressor

learning_rates = [0.01, 0.03, 0.1, 0.3, 0.5]
estimators = [10, 50, 100, 200, 500]
depths = [1, 2, 5, 10]

for learning_rate in learning_rates:
    for estimator in estimators:
        for depth in depths:
            print("Learning Rate: {0}, # Estimators: {1}, Depth: {2}".format(learning_rate, estimator, depth))
            print("--------------------------------------------------")
            xg_reg = XGBRegressor(
                learning_rate=learning_rate, max_depth=depth, n_estimators=estimator)
            xg_reg.fit(X_train_cv_reg1, y_train1)
            print("Training Error: {}".format(calculate_MSE(y_train1, vthreshold_rating(xg_reg.predict(X_train_cv_reg1)))))
            print("Validation Error: {}".format(calculate_MSE(y_val1, vthreshold_rating(xg_reg.predict(X_val_cv_reg1)))))
            print()

Learning Rate: 0.01, # Estimators: 10, Depth: 1
--------------------------------------------------
Training Error: 12.593585326307256
Validation Error: 12.67473662884927

Learning Rate: 0.01, # Estimators: 10, Depth: 2
--------------------------------------------------
Training Error: 12.593585326307256
Validation Error: 12.67473662884927

Learning Rate: 0.01, # Estimators: 10, Depth: 5
--------------------------------------------------
Training Error: 12.593585326307256
Validation Error: 12.67473662884927

Learning Rate: 0.01, # Estimators: 10, Depth: 10
--------------------------------------------------
Training Error: 12.593585326307256
Validation Error: 12.67473662884927

Learning Rate: 0.01, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 6.550438196243737
Validation Error: 6.5880611764144055

Learning Rate: 0.01, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 6.52725632182597
Validation E

Validation Error: 0.6136955626194984

Learning Rate: 0.1, # Estimators: 100, Depth: 1
--------------------------------------------------
Training Error: 0.8155976600043131
Validation Error: 0.7875032466405151

Learning Rate: 0.1, # Estimators: 100, Depth: 2
--------------------------------------------------
Training Error: 0.726521861767224
Validation Error: 0.7121344775180954

Learning Rate: 0.1, # Estimators: 100, Depth: 5
--------------------------------------------------
Training Error: 0.5527333729149018
Validation Error: 0.6203616906595741

Learning Rate: 0.1, # Estimators: 100, Depth: 10
--------------------------------------------------
Training Error: 0.30108368671816804
Validation Error: 0.5902296716747013

Learning Rate: 0.1, # Estimators: 200, Depth: 1
--------------------------------------------------
Training Error: 0.7603381182947971
Validation Error: 0.7376392402745808

Learning Rate: 0.1, # Estimators: 200, Depth: 2
--------------------------------------------------
Tr

Training Error: 0.04102141880108724
Validation Error: 0.7051522031851937

Learning Rate: 0.5, # Estimators: 500, Depth: 1
--------------------------------------------------
Training Error: 0.6162994432119324
Validation Error: 0.6326113331540619

Learning Rate: 0.5, # Estimators: 500, Depth: 2
--------------------------------------------------
Training Error: 0.4906136669008255
Validation Error: 0.5963565134729887

Learning Rate: 0.5, # Estimators: 500, Depth: 5
--------------------------------------------------
Training Error: 0.17578364274699398
Validation Error: 0.6432141219299883

Learning Rate: 0.5, # Estimators: 500, Depth: 10
--------------------------------------------------
Training Error: 0.0030042775739243374
Validation Error: 0.7145314862685995



It seems that `learning_rate=0.3`, `n_estimators=500`, and `max_depth=2` provides a very good model

In [36]:

xg_reg = XGBRegressor(learning_rate=0.3, n_estimators=500, max_depth=2)
xg_reg.fit(X_train_cv_reg1, y_train1)

train_MSE = calculate_MSE(y_train1, vthreshold_rating(xg_reg.predict(X_train_cv_reg1)))
val_MSE = calculate_MSE(y_val1, vthreshold_rating(xg_reg.predict(X_val_cv_reg1)))

print("Training error based on XGBoost CountVectorizer prediction: %.3f" % train_MSE)
print("Validation error based on XGBoost CountVectorizer prediction: %.3f" % val_MSE)

NameError: name 'XGBRegressor' is not defined

### Meta Models

In this section we look at models that combine both collaborative filtering and language models.

We start by using the predictions of the collaborative filtering as features in our language model

In [None]:
columns_to_keep = ['cleanedPrice', 'isPop', 'isAlternativeRock', 'isJazz', 'isClassical', 'isDanceElectronic', 'reviewWordCount', 'userRating', 'itemRating']
X_train_reg2 = X_train_mod[columns_to_keep]
X_val_reg2 = X_val_mod[columns_to_keep]

In [None]:
min_max_scaler = MinMaxScaler()
X_train_reg2['reviewWordCount'] = X_train_reg2['reviewWordCount'].apply(lambda x: np.log(x))
X_val_reg2['reviewWordCount'] = X_val_reg2['reviewWordCount'].apply(lambda x: np.log(x))

In [None]:
X_train_reg2 = clean_dataset(X_train_reg2)
y_train2 = y_train[y_train.index.isin(X_train_reg2.index)]
X_train2 = X_train[X_train.index.isin(X_train_reg2.index)]

X_val_reg2 = clean_dataset(X_val_reg2)
y_val2 = y_val[y_val.index.isin(X_val_reg2.index)]
X_val2 = X_val[X_val.index.isin(X_val_reg2.index)]

In [None]:
X_train_reg2 = min_max_scaler.fit_transform(X_train_reg2)
X_val_reg2 = min_max_scaler.transform(X_val_reg2)

In [None]:
alphas = [0.0, 0.01, 0.03, 0.1, 0.3]
for alpha in alphas:
    print("Alpha = {}".format(alpha))
    print("------------")
    reg_model = Ridge(alpha=alpha)
    reg_model.fit(X_train_reg2, y_train2)
    print("Training Error: {}".format(calculate_MSE(y_train2, vthreshold_rating(reg_model.predict(X_train_reg2)))))
    print("Validation Error: {}".format(calculate_MSE(y_val2, vthreshold_rating(reg_model.predict(X_val_reg2)))))
    print()

In [None]:
exclude_english = set(stopwords.words('english'))
ps = PorterStemmer()
X_train2['processedReview'] = X_train2['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))
X_val2['processedReview'] = X_val2['fullReviewText'].apply(lambda x: process_review_text(x, exclude_english, ps))

In [None]:
cv = CountVectorizer(max_features=1500)
X_train_cv2 = cv.fit_transform(X_train2['processedReview'])
X_val_cv2 = cv.transform(X_val2['processedReview'])

In [None]:
X_train_reg2_sp = sp.csr_matrix(X_train_reg2)
X_train_cv_reg2 = sp.hstack((X_train_cv2, X_train_reg2_sp), format='csr')

X_val_reg2_sp = sp.csr_matrix(X_val_reg2)
X_val_cv_reg2 = sp.hstack((X_val_cv2, X_val_reg2_sp), format='csr')

In [254]:
learning_rates = [0.01, 0.03, 0.1, 0.3, 0.5]
estimators = [50, 100, 200, 500]
depths = [1, 2, 5, 10]

for learning_rate in learning_rates:
    for estimator in estimators:
        for depth in depths:
            print("Learning Rate: {0}, # Estimators: {1}, Depth: {2}".format(learning_rate, estimator, depth))
            print("--------------------------------------------------")
            xg_reg = XGBRegressor(
                learning_rate=learning_rate, max_depth=depth, n_estimators=estimator)
            xg_reg.fit(X_train_cv_reg2, y_train2)
            print("Training Error: {}".format(calculate_MSE(y_train2, vthreshold_rating(xg_reg.predict(X_train_cv_reg2)))))
            print("Validation Error: {}".format(calculate_MSE(y_val2, vthreshold_rating(xg_reg.predict(X_val_cv_reg2)))))
            print()

Learning Rate: 0.01, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 6.491714957506724
Validation Error: 6.595133463794399

Learning Rate: 0.01, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 6.461911787213483
Validation Error: 6.581446631609598

Learning Rate: 0.01, # Estimators: 50, Depth: 5
--------------------------------------------------
Training Error: 6.40314233843784
Validation Error: 6.494756910963362

Learning Rate: 0.01, # Estimators: 50, Depth: 10
--------------------------------------------------
Training Error: 6.322468623454848
Validation Error: 6.449049836055851

Learning Rate: 0.01, # Estimators: 100, Depth: 1
--------------------------------------------------
Training Error: 2.9063349781748853
Validation Error: 3.0541690432646975

Learning Rate: 0.01, # Estimators: 100, Depth: 2
--------------------------------------------------
Training Error: 2.864983122568773
Validation Er

Training Error: 0.06969777971966795
Validation Error: 1.0286062081558125

Learning Rate: 0.3, # Estimators: 50, Depth: 1
--------------------------------------------------
Training Error: 0.6692930402447763
Validation Error: 0.8200044562755053

Learning Rate: 0.3, # Estimators: 50, Depth: 2
--------------------------------------------------
Training Error: 0.5884008659699668
Validation Error: 0.7383251494368801

Learning Rate: 0.3, # Estimators: 50, Depth: 5
--------------------------------------------------
Training Error: 0.42549690960893155
Validation Error: 0.6699183704615137

Learning Rate: 0.3, # Estimators: 50, Depth: 10
--------------------------------------------------
Training Error: 0.18309838036827866
Validation Error: 1.0646175694867113

Learning Rate: 0.3, # Estimators: 100, Depth: 1
--------------------------------------------------
Training Error: 0.6252972247435069
Validation Error: 0.7682903278136153

Learning Rate: 0.3, # Estimators: 100, Depth: 2
-------------------

It seems `learning_rate=0.3`, `n_estimators=200`, `max_depth=2` performs the best

In [None]:
xg_reg = XGBRegressor(learning_rate=0.3, n_estimators=200, max_depth=2)
xg_reg.fit(X_train_cv_reg2, y_train2)

train_MSE = calculate_MSE(y_train2, vthreshold_rating(xg_reg.predict(X_train_cv_reg2)))
val_MSE = calculate_MSE(y_val2, vthreshold_rating(xg_reg.predict(X_val_cv_reg2)))

print("Training error based on XGBoost CountVectorizer prediction: %.3f" % train_MSE)
print("Validation error based on XGBoost CountVectorizer prediction: %.3f" % val_MSE)

This is actually much worse compared to the pure language model.

However, we could also create a meta model by taking a weighted average of predictions from collaborative filtering and the pure language model. We now try this for a few candidate weightings

In [245]:
weights = [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]

xg_reg = XGBRegressor(learning_rate=0.3, n_estimators=500, max_depth=2)
xg_reg.fit(X_train_cv_reg1, y_train1)

cf_train_preds = vthreshold_rating(X_train_mod['pred'])
cf_val_preds = vthreshold_rating(X_val_mod['pred'])

for weight in weights:
    print("Weight: %.1f" % weight)
    print("------------")
    train_MSE = calculate_MSE(y_train1, ((weight*reg_train_preds) + ((1.0 - weight)*cf_train_preds)))
    val_MSE = calculate_MSE(y_val1, ((weight*reg_val_preds) + ((1.0 - weight)*cf_val_preds)))
    print("Training error: %.3f" % train_MSE)
    print("Validation error: %.3f" % val_MSE)
    print()

Weight: 0.0
------------
Training error: 0.811
Validation error: 0.979

Weight: 0.1
------------
Training error: 0.755
Validation error: 0.910

Weight: 0.3
------------
Training error: 0.660
Validation error: 0.793

Weight: 0.5
------------
Training error: 0.590
Validation error: 0.702

Weight: 0.7
------------
Training error: 0.545
Validation error: 0.639

Weight: 0.9
------------
Training error: 0.524
Validation error: 0.602

Weight: 1.0
------------
Training error: 0.522
Validation error: 0.594

