# Amazon Recommender System

In this notebook we prototype a set of models to be used in order to build a Recommender System for the Amazon music data.

We first establish a few simple baselines and then progress to implementing two different classes of models:
*  Collaborative filtering models based only on user and item data (no text)
*  A textual based model

We then finally combine the best model from each class into a meta model and evaluate it's performance.

We start with some necessary imports

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')
plt.style.use('seaborn-darkgrid')
%matplotlib inline

random.seed(17)

Loading data from json into a pandas dataframe

In [2]:
data = None
with open(os.path.join('data', 'train.json'), 'r') as train_file:
    data = [json.loads(row) for row in train_file]

In [3]:
data_df = pd.DataFrame(data)
data_df

Unnamed: 0,overall,reviewTime,reviewerID,reviewText,summary,unixReviewTime,category,price,itemID,reviewHash,image
0,4.0,"08 24, 2010",u04428712,"So is Katy Perry's new album ""Teenage Dream"" c...",Amazing that I Actually Bought This...More Ama...,1282608000,Pop,$35.93,p70761125,85559980,
1,5.0,"10 31, 2009",u06946603,"I got this CD almost 10 years ago, and given t...",Excellent album,1256947200,Alternative Rock,$11.28,p85427891,41699565,
2,4.0,"10 13, 2015",u92735614,I REALLY enjoy this pairing of Anderson and Po...,"Love the Music, Hate the Light Show",1444694400,Pop,$89.86,p82172532,24751194,
3,5.0,"06 28, 2017",u35112935,Finally got it . It was everything thought it ...,Great,1498608000,Pop,$11.89,p15255251,22820631,
4,4.0,"10 12, 2015",u07141505,"Look at all star cast. Outstanding record, pl...",Love these guys.,1444608000,Jazz,$15.24,p82618188,53377470,
...,...,...,...,...,...,...,...,...,...,...,...
199995,4.0,"05 1, 2004",u68902609,"With this, Mariah's third album, Mariah proved...",Well Done Mariah! You Show 'Em!,1083369600,Pop,$7.98,p84118731,35077372,
199996,5.0,"02 27, 2017",u15269603,Fantastic CD. All the hits are here and even ...,"Great collection, excellent sound!",1488153600,Pop,$11.49,p08613950,09788722,
199997,3.0,"03 1, 2011",u25124021,"This recording is rather disappointing, to a c...",Odd Couplings,1298937600,Classical,$13.57,p25341819,71627957,
199998,5.0,"03 20, 2016",u04485604,Get it now ! Right now ! I am partial. I am a ...,Our Poet,1458432000,Alternative Rock,$11.07,p19134748,27463540,


Defining a function to handle price so that it can be converted to a float

In [4]:
def trim_price(price):
    """Trims `price` to remove the $ sign.
    
    If the price variable does not have the format $x.xx
    then the empty string is returned.
    
    Parameters
    ----------
    price: str
        A string representing a price.
    
    Returns
    -------
    str
        A string representing `price` but with the $ sign removed,
        or the empty string if `price` does not have the correct
        format.
    
    """
    if (not pd.isnull(price) and isinstance(price, str) and
        len(price) > 0 and price[0] == '$'):
        return price[1:]
    return ""

### Preprocessing

We add some additional features:
*  reviewMonth - the month in which the review was done.
*  reviewYear - the year in which the review was done.
*  reviewHour - the hour in which the review was done
*  cleanedPrice - a numeric version of the price column. We only keep this column if the price is correctly formatted.
*  fullReviewText - a column that combines the summary followed by reviewText
*  hasReviewText - indicates whether the record has an associated review based on the fullReviewText column

In [5]:
from datetime import datetime

data_df['reviewMonth'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[0])
data_df['reviewYear'] = data_df['reviewTime'].apply(lambda x: x.split(' ')[2])
data_df['reviewHour'] = data_df['unixReviewTime'].apply(lambda x: datetime.fromtimestamp(x).hour)
data_df['reviewMonthYear'] = data_df['reviewYear'] + '-' + data_df['reviewMonth']

data_df['cleanedPrice'] = data_df['price'].apply(lambda x: trim_price(x))
data_df = data_df[data_df['cleanedPrice'] != ""]
data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')

data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
data_df['fixedSummary'] = np.where(pd.isnull(data_df['summary']), "", data_df['summary'])
data_df['fullReviewText'] = data_df['fixedSummary'] + " " + data_df['fixedReviewText']

data_df['hasReviewText'] = data_df['fullReviewText'].apply(lambda x: 0 if x == "" or x == " " else 1)

data_df = data_df.drop(columns=['fixedReviewText', 'fixedSummary'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['cleanedPrice'] = data_df['cleanedPrice'].astype('float')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedReviewText'] = np.where(pd.isnull(data_df['reviewText']), "", data_df['reviewText'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df['fixedSummary'] = np.where(pd.is

### Evaluation Metrics

Definining a MSE function

In [6]:
def calculate_MSE(actuals, predicteds):
    """Calculates the Mean Squared Error between `actuals` and `predicteds`.
    
    Parameters
    ----------
    actuals: np.array
        A numpy array of the actual values.
    predicteds: np.array
        A numpy array of the predicted values.
    
    Returns
    -------
    float
        A float representing the Mean Squared Error between `actuals` and
        `predicteds`.
    
    """
    return (((actuals - predicteds)**2).sum()) / (len(actuals))

Separate targets and data.

Then split into training and validation sets

In [7]:
from sklearn.model_selection import train_test_split

targets = data_df['overall']
feature_data = data_df.drop(columns=['overall'])

X_train, X_val, y_train, y_val = train_test_split(
    feature_data, targets, shuffle=True, test_size=0.2, random_state=17)

### Model Fitting

Throughout the model fitting process we will keep 3 arrays that store the model name, training error, and validation error respectively for all models that we prototype.

##### Baselines

We will look at two simple baseline models.

The first is the same baseline model implemented in `baseline.py`. But we will evaluate its performance on the validation set in order to fit models and compare performance on data that is distinct from the test set.

In this model we simply compute the average rating and assign this as our prediction.

In [8]:
model_names = []
train_errors = []
validation_errors = []

In [9]:
def error_on_average(targets, avg):
    """Computers the error based on using average rating as the prediction.
    
    Parameters
    ----------
    targets: np.array
        The actual ratings.
    avg: float
        The predicted rating based on an average.
    
    Returns
    -------
    float
        A float representing the mean squared error from predicting
        based on `avg`.
    
    """
    return calculate_MSE(targets, avg)

In [10]:
train_avg = y_train.mean()

model_names.append("Average")
train_errors.append(error_on_average(y_train, train_avg))
validation_errors.append(error_on_average(y_val, train_avg))

print("Training error based on average prediction: %.3f" % train_errors[0])
print("Validation error based on average prediction: %.3f" % validation_errors[0])

Training error based on average prediction: 0.986
Validation error based on average prediction: 1.000


Our second baseline model is slightly more complicated. We will calculate three types of quantities:
*  The overall average
*  The difference between the average rating for each item and the overall average
*  The difference between the average rating for each user and the overall average

Our prediction for a particular user and item will then be the sum of these 3 quantities.

We will denote this model as Weighted Average

In [20]:
train_df = pd.concat([X_train, y_train], axis=1)

train_avg_total = y_train.mean()
train_user_avg = train_df.groupby(train_df['reviewerID'], as_index=False)['overall'].mean()
train_item_avg = train_df.groupby(train_df['itemID'], as_index=False)['overall'].mean()
train_user_avg.columns = ['reviewerID', 'userAverage']
train_item_avg.columns = ['itemID', 'itemAverage']

In [29]:
def threshold_rating(rating):
    """Thresholds `rating` to lie in the range [1, 5].
    
    Parameters
    ----------
    rating: float
        The rating to be thresholded.
    
    Returns
    -------
    float
        A float representing the thresholded rating.
    
    """
    if rating < 1:
        return 1
    if rating > 5:
        return 5
    return rating

def weighted_average_error(X, y, total_avg, user_avgs, item_avgs):
    """Calculates the error based on the weighted average prediction.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    y: np.array
        A numpy array containing the targets
    total_avg: float
        The average across all users/items.
    user_avgs: pd.DataFrame
        A DataFrame containing the average rating for each user.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for each item.
    
    Returns
    -------
    float
        A float representing the mean squared error of the predictions.
    
    """
    df_user = pd.merge(X, user_avgs, how='left', on=['reviewerID'])
    df_final = pd.merge(df_user, item_avgs, how='left', on=['itemID'])
    df_final = df_final[['userAverage', 'itemAverage']]
    df_final.fillna(total_avg)
    print(len(df_final) == len(y))
    df_final['pred'] = df_final['userAverage'] + df_final['itemAverage'] - total_avg
    df_final['pred'].apply(lambda x: threshold_rating(x))
    return calculate_MSE(y, df_final['pred'])

In [30]:
train_MSE = weighted_average_error(X_train, y_train, train_avg_total, train_user_avg, train_item_avg)
val_MSE = weighted_average_error(X_val, y_val, train_avg_total, train_user_avg, train_item_avg)

print("Training error based on weighted average prediction: %.3f" % train_MSE)
print("Validation error based on weighted average prediction: %.3f" % val_MSE)

True
True
Training error based on weighted average prediction: 1.613
Validation error based on weighted average prediction: 0.293
