

*  Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”
> This system recommends resturants to users
*  Find a dataset, or build out your own toy dataset.  As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.
> This project, uses data from kaggle: https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings/data


In [1]:
"""Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing).  
From there, create a user-item matrix. """

import pandas as pd

ratings = pd.read_csv('rating_final.csv')
"""If you choose to work with a large dataset, you’re encouraged to also create a small, relatively dense “user-item” matrix as a 
subset so that you can hand-verify your calculations. """
places = [135085, 132825, 135032, 132834, 135052, 135038]
ratings2 = ratings[ratings['placeID'].isin(places)]
user_matrix = ratings2.pivot(index='userID', columns='placeID', values='rating')
subset = user_matrix.tail(15)
subset

placeID,132825,132834,135032,135038,135052,135085
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
U1108,,,2.0,,,1.0
U1109,2.0,,2.0,2.0,,2.0
U1112,,1.0,,,,
U1113,,,0.0,1.0,,1.0
U1114,0.0,0.0,,,,
U1116,2.0,2.0,,2.0,2.0,2.0
U1120,,,1.0,0.0,,0.0
U1122,,2.0,,2.0,,2.0
U1124,,,1.0,,,
U1125,,,1.0,2.0,,


In [0]:
"""Break your ratings into separate training and test datasets. """
import numpy as np
TRAIN_SIZE = 0.80
msk = np.random.rand(len(subset)) < TRAIN_SIZE

train = subset[msk]  
test = subset[~msk]

In [3]:
"""Using your training data, calculate the raw average (mean) rating for every user-item combination."""
average = train.unstack().mean()
average

1.2857142857142858

In [4]:
"""Calculate the RMSE for raw average for both your training data and your test data. Using your training data, calculate the bias for each user and each item."""
SE = (train - average)*(train - average)
MSE = SE.mean().mean()
RMSE = MSE ** (1/2)
"train RMSE is " + str(RMSE)

'train RMSE is 0.8386452019503516'

In [5]:
SE = (test - average)*(test - average)
MSE = SE.mean().mean()
RMSE = MSE ** (1/2)
"test RMSE is " + str(RMSE)

'test RMSE is 0.8622861560792835'

In [6]:
"""From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination."""
user_averages = subset.mean(axis=1)
place_averages = train.mean(axis=0)

user_averages - average

userID
U1108    0.214286
U1109    0.714286
U1112   -0.285714
U1113   -0.619048
U1114   -1.285714
U1116    0.714286
U1120   -0.952381
U1122    0.714286
U1124   -0.285714
U1125    0.214286
U1126    0.214286
U1132    0.314286
U1134    0.214286
U1135   -1.285714
U1137    0.714286
dtype: float64

In [7]:
place_averages - average

placeID
132825    0.214286
132834    0.047619
135032   -0.285714
135038    0.047619
135052    0.047619
135085    0.047619
dtype: float64

In [8]:
"""Calculate the RMSE for the baseline predictors for both your training data and your test data."""

train1 = train.copy()
for val in places:
  a = train1.apply(lambda x: x[val] if pd.notnull(x[val]) else average+place_averages[val], axis=1)
  train1[val] = (a + user_averages).clip(0, 2)
train1

placeID,132825,132834,135032,135038,135052,135085
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
U1108,2.0,2.0,2.0,2.0,2.0,2.0
U1109,2.0,2.0,2.0,2.0,2.0,2.0
U1113,2.0,2.0,0.666667,1.666667,2.0,1.666667
U1116,2.0,2.0,2.0,2.0,2.0,2.0
U1120,2.0,2.0,1.333333,0.333333,2.0,0.333333
U1124,2.0,2.0,2.0,2.0,2.0,2.0
U1125,2.0,2.0,2.0,2.0,2.0,2.0
U1132,2.0,2.0,2.0,2.0,2.0,2.0
U1134,2.0,2.0,2.0,2.0,2.0,2.0
U1135,0.0,0.0,0.0,2.0,0.0,0.0


In [9]:
test1 = test.copy()

for val in places:
  a = test1.apply(lambda x: x[val] if pd.notnull(x[val]) else average+place_averages[val], axis=1)
  test1[val] = (a + user_averages).clip(0, 2)
test1

placeID,132825,132834,135032,135038,135052,135085
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
U1112,2.0,2.0,2.0,2.0,2.0,2.0
U1114,0.0,0.0,2.0,2.0,2.0,2.0
U1122,2.0,2.0,2.0,2.0,2.0,2.0
U1126,2.0,2.0,2.0,2.0,2.0,2.0


In [10]:
 """Summarize your results. """

SE = (test - test1)*(test - test1)
MSE = SE.mean().mean()
RMSE = MSE ** (1/2)
"train RMSE using baseline predictors is " + str(RMSE)

'train RMSE using baseline predictors is 0.408248290463863'

In [12]:
SE = (train - train1)*(train - train1)
MSE = SE.mean().mean()
RMSE = MSE ** (1/2)
"train RMSE using baseline predictors is " + str(RMSE) + "which is much lower than using a straight average"

'train RMSE using baseline predictors is 0.4169751944147297which is much lower than using a straight average'