# Amazon Product Ratings - Case Study
Problem Statement:
    * Build your own recommendation system for products on an e-commerce website like Amazon.com.

Dataset:
    * Amazon Reviews data ratings_Electronics.csv

Dataset columns: 
    1. First three columns are userId, productId, and ratings and the fourth column is timestamp.
    2. You can discard the timestamp column as in this case you may not need to use it.
    3. The repository has several datasets.
    4. For this case study, please use the Electronics dataset.
    5. The host page has several pointers to scripts and other examples that can help with parsing the datasets.
    6. The data set consists of: 7,824,482 Ratings (1-5) for Electronics products.
    7. Other metadata about products.
    8. Please see the description of the fields available on the web page cited above.
    9. For convenience of future use, parse the raw data file (using Python, for example)
    10. Extract the following fields:
        a. 'product/productId' as prod_id,
        b. 'product/title' as prod_name,
        c. 'review/userId' as user_id,
        d. 'review/score' as rating
    11. Save these to a tab separated file.
    12. Name this file as product_ratings.csv.

Mark Distributions:
    * Step - 1,2,3,8 - 5 marks each
    * Step - 4,5,6,7 - 10 marks each

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

import warnings;
warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from surprise import Dataset,Reader
from surprise.model_selection import train_test_split as surp_train_test_split
from surprise import KNNWithMeans
from surprise import accuracy
from surprise import Prediction
from surprise.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split as skl_train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

%matplotlib inline

Q1. Read and explore the dataset. (Rename column, plot histograms, find data characteristics)

In [2]:
rating=pd.read_csv('ratings_Electronics.csv',names=['prod_name','prod_id','rating','user_id'])

In [3]:
rating.dtypes

prod_name     object
prod_id       object
rating       float64
user_id        int64
dtype: object

In [4]:
rating=rating[['prod_name','prod_id','user_id','rating']]
rating.head()

Unnamed: 0,prod_name,prod_id,user_id,rating
0,AKM1MP6P0OYPR,132793040,1365811200,5.0
1,A2CX7LUOHB2NDG,321732944,1341100800,5.0
2,A2NWSAGRHCP8N5,439886341,1367193600,1.0
3,A2WNBOD3WNDNKT,439886341,1374451200,3.0
4,A1GI0U4ZRJA8WN,439886341,1334707200,1.0


In [5]:
rating.iloc[:,0].count()

7824482

In [6]:
rating.user_id=rating.user_id.astype('str')

In [7]:
rating.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,7824482.0,4.012337,1.38091,1.0,3.0,5.0,5.0,5.0


In [8]:
for i in range(np.size(rating.columns)):
    print(rating.columns[i],np.size(rating.iloc[:,i].value_counts()))

prod_name 4201696
prod_id 476002
user_id 5489
rating 5


In [9]:
rating.drop(columns='prod_name',inplace=True)
rating.head()

Unnamed: 0,prod_id,user_id,rating
0,132793040,1365811200,5.0
1,321732944,1341100800,5.0
2,439886341,1367193600,1.0
3,439886341,1374451200,3.0
4,439886341,1334707200,1.0


In [10]:
rating.corr()

Unnamed: 0,rating
rating,1.0


Q2. Take subset of dataset to make it less sparse/more dense. (For example, keep the users only who has given 50 or more number of ratings)

In [11]:
rating.drop_duplicates(inplace=True)

In [12]:
rating.count()

prod_id    6801855
user_id    6801855
rating     6801855
dtype: int64

In [13]:
rating.isnull().any()

prod_id    False
user_id    False
rating     False
dtype: bool

In [14]:
rating_sub=rating.groupby(by=['prod_id','user_id']).max().reset_index()

In [15]:
dupe=rating_sub.groupby(by=['user_id']).count()
dupe['review_count']=dupe.rating
dupe=dupe.drop(columns=['prod_id','rating']).sort_values(by='review_count',ascending=False)
dupe=dupe[dupe.review_count>50].reset_index()
dupe.head()

Unnamed: 0,user_id,review_count
0,1380672000,10963
1,1389052800,10821
2,1388707200,9732
3,1356652800,9114
4,1404691200,8779


In [16]:
rating_sub.shape
rating_sub=rating_sub[rating_sub.user_id.isin(dupe.user_id)]
rating_sub.shape

(6108711, 3)

(6075674, 3)

In [17]:
rating_sub.head()

Unnamed: 0,prod_id,user_id,rating
0,132793040,1365811200,5.0
1,321732944,1341100800,5.0
2,439886341,1334707200,1.0
3,439886341,1367193600,1.0
4,439886341,1374451200,3.0


In [18]:
dupe=pd.DataFrame(data=rating_sub.iloc[:,[0,1]].groupby(by=['prod_id']).count()['user_id'].sort_values(ascending=False).reset_index())
dupe.columns=['prod_id','review_count']
dupe=dupe[dupe.review_count>1000].reset_index(drop=True)
rating_sub=rating_sub[rating_sub.prod_id.isin(dupe.prod_id)]
rating_sub.shape

(108076, 3)

In [19]:
np.size(rating_sub.user_id.unique())
np.size(rating_sub.prod_id.unique())

3732

87

Q3. Split the data randomly into train and test dataset. (For example split it in 70/30 ratio)

In [20]:
r_train,r_test=skl_train_test_split(rating_sub,test_size=0.3)

Q4. Build Popularity Recommender model.

In [21]:
r_test_dupe=r_test.copy()
r_test.rating=np.nan
rating_new=pd.concat([r_train,r_test]).reset_index(drop=True)
rating_new.shape

(108076, 3)

In [22]:
rating_new.loc[:,['prod_id','user_id']].shape
rating_new.loc[:,['prod_id','user_id']].drop_duplicates().shape
rating_new.loc[:,['prod_id','user_id']].shape

(108076, 2)

(108076, 2)

(108076, 2)

In [23]:
rating_new.isnull().any()

prod_id    False
user_id    False
rating      True
dtype: bool

In [24]:
rating_new.groupby(by='prod_id').mean().sort_values(by=['rating'],ascending=False).head()

Unnamed: 0_level_0,rating
prod_id,Unnamed: 1_level_1
B000LRMS66,4.944111
B003LR7ME6,4.933742
B003ES5ZUU,4.929742
B0019EHU8G,4.913978
B004GF8TIK,4.912306


In [25]:
rating_new.shape
len(rating_new['prod_id'].unique())
len(rating_new['user_id'].unique())
rating_new.groupby(by='prod_id')['rating'].mean().sort_values(ascending=False).head(5)

(108076, 3)

87

3732

prod_id
B000LRMS66    4.944111
B003LR7ME6    4.933742
B003ES5ZUU    4.929742
B0019EHU8G    4.913978
B004GF8TIK    4.912306
Name: rating, dtype: float64

In [26]:
x_train,x_test=skl_train_test_split(rating_new,test_size=0.3,random_state=1)
x_rating_value=x_train.groupby(by='prod_id')['rating'].mean().sort_values(ascending=False)
x_rating_test_value=x_test.groupby(by='prod_id')['rating'].mean().sort_values(ascending=False)

In [27]:
train_value=x_rating_value.reset_index(name='rating')
train_value.head()
test_value=x_rating_test_value.reset_index(name='rating')
test_value.head()

Unnamed: 0,prod_id,rating
0,B003LR7ME6,4.944351
1,B000LRMS66,4.943348
2,B003ES5ZUU,4.940754
3,B0019EHU8G,4.920493
4,B004GF8TIK,4.909449


Unnamed: 0,prod_id,rating
0,B000LRMS66,4.945892
1,B004GF8TIK,4.919598
2,B0052YFYFK,4.916996
3,B00316263Y,4.911538
4,B003ES5ZUU,4.909091


In [28]:
same_products=pd.merge(test_value,train_value,on='prod_id',how='inner')
same_products.head()

Unnamed: 0,prod_id,rating_x,rating_y
0,B000LRMS66,4.945892,4.943348
1,B004GF8TIK,4.919598,4.909449
2,B0052YFYFK,4.916996,4.908911
3,B00316263Y,4.911538,4.85219
4,B003ES5ZUU,4.909091,4.940754


In [29]:
test_value[test_value['prod_id'].isin(same_products.prod_id.head(1))]
train_value[train_value['prod_id'].isin(same_products.prod_id.head(1))]
print('RMSE:',np.sqrt(mean_squared_error(same_products['rating_y'],same_products['rating_x'])))

Unnamed: 0,prod_id,rating
0,B000LRMS66,4.945892


Unnamed: 0,prod_id,rating
1,B000LRMS66,4.943348


RMSE: 0.08175152155462859


Q5. Build Collaborative Filtering model.

In [30]:
reader=Reader(rating_scale=(1,5))
data=Dataset.load_from_df(rating_new[['user_id','prod_id','rating']],reader)

In [31]:
trainset,testset=surp_train_test_split(data,test_size=0.3,random_state=0)

In [32]:
trainset.to_raw_uid(0)
trainset.to_raw_iid(0)

'1401148800'

'B002HWRJY4'

In [33]:
len(rating_new['user_id'].unique())
len(rating_new['prod_id'].unique())
trainset_cv=trainset
testset_cv=testset
trainset_cv

3732

87

<surprise.trainset.Trainset at 0x15f3f366978>

Q6. Evaluate both the models. (Once the model is trained on the training data, it can be used to compute the error (RMSE) on predictions made on the test data.)

RMSE is done for popularity based model above. RMSE for collaborative Filtering model is below.

In [34]:
param_grid={'k':[5],
            'sim_options':{'name':['msd','cosine','pearson'],
                           'user_based':[False]}}
gs=GridSearchCV(KNNWithMeans,param_grid,measures=['rmse','mae'],cv=5)
gs.fit(data)
gs.best_score['rmse']
gs.best_params['rmse']

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson si

nan

{'k': 5, 'sim_options': {'name': 'msd', 'user_based': False}}

Q7. Get top - K (K = 5) recommendations. Since our goal is to recommend new products to each user based on his/her habits, we will recommend 5 new products.

Q8. Summarise your insights.

In [35]:
algo=KNNWithMeans(k=5,sim_options={'name':'pearson','user_based':False})
algo.fit(trainset)
test_pred=algo.test(testset)
accuracy.rmse(test_pred)
test_pred[:2]

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x15f46b02a58>

RMSE: nan


nan

[Prediction(uid='1386288000', iid='B00007M1TZ', r_ui=5.0, est=5, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='1295827200', iid='B00007M1TZ', r_ui=5.0, est=5, details={'actual_k': 0, 'was_impossible': False})]

In [36]:
predictions_df=pd.DataFrame([[x.uid,x.iid,x.est]for x in test_pred])
predictions_df.columns=['user_id','prod_id','est_rating']
predictions_df.sort_values(by=['user_id','est_rating',],ascending=False,inplace=True)
predictions_df[:5]
predictions_df.head(20)
predictions_df.groupby('user_id').head(5).reset_index(drop=True).head()

Unnamed: 0,user_id,prod_id,est_rating
5388,1405987200,B000WYVBR0,5
6988,1405987200,B0012S4APK,5
15410,1405987200,B000S5Q9CA,5
18668,1405987200,B00B46XUQU,5
23244,1405987200,B004GF8TIK,5


Unnamed: 0,user_id,prod_id,est_rating
5388,1405987200,B000WYVBR0,5
6988,1405987200,B0012S4APK,5
15410,1405987200,B000S5Q9CA,5
18668,1405987200,B00B46XUQU,5
23244,1405987200,B004GF8TIK,5
25342,1405987200,B00020S7XK,5
29582,1405987200,B0007MXZB2,5
1362,1405900800,B000HPV3RW,5
3959,1405900800,B0043T7FXE,5
11072,1405900800,B000WL6YY8,5


Unnamed: 0,user_id,prod_id,est_rating
0,1405987200,B000WYVBR0,5
1,1405987200,B0012S4APK,5
2,1405987200,B000S5Q9CA,5
3,1405987200,B00B46XUQU,5
4,1405987200,B004GF8TIK,5
