Problem Statement -
Build your own recommendation system for products on an e-commerce website like Amazon.com.


Dataset - ​ Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) ratings_Electronics_Ver3.tar.xzView in a new window (you may use winrar application to extract the .csv file)

Dataset columns - first three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.


o The repository has several datasets. For this case study, please use the Electronics dataset.
o The host page has several pointers to scripts and other examples that can help with parsing the datasets.
o The data set consists of:
● 7,824,482 Ratings (1-5) for Electronics products.
● Other metadata about products. Please see the description of the fields available on the web page cited above.


o For convenience of future use, parse the raw data file (using Python, for example) and extract the following fields: 'product/productId' as prod_id, 'product/title' as prod_name, 'review/userId' as user id, 'review/score' as rating
o Save these to a tab separated file. Name this file as product_ratings.csv.

Steps -
1. Read and explore the dataset. (Rename column, plot histograms, find data characteristics)

2. Take subset of dataset to make it less sparse/more dense. (For example, keep the users only who has given 50 or more number of ratings )
3. Split the data randomly into train and test dataset. (For example split it in 70/30 ratio)
4. Build Popularity Recommender model.
5. Build Collaborative Filtering model.
6. Evaluate both the models. (Once the model is trained on the training data, it can be used to compute the error (RMSE) on predictions made on the test data.)
7. Get top - K (K = 5) recommendations. Since our goal is to recommend new products to each user based on his/her habits,we will recommend 5 new products.
8. Summarise your insights.

Mark Distributions -
Step - 1,2,3,8 - 5 marks each
Step - 4,5,6,7 - 10 marks each

Please note: Since going forward, you will be pushing all your assessment files to the same repository for the remainder of the program, so it is important that you follow some name structure to identify your assessment submission properly.

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
col_names = ['userid', 'productid', 'rating', 'timestamp']
df = pd.read_csv('ratings_Electronics.csv', names = col_names)

In [5]:
df.shape

(7824482, 4)

In [6]:
df.drop(['timestamp'], axis=1, inplace = True)

In [7]:
df.head()

Unnamed: 0,userid,productid,rating
0,AKM1MP6P0OYPR,132793040,5.0
1,A2CX7LUOHB2NDG,321732944,5.0
2,A2NWSAGRHCP8N5,439886341,1.0
3,A2WNBOD3WNDNKT,439886341,3.0
4,A1GI0U4ZRJA8WN,439886341,1.0


In [8]:
print ('**********Describe**********************')
df.describe(include = 'all').transpose()

print ('**********Info**********************')
df.info()



print ('**********Is NA Count**********************')
df.isna().sum()


print ('**********Is Null**********************')

df.isnull().any(axis=0)

**********Describe**********************


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
userid,7824482.0,4201696.0,A5JLAU2ARJ0BO,520.0,,,,,,,
productid,7824482.0,476002.0,B0074BW614,18244.0,,,,,,,
rating,7824480.0,,,,4.01234,1.38091,1.0,3.0,5.0,5.0,5.0


**********Info**********************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824482 entries, 0 to 7824481
Data columns (total 3 columns):
userid       object
productid    object
rating       float64
dtypes: float64(1), object(2)
memory usage: 179.1+ MB
**********Is NA Count**********************


userid       0
productid    0
rating       0
dtype: int64

**********Is Null**********************


userid       False
productid    False
rating       False
dtype: bool

#### First two columns are object type and the last column rating is numerical with mean rating of ~4. No null values in the dataframe.
#### too many records

In [9]:
pd.crosstab(df['rating'], df['productid'] )

productid,0132793040,0321732944,0439886341,0511189877,0528881469,0558835155,059400232X,0594012015,0594017343,0594017580,...,B00LOLBBQQ,B00LPQRT34,B00LS5WBYE,B00LTAUTHE,B00LXEC8CU,BT008G3W52,BT008SXQ4C,BT008T2BGK,BT008UKTMW,BT008V9J9U
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,0,0,2,0,9,0,0,6,1,0,...,1,0,0,0,0,0,1,0,1,0
2.0,0,0,0,1,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3.0,0,0,1,0,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,2,0
4.0,0,0,0,0,5,0,0,0,0,0,...,0,1,0,0,0,0,0,0,4,0
5.0,1,1,0,5,7,0,3,2,0,0,...,1,0,1,1,1,1,0,1,7,1


In [19]:
s_product = df['productid'].value_counts()

In [20]:
type (s_product)

pandas.core.series.Series

In [21]:
s_product

B0074BW614    18244
B00DR0PDNE    16454
B007WTAJTO    14172
B0019EHU8G    12285
B006GWO5WK    12226
B003ELYQGG    11617
B003ES5ZUU    10276
B007R5YDYA     9907
B00622AG6S     9823
B0002L5R78     9487
B008OHNZI0     8966
B003LR7ME6     8840
B000LRMS66     8715
B009SYZ8OC     8370
B00BGA9WK2     7561
B004QK7HI8     7060
B009A5204K     7059
B00BGGDVOO     6893
B0098F5W0Q     6616
B002MAPRYU     6599
B002WE6D44     6509
B005HMKKH4     6134
B0012S4APK     5642
B0052YFYFK     5521
B0001FTVEK     5345
B0044YU60M     5239
B00316263Y     5038
B000I68BD4     4903
B006ZP8UOW     4842
B0041Q38NU     4774
              ...  
B0055MNG86        1
B00009R9EY        1
B00006I5Z4        1
B0000DFZOE        1
B004YLRO4G        1
B00HS5R3ZU        1
B000PG5RWU        1
B000NRDPGQ        1
B000247YNA        1
B000YJKBLA        1
B00155Z0BS        1
B002CJKH8M        1
B002LWUTDS        1
B0056BEU18        1
B00456QOKK        1
B003O5T0MC        1
B004T39OGA        1
B004YYPI1O        1
B000WGSILK        1


In [18]:
type (s_product)

pandas.core.series.Series

In [22]:
df_product= pd.Series.to_frame(s_product)


In [23]:
type(df_product)

pandas.core.frame.DataFrame

In [24]:
df_product.shape

(476002, 1)

In [27]:
df_product['productidorig'] = list(df_product.index)

In [25]:
df_product.head()

Unnamed: 0,productid
B0074BW614,18244
B00DR0PDNE,16454
B007WTAJTO,14172
B0019EHU8G,12285
B006GWO5WK,12226


In [28]:
df_product.head()

Unnamed: 0,productid,productidorig
B0074BW614,18244,B0074BW614
B00DR0PDNE,16454,B00DR0PDNE
B007WTAJTO,14172,B007WTAJTO
B0019EHU8G,12285,B0019EHU8G
B006GWO5WK,12226,B006GWO5WK


In [29]:
df_product.columns = ['rating', 'productid']

In [30]:
df_product.head()

Unnamed: 0,rating,productid
B0074BW614,18244,B0074BW614
B00DR0PDNE,16454,B00DR0PDNE
B007WTAJTO,14172,B007WTAJTO
B0019EHU8G,12285,B0019EHU8G
B006GWO5WK,12226,B006GWO5WK


In [39]:
table = pd.pivot_table(df, values ='rating', index =['productid'], 
                         aggfunc = {"count","mean" }) 

In [40]:
table.head

<bound method NDFrame.head of             count      mean
productid                  
0132793040      1  5.000000
0321732944      1  5.000000
0439886341      3  1.666667
0511189877      6  4.500000
0528881469     27  2.851852
0558835155      1  3.000000
059400232X      3  5.000000
0594012015      8  2.000000
0594017343      1  1.000000
0594017580      1  3.000000
0594033896      5  4.400000
0594033926     15  4.533333
0594033934      2  5.000000
0594202442      1  4.000000
0594287995      1  5.000000
0594296420      6  4.666667
0594450209      2  5.000000
0594450705      1  5.000000
0594451647     14  4.357143
0594477670      3  4.666667
0594478162      1  4.000000
0594481813     31  4.225806
0594481902     13  4.384615
0594482127      1  4.000000
0594511488      2  5.000000
0594514681      2  4.500000
0594514789      1  5.000000
0594549507      1  4.000000
0594549558      1  5.000000
0743610431      2  4.000000
...           ...       ...
B00LGN7Y3G      1  5.000000
B00LGQ6HL8      6 