# Problem Statement - Build your own recommendation system for products on an e-commerce website like Amazon.com.


Dataset - ​ Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) ratings_Electronics_Ver3.tar.xzView in a new window (you may use winrar application to extract the .csv file)

Dataset columns - first three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.


o The repository has several datasets. For this case study, please use the Electronics dataset.
o The host page has several pointers to scripts and other examples that can help with parsing the datasets.
o The data set consists of:
● 7,824,482 Ratings (1-5) for Electronics products.
● Other metadata about products. Please see the description of the fields available on the web page cited above.


o For convenience of future use, parse the raw data file (using Python, for example) and extract the following fields: 'product/productId' as prod_id, 'product/title' as prod_name, 'review/userId' as user id, 'review/score' as rating
o Save these to a tab separated file. Name this file as product_ratings.csv.

Steps -
1. Read and explore the dataset. (Rename column, plot histograms, find data characteristics)

2. Take subset of dataset to make it less sparse/more dense. (For example, keep the users only who has given 50 or more number of ratings )
3. Split the data randomly into train and test dataset. (For example split it in 70/30 ratio)
4. Build Popularity Recommender model.
5. Build Collaborative Filtering model.
6. Evaluate both the models. (Once the model is trained on the training data, it can be used to compute the error (RMSE) on predictions made on the test data.)
7. Get top - K (K = 5) recommendations. Since our goal is to recommend new products to each user based on his/her habits,we will recommend 5 new products.
8. Summarise your insights.

Mark Distributions -
Step - 1,2,3,8 - 5 marks each
Step - 4,5,6,7 - 10 marks each

In [2]:
import pandas as pd

In [3]:
# Read and explore the dataset. (Rename column, plot histograms, find data characteristics)
df = pd.read_csv('ratings_Electronics.csv',  header = None)
df.columns = ['userId', 'productId', 'ratings','timestamp']    


In [4]:
df.head()

Unnamed: 0,userId,productId,ratings,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,1365811200
1,A2CX7LUOHB2NDG,321732944,5.0,1341100800
2,A2NWSAGRHCP8N5,439886341,1.0,1367193600
3,A2WNBOD3WNDNKT,439886341,3.0,1374451200
4,A1GI0U4ZRJA8WN,439886341,1.0,1334707200


In [5]:
# data characteristics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ratings,7824482.0,4.012337,1.38091,1.0,3.0,5.0,5.0,5.0
timestamp,7824482.0,1338178000.0,69004260.0,912729600.0,1315354000.0,1361059000.0,1386115000.0,1406074000.0


In [6]:
df.shape

(7824482, 4)

In [7]:
df.dtypes

userId        object
productId     object
ratings      float64
timestamp      int64
dtype: object

In [8]:
# plot histograms
df['ratings'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1d78bed5548>

In [9]:
df['ratings'].value_counts()
# more than 4 million are are above 5.0 rating

5.0    4347541
4.0    1485781
1.0     901765
3.0     633073
2.0     456322
Name: ratings, dtype: int64

In [10]:
# making new data frame - Dropping Rows with at least 1 null value.
new_df = df.dropna(axis = 0, how ='any') 

In [11]:
new_df.shape

(7824482, 4)

In [12]:
new_df['ratings'].value_counts()
# since there is no change, there ar no null values

5.0    4347541
4.0    1485781
1.0     901765
3.0     633073
2.0     456322
Name: ratings, dtype: int64

In [13]:
user_counts = new_df['userId'].value_counts()
user_counts

A5JLAU2ARJ0BO     520
ADLVFFE4VBT8      501
A3OXHLG6DIBRW8    498
A6FIAB28IS79      431
A680RUE1FDO8B     406
                 ... 
A1U6GF9UP215KE      1
A35WMTYW24WN06      1
A3LLKTR1OX9H06      1
A900YI1TA8R7Q       1
AWVL22DSRMKEH       1
Name: userId, Length: 4201696, dtype: int64

In [14]:
# Take subset of dataset to make it less sparse/more dense. 
# (For example, keep the users only who has given 50 or more number of ratings )

In [15]:
# Create subset of dataset of users who has given 50 or more number of ratings 
#criteria = new_df['userId'].value_counts() > 50

subset_df = new_df[new_df.userId.isin(user_counts[user_counts > 50].index)]
subset_df.head()

Unnamed: 0,userId,productId,ratings,timestamp
118,AT09WGFUM934H,594481813,3.0,1377907200
177,A32HSNCNPRUMTR,970407998,1.0,1319673600
178,A17HMM1M7T9PJ1,970407998,4.0,1281744000
492,A3CLWR1UUZT6TG,972683275,5.0,1373587200
631,A3TAS1AG6FMBQW,972683275,5.0,1353456000


In [16]:
subset_df.shape

(122171, 4)

In [18]:
# popular products
popular_products=subset_df.groupby('productId')['ratings'].mean()

In [19]:
popular_products

productId
0594481813    3.0
0970407998    2.5
0972683275    5.0
1400501466    3.0
1400501520    5.0
             ... 
B00LED02VY    4.0
B00LGN7Y3G    5.0
B00LGQ6HL8    5.0
B00LI4ZZO8    4.5
B00LKG1MC8    5.0
Name: ratings, Length: 47155, dtype: float64

In [20]:
# Top Rated products
popular_products.sort_values(ascending=False).head(10)

productId
B00LKG1MC8    5.0
B000H8WLKC    5.0
B000HA4EZK    5.0
B004EHZZDW    5.0
B004EI0EG4    5.0
B000H9J3WA    5.0
B004EK9ODG    5.0
B004EKEBNY    5.0
B004EKEF0S    5.0
B004EKOCSS    5.0
Name: ratings, dtype: float64

In [22]:
# unique users
len(subset_df["userId"].unique())


1466

In [16]:
# Split the data randomly into train and test dataset. (For example split it in 70/30 ratio)

In [23]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 5))

ModuleNotFoundError: No module named 'surprise'

In [None]:
data = Dataset.load_from_df(subset_df[['user', 'item', 'rating']], reader)

In [None]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=123)

In [None]:
user_records = trainset.ur

In [None]:
user_records[1]

In [None]:
print(trainset.to_raw_uid(0))
print(trainset.to_raw_iid(1))

In [None]:
from surprise import KNNWithMeans
from surprise import accuracy
from surprise import Prediction

In [None]:
algo = KNNWithMeans(k=5, sim_options={'name': 'pearson', 'user_based': False})
algo.fit(trainset)

In [None]:
len(testset)

In [None]:
# Evalute on test set
test_pred = algo.test(testset)

# compute RMSE
accuracy.rmse(test_pred)

In [None]:
# View all predictions
test_pred[:]

In [None]:
# convert results to dataframe
test_pred_df = pd.DataFrame(test_pred)
test_pred_df["was_impossible"] = [x["was_impossible"] for x in test_pred_df["details"]]

In [None]:
test_pred_df.loc[test_pred_df.was_impossible]

In [None]:
# Make prediction for a single user
algo.predict(uid="1",iid="B000YMJ6ZE")

In [None]:
# top n recommendations
testset_new = trainset.build_anti_testset()

In [None]:
len(testset_new)

In [None]:
testset_new[:10]

In [None]:
predictions = algo.test(testset_new[:10])

In [None]:
predictions_df = pd.DataFrame([[x.uid,x.iid,x.est] for x in predictions])

In [None]:
predictions_df.columns = ["user","product","rating"]
predictions_df.sort_values(by = ["user", "rating"],ascending=False,inplace=True)

In [None]:
predictions_df

In [None]:
top_10_recos = predictions_df.groupby("user").head(10).reset_index(drop=True)

In [None]:
# Top 10 Recommendation for the user

In [None]:
# SVD Based Recommendation

In [None]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(prod_rating_filtered[['user', 'item', 'rating']], reader)

In [None]:
# Split data to train and test
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=123)

In [None]:
from surprise import SVD
from surprise import accuracy

In [None]:
svd_model = SVD(n_factors=4,biased=False)
svd_model.fit(trainset)

In [None]:
test_pred = svd_model.test(testset)

In [None]:
test_pred_df = pd.DataFrame([[x.uid,x.iid,x.est] for x in test_pred])

In [None]:
test_pred_df.head()

In [None]:
test_pred_df.columns = ["user","item","rating"]
test_pred_df.sort_values(by = ["user", "rating"],ascending=False,inplace=True)

In [None]:
test_pred_df.head()

In [None]:
top_10_recos.head(10)

In [None]:
# compute RMSE
accuracy.rmse(test_pred)