# <font style = "color:rgb(139,0,0)">RECOMMENDATION SYSTEMS PROJECT</font>

### <font style = "color:rgb(139,0,0)">Data Description</font>

  • **`author`**  : name of the person who gave the rating
  • **`country`** : country the person who gave the rating belongs to
  • **`date`**    : date of the rating
  • **`domain`**  : website from which the rating was taken from
  • **`extract`** : rating content
  • **`language`**: language in which the rating was given
  • **`product`** : name of the product/mobile phone for which the rating was given
  • **`score`**   : average rating for the phone
  • **`score_max`**: highest rating given for the phone
  • **`source`**   : source from where the rating was taken

### <font style = "color:rgb(139,0,0)">Domain</font>

Smartphone, Electronics

### <font style = "color:rgb(139,0,0)">Context</font>

India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone  users  across  Asia  Pacific.  The  combination  of  very  high  sales  volumes  and  the  average  smartphone  consumer  behaviour  has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right  place,  there  are  90%  chances  that  user  will  enquire  for  the  same.  This  Case  Study  is  targeted  to  build  a  recommendation  system based on individual consumer’s behaviour or choice.


### <font style = "color:rgb(139,0,0)">Project Objective</font>
We  will  build  a  recommendation  system  using  popularity  based  and  collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.

## <font style = "color:rgb(184,134,11)">Steps and tasks</font>

### <font style = "color:rgb(184,134,11)">Step 1 :- Importing necessary Libraries & the Dataset</font>

In [135]:
#necessary imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

#for recommendation system
from surprise import SVD, Dataset, Reader, KNNWithMeans
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy
import scipy.sparse
from scipy.sparse.linalg import svds

# To encode categorical variables
from sklearn.preprocessing import LabelEncoder

# for merging
import os
import glob
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
%matplotlib notebook

### <font style = "color:rgb(184,134,11)">Step 2 :- Exploratory Data Analysis and Data Cleaning</font>

In [136]:
#importing the data sets and merging them into one dataframe.
# to reach the folder containing all files we mention path.
path = "C:/Users/DELL/Desktop/Electronics"

files = glob.glob(os.path.join(path, r"C:\Users\DELL\Desktop\phone\*.csv"))

df1 = []
for f in files:
    df2 = pd.read_csv(f, sep=',',encoding='latin1')
   # f['file'] = f.split('/')[-1]
    df1.append(df2)
    


In [137]:
df3 = pd.concat(df1, ignore_index=True, sort=True)

# dropping irrelevant columns and rows as we only need 'author','product' and 'score' attributes.
df = df3.drop(['phone_url','date','lang','country','source','domain','extract','score_max'], axis=1)

In [138]:
# viewing first 5 rows of the imported dataset
df.head()

Unnamed: 0,author,product,score
0,CarolAnn35,Samsung Galaxy S8,10.0
1,james0923,Samsung Galaxy S8,10.0
2,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl...",6.0
3,Buster2020,Samsung Galaxy S8 64GB (AT&T),9.2
4,S Ate Mine,Samsung Galaxy S8,4.0


In [139]:
# viewing few last 5 rows of the imported dataset
df.tail()

Unnamed: 0,author,product,score
1415128,david.paul,Alcatel Club Plus Handy,2.0
1415129,Christiane14,Alcatel Club Plus Handy,10.0
1415130,michaelawr,Alcatel Club Plus Handy,2.0
1415131,claudia0815,Alcatel Club Plus Handy,8.0
1415132,michaelawr,Alcatel Club Plus Handy,2.0


Now that we have successfully read the dataset and dropped the irrelevant columns, let's explore the data.

In [140]:
# viewing the number of rows and columns (shape of the dataset)
rows, columns = df.shape
print("No of rows: ", rows) 
print("No of columns: ", columns)

No of rows:  1415133
No of columns:  3


In [141]:
#Checking the Data types
df.dtypes

author      object
product     object
score      float64
dtype: object

## Now there are no missing values present.

## Rounding off 'score' attribute column to nearest integer.

In [147]:
df.score = df.score.round()
df

Unnamed: 0,author,product,score
0,CarolAnn35,Samsung Galaxy S8,10.0
1,james0923,Samsung Galaxy S8,10.0
2,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl...",6.0
3,Buster2020,Samsung Galaxy S8 64GB (AT&T),9.0
4,S Ate Mine,Samsung Galaxy S8,4.0
...,...,...,...
1415127,anjuli,Alcatel Club Plus Handy,2.0
1415128,david.paul,Alcatel Club Plus Handy,2.0
1415129,Christiane14,Alcatel Club Plus Handy,10.0
1415130,michaelawr,Alcatel Club Plus Handy,2.0


Now we can see every value in 'score' column is rounded off to nearest integer.

In [148]:
# Checking for missing values present in the dataset
print('Number of missing values in the dataset are :-\n', df.isnull().sum())

Number of missing values in the dataset are :-
 author     0
product    0
score      0
dtype: int64


In [149]:
# Checking for duplicate values present in the dataset
print('Number of duplicate values in the dataset are :-\n', df.duplicated().sum())

Number of duplicate values in the dataset are :-
 0


In [150]:
# dropping rows with duplicate values
df.drop_duplicates(keep = 'first' , inplace = True)

In [151]:
#dropping rows with missing values as we only need 1000000 columns
df.dropna(how='any', thresh=None, subset=None, inplace=True)

In [152]:
# Checking for missing values again
print('Number of missing values in the datset are-\n', df.isnull().sum())

Number of missing values in the datset are-
 author     0
product    0
score      0
dtype: int64


In [153]:
# Checking for duplicate values again
print('Number of duplicate values in the datset are-\n', df.duplicated().sum())

Number of duplicate values in the datset are-
 0


In [154]:
# Checking the number of rows and columns again(shape of the data)
rows, columns = df.shape
print("No of rows: ", rows) 
print("No of columns: ", columns) 

No of rows:  1163104
No of columns:  3


In [155]:
# finding minimum and maximum score 

def find_min_max_score():
    print('The minimum score is: %d' %(df['score'].min()))
    print('The maximum score is: %d' %(df['score'].max()))
    
find_min_max_score() 

The minimum score is: 0
The maximum score is: 10


We can see here that the score ranges from 0 to 10.

In [156]:
#let's see how many same and different persons have scored.
len(df["author"].unique())

778979

We can note here that there are just around 778979 unique authors but 1291038 rows which means that some users have given scores to multiple products.

In [157]:
#let's see how many different products are there.
len(df["product"].unique())

55274

Also we have 55274 unique products and rest are same.

In [158]:
#unique scores given.
df["score"].unique()

array([10.,  6.,  9.,  4.,  8.,  2.,  7.,  5.,  3.,  1.,  0.])

In [159]:
# Summary statistics of 'score' variable
df[['score']].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,1163104.0,8.018545,2.597187,0.0,7.0,9.0,10.0,10.0


In [160]:
#Limiting Dataset to 1000000 records
df = df.sample(n=1000000,random_state=612)

In [161]:
#checking number of records in Dataset
df.shape

(1000000, 3)

In [162]:
#checking updated DataFrame for records
df.head()

Unnamed: 0,author,product,score
865670,B. McCarthy,"Kyocera Torque, Black 4GB (Sprint)",6.0
710715,CMariaG,Samsung Smartphone Samsung Galaxy Gran Duos De...,10.0
87204,kevkyle,LG V10,9.0
19955,soki,Samsung G935 Galaxy S7 Edge Smartphone da 32GB...,10.0
833744,sylvain,Samsung Galaxy Note Smartphone HSPA/EDGE/GPRS ...,10.0


### <font style = "color:rgb(184,134,11)">Step 3 :- Creating a Popularity Based Recommendation System</font>

## Identifying the most rated features

In [163]:
#Here we calculate the Number of times a user has rated a Product
rating_by_user = df.groupby(by='author')['score'].count().sort_values(ascending=False)
rating_by_user.head()

author
Amazon Customer    14465
Cliente Amazon      5496
Client d'Amazon     2341
Amazon Kunde        1745
Anonymous           1631
Name: score, dtype: int64

## Identifying users with most number of views

In [164]:
#Here we calculate the number of times the Product has been rated
rating_by_product = df.groupby(by='product')['score'].count().sort_values(ascending=False)
rating_by_product.head(5)

product
OnePlus 3 (Graphite, 64 GB)          1915
Lenovo Vibe K4 Note (White,16GB)     1839
Lenovo Vibe K4 Note (Black, 16GB)    1578
Samsung Galaxy J3 (8GB)              1563
OnePlus 3 (Soft Gold, 64 GB)         1555
Name: score, dtype: int64

## Here we are selecting the data with products having more than 50 ratings and also the users who have given more than 50 ratings.

In [165]:
#Here we make a dataframe with above results consisting of only 3 attributes i.e. Author, Product and Score
rating_count = df['author'].value_counts()
top_products = df[df['author'].isin(rating_count[rating_count >= 50].index)]
top_products.head()

Unnamed: 0,author,product,score
746232,M.,BlackBerry Q10 SQN100-1 16GB 4G LTE Unlocked G...,6.0
1389526,Anonymous,Sharp TM150,10.0
669850,Steve,BLU Q170T Samba TV Unlocked Dual SIM Quad-Band...,8.0
343482,Andre,"Samsung Galaxy S5 G900A AT&T GSM Cellphone, 16...",10.0
173541,Amazon Customer,"Huawei P8 Lite ALE-L21 16GB Gold, Dual Sim, 5-...",2.0


## Building a Popularity based model and recommending top 5 mobile phones

In [166]:
#Here we calculate score based upon Rating given by the Authors.
cut_off = df.groupby('product').agg({'author': 'count'}).reset_index()
cut_off.rename(columns = {'author': 'score'},inplace=True)
cut_off.head()

Unnamed: 0,product,score
0,"'Smartphone Meizu Pro 5, 5,7 pouces avec Exyno...",1
1,'Sony Xperia X (F5122) â White â Dual Sim ...,1
2,'Sony Xperia X (F5122) â rosa â Dual Sim (...,1
3,"(7.62 cm (3 )Afficheur/Ã©cran, 2 MPixCamÃ©ra;b...",1
4,(DG300 Versione Aggiornata)5'' DOOGEE VOYAGER2...,41


In [167]:
#Here we are giving rank based upon rating given by the Authors.
sorted_out = cut_off.sort_values(['score', 'product'], ascending = [0,1]) 
sorted_out['Rank'] = sorted_out['score'].rank(ascending=0, method='first')
popularity_recommendation = sorted_out.head()
popularity_recommendation

Unnamed: 0,product,score,Rank
30671,"OnePlus 3 (Graphite, 64 GB)",1915,1.0
21112,"Lenovo Vibe K4 Note (White,16GB)",1839,2.0
21111,"Lenovo Vibe K4 Note (Black, 16GB)",1578,3.0
36304,Samsung Galaxy J3 (8GB),1563,4.0
30672,"OnePlus 3 (Soft Gold, 64 GB)",1555,5.0


In [168]:
 #Creating a function that will take 'User id' of 'Author' as an input and then give back 'Recommendations' as output.
 
 def Recommend(userID):     
    user_recommendation = popularity_recommendation 
    user_recommendation['author'] = userID #Adding userID column for which the recommendations are being generated. 
    col = user_recommendation.columns.tolist() #Bringing userID column to the front.
    col = col[-1:] + col[:-1] 
    user_recommendation = user_recommendation[col] 
          
    return user_recommendation 

In [169]:
#Testing Recommendation on random Authors
Recommend(100)

Unnamed: 0,author,product,score,Rank
30671,100,"OnePlus 3 (Graphite, 64 GB)",1915,1.0
21112,100,"Lenovo Vibe K4 Note (White,16GB)",1839,2.0
21111,100,"Lenovo Vibe K4 Note (Black, 16GB)",1578,3.0
36304,100,Samsung Galaxy J3 (8GB),1563,4.0
30672,100,"OnePlus 3 (Soft Gold, 64 GB)",1555,5.0


In [170]:
Recommend(1000)

Unnamed: 0,author,product,score,Rank
30671,1000,"OnePlus 3 (Graphite, 64 GB)",1915,1.0
21112,1000,"Lenovo Vibe K4 Note (White,16GB)",1839,2.0
21111,1000,"Lenovo Vibe K4 Note (Black, 16GB)",1578,3.0
36304,1000,Samsung Galaxy J3 (8GB),1563,4.0
30672,1000,"OnePlus 3 (Soft Gold, 64 GB)",1555,5.0


## Collaborative filtering Model using SVD

Building a collaborative filtering model using SVD. 

In [171]:
# Shape of the dataframe
df.shape

(1000000, 3)

In [172]:
#Limiting dataset to 50000 records and sampling these records randomly
df = df.sample(n=5000,random_state=612)
df.shape #checking dimensions of the dataframe again

(5000, 3)

In [173]:
#Calculating Number of times the user have rated a Product.
rating_per_product = df.groupby(by='product')['score'].count().sort_values(ascending=False)

In [174]:
# Dropping features other than Author, Product and Score
df.drop(df.columns.difference(['author','product','score']),inplace=True,axis=1)

In [175]:
#Checking for unique users in the sampled dataset.
df['author'].nunique()

4717

In [176]:
#Checking for unique products in the sampled dataset.
df['product'].nunique()

3940

In [177]:
#initiating LabelEncoder
le = LabelEncoder()
#Encoding 'author' and 'product' as unique integers for ease.
df['author'] = le.fit_transform(df['author'])
df['product'] = le.fit_transform(df['product'])

In [178]:
#checking encoding
df.head()

Unnamed: 0,author,product,score
636764,461,404,4.0
862691,3347,866,10.0
465184,2294,2919,4.0
968849,3673,894,2.0
129553,3917,2657,10.0


In [179]:
# Applying Pivot 'Product Vs Users' and allocating Ratings as values. 
# 0 indicates that user has not rated the product yet.
df.set_index('product',inplace=True)
pivot_data = df.pivot_table(index='product',columns='author',values='score',aggfunc=np.mean).fillna(0)
pivot_data

author,0,1,2,3,4,5,6,7,8,9,...,4707,4708,4709,4710,4711,4712,4713,4714,4715,4716
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3937,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Applying SVD to the pivoted data and limitng singular values or vectors to 50 and assigning values to U, Sigma and Vt 

In [180]:
U, sigma, Vt = svds(pivot_data, k = 50) 

In [181]:
# Verification of assigning of U
U

array([[-4.11756114e-21,  4.41543824e-23, -4.47749707e-18, ...,
        -1.18336931e-18, -1.94969575e-18, -4.11149531e-19],
       [ 1.02291631e-18, -1.02834187e-20,  8.16613582e-19, ...,
         5.09319079e-19,  4.98704556e-18, -1.34352316e-18],
       [-1.13198536e-21,  9.39892950e-24,  3.43253527e-18, ...,
         1.76728159e-18,  1.56433145e-19, -2.88794405e-18],
       ...,
       [-1.19739747e-20,  1.32601338e-19, -5.99377977e-19, ...,
         7.71879281e-19,  4.22863724e-18, -2.95494448e-18],
       [-2.74914407e-20,  6.37072611e-21,  7.90375781e-19, ...,
         1.67262208e-18,  2.31937268e-17, -1.43779723e-17],
       [ 6.28461599e-21, -6.35895387e-23,  2.60948834e-18, ...,
         2.21425795e-18,  2.85903150e-18,  5.47110149e-19]])

In [182]:
# Verification of assigning of sigma
sigma

array([20.        , 20.        , 20.        , 20.        , 20.        ,
       20.        , 20.        , 20.08944121, 20.22374842, 20.22374842,
       20.23944342, 20.32240143, 20.32240143, 20.59126028, 20.68816087,
       20.74313293, 21.04165761, 21.28379665, 21.34638658, 21.47091055,
       21.54065923, 21.54065923, 21.67948339, 21.9089023 , 22.36067977,
       22.36067977, 22.54952208, 22.55673137, 22.97825059, 23.15167381,
       23.53578663, 24.67792536, 25.21904043, 25.3968502 , 25.47152182,
       25.61249695, 25.76819745, 25.7812022 , 25.84569597, 26.45751311,
       26.55183609, 27.27636339, 27.48201398, 28.05907415, 28.14249456,
       29.01895008, 29.06888371, 29.39387691, 46.15096519, 66.67807299])

In [183]:
# Verification of assigning of Vt
Vt

array([[ 6.67678911e-21, -7.21960284e-21,  1.06896731e-21, ...,
        -3.03045227e-23, -3.09830746e-21, -4.98950487e-21],
       [-9.94854637e-23,  7.48081464e-23, -1.21350455e-23, ...,
         2.46940605e-26,  4.51010270e-23,  5.54663061e-23],
       [ 9.44823787e-19, -3.07844486e-20, -1.07431332e-18, ...,
         1.11792816e-18, -4.08798081e-19,  2.57413656e-18],
       ...,
       [ 1.54046759e-18, -1.69229360e-18,  3.96010418e-20, ...,
         3.84842954e-20, -4.72362775e-19, -1.17014795e-18],
       [ 1.32993104e-18, -1.24900642e-18, -7.52392560e-20, ...,
         4.05225070e-22,  8.81832255e-19, -1.06119156e-18],
       [ 1.00982988e-19, -1.71196168e-19,  6.28542381e-20, ...,
        -2.75476957e-20,  5.00766129e-19, -1.71451780e-19]])

In [184]:
# Extracting Diagonal Array out of sigma
sigma = np.diag(sigma)

In [185]:
# Performing Dot Products to assign values for predictions
predictions_ratings = np.dot(np.dot(U, sigma), Vt) 

In [186]:
# Creating new DataFrame for Predictions
df_predict = pd.DataFrame(predictions_ratings, columns = pivot_data.columns)

In [187]:
df_predict.head()

author,0,1,2,3,4,5,6,7,8,9,...,4707,4708,4709,4710,4711,4712,4713,4714,4715,4716
0,-2.054914e-33,2.570151e-33,1.1576159999999999e-34,3.2878719999999997e-34,3.812367e-34,-4.793875e-34,-4.695994e-34,-1.232528e-34,1.055014e-33,-1.005412e-17,...,1.1366440000000001e-33,1.608428e-33,2.924663e-35,1.649036e-33,6.156768e-34,-1.49128e-33,-1.984722e-33,-7.014097e-35,-1.074923e-33,1.522006e-33
1,-1.2568030000000002e-33,1.964473e-33,-1.61092e-34,2.9975560000000003e-33,2.701372e-34,-3.9536709999999995e-34,3.1859999999999997e-34,-3.66054e-34,6.660114e-34,9.368030000000001e-17,...,-2.441914e-31,1.2283650000000001e-33,3.476966e-34,1.3136180000000001e-31,4.0251389999999995e-34,-1.296586e-32,-2.379669e-32,1.87597e-35,-7.377819000000001e-33,1.634667e-33
2,-7.776122e-34,1.252946e-33,-1.005908e-34,9.087942e-34,6.193247e-35,-2.115716e-34,-3.795859e-33,-3.549005e-34,3.175654e-34,-1.1862130000000002e-17,...,2.169266e-32,4.091245e-34,2.740647e-35,1.808003e-32,1.9773159999999997e-34,6.385396e-33,9.808713000000001e-33,2.718291e-34,-2.620516e-33,9.983884999999999e-34
3,-4.6338580000000006e-33,5.962125e-33,5.010227e-35,2.382767e-34,6.413053e-34,-9.733839e-34,-1.149801e-33,-8.498589e-34,2.235542e-33,-2.6577790000000002e-17,...,1.50623e-32,3.19865e-33,-9.644923e-35,3.683954e-33,1.180748e-33,-3.415395e-33,-5.915033e-33,6.900394999999999e-35,-2.550497e-33,4.137473e-33
4,1.855982e-33,-2.4736520000000002e-33,4.712436e-35,1.782836e-34,-1.976308e-34,3.652283e-34,5.387395e-34,5.088808e-34,-8.663879e-34,1.027355e-17,...,-8.60589e-33,-1.1802270000000001e-33,6.99898e-35,-2.870668e-33,-4.185459e-34,1.900303e-33,3.6180920000000005e-33,-9.578891e-35,1.1438160000000001e-33,-1.850582e-33


 Now creating a function to get predictions with User ID and Number of Recommendations as input and Recommendations as Output.


In [189]:
def recommend_items(UserID, num_recommendations):
      
    user_idx = UserID-1 # index starts at 0
    
    # Get and sort the user's ratings
    sorted_user_ratings = pivot_data.iloc[UserID].sort_values(ascending=False)
    #sorted_user_ratings
    sorted_user_predictions = df_predict.iloc[UserID].sort_values(ascending=False)
    #sorted_user_predictions

    temp = pd.concat([sorted_user_ratings, sorted_user_predictions], axis=1)
    temp.index.name = 'Recommended Items'
    temp.columns = ['user_ratings', 'user_predictions']
    
    temp = temp.loc[temp.user_ratings == 0]   
    temp = temp.sort_values('user_predictions', ascending=False)
    print('\nRecommended Products for user(UserID = {}):\n'.format(UserID))
    print(temp.head(num_recommendations))

## Testing Recommendations on various users with varying inputs

In [190]:
recommend_items(486,5)


Recommended Products for user(UserID = 486):

                   user_ratings  user_predictions
Recommended Items                                
619                         0.0      1.132171e-17
1951                        0.0      5.431120e-18
224                         0.0      4.324425e-18
3961                        0.0      4.066611e-18
2629                        0.0      4.066611e-18


In [191]:
recommend_items(832,4)


Recommended Products for user(UserID = 832):

                   user_ratings  user_predictions
Recommended Items                                
858                         0.0      3.520503e-16
573                         0.0      3.520503e-16
2279                        0.0      3.520503e-16
3693                        0.0      3.520503e-16


In [192]:
recommend_items(669,7)


Recommended Products for user(UserID = 669):

                   user_ratings  user_predictions
Recommended Items                                
619                         0.0      2.913487e-16
1951                        0.0      1.272000e-16
224                         0.0      1.139220e-16
3792                        0.0      1.087699e-16
2973                        0.0      1.087699e-16
2686                        0.0      1.087699e-16
2546                        0.0      1.087699e-16


## SVD 

In [193]:
df3.head()

Unnamed: 0,author,country,date,domain,extract,lang,phone_url,product,score,score_max,source
0,CarolAnn35,us,5/2/2017,verizonwireless.com,As a diehard Samsung fan who has had every Sam...,en,/cellphones/samsung-galaxy-s8/,Samsung Galaxy S8,10.0,10.0,Verizon Wireless
1,james0923,us,4/28/2017,phonearena.com,Love the phone. the phone is sleek and smooth ...,en,/cellphones/samsung-galaxy-s8/,Samsung Galaxy S8,10.0,10.0,Phone Arena
2,R. Craig,us,5/4/2017,amazon.com,Adequate feel. Nice heft. Processor's still sl...,en,/cellphones/samsung-galaxy-s8/,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl...",6.0,10.0,Amazon
3,Buster2020,us,5/2/2017,samsung.com,Never disappointed. One of the reasons I've be...,en,/cellphones/samsung-galaxy-s8/,Samsung Galaxy S8 64GB (AT&T),9.2,10.0,Samsung
4,S Ate Mine,us,5/11/2017,verizonwireless.com,I've now found that i'm in a group of people t...,en,/cellphones/samsung-galaxy-s8/,Samsung Galaxy S8,4.0,10.0,Verizon Wireless


In [194]:
# Dropping features other than Author, Product and Score
df3.drop(df3.columns.difference(['author','product','score']),inplace=True,axis=1)

In [195]:
df3.head()

Unnamed: 0,author,product,score
0,CarolAnn35,Samsung Galaxy S8,10.0
1,james0923,Samsung Galaxy S8,10.0
2,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl...",6.0
3,Buster2020,Samsung Galaxy S8 64GB (AT&T),9.2
4,S Ate Mine,Samsung Galaxy S8,4.0


In [196]:
# Identifying the most rated products
df3.groupby('product')['score'].count().reset_index().sort_values('score', ascending=False)[:10]

Unnamed: 0,product,score
23698,"Lenovo Vibe K4 Note (White,16GB)",5226
23697,"Lenovo Vibe K4 Note (Black, 16GB)",4390
34801,"OnePlus 3 (Graphite, 64 GB)",4103
34802,"OnePlus 3 (Soft Gold, 64 GB)",3563
17134,Huawei P8lite zwart / 16 GB,2707
23701,"Lenovo Vibe K5 (Gold, VoLTE update)",2534
44664,Samsung Galaxy S6 zwart / 32 GB,2345
23703,"Lenovo Vibe K5 (Grey, VoLTE update)",2108
30658,Nokia 5800 XpressMusic,2070
23662,"Lenovo Used Lenovo Zuk Z1 (Space Grey, 64GB)",1952


In [197]:
# Identifying the users with most reviews

df3.groupby('author')['score'].count().reset_index().sort_values('score', ascending=False)[:10]

Unnamed: 0,author,score
30408,Amazon Customer,76978
97098,Cliente Amazon,19304
576241,e-bit,8663
97069,Client d'Amazon,7716
30847,Amazon Kunde,4750
41621,Anonymous,2746
578127,einer Kundin,2610
578124,einem Kunden,1898
749930,unknown,1738
41622,Anonymous,1461


In [198]:
# Downsampling the dataset with top 50 users and top 50 products sorted with 'Ratings'
min_ratings = 50
filter_products = df3['product'].value_counts() > min_ratings
filter_products = filter_products[filter_products].index.tolist()

min_user_ratings = 50
filter_users = df3['author'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df3[(df3['product'].isin(filter_products)) & (df3['author'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df3.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(1415133, 3)
The new data frame shape:	(175577, 3)


In [199]:
# Loading Dataset for Surprise and assigning Rating on a scale of 1-10.
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(df_new[['author', 'product', 'score']], reader)

### Building a collaborative filtering model using kNNWithMeans from surprise. 

In [200]:
# Applying SVD and KNN with Means with 20 means algorithm to recommend
# Applying k-fold cross validation with 5 folds to achieve better performance
 
benchmark = []
# Iterating over all the algorithms
for algorithm in [SVD(), KNNWithMeans(k=20)]:
    # Performing cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
    
    # for getting results and appending the algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse') 

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVD,,14.192789,0.527698
KNNWithMeans,,2.694651,10.454016


### Evaluating the collaborative model and Printing RMSE value.

In [201]:
trainset, testset = train_test_split(data, test_size=0.25)
algo_svd = SVD()
predictions = algo_svd.fit(trainset).test(testset)
accuracy.rmse(predictions)

RMSE: nan


nan

### Predicting average rating for test users.

In [212]:
test_author = 'Amazon Customer'
test_product = 'OnePlus 3 (Graphite, 64 GB)'

In [213]:
# Predicting score for random user and random product 
svd_test = algo_svd.predict(test_author,test_product)

In [214]:
# Applying Knn with Means(k=30)
trainset, testset = train_test_split(data, test_size=0.25)
algo_knn = KNNWithMeans(k=30)
predictions = algo_knn.fit(trainset).test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: nan


nan

In [215]:
# Predicting score for random user and random product 
knn_test = algo_knn.predict(test_author,test_product)

### Recommending top 5 products for test users.

In [207]:
#  Function to return number of items rated by given user 

def get_Iu(uid):
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError:
        return 0
       
# Function to return number of users that have rated given item

def get_Ui(iid):
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
        
 # Getting Top Predictions based upon Error between Actual and predicted   
df5 = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df5['Iu'] = df5.uid.apply(get_Iu)
df5['Ui'] = df5.iid.apply(get_Ui)
df5['err'] = abs(df5.est - df5.rui)
best_predictions = df5.sort_values(by='err')[:5]

In [208]:
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
23000,davide,"Asus ZenFone Max Smartphone, Schermo da 5.5"" H...",10.0,10.0,"{'actual_k': 29, 'was_impossible': False}",84,87,0.0
18257,Ð®ÑÐ¸Ð¹,Sony Xperia Z1 (ÑÐµÑÐ½ÑÐ¹),10.0,10.0,"{'actual_k': 0, 'was_impossible': False}",103,133,0.0
9693,Simone,"Samsung Galaxy J5 Smartphone, Bianco [Italia]",10.0,10.0,"{'actual_k': 30, 'was_impossible': False}",166,44,0.0
18256,Stefano,"Honor 6+ Smartphone, 4G LTE, Dual SIM, Display...",10.0,10.0,"{'actual_k': 19, 'was_impossible': False}",200,23,0.0
18255,Client d'Amazon,Microsoft Lumia 435 Smartphone dÃ©bloquÃ© 3G+ ...,10.0,10.0,"{'actual_k': 13, 'was_impossible': False}",3512,13,0.0


In [209]:
# Verifying output from Function get_Ui
get_Ui('Smartphone LG K10 K430TV')

168

In [210]:
# Verifying output from Function getIu
get_Iu('Amazon Customer')


46452

### In what business scenario should you use Popularity Based Recommendation system?

We should use Popularity Based Recommendation system in rating businesses like rotten tomatoes or imdb because the popularity based recommendation system doesn't suffer from the cold start problem i.e. even if we're just starting with a business we can use this recommendation system as it can recommend products on various different stages of filtering and there's also no need of a user's past or historical data.

### In what business scenario you should use CF based Recommendation Systems?

We can use CF based Recommendation systems for businesses which require dealing with large amounts of data regarding user and products. We can use - user based or item based recommendation. like - flipkart, jio store