• PROJECT OBJECTIVE: We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively..

• DOMAIN: Smartphone, Electronics

• CONTEXT: India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by
smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has
made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they
are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the
right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system
based on individual consumer’s behaviour or choice.

1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps.

• Merge the provided CSVs into one data-frame.

• Check a few observations and shape of the data-frame.

• Round off scores to the nearest integers.

• Check for missing values. Impute the missing values if there is any.

• Check for duplicate values and remove them if there is any.

• Keep only 1000000 data samples. Use random state=612.

• Drop irrelevant features. Keep features like Author, Product, and Score

In [87]:
import pandas as pd
import numpy as np
from surprise import KNNWithMeans

In [88]:
df1=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_1.csv",encoding='latin1')
df2=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_2.csv",encoding='latin1')
df3=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_3.csv",encoding='latin1')
df4=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_4.csv",encoding='latin1')
df5=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_5.csv",encoding='latin1')
df6=pd.read_csv("D:/ML/Recommended_system/Dataset/phone_user_review_file_6.csv",encoding='latin1')

In [3]:
print(df1.shape)
print(df2.shape)
print(df3.shape)
print(df4.shape)
print(df5.shape)
print(df6.shape)

(374910, 11)
(114925, 11)
(312961, 11)
(98284, 11)
(350216, 11)
(163837, 11)


In [4]:
df= pd.concat([df1, df2,df3,df4,df5,df6],ignore_index=True)

In [5]:
df.shape

(1415133, 11)

In [6]:
df.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1415133 entries, 0 to 1415132
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 118.8+ MB


In [8]:
duplicate = df[df.duplicated(keep = 'first')] 

In [9]:
duplicate.shape

(6412, 11)

In [10]:
df.drop_duplicates(subset=None, keep='first', inplace=True,ignore_index=False)

In [11]:
df.isnull().mean() * 100

phone_url    0.000000
date         0.000000
lang         0.000000
country      0.000000
source       0.000000
domain       0.000000
score        4.478743
score_max    4.478743
extract      1.349735
author       4.388165
product      0.000071
dtype: float64

In [12]:
df.dropna(inplace=True)

In [13]:
df.shape

(1271451, 11)

In [14]:
df['score']=df['score'].apply(np.floor) 

In [15]:
df.score

0          10.0
1          10.0
2           6.0
3           9.0
4           4.0
           ... 
1415128     2.0
1415129    10.0
1415130     2.0
1415131     8.0
1415132     2.0
Name: score, Length: 1271451, dtype: float64

In [16]:
df_1mill=df.sample(n=1000000,replace=False,random_state=612)

In [17]:
df_1mill

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
609806,/cellphones/nokia-lumia-1020/,11/20/2013,en,gb,Amazon,amazon.co.uk,10.0,10.0,"How did I live without it, excellent in every ...",jimperkins110,Lumia Nokia Lumia 1020 UK Sim Free Windows Sma...
1372024,/cellphones/motorola-v557/,1/19/2006,en,us,Amazon,amazon.com,2.0,10.0,I've had this phone for a little more than a y...,Penelope Brown,Motorola V557 Unlocked Quadband Cell Phone
1110810,/cellphones/sanyo-scp-3810/,12/14/2009,en,us,Phone Scoop,phonescoop.com,4.0,10.0,I won't give it a rating so far. I just acquir...,JohnnyPasta,SCP-3810 / Mirro
532252,/cellphones/apple-iphone-5s/,8/26/2014,en,us,Amazon,amazon.com,10.0,10.0,"my in-laws really love the gifts iph5s, thank you",Mitchell Coleman,Apple iPhone 5s 16GB 4G LTE GSM Gold - AT&T Wi...
1334423,/cellphones/motorola-l6/,1/28/2009,es,ec,MercadoLibre,opinion.mercadolibre.com.ec,6.0,10.0,ESTA BUENO BONITO Y BARATO,JESSROSA,Motorola L6
...,...,...,...,...,...,...,...,...,...,...,...
419842,/cellphones/bq-aquaris-e4-5/,1/24/2015,es,es,Amazon,amazon.es,10.0,10.0,Buen procesador. Va rÃ¡pido y tiene una baterÃ...,Etelop,BQ Aquaris E4.5 - Smartphone libre Android (pa...
915112,/cellphones/lg-p690/,5/2/2012,ru,ua,Price.ua,price.ua,10.0,10.0,"Ð¡Ð¼Ð°ÑÑ Ð¾ÑÐµÐ½Ñ Ð¿ÑÐ¾Ð´ÑÐ¼Ð°Ð½Ð½ÑÐ¹, ...",ipopov-sky,LG P690 Optimus Link
1263546,/cellphones/sony-ericsson-w300i/,11/12/2009,es,ar,MercadoLibre,opinion.mercadolibre.com.ar,10.0,10.0,no tengo mucho decir les quiero decir que teng...,DINAMICA99,Sony Ericsson W300
301402,/cellphones/bq-aquaris-e5/,12/5/2014,es,es,Amazon,amazon.es,2.0,10.0,Estoy muy. Contento y satisfecho con este mÃ³v...,Cliente Amazon,BQ Aquaris E5 HD - Smartphone libre Android (p...


2. Answer the following questions

• Identify the most rated features.

• Identify the users with most number of reviews.

• Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final
dataset.

In [18]:
print("total unique author - ",len(df["author"].unique()))

total unique author -  770521


In [19]:
print("Author with highet number of reviews- ",(df["author"].value_counts().head(1)))

Author with highet number of reviews-  Amazon Customer    76933
Name: author, dtype: int64


In [20]:
print("total unique product - ",len(df["product"].unique()))

total unique product -  54872


In [21]:
print("Product with highet number of reviews- ",(df["product"].value_counts().head(1)))

Product with highet number of reviews-  Lenovo Vibe K4 Note (White,16GB)    5223
Name: product, dtype: int64


In [28]:
df_features=df_1mill[['lang','country','score','author','product']]

In [29]:
df_features

Unnamed: 0,lang,country,score,author,product
609806,en,gb,10.0,jimperkins110,Lumia Nokia Lumia 1020 UK Sim Free Windows Sma...
1372024,en,us,2.0,Penelope Brown,Motorola V557 Unlocked Quadband Cell Phone
1110810,en,us,4.0,JohnnyPasta,SCP-3810 / Mirro
532252,en,us,10.0,Mitchell Coleman,Apple iPhone 5s 16GB 4G LTE GSM Gold - AT&T Wi...
1334423,es,ec,6.0,JESSROSA,Motorola L6
...,...,...,...,...,...
419842,es,es,10.0,Etelop,BQ Aquaris E4.5 - Smartphone libre Android (pa...
915112,ru,ua,10.0,ipopov-sky,LG P690 Optimus Link
1263546,es,ar,10.0,DINAMICA99,Sony Ericsson W300
301402,es,es,2.0,Cliente Amazon,BQ Aquaris E5 HD - Smartphone libre Android (p...


In [30]:
df_prod=df_features.groupby("product").filter(lambda x: len(x) >= 50)

In [31]:
df_prod

Unnamed: 0,lang,country,score,author,product
1334423,es,ec,6.0,JESSROSA,Motorola L6
1213429,en,us,2.0,Amazon Customer,Motorola Triumph Prepaid Android Phone (Virgin...
1087957,en,us,6.0,Gary D. Scott,Samsung SGH-A847 Rubgy 2 Rugged GSM Unlocked A...
230610,it,it,10.0,Lorise,"Microsoft Lumia 950 XL Smartphone, 5.7"", camer..."
508581,en,in,8.0,Vishal P.,"Lenovo Vibe K5 (Gold, VoLTE update)"
...,...,...,...,...,...
1052390,de,de,8.0,Chris,"Microsoft Nokia C1-01 Handy (4,6 cm (1,8 Zoll)..."
419842,es,es,10.0,Etelop,BQ Aquaris E4.5 - Smartphone libre Android (pa...
1263546,es,ar,10.0,DINAMICA99,Sony Ericsson W300
301402,es,es,2.0,Cliente Amazon,BQ Aquaris E5 HD - Smartphone libre Android (p...


In [32]:
df=df_prod.groupby("author").filter(lambda x: len(x) >= 50)

In [33]:
df

Unnamed: 0,lang,country,score,author,product
1213429,en,us,2.0,Amazon Customer,Motorola Triumph Prepaid Android Phone (Virgin...
478771,en,in,6.0,Amazon Customer,Sony Xperia XA Dual (Graphite Black)
74417,en,in,10.0,Amazon Customer,"OnePlus 3 (Soft Gold, 64 GB)"
499887,en,in,10.0,Amazon Customer,"Lenovo PHAB Plus Tablet (6.8 inch, 32GB, Wi-Fi..."
1306428,ru,ua,4.0,ÐÐ¸ÑÐ°Ð¸Ð»,Nokia 2630
...,...,...,...,...,...
64934,it,it,10.0,Davide,"Huawei P9 Lite Smartphone, LTE, Display 5.2'' ..."
149289,en,in,10.0,Amazon Customer,"Honor 6X (Grey, 32GB)"
1052390,de,de,8.0,Chris,"Microsoft Nokia C1-01 Handy (4,6 cm (1,8 Zoll)..."
301402,es,es,2.0,Cliente Amazon,BQ Aquaris E5 HD - Smartphone libre Android (p...


3. Build a popularity based model and recommend top 5 mobile phones.

In [34]:
score_mean_count = pd.DataFrame(df.groupby('product')['score'].mean()) 
score_mean_count['score_counts'] = pd.DataFrame(df.groupby('product')['score'].count())  
score_mean_count['rating'] = score_mean_count['score_counts']*score_mean_count['score']
score_mean_count.sort_values(by='rating',ascending=False).head(5)

Unnamed: 0_level_0,score,score_counts,rating
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Lenovo Vibe K4 Note (White,16GB)",6.944073,2396,16638.0
"Lenovo Vibe K4 Note (Black, 16GB)",7.012309,2031,14242.0
"OnePlus 3 (Graphite, 64 GB)",8.510345,1450,12340.0
"OnePlus 3 (Soft Gold, 64 GB)",8.294246,1373,11388.0
"Lenovo Vibe K5 (Gold, VoLTE update)",6.587768,1259,8294.0


4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you
can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You
can try both user-based and item-based model.

5. Evaluate the collaborative model. Print RMSE value.

In [105]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['author', 'product', 'score']], reader)

In [106]:
data

<surprise.dataset.DatasetAutoFolds at 0x2a451446648>

In [107]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=43)

In [108]:
type(trainset)

surprise.trainset.Trainset

In [109]:
user_records = trainset.ur
type(user_records)

collections.defaultdict

In [144]:
user_records

defaultdict(list,
            {0: [(0, 4.0),
              (2, 10.0),
              (4, 8.0),
              (8, 6.0),
              (9, 8.0),
              (10, 4.0),
              (12, 6.0),
              (14, 10.0),
              (4, 8.0),
              (18, 2.0),
              (21, 10.0),
              (25, 2.0),
              (26, 10.0),
              (27, 2.0),
              (29, 8.0),
              (30, 10.0),
              (33, 10.0),
              (34, 8.0),
              (4, 2.0),
              (38, 2.0),
              (41, 8.0),
              (43, 2.0),
              (45, 10.0),
              (46, 2.0),
              (48, 10.0),
              (50, 10.0),
              (51, 6.0),
              (57, 2.0),
              (59, 8.0),
              (60, 8.0),
              (62, 8.0),
              (68, 10.0),
              (70, 10.0),
              (72, 4.0),
              (74, 8.0),
              (78, 10.0),
              (4, 6.0),
              (80, 6.0),
              (81, 8.0),


In [112]:
print(trainset.to_raw_uid(10))
print(trainset.to_raw_iid(1000))

Maria
Apple iPhone 4S Verizon Cellphone, 16GB, White


Using K-Means Algorithim

In [113]:
from surprise import KNNWithMeans
from surprise import accuracy
from surprise import Prediction

In [114]:
algo = KNNWithMeans(k=100, sim_options={'name': 'cosine', 'user_based': True})
algo.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x2a40465b688>

In [115]:
len(testset)

25295

In [116]:
testset[5:10]

[('Mauro',
  'Asus ZenFone 2 Smartphone, Schermo da 5.5" Full HD, Processore Quad Core 2,3 GHz, RAM 4 GB, 32 GB, 4G/LTE, Argento',
  8.0),
 ('Amazon Customer', 'OnePlus 3 (Soft Gold, 64 GB)', 6.0),
 ('Cliente Amazon',
  'Samsung J320 Galaxy J3 (2016) Dual SIM 5", Quad Core, 8GB LTE, Nero [Italia]',
  8.0),
 ('Rahul', 'Samsung Guru GT-E1200 (Indigo Blue)', 4.0),
 ('e-bit', 'Smartphone Samsung Galaxy S4 Mini GT-I9192', 10.0)]

In [117]:
# Evalute on test set
test_pred = algo.test(testset)

# compute RMSE
accuracy.rmse(test_pred)

RMSE: 2.7325


2.7325254981995597

6. Predict score (average rating) for test users. 

In [121]:
test_pred[101:200]

[Prediction(uid='Amazon-Kunde', iid='Wiko Rainbow Smartphone (12,7 cm (5 Zoll) Display , 8 GB interner Speicher, Android 4.4.2 KitKat) neongelb', r_ui=10.0, est=9.365307406864968, details={'actual_k': 4, 'was_impossible': False}),
 Prediction(uid='e-bit', iid='Smartphone Motorola Moto G 3Âª GeraÃ§Ã£o XT1543 8GB', r_ui=8.0, est=7.906742117832665, details={'actual_k': 37, 'was_impossible': False}),
 Prediction(uid='einem Kunden', iid='Samsung Galaxy S7 edge Smartphone, 13,9 cm (5,5 Zoll) Display, LTE (4G)', r_ui=10.0, est=9.76, details={'actual_k': 100, 'was_impossible': False}),
 Prediction(uid='Cliente Amazon', iid='Samsung G920 Galaxy S6 Smartphone, 32 GB, Oro [Italia]', r_ui=10.0, est=8.10854946446826, details={'actual_k': 69, 'was_impossible': False}),
 Prediction(uid='Cliente Amazon', iid='Microsoft Telefonia Lumia 950 XL Smartphone, 32 GB, Bianco [Italia]', r_ui=10.0, est=7.369693139978514, details={'actual_k': 11, 'was_impossible': False}),
 Prediction(uid='Amazon Customer', iid=

Using SVD Algorithim

In [126]:
from surprise import SVD
from surprise import accuracy

In [127]:
svd_model = SVD(n_factors=50)
svd_model.fit(trainset)
test_pred = svd_model.test(testset)


In [128]:
# compute RMSE
accuracy.rmse(test_pred)

RMSE: 2.7418


2.741804224420085

In [129]:
test_pred_df = pd.DataFrame([[x.uid,x.iid,x.est] for x in test_pred])

8. Try and recommend top 5 products for test users.

In [130]:
test_pred_df.head()

Unnamed: 0,0,1,2
0,Andrew,Motorola Droid 3 Verizon Xt862 Verizon Cell Phone,7.918593
1,ÐÐ°Ð²ÐµÐ»,Samsung Galaxy J1,7.43444
2,Amazon Customer,"Lenovo Vibe K4 Note (Black, 16GB)",6.651575
3,Amazon Customer,"Lenovo Vibe K4 Note (White,16GB)",6.61303
4,einer Kundin,"Apple iPhone 7 Plus 5,5"" 32 GB",9.234653


In [136]:
test_pred_df.columns = ["AUTHOR","MOBILE_NAME","EST_RATING"]
test_pred_df.sort_values(by = ["EST_RATING"],ascending=False,inplace=True)

In [137]:
top_10_phones = test_pred_df.groupby("EST_RATING").head(10).reset_index(drop=True)

In [138]:
top_10_phones.head(30)

Unnamed: 0,AUTHOR,MOBILE_NAME,EST_RATING
0,Luca,Lenovo Motorola Moto G 4G 3 Generazione Smartp...,10.0
1,Michael,"Samsung E1200 Handy (3,9 cm (1,52 Zoll) Displa...",10.0
2,Amazon-Kunde,Microsoft Nokia Lumia 630 Single-SIM Smartphon...,10.0
3,Amazon-Kunde,"Samsung Galaxy Note 3 Smartphone (14,5 cm (5,7...",10.0
4,Amazon-Kunde,"Samsung Galaxy S4 Smartphone (12,7 cm (4,9 Zol...",10.0
5,Amazon-Kunde,"Motorola Moto G 3. Generation Smartphone (12,7...",10.0
6,ÐÐ¸ÑÐ°Ð¸Ð»,Apple iPhone 5s 16GB (ÑÐµÑÐµÐ±ÑÐ¸ÑÑÑÐ¹),10.0
7,ÐÐ»ÐµÐºÑÐ°Ð½Ð´ÑÐ°,Samsung Galaxy S7 edge,10.0
8,Luca,Lenovo Motorola Moto G 4G 3 Generazione Smartp...,10.0
9,Francesco,"Meizu M2 Note Smartphone, 5.5"" Full HD, 4G, 13...",10.0


11.In what business scenario you should use popularity based Recommendation Systems ?

Easiest way to build a recommendation system is popularity based, simply over all the products that are popular, 
So how to identify popular products, which could be identified by which are all the products that are bought most,
Business scenarios such as MOBILE PHONE , AUTOMOBILE INDUSTRY, all SERVICE BASED INDUSTRYS HAVE A STRONG SCOPE

12. In what business scenario you should use CF based Recommendation Systems ?

collaborative filtering models which are based on assumption that people like things similar to other things they like, 
and things that are liked by other people with similar taste.collaborative filtering models are two types,

I.Nearest neighbor


II.Matrix factorization

Collaborative Filtering (CF) is the most popular and widely used approach for RS which tries to analyze the user's interest over the target item on the basis of views expressed by other like-minded users.

13. What other possible methods can you think of which can further improve the recommendation for different users ?


a.Improving Accuracy of Recommender System by Clustering Items Based on Stability of User Similarity.Collaborative filtering, one of the most widely used approach in recommender system, predicts a user's rating towards an item by aggregating ratings given by users having similar preference to that user.