# Project Based

**DOMAIN**: Smartphone, Electronics


**CONTEXT**: India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.


**PROJECT OBJECTIVE**: We will build a recommendation system using **popularity based** and **collaborative filtering** methods to recommend mobile phones to a user which are most popular and personalised respectively.

## Import the libraries

In [1]:
import numpy as np # mathematical manipulations
import pandas as pd # data manipulations

# pre-processing of data
from sklearn import preprocessing

# collection of data
from collections import defaultdict

# splitting into train and test sets and cross validations
from surprise.model_selection import train_test_split, cross_validate

# Recommender libraries
from surprise import SVD,Dataset, KNNWithMeans, accuracy, Reader

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# uncomment these to see whole data
# pd.set_option('max_columns',None)
# pd.set_option('max_rows',None)

## Import and Warehouse the data

In [2]:
# importing data to a data frame from csv
user_review_1 = pd.read_csv('phone_user_review_file_1.csv')

In [3]:
# checking top five rows to see if data is imported
user_review_1.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [4]:
# analysing shape of the dataframe
user_review_1.shape

(374910, 11)

In [5]:
# seeing shape and data types of various features
user_review_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374910 entries, 0 to 374909
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  374910 non-null  object 
 1   date       374910 non-null  object 
 2   lang       374910 non-null  object 
 3   country    374910 non-null  object 
 4   source     374910 non-null  object 
 5   domain     374910 non-null  object 
 6   score      366691 non-null  float64
 7   score_max  366691 non-null  float64
 8   extract    371934 non-null  object 
 9   author     371641 non-null  object 
 10  product    374910 non-null  object 
dtypes: float64(2), object(9)
memory usage: 31.5+ MB


#### Observations:
- The dataframe has **374910 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [6]:
# importing data to a data frame from csv
user_review_2 = pd.read_csv('phone_user_review_file_2.csv')

In [7]:
# checking top five rows to see if data is imported
user_review_2.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/leagoo-lead-7/,4/15/2015,en,us,Amazon,amazon.com,2.0,10.0,"The telephone headset is of poor quality , not...",luis,Leagoo Lead7 5.0 Inch HD JDI LTPS Screen 3G Sm...
1,/cellphones/leagoo-lead-7/,5/23/2015,en,gb,Amazon,amazon.co.uk,10.0,10.0,This is my first smartphone so I have nothing ...,Mark Lavin,Leagoo Lead 7 Lead7 MTK6582 Quad core 1GB RAM ...
2,/cellphones/leagoo-lead-7/,4/27/2015,en,gb,Amazon,amazon.co.uk,8.0,10.0,Great phone. Battery life not great but seems ...,tracey,Leagoo Lead 7 Lead7 MTK6582 Quad core 1GB RAM ...
3,/cellphones/leagoo-lead-7/,4/22/2015,en,gb,Amazon,amazon.co.uk,10.0,10.0,Best 90 quid I've ever spent on a smart phone,Reuben Ingram,Leagoo Lead 7 Lead7 MTK6582 Quad core 1GB RAM ...
4,/cellphones/leagoo-lead-7/,4/18/2015,en,gb,Amazon,amazon.co.uk,10.0,10.0,I m happy with this phone.it s very good.thx team,viorel,Leagoo Lead 7 Lead7 MTK6582 Quad core 1GB RAM ...


In [8]:
# analysing shape of the dataframe
user_review_2.shape

(114925, 11)

In [9]:
# seeing shape and data types of various features
user_review_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114925 entries, 0 to 114924
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  114925 non-null  object 
 1   date       114925 non-null  object 
 2   lang       114925 non-null  object 
 3   country    114925 non-null  object 
 4   source     114925 non-null  object 
 5   domain     114925 non-null  object 
 6   score      112166 non-null  float64
 7   score_max  112166 non-null  float64
 8   extract    113965 non-null  object 
 9   author     113290 non-null  object 
 10  product    114925 non-null  object 
dtypes: float64(2), object(9)
memory usage: 9.6+ MB


#### Observations:
- The dataframe has **114925 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [10]:
# importing data to a data frame from csv
user_review_3 = pd.read_csv('phone_user_review_file_3.csv')

In [11]:
# checking top five rows to see if data is imported
user_review_3.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,11/7/2015,pt,br,Submarino,submarino.com.br,6.0,10.0,"recomendo, eu comprei um, a um ano, e agora co...",herlington tesch,Samsung Smartphone Samsung Galaxy S3 Slim G381...
1,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,10/2/2015,pt,br,Submarino,submarino.com.br,10.0,10.0,Comprei um pouco desconfiada do site e do celu...,Luisa Silva Marieta,Samsung Smartphone Samsung Galaxy S3 Slim G381...
2,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,9/2/2015,pt,br,Submarino,submarino.com.br,10.0,10.0,"Muito bom o produto, obvio que tem versões mel...",Cyrus,Samsung Smartphone Samsung Galaxy S3 Slim G381...
3,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,9/2/2015,pt,br,Submarino,submarino.com.br,8.0,10.0,Unica ressalva fica para a camera que poderia ...,Marcela Santa Clara Brito,Samsung Smartphone Samsung Galaxy S3 Slim G381...
4,/cellphones/samsung-galaxy-s-iii-slim-sm-g3812/,9/1/2015,pt,br,Colombo,colombo.com.br,8.0,10.0,Rapidez e atenção na entrega. O aparelho é mui...,Claudine Maria Kuhn Walendorff,"Smartphone Samsung Galaxy S3 Slim, Dual Chip, ..."


In [12]:
# analysing shape of the dataframe
user_review_3.shape

(312961, 11)

In [13]:
# seeing shape and data types of various features
user_review_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312961 entries, 0 to 312960
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  312961 non-null  object 
 1   date       312961 non-null  object 
 2   lang       312961 non-null  object 
 3   country    312961 non-null  object 
 4   source     312961 non-null  object 
 5   domain     312961 non-null  object 
 6   score      304933 non-null  float64
 7   score_max  304933 non-null  float64
 8   extract    310231 non-null  object 
 9   author     302173 non-null  object 
 10  product    312960 non-null  object 
dtypes: float64(2), object(9)
memory usage: 26.3+ MB


#### Observations:
- The dataframe has **312961 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [14]:
# importing data to a data frame from csv
user_review_4 = pd.read_csv('phone_user_review_file_4.csv')

In [15]:
# checking top five rows to see if data is imported
user_review_4.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-s7262-duos-galaxy-ace/,3/11/2015,en,us,Amazon,amazon.com,2.0,10.0,was not conpatable with my phone as stated. I ...,Frances DeSimone,Samsung Galaxy Star Pro DUOS S7262 Unlocked Ce...
1,/cellphones/samsung-s7262-duos-galaxy-ace/,17/11/2015,en,in,Zopper,zopper.com,10.0,10.0,Decent Functions and Easy to Operate Pros:- Th...,Expert Review,Samsung Galaxy Star Pro S7262 Black
2,/cellphones/samsung-s7262-duos-galaxy-ace/,29/10/2015,en,in,Amazon,amazon.in,4.0,10.0,Not Good Phone such price. Hang too much and v...,Amazon Customer,Samsung Galaxy Star Pro GT-S7262 (Midnight Black)
3,/cellphones/samsung-s7262-duos-galaxy-ace/,29/10/2015,en,in,Amazon,amazon.in,6.0,10.0,not bad for features,Amazon Customer,Samsung Galaxy Star Pro GT-S7262 (Midnight Black)
4,/cellphones/samsung-s7262-duos-galaxy-ace/,29/10/2015,en,in,Amazon,amazon.in,10.0,10.0,Excellent product,NHK,Samsung Galaxy Star Pro GT-S7262 (Midnight Black)


In [16]:
# analysing shape of the dataframe
user_review_4.shape

(98284, 11)

In [17]:
# seeing shape and data types of various features
user_review_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98284 entries, 0 to 98283
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   phone_url  98284 non-null  object 
 1   date       98284 non-null  object 
 2   lang       98284 non-null  object 
 3   country    98284 non-null  object 
 4   source     98284 non-null  object 
 5   domain     98284 non-null  object 
 6   score      93706 non-null  float64
 7   score_max  93706 non-null  float64
 8   extract    96857 non-null  object 
 9   author     92696 non-null  object 
 10  product    98284 non-null  object 
dtypes: float64(2), object(9)
memory usage: 8.2+ MB


#### Observations:
- The dataframe has **98284 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [18]:
# importing data to a data frame from csv
user_review_5 = pd.read_csv('phone_user_review_file_5.csv')

In [19]:
# checking top five rows to see if data is imported
user_review_5.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/karbonn-k1616/,7/13/2016,en,in,91 Mobiles,91mobiles.com,2.0,10.0,I bought 1 month before. currently speaker is ...,venkatesh,Karbonn K1616
1,/cellphones/karbonn-k1616/,7/13/2016,en,in,91 Mobiles,91mobiles.com,6.0,10.0,"I just bought one week back, I have Airtel con...",Venkat,Karbonn K1616
2,/cellphones/karbonn-k1616/,7/13/2016,en,in,91 Mobiles,91mobiles.com,4.0,10.0,one problem in this handset opera is not worki...,krrish,Karbonn K1616
3,/cellphones/karbonn-k1616/,4/25/2014,en,in,Naaptol,naaptol.com,10.0,10.0,here Karbonn comes up with an another excellen...,BRIJESH CHAUHAN,Karbonn K1616 - Black
4,/cellphones/karbonn-k1616/,4/23/2013,en,in,Naaptol,naaptol.com,10.0,10.0,"What a phone, all so on Naaptol my god 23% off...",Suraj CHAUHAN,Karbonn K1616 - Black


In [20]:
# analysing shape of the dataframe
user_review_5.shape

(350216, 11)

In [21]:
# seeing shape and data types of various features
user_review_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350216 entries, 0 to 350215
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  350216 non-null  object 
 1   date       350216 non-null  object 
 2   lang       350216 non-null  object 
 3   country    350216 non-null  object 
 4   source     350216 non-null  object 
 5   domain     350216 non-null  object 
 6   score      321983 non-null  float64
 7   score_max  321983 non-null  float64
 8   extract    341836 non-null  object 
 9   author     321351 non-null  object 
 10  product    350216 non-null  object 
dtypes: float64(2), object(9)
memory usage: 29.4+ MB


#### Observations:
- The dataframe has **350216 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [22]:
# importing data to a data frame from csv
user_review_6 = pd.read_csv('phone_user_review_file_6.csv')

In [23]:
# checking top five rows to see if data is imported
user_review_6.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-instinct-sph-m800/,9/16/2011,en,us,Phone Arena,phonearena.com,8.0,10.0,I've had the phone for awhile and it's a prett...,ajabrams95,Samsung Instinct HD
1,/cellphones/samsung-instinct-sph-m800/,2/13/2014,en,us,Amazon,amazon.com,6.0,10.0,to be clear it is not the sellers fault that t...,Stephanie,Samsung SPH M800 Instinct
2,/cellphones/samsung-instinct-sph-m800/,12/30/2011,en,us,Phone Scoop,phonescoop.com,9.0,10.0,Well i love this phone. i have had ton of phon...,snickers,Instinct M800
3,/cellphones/samsung-instinct-sph-m800/,10/18/2008,en,us,HandCellPhone,handcellphone.com,4.0,10.0,I have had my Instinct for several months now ...,A4C,Samsung Instinct
4,/cellphones/samsung-instinct-sph-m800/,9/6/2008,en,us,Reviewed.com,reviewed.com,6.0,10.0,i have had this instinct phone for about two m...,betaBgood,Samsung Instinct


In [24]:
# analysing shape of the dataframe
user_review_6.shape

(163837, 11)

In [25]:
# seeing shape and data types of various features
user_review_6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163837 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   phone_url  163837 non-null  object 
 1   date       163837 non-null  object 
 2   lang       163837 non-null  object 
 3   country    163837 non-null  object 
 4   source     163837 non-null  object 
 5   domain     163837 non-null  object 
 6   score      152165 non-null  float64
 7   score_max  152165 non-null  float64
 8   extract    160949 non-null  object 
 9   author     150780 non-null  object 
 10  product    163837 non-null  object 
dtypes: float64(2), object(9)
memory usage: 13.7+ MB


#### Observations:
- The dataframe has **163837 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.

In [26]:
# concat all the individual data frames and ignoring the index
user_review = pd.concat([user_review_1,user_review_2,user_review_3,user_review_4,user_review_5,user_review_6],ignore_index=True)

In [27]:
# checking top five rows to see if data is concatenated
user_review.head()

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
0,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Verizon Wireless,verizonwireless.com,10.0,10.0,As a diehard Samsung fan who has had every Sam...,CarolAnn35,Samsung Galaxy S8
1,/cellphones/samsung-galaxy-s8/,4/28/2017,en,us,Phone Arena,phonearena.com,10.0,10.0,Love the phone. the phone is sleek and smooth ...,james0923,Samsung Galaxy S8
2,/cellphones/samsung-galaxy-s8/,5/4/2017,en,us,Amazon,amazon.com,6.0,10.0,Adequate feel. Nice heft. Processor's still sl...,R. Craig,"Samsung Galaxy S8 (64GB) G950U 5.8"" 4G LTE Unl..."
3,/cellphones/samsung-galaxy-s8/,5/2/2017,en,us,Samsung,samsung.com,9.2,10.0,Never disappointed. One of the reasons I've be...,Buster2020,Samsung Galaxy S8 64GB (AT&T)
4,/cellphones/samsung-galaxy-s8/,5/11/2017,en,us,Verizon Wireless,verizonwireless.com,4.0,10.0,I've now found that i'm in a group of people t...,S Ate Mine,Samsung Galaxy S8


In [28]:
# analysing shape of the dataframe
user_review.shape

(1415133, 11)

In [29]:
# describing the number columns for statistic values
user_review.describe()

Unnamed: 0,score,score_max
count,1351644.0,1351644.0
mean,8.00706,10.0
std,2.616121,0.0
min,0.2,10.0
25%,7.2,10.0
50%,9.2,10.0
75%,10.0,10.0
max,10.0,10.0


#### Observations:
- The dataframe has **1415133 data points** and **11 features**.
- There are many **null** points in the data set.
- All the data types of the features seem to be correct.
- We have to take care of the null values by pre-processing the data.
- All the data manipulations will be done on a copy of the dataframe, to preserve any loss of the original data.

In [30]:
# making a copy of dataframe
user_review_copy = user_review.copy()

In [31]:
# checking for the missing values
user_review_copy.isna().sum()

phone_url        0
date             0
lang             0
country          0
source           0
domain           0
score        63489
score_max    63489
extract      19361
author       63202
product          1
dtype: int64

#### Observations:
- There are **63489 null data points** in `score`, `score_max`.
- There are **19361 null data points** in `extract`.
- There are **63202 null data points** in `author`.
- There is **1 null data point** in `product`.
- We can impute these values by:
    + `dropping` the null values as we have huge number of data points for non-number columns `extract`, `author` and `product`.
    + imputing with `zero`, `median` or `mean` values for number columns like `score` and `score_max`.

In [32]:
# imputing null/NaN number values to median 
# and rounding off to nearest integer values
# and change data type to int
for col in user_review.columns:
    if user_review_copy[col].dtype != 'object': # selecting only number columns
        med = user_review_copy[col].median() # getting median of that feature
        user_review_copy[col].fillna(med,inplace=True) # filling null values with median
        user_review_copy[col] = user_review_copy[col].astype(int) # changing data type to int

In [33]:
# dropping non-number null values
user_review_copy.dropna(inplace=True)

In [34]:
# checking for the missing values
user_review_copy.isna().sum()

phone_url    0
date         0
lang         0
country      0
source       0
domain       0
score        0
score_max    0
extract      0
author       0
product      0
dtype: int64

In [35]:
# checking for duplicate values
user_review_copy.duplicated().any()

True

In [36]:
# dropping duplicate values
user_review_copy = user_review_copy.drop_duplicates()

In [37]:
# keeping only 1000000 samples with random_state as 612
user_review_copy=user_review_copy.sample(n=1000000,random_state=612)

In [38]:
# checking shape of the dataframe
user_review_copy.shape

(1000000, 11)

In [39]:
# keeping only relevant features like author, product, score
auth_prd_scr = user_review_copy[['author','product','score']]
auth_prd_scr.head()

Unnamed: 0,author,product,score
1005326,Paul B,Samsung i897 Captivate Android Smartphone Gala...,10
453603,Yuvraj,"Blu Win JR LTE (Grey, 4GB)",10
1010409,Pankaj Bhalla,"Lenovo P780 (Deep Black, 4GB)",10
866960,Bgrazina,Samsung Galaxy XCover 2,6
498651,Joyce D. Pratt,"BLU Vivo XL Smartphone - 5.5"" 4G LTE - GSM Unl...",2


In [40]:
# Top 10 most rated products
auth_prd_scr['product'].value_counts().head(10)

Lenovo Vibe K4 Note (White,16GB)                3908
Lenovo Vibe K4 Note (Black, 16GB)               3234
OnePlus 3 (Graphite, 64 GB)                     3128
OnePlus 3 (Soft Gold, 64 GB)                    2643
Huawei P8lite zwart / 16 GB                     1994
Samsung Galaxy Express I8730                    1990
Lenovo Vibe K5 (Gold, VoLTE update)             1865
Samsung Galaxy S6 zwart / 32 GB                 1729
Lenovo Vibe K5 (Grey, VoLTE update)             1603
Lenovo Used Lenovo Zuk Z1 (Space Grey, 64GB)    1454
Name: product, dtype: int64

In [41]:
# Top 10 users with most number of reviews
auth_prd_scr['author'].value_counts().head(10)

Amazon Customer    57765
Cliente Amazon     14564
e-bit               6309
Client d'Amazon     5720
Amazon Kunde        3624
einer Kundin        1963
Anonymous           1939
einem Kunden        1433
unknown             1283
Anonymous           1087
Name: author, dtype: int64

In [42]:
# extracting products with more than 50 reviews
df = (auth_prd_scr['product'].value_counts() > 50).reset_index()
products_with_more_than_50_reviews = df[df['product']]['index']
product_has_more_than_50_reviews = auth_prd_scr['product'].isin(products_with_more_than_50_reviews)

In [43]:
# extracting users with more than 50 reviews
df = (auth_prd_scr['author'].value_counts() > 50).reset_index()
users_with_more_than_50_reviews = df[df['author']]['index']
user_gave_more_than_50_reviews = auth_prd_scr['author'].isin(users_with_more_than_50_reviews)

In [44]:
# extracting data with products and users having more than 50 reviews
auth_prd_scr_more_than_50 = auth_prd_scr[product_has_more_than_50_reviews]
auth_prd_scr_more_than_50 = auth_prd_scr_more_than_50[user_gave_more_than_50_reviews]

In [45]:
# checking shape of the dataframe
auth_prd_scr_more_than_50.shape

(108983, 3)

#### Observations:
- There are **108983 data points** where the product has more than 50 reviews and user also has given more than 50 reviews.

### Popularity Based Recommender System

In [46]:
def recommend_top_n_mobiles(n=5):
    '''
    Recommend top n mobiles, based on simple popularity based recommendation system.
    It does NOT take into account the number of people who have provided the rating.
    Default value of n=5
    '''
    return pd.DataFrame(auth_prd_scr.groupby(by=['product'])['score'].mean().sort_values(ascending=False).head(n))

In [47]:
recommend_top_n_mobiles(7)

Unnamed: 0_level_0,score
product,Unnamed: 1_level_1
Smartphone Sony Xperia E1 Desbloqueado Vivo Android 4.3 Tela 4 4GB 3G Wi-Fi Câmera 3MP - Branco,10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Branco Android 4.4.2 4G Câmera 16 MP Memória Interna 16 GB,10.0
Samsung Smartphone Samsung Galaxy S5 Duos Desbloqueado/ Dual Chip / Branco / 4G / 16 MP / Android 4.4,10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado/ Branco / 4G / 16 MP / Android 4.4.2 / 16 GB / USB 3.0,10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Vivo Preto Android 4.4.2 4G Câmera 16 MP Memória Interna 16GB,10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Tim Preto Android 4.4.2 4G Câmera 16 MP Memória Interna 16GB,10.0
Samsung Smartphone Samsung Galaxy S5 Desbloqueado Preto Android 4.4.2 4G Câmera 16 MP Memória Interna 16GB,10.0


In [48]:
score_mean_count = pd.DataFrame(auth_prd_scr.groupby(by=['product'])['score'].mean())

In [49]:
score_mean_count['score_count'] = auth_prd_scr.groupby(by=['product'])['score'].count()

In [50]:
def count_based_recommend_top_n_mobiles(n=5):
    '''
    Recommend top n mobiles, based on popularity based recommendation system.
    It takes into account the number of people who have provided the rating.
    Default value of n=5
    '''
    return score_mean_count.sort_values(by=['score','score_count'],ascending=[False,False]).head(n)

In [51]:
count_based_recommend_top_n_mobiles(7)

Unnamed: 0_level_0,score,score_count
product,Unnamed: 1_level_1,Unnamed: 2_level_1
Samsung Galaxy Note5,10.0,144
Nokia Smartphone Nokia Lumia 520 Desbloqueado Oi Preto Windows Phone 8 Câmera 5MP 3G Wi-Fi Memória Interna 8G GPS,10.0,132
Motorola Smartphone Motorola Moto X Desbloqueado Preto Android 4.2.2 Câmera 10MP e Frontal 2MP Memória Interna de 16GB GSM,10.0,131
Samsung Smartphone Galaxy Win Duos Branco Desbloqueado Dual Chip Câmera 5MP Processador Quad Core 1.2 Ghz Android 4.1 3G Wi- Fi e Memória 8GB,10.0,127
Motorola Smartphone Motorola Moto G Dual Chip Desbloqueado TIM Android 4.3 Tela 4.5 8GB 3G Wi-Fi Câmera 5MP - Preto,10.0,126
Samsung Smartphone Dual Chip Samsung Galaxy SIII Duos Desbloqueado Claro Azul Android 4.1 3G/Wi-Fi Câmera 5MP,10.0,123
Motorola Smartphone Motorola Novo Moto G DTV Colors Dual Chip XT 1069 Desbloqueado Android 4.4 Tela 5 16GB 3G Wi-Fi Câmera de 8MP - Preto,10.0,119


### Collaborative Filtering using Singular Value Decomposition(SVD)

**NOTE**:: Tried to build recommender system using whole data, but ran out of memory. So using 5000 samples only.

In [52]:
# getting 5k samples
auth_prd_scr_5k = user_review_copy.sample(n=5000,random_state=612)

# keeping only relevant features like author, product, score
auth_prd_scr_5k = auth_prd_scr_5k[['author','product','score']]

In [53]:
# supplying range of rating scale
reader = Reader(rating_scale=(1,10))

In [54]:
# load data in format of surprise SVD
data = Dataset.load_from_df(auth_prd_scr_5k,reader = reader)

In [55]:
# build trainset from data
trainset = data.build_full_trainset()

In [56]:
# prints user data with ratings in the format of
# {user_id:[(item_id,ratings)]}
trainset.ur

defaultdict(list,
            {0: [(0, 10.0)],
             1: [(1, 4.0)],
             2: [(2, 8.0)],
             3: [(3, 10.0)],
             4: [(4, 10.0)],
             5: [(5, 4.0)],
             6: [(6, 6.0)],
             7: [(7, 4.0)],
             8: [(8, 10.0)],
             9: [(9, 10.0)],
             10: [(10, 2.0),
              (15, 2.0),
              (33, 8.0),
              (36, 6.0),
              (43, 10.0),
              (45, 10.0),
              (51, 10.0),
              (65, 2.0),
              (67, 2.0),
              (89, 10.0),
              (61, 6.0),
              (95, 4.0),
              (101, 10.0),
              (109, 2.0),
              (113, 10.0),
              (115, 10.0),
              (117, 4.0),
              (146, 6.0),
              (147, 10.0),
              (177, 2.0),
              (185, 2.0),
              (204, 8.0),
              (215, 2.0),
              (216, 2.0),
              (65, 10.0),
              (275, 10.0),
              (185, 

In [57]:
algo = SVD() # initialise the SVD algorithm
algo.fit(trainset) # fit on train data

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1eaa9d7c1f0>

In [58]:
# Then predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()

In [59]:
# testset is in the form of [(user_id,item_id_not_rated,global_mean)]
testset

[('Bill W',
  'LG Electronics G4 Smartphone 14 cm (5,5 Zoll) (Touch-Display, 32 GB Speicher, Android 6) braune Lederversion',
  7.9896),
 ('Bill W', 'Motorola KRZR K1', 7.9896),
 ('Bill W', 'LG G2 Sprint LS980', 7.9896),
 ('Bill W', 'LG Spirit H420 Smartphone, Bianco [Italia]', 7.9896),
 ('Bill W',
  'Générique Ecran Vitre Tactile WIKO RAINBOW - Noir + outils - NEUF',
  7.9896),
 ('Bill W', 'Samsung Galaxy S4 GT-I9500 16GB (черный)', 7.9896),
 ('Bill W', 'Samsung Galaxy S5 SM-G900F 16GB (синий)', 7.9896),
 ('Bill W',
  'Sony Xperia Z5 Compact E5823 2GB/32GB 23MP 4.6-inch 4G LTE Factory Unlocked (WHITE) - International Stock No Warranty',
  7.9896),
 ('Bill W',
  'Samsung Galaxy Note 7 SM-N930F 64GB 4G Azul - Smartphone (Android, NanoSIM, GSM, HSPA, LTE, USB Type-C), color azul',
  7.9896),
 ('Bill W',
  'Samsung Galaxy S3 I747 16GB 4G LTE Unlocked GSM Android Smartphone - Marble White (International version, No Warranty)',
  7.9896),
 ('Bill W',
  'Microsoft Lumia 640 XL 8GB Unlocked G

In [60]:
# making predictions on test data
predictions = algo.test(testset)

In [61]:
# [(user_id,item_id,global_mean,predicted_rating,was_impossible)]
predictions

[Prediction(uid='Bill W', iid='LG Electronics G4 Smartphone 14 cm (5,5 Zoll) (Touch-Display, 32 GB Speicher, Android 6) braune Lederversion', r_ui=7.9896, est=7.934320998646421, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='Motorola KRZR K1', r_ui=7.9896, est=8.36129986961754, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='LG G2 Sprint LS980', r_ui=7.9896, est=8.44798527794245, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='LG Spirit H420 Smartphone, Bianco [Italia]', r_ui=7.9896, est=8.259973627509096, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='Générique Ecran Vitre Tactile WIKO RAINBOW - Noir + outils - NEUF', r_ui=7.9896, est=7.906385640234391, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='Samsung Galaxy S4 GT-I9500 16GB (черный)', r_ui=7.9896, est=8.182695477021419, details={'was_impossible': False}),
 Prediction(uid='Bill W', iid='Samsung Galaxy S5 SM-G900F 16GB (синий)', r_ui

In [62]:
def get_top_n(predictions, n=5):
    '''
    Recommends items to users that they have not watched or rated based on 
    Singular Value Decomposition(SVD), User-User based Collaborative filtering
    Item-Item based Collaborative filtering, by finding ratings by latent features
    of user and items.
    By Default gives top 5 recommendations.
    '''
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [63]:
get_top_n(predictions,7)

defaultdict(list,
            {'Bill W': [('Samsung Galaxy S7 edge 32GB (Verizon)',
               8.969860245475427),
              ('Samsung Galaxy S7 32GB (T-Mobile)', 8.967496325517399),
              ('Lenovo Motorola Moto G Smartphone, 4,5 pollici display HD, processore Qualcomm, memoria 16GB, MicroSIM, Android 4.3 OS, fotocamera da 5 MP, Nero [Germania]',
               8.949367664191657),
              ('Huawei Honor 5X Unlocked Smartphone, 16GB Dark Grey (US Warranty)',
               8.935005040225537),
              ('Nokia E71', 8.887114149040647),
              ('Apple iPhone 5s 16GB (серебристый)', 8.870085605296001),
              ('Nokia T-Mobile Nokia Lumia 635 - No Contract Phone - White',
               8.857637579498245)],
             'Uwe J.': [('Nokia 5530 XpressMusic', 8.574412608263945),
              ('Samsung Galaxy S7 edge 32GB (Verizon)', 8.5183676685449),
              ('Samsung Galaxy S7 32GB (T-Mobile)', 8.4812769701692),
              ('HTC Desire C', 8

In [64]:
# evaluate model accuracy with Root Mean Square Error(RMSE)
print("SVD Model:")
accuracy.rmse(predictions, verbose=True)

SVD Model:
RMSE: 0.3397


0.3397376623994116

In [65]:
# cross-validations for better results
cross_validate(algo, data, measures=['rmse'], cv=5)

{'test_rmse': array([2.61486829, 2.69961639, 2.54512565, 2.58547366, 2.60690819]),
 'fit_time': (0.3711731433868408,
  0.22060632705688477,
  0.20150232315063477,
  0.2503774166107178,
  0.28049635887145996),
 'test_time': (0.012577056884765625,
  0.005304574966430664,
  0.004000663757324219,
  0.005524873733520508,
  0.013354301452636719)}

#### Observations:
- Building a `Collaborative filter using SVD` for huge samples requires lot of computational power.
- `RMSE` of SVD is lower as compared to cross-validated.

### User-User Based Collaborative Filtering

In [66]:
# supplying range of rating scale
reader = Reader(rating_scale=(1,10))

In [67]:
# load data in format of surprise library
data = Dataset.load_from_df(auth_prd_scr_5k,reader = reader)

In [68]:
# split into test and train data
trainset, testset = train_test_split(data, test_size=.15,random_state=612)

In [69]:
# user based collaborative filtering which 
# search for 50 nearest neighbours, pearson for similarity, user-user filter

algo_U = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True}) # initialise user-user filtering
algo_U.fit(trainset) # fit on trainset

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ea9dbc0280>

In [70]:
# run the trained model against the testset
test_pred_U = algo_U.test(testset)

In [71]:
# [(user_id,item_id,actual_mean,predicted_rating,details{used_neighbours,was_impossible})]
test_pred_U

[Prediction(uid='Amazon Customer', iid='Asus ZenFone 3 Max ZC520TL-4G860IN (Gold)', r_ui=8.0, est=10, details={'actual_k': 1, 'was_impossible': False}),
 Prediction(uid='deviaje', iid='Nokia Lumia 920', r_ui=10.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='TekGru', iid='Samsung Galaxy S III', r_ui=2.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='saleh', iid='BlackBerry RIM BlackBerry Bold 9930 Verizon Unlocked for QUAD Band GSM NON CAMERA VERSION OS7 Touch Screen', r_ui=8.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='voin2002', iid='Nokia X2-00', r_ui=10.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Daekmun', iid='Sony Ericsson K850', r_ui=8.0, est=8.002588235294118, details={'was_impossible': Tr

In [72]:
get_top_n(test_pred_U,5)

defaultdict(list,
            {'Amazon Customer': [('Asus ZenFone 3 Max ZC520TL-4G860IN (Gold)',
               10),
              ('Asus Zenfone Max ZC550KL-6A068IN (Black, 2GB, 16GB)', 10),
              ('Motorola Moto G, 4th Gen (Black, 2 GB, 16 GB)',
               8.991874162372332),
              ('YU Yuphoria YU5010A (Black+Silver)', 8.991279216961317),
              ('Lenovo S60 (Pearl White)', 8.002588235294118)],
             'deviaje': [('Nokia Lumia 920', 8.002588235294118)],
             'TekGru': [('Samsung Galaxy S III', 8.002588235294118)],
             'saleh': [('BlackBerry RIM BlackBerry Bold 9930 Verizon Unlocked for QUAD Band GSM NON CAMERA VERSION OS7 Touch Screen',
               8.002588235294118)],
             'voin2002': [('Nokia X2-00', 8.002588235294118)],
             'Daekmun': [('Sony Ericsson K850', 8.002588235294118)],
             'Julie1998': [('Samsung E1150 1.43" 72.5g Brown - Handys (3,632 cm (1.430"), 128 x 128 Pixel, CSTN, 0 MB, Polyphonisch, L

In [73]:
# evaluate model accuracy with Root Mean Square Error(RMSE)
print("User-based Model :")
accuracy.rmse(test_pred_U, verbose=True)

User-based Model :
RMSE: 2.6714


2.6713503296589796

In [74]:
# cross-validations for better results
cross_validate(algo_U, data, measures=['rmse'], cv=5)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([2.65433094, 2.644126  , 2.67421973, 2.60045312, 2.58556446]),
 'fit_time': (0.3231630325317383,
  0.30826401710510254,
  0.27933335304260254,
  0.27205991744995117,
  0.33126091957092285),
 'test_time': (0.004999876022338867,
  0.0052301883697509766,
  0.0050013065338134766,
  0.005305767059326172,
  0.004396915435791016)}

### Item-Item Based Collaborative Filtering

In [75]:
# supplying range of rating scale
reader = Reader(rating_scale=(1,10))

In [76]:
# load data in format of surprise library
data = Dataset.load_from_df(auth_prd_scr_5k,reader = reader)

In [77]:
# split into test and train data
trainset, testset = train_test_split(data, test_size=.15,random_state=612)

In [78]:
# item based collaborative filtering which 
# search for 50 nearest neighbours, pearson for similarity, user-user filter

algo_I = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False}) # initialise item-item filtering
algo_I.fit(trainset) # fit on trainset

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ea9dbc0370>

In [79]:
# run the trained model against the testset
test_pred_I = algo_I.test(testset)

In [80]:
# [(user_id,item_id,actual_mean,predicted_rating,details{used_neighbours,was_impossible})]
test_pred_I

[Prediction(uid='Amazon Customer', iid='Asus ZenFone 3 Max ZC520TL-4G860IN (Gold)', r_ui=8.0, est=9.311266913952057, details={'actual_k': 50, 'was_impossible': False}),
 Prediction(uid='deviaje', iid='Nokia Lumia 920', r_ui=10.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='TekGru', iid='Samsung Galaxy S III', r_ui=2.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='saleh', iid='BlackBerry RIM BlackBerry Bold 9930 Verizon Unlocked for QUAD Band GSM NON CAMERA VERSION OS7 Touch Screen', r_ui=8.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='voin2002', iid='Nokia X2-00', r_ui=10.0, est=8.002588235294118, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Daekmun', iid='Sony Ericsson K850', r_ui=8.0, est=8.002588235294118, details={'was

In [81]:
get_top_n(test_pred_I,5)

defaultdict(list,
            {'Amazon Customer': [('Asus Zenfone Max ZC550KL-6A068IN (Black, 2GB, 16GB)',
               10),
              ('Huawei P8 Lite US Version- 5 Unlocked Android 4G LTE Smartphone - Octa Core 1.5GHz, Dual SIM, Gorilla Glass, 13MP Camera - White (U.S. Warranty)',
               10),
              ('Motorola Moto G, 4th Gen (Black, 2 GB, 16 GB)',
               9.410066842689323),
              ('Asus ZenFone 3 Max ZC520TL-4G860IN (Gold)', 9.311266913952057),
              ('Desconocido Xiaomi Redmi 2 - Smartphone libre Android (pantalla 4.7", cámara 8 Mp, 8 GB, Quad-Core 1.2 GHz, 1 GB RAM), gris',
               9.0)],
             'deviaje': [('Nokia Lumia 920', 8.002588235294118)],
             'TekGru': [('Samsung Galaxy S III', 8.002588235294118)],
             'saleh': [('BlackBerry RIM BlackBerry Bold 9930 Verizon Unlocked for QUAD Band GSM NON CAMERA VERSION OS7 Touch Screen',
               8.002588235294118)],
             'voin2002': [('Nokia X2-00',

In [82]:
# evaluate model accuracy with Root Mean Square Error(RMSE)
print("Item-based Model :")
accuracy.rmse(test_pred_I, verbose=True)

Item-based Model :
RMSE: 2.6762


2.6762364822715283

In [83]:
# cross-validations for better results
cross_validate(algo_I, data, measures=['rmse'], cv=5)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([2.52259376, 2.57234206, 2.67476769, 2.65469343, 2.81606693]),
 'fit_time': (0.26340579986572266,
  0.29419755935668945,
  0.25064635276794434,
  0.25007200241088867,
  0.29479336738586426),
 'test_time': (0.024475812911987305,
  0.009218215942382812,
  0.013002634048461914,
  0.012319087982177734,
  0.009493827819824219)}

#### Observations:
- `Lenovo Vibe K4 Note (White,16GB)` is the most rated product.
- `Amazon Customer` is the user who has given most of the ratings.
- `Samsung Galaxy Note5` is the most popular product based on ratings and people rated.
- `Smartphone Sony Xperia E1 Desbloqueado Vivo Android 4.3 Tela 4 4GB 3G Wi-Fi Câmera 3MP - Branco` is the most popular based on ratings.
- `Root Mean Squared Error(RMSE)` for `Singular Value Decomposition(SVD)` is **0.34**.
- `Root Mean Squared Error(RMSE)` for `Cross-Validated Singular Value Decomposition(SVD)` is **2.6**.
- `Root Mean Squared Error(RMSE)` for `User-User based Collaborative Filtering` is **2.67**.
- `Root Mean Squared Error(RMSE)` for `Cross-Validated User-User based Collaborative Filtering` is **2.57**.
- `Root Mean Squared Error(RMSE)` for `Item-Item based Collaborative Filtering` is **2.67**.
- `Root Mean Squared Error(RMSE)` for `Cross-Validated Item-Item based Collaborative Filtering` is **2.72**.

#### Shortcomings:
- Due to `out-of-memory` issue, whole dataset could not be considered.
- We can have a hands-on session on `Cloud based` or `Collab based` Python systems, so that whole data can be taken and better inferences can be made.
- We can run a `GridSearchCV` or `RandomizedSearchCV` for finding optimal **hyper-parameters** for algorithm.

#### Popularity based Recommendation System
**Popularity based recommendation system** relies on the `popularity`, `trends` and `frequency` counts of which items were most purchased. So it can be used in a business scenario where we don't have any prior user data or we are launching a product in new geo-region and we can live with non-personalized recommendations to atleast give a chance of recommendations on pure past popularity basis.

#### Collaborative Filtering based Recommendation System
**Collaborative Filtering based recommendation system**  is used to building `intelligent` recommender systems that can learn to give better recommendations as more information about users is collected. It is a `personalised` recommender system , recommendations are made based on the past behaviour of the user. So it can be used in a business scenario where we have enough user profile data or product profile data or user-item interactions data.

#### Other Recommendation Systems:
- **Content Based Recommendation Systems** where the product description is searched for against the query supplied by the customer to get the most similar product.
- **Classification Model Based Recommendation Systems** to build non-scalable, individual models for products to see whether a customer will like it or not.
- **Association Rule Mining** or **Market-Basket Analysis** or **Apriori Principle** to know the purchase patterns of products going together and do marketing techniques of similar bought products together to uplift the sales.
- **Hybrid Approaches** combines multiple recommendation systems together to overcome shortcomings of one RS by other so that each RS compliments each other. Although any type of recommender systems can be combined a common approach in industry is to combine `content based` approaches and `collaborative filtering` approaches. `Content based` models can be used to solve the **Cold Start** and **GraySheep** problems in `Collaborative Filtering`. Some of the typical methods of Hybridization include:
    + `Weighted–Recommendations` from each system is weighted to calculate final recommendation.
    + `Switching–System` switches between different recommendation model.
    + `Mixed-Recommendations` from different recommenders are presented together.
- A common approach is to use `Latent Factor` models for high level recommendation and then improving them using `content` based systems by using information on users or items.