### DOMAIN: Smartphone, Electronics
#### CONTEXT: 
 India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. 

#### This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.

### DATA DESCRIPTION: 
1. author : name of the person who gave the ratingcountry 
2. country the person who gave the rating belongs to
3. data : date of the rating
4. domain: website from which the rating was taken from
5. extract: rating content
6. language: language in which the rating was given
7. product: name of the product/mobile phone for which the rating was given
8. score: average rating for the phone
9. score_max: highest rating given for the phone
10. source: source from where the rating was taken
*Data source: 
#### PROJECT OBJECTIVE: We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively..

#### step 1: Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os
from os import listdir

#### step 2: Merge the provided CSVs into one data-frame.

In [2]:
# import sys
path=os.getcwd()+"\Data\Data Set\\"
filenames = listdir(path)
x=[filename for filename in filenames if filename.endswith( ".csv" )]

d1=pd.concat([pd.read_csv(path+ x,encoding='latin1') for x in x], axis=0)
#d1=pd.concat([pd.read_csv(path+ x,encoding='ISO-8859-1') for x in x], axis=0)
d1.shape

(1415133, 11)

In [3]:
## Check a few observations and shape of the data-frame.
d1.sample(3)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
169763,/cellphones/samsung-galaxy-j1-2016-4-5-sm-j120/,7/18/2016,en,us,Amazon,amazon.com,10.0,10.0,I love this phone. It works great,Amazon Customer,Samsung Galaxy Express 3 AT&T Prepaid (U.S. Wa...
141341,/cellphones/cubot-s308/,12/27/2014,de,de,Amazon,amazon.de,10.0,10.0,"artikel wie beschrieben, nur fÃ¼r ein 5 zoler ...",!!!!!!!!!!!!!!!!!!!!!!!!!!!!,Cubot 5 '' CUBOT S308 IPS HD Display 3G Smartp...
117933,/cellphones/samsung-wave-ii-s8530/,9/16/2011,es,es,Ciao,ciao.es,10.0,10.0,"Una estetica elegante, pedazo de movil y parec...",abuquina,Samsung S8530 Wave II


#### step 3 : data preprocessing and exploration

In [4]:
##Round off scores to the nearest integers
df=d1.copy()
df['score']=df['score'].round()
df['score_max']=df['score_max'].round()

In [5]:
## Check for missing values. Impute the missing values if there is any
df.isnull().mean()*100

phone_url    0.000000
date         0.000000
lang         0.000000
country      0.000000
source       0.000000
domain       0.000000
score        4.486433
score_max    4.486433
extract      1.368140
author       4.466153
product      0.000071
dtype: float64

In [6]:
#missing value imputation
df['score']= df['score'].replace(np.nan, 0)
df['score_max']= df['score_max'].replace(np.nan, 0)
df['extract']= df['extract'].replace(np.nan,'no captured')
df['author']= df['author'].replace(np.nan,'no captured')
df['product']= df['product'].replace(np.nan,'no captured')

In [64]:
#Check for duplicate values and remove them if there is any
if len(df[df.duplicated()]) > 0:
    print('Data has duplicate rows')
    df_temp1=df.drop_duplicates()
    print('Number of rows dropped', len(df)-len(df_temp1))
    print(df_temp1.shape)
else :
    print('Data does not have duplicate rows')
    df_temp1=df.copy()
    print(df_temp1.shape)
    print('Number of rows dropped', len(df)-len(df_temp1))

Data has duplicate rows
Number of rows dropped 6436
(1408697, 11)


In [8]:
## highest rating given for the phone
df_temp1['score_max'].value_counts()

10.0    1345604
0.0       63093
Name: score_max, dtype: int64

In [9]:
len(df_temp1['lang'].value_counts())

22

In [10]:
df_temp1[df_temp1['lang']=='ar'].sample(2)

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
61385,/cellphones/huawei-p9-plus/,7/14/2016,ar,ae,Souq,uae.souq.com,10.0,10.0,Ø´ÙØ±Ø§,mahdi1408,ÙØ§ÙØ§ÙÙ P9 Ø¨ÙÙØ³ - 64 Ø¬ÙØ¬Ø§Ø¨Ø§ÙØª...
169811,/cellphones/samsung-galaxy-j1-2016-4-5-sm-j120/,5/10/2017,ar,ae,Souq,uae.souq.com,10.0,10.0,ÙÙØ§Ø³Ù Ø§Ù ÙØ°Ù ÙÙ Ø§ÙÙØ±Ù Ø§ÙØ«Ø...,melody1989,Ø³Ø§ÙØ³ÙÙØ¬ Ø¬Ø§ÙÙØ³Ù J1 2016 SM-J120H -...


#### Summary : 
1. maximum score 0-10 and there is a data in 22 launganges with some in non readable format/ non english format
2. Above data is not adding any value, as this is not in readable format with english

In [68]:
## Drop irrelevant features. Keep features like Author, Product, and Score
drop_col_list=['phone_url','date','country','domain','source','extract','score_max', 'lang']
df_temp1_sub=df_temp1.drop(drop_col_list, axis=1)
df_temp1_sub.sample(2)

Unnamed: 0,score,author,product
265063,10.0,Kurt C. Anchorstar,"Apple iPhone 6, Space Gray, 16GB (AT&T)"
221618,9.0,Tazzymama71,Samsung Galaxy S4 Active 16GB (AT&T)


In [12]:
##Keep only 1000000 data samples. Use random state=612.
df_sampled = df_temp1_sub.sample(n=1000000 ,random_state=612)
df_sampled.shape

(1000000, 3)

In [13]:
df_sampled.head(2)

Unnamed: 0,score,author,product
55265,8.0,Vinod Kumar Chengespur,"Lenovo Vibe K4 Note (White,16GB)"
97318,2.0,Sharon,HTC Desire 816 Black (Virgin mobile) - 5.5 inc...


In [14]:
#identify most rated feature

#### step 3.1: processing on data frame column

In [15]:
import re
import string
df_sampled['product']= df_sampled['product'].map(lambda x: x.lower())
df_sampled['author']= df_sampled['author'].map(lambda x: x.lower())
df_sampled['prod_details']=df_sampled['product'].map(lambda x: ' '.join(x.split()[:3]))
df_sampled['prod_details1']= df_sampled['prod_details'].map(lambda x: ' '.join(re.split(r'\W+', x)[:3]))
df_sampled['nonalpha']=df_sampled['prod_details1'].map(lambda x: re.sub("[^a-zA-Z0-9]+", "1",x))
df_sampled['authornonalpha']=df_sampled['author'].map(lambda x: re.sub("[^a-zA-Z0-9]+", "1",x))
df_sampled.sample(2)

Unnamed: 0,score,author,product,prod_details,prod_details1,nonalpha,authornonalpha
304936,4.0,la tommi,asus zenfone 2 smartphone 4gb 32gb android 5.0...,asus zenfone 2,asus zenfone 2,asus1zenfone12,la1tommi
280755,8.0,sqej,ð¡ð¾ñð¾ð²ñð¹ ñðµð»ðµñð¾ð½ nokia 206 dual sim,ð¡ð¾ñð¾ð²ñð¹ ñðµð»ðµñð¾ð½ nokia,ð ð¾ñ ð¾ð²ñ,1,sqej


#### step 3.2: getting readable data from product and user data

In [16]:
df_sampled_eng=df_sampled[df_sampled['nonalpha']!="1"]
df_sampled_eng.shape

df_sampled_auteng=df_sampled[df_sampled['authornonalpha']!="1"]
df_sampled_auteng.shape
df_sampled_auteng['author'].unique().tolist()

['vinod kumar chengespur',
 'sharon',
 'an sionnach',
 'jomine jose',
 'walter',
 'cliente amazon',
 'hjkage',
 'egornesterenko2009',
 'no captured',
 'chads3010',
 'hitesh malhotra',
 'luis del rã\xado',
 'w kessing',
 'elgringo_221',
 'viktoria-o',
 'amazon customer',
 'josef',
 'hussainn',
 'alessio t.',
 'amigo-vulnerable',
 'merendeiro',
 'lucas',
 'scorpionne666',
 'j. cerquera',
 'anonymous',
 'r. o. davis',
 'phobia',
 'dennis kee',
 'stuart gosney',
 'lynneb',
 'maribel',
 'dilemma21',
 'vlad',
 'adam',
 'nino',
 'mk',
 'farooque ahmed',
 'elke kaiser',
 'i. palmer',
 'aaabadtec ',
 'josemanuel',
 'rdraper',
 'marie-pierre',
 'frank westphal',
 'john',
 'raymond ong a kwien',
 'ja-topograf',
 'molux',
 'dominik',
 'claude',
 'irinags2009',
 'marcato mauro',
 'oc1905',
 'felivrin',
 'killingjoke',
 'ajai mohan',
 'iomar lemes barbosa',
 'chris p',
 'pravin',
 'leorock65',
 'b210bl',
 'annette barrett',
 'roticbeatz',
 'ramires',
 'lypssos',
 'andfri80',
 'netmax',
 'n. n. kaong

In [44]:
## most rated product
prod=df_sampled.groupby(['prod_details1']).size().reset_index(name='counts')
prod.sort_values('counts',ascending=False)[0:5]

Unnamed: 0,prod_details1,counts
14162,samsung galaxy s6,19578
14163,samsung galaxy s7,19015
14152,samsung galaxy s,14071
14155,samsung galaxy s4,13487
14156,samsung galaxy s5,12425


#### Data frame Prod will give most rated products, above is the list of top 5 products which has recived maximum reviews 


In [18]:
df_sampled.groupby('prod_details1')['score'].agg(['mean','size']).sort_values(['mean','size'],ascending=False).head(10).reset_index()

Unnamed: 0,prod_details1,mean,size
0,apple smartphone libre,10.0,22
1,iphone 6s 16gb,10.0,16
2,samsung smartphone libre,10.0,16
3,lg optimus gj,10.0,15
4,samsung note ii,10.0,13
5,iphone 6s 64gb,10.0,12
6,coque sony xperia,10.0,11
7,seminovo apple iphone,10.0,11
8,blackberry dtek50 16,10.0,10
9,htc p4550 tytn,10.0,10


Summary : Above is the top 10 highest rated products based on ratings and size

In [19]:
prod_rd=df_sampled_eng.groupby(['prod_details1']).size().reset_index(name='counts')
print(df_sampled_eng.groupby('prod_details1')['score'].agg(['mean','size']).sort_values(['mean','size'], ascending=False).head(2))  
prod_rd_gt50_rate_list=prod_rd[prod_rd['counts']>50]['prod_details1']
prod_rd_gt50_rate_list[0:2]

                        mean  size
prod_details1                     
apple smartphone libre  10.0    22
iphone 6s 16gb          10.0    16


3          apple
16     envios da
Name: prod_details1, dtype: object

Summary: Since this data has non readable characters, removing them and above is list of only readable highest rating products

In [20]:
## authors with most reviews
author=df_sampled.groupby(['author']).size().reset_index(name='counts')
author.sort_values('counts',ascending=False)[0:10]

Unnamed: 0,author,counts
26747,amazon customer,54605
375541,no captured,43809
99693,cliente amazon,13674
140909,e-bit,5959
99668,client d'amazon,5496
27077,amazon kunde,3283
39136,anonymous,2067
145533,einer kundin,1890
145530,einem kunden,1350
518385,unknown,1225


Summary : Above is the list of authors with maximum reviews

In [21]:
## authors names too have junk data so checking list of authors whose names are readable
## authors with most reviews
author_rd=df_sampled_auteng.groupby(['author']).size().reset_index(name='counts')
author_rd.sort_values('counts',ascending=False)[0:2]
author_rd_gt50_rate_list=author_rd[author_rd['counts']>50]['author'].to_list()
author_rd_gt50_rate_list[0:3]

['a', 'aaron', 'abhishek']

Summary : Above is the list of authors with maximum reviews with authors name in readble format

In [22]:
## Select the data with products having more than 50 ratings
prod_gt50_rate_list=prod[prod['counts']>50]['prod_details1']
prod_gt50_rate_list[0:4]

3          apple
16     envios da
19           htc
28        huawei
Name: prod_details1, dtype: object

In [23]:
## users who have given more than 50 ratings.
author=df_sampled.groupby(['author']).size().reset_index(name='counts')
author_gt50_rate_list=author[author['counts']>50]['author'].to_list()
author_gt50_rate_list[0:2]

['#', '????????']

In [24]:
df_final=df_sampled[(df_sampled['prod_details'].isin(prod_gt50_rate_list)) & (df_sampled['author'].isin(author_gt50_rate_list))]
df_final.shape

(174326, 7)

## populatrity based recommondation 

In [70]:
df_final1=df_final.drop_duplicates()
print(df_final1.shape)
df_final.groupby('prod_details1')['score'].agg(['size','mean']).sort_values(['mean','size'], ascending=False)[0:5]

(89069, 7)


Unnamed: 0_level_0,size,mean
prod_details1,Unnamed: 1_level_1,Unnamed: 2_level_1
blackberry oem z10,12,10.0
blackberry passport,12,10.0
samsung e1150 handy,9,10.0
nokia n8 sim,8,10.0
iphone 6 16,6,10.0


#### Summary : above are 5 products recommoded based on popularity/ratings from users.
These products have high rating but these products have been recommonded by very less users. products with high reviwes are not appraring in top list

In [71]:
df_final_v1=df_sampled[(df_sampled['prod_details'].isin(prod_rd_gt50_rate_list)) & (df_sampled['author'].isin(author_rd_gt50_rate_list))]
df_final_v1.shape
df_final_v1=df_final_v1.drop_duplicates()

###Summary : this data has less noise,

In [72]:
df_final_v1.groupby('prod_details1')['score'].agg(['mean','size']).sort_values(['mean','size'],ascending=False).head(5)  

Unnamed: 0_level_0,mean,size
prod_details1,Unnamed: 1_level_1,Unnamed: 2_level_1
blackberry oem z10,10.0,9
samsung e1150 handy,10.0,9
blackberry passport,10.0,8
nokia n8 sim,10.0,8
sony ericsson arc,10.0,5


Summary: data with less noise too produce same list of products based on popularity 

## Build a collaborative filtering model using SVD.

In [29]:
from surprise import SVD, KNNWithMeans
from surprise import accuracy
from surprise import Dataset,Reader

In [89]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df_final1[['author', 'prod_details1', 'score']], reader)

# Split data to train and test
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=123)

print(trainset.to_raw_iid(0))

svd_model = SVD(n_factors=5,biased=False)
svd_model.fit(trainset)

test_pred = svd_model.test(testset)

# compute RMSE
print(accuracy.rmse(test_pred))
pred = pd.DataFrame(test_pred)
# # pred[pred['uid'] == 'amazon customer'][['iid', 'r_ui','est']].sort_values(by = 'r_ui',ascending = False).head(10)
# pred[['uid','uid','iid', 'r_ui','est']].sort_values(by = ['uid','r_ui','est'],ascending = False).head(10)

htc incredible s
RMSE: 2.9514
2.9513655907437624


In [93]:
from collections import defaultdict
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

### New product recommendation for user

In [95]:
top_n = get_top_n(test_pred, n=10)
# pred.sort_values(['uid','r_ui','est'],ascending=False).head(10)
top_n

defaultdict(list,
            {'alexander': [('huawei ascend g600', 10),
              ('htc one smartphone', 9.556021453405585),
              ('huawei ascend mate', 9.425918521969589),
              ('asus computer zenfone', 9.083667572781483),
              ('sony xperia s', 9.032531882250506),
              ('samsung s5230 star', 8.91584944079047),
              ('honor 7 smartphone', 8.639872568266167),
              ('htc one x', 8.534793979834358),
              ('lenovo motorola moto', 8.320956781191608),
              ('sony ericsson k750i', 8.297947151611009)],
             'juan carlos': [('sony xperia z5', 9.22362311479064),
              ('bq aquaris e5', 8.993835718545952),
              ('lg l bello', 8.953404624477994),
              ('apple iphone 5s', 8.934542845367217),
              ('samsung galaxy note', 8.878961035375392),
              ('samsung galaxy s3', 8.779275442592148),
              ('samsung galaxy s3', 8.779275442592148),
              ('galaxy samsung

In [145]:
### validating
prod_listdf=df_final1[df_final1['author']=='alexander'][['prod_details','score']].sort_values('score', ascending=False)
prod_list=prod_listdf['prod_details'][0:10]
print('user has given high rating to below products')
print(prod_listdf[0:10])
u1=pred[pred['uid']=='alexander'][['uid','iid','r_ui','est']].sort_values(['r_ui','est'], ascending=False)
recom= u1[~u1['iid'].isin(prod_list)]
print(' ')
print('recommended products for user')
recom[0:5]

user has given high rating to below products
                prod_details  score
8497     sony ericsson k750i   10.0
52070       sony xperia tipo   10.0
85396         sprint lg volt   10.0
244829        sony xperia sp   10.0
91564   samsung s7582 galaxy   10.0
93964      samsung galaxy j5   10.0
69493         htc wildfire s   10.0
367736         huawei mate 2   10.0
144259     samsung galaxy a5   10.0
156158       lg google nexus   10.0
 
recommended products for user


Unnamed: 0,uid,iid,r_ui,est
11762,alexander,huawei ascend g600,10.0,10.0
5272,alexander,htc one smartphone,10.0,9.556021
5006,alexander,huawei ascend mate,10.0,9.425919
20044,alexander,honor 7 smartphone,10.0,8.639873
0,alexander,htc one x,10.0,8.534794


In [147]:
### same analysis for readable data
reader = Reader(rating_scale=(1, 10))
data1 = Dataset.load_from_df(df_final_v1[['author', 'prod_details1', 'score']], reader)

# Split data to train and test
from surprise.model_selection import train_test_split
trainset1, testset1 = train_test_split(data1, test_size=.25,random_state=123)

print(trainset.to_raw_iid(0))

svd_model = SVD(n_factors=5,biased=False)
svd_model.fit(trainset1)

test_pred1= svd_model.test(testset1)

# compute RMSE
print(accuracy.rmse(test_pred1))

pred = pd.DataFrame(test_pred1)
top_n = get_top_n(test_pred1, n=10)
top_n

htc incredible s
RMSE: 2.9341
2.934079326909404


defaultdict(list,
            {'ivan': [('sim free apple', 9.849520331227229),
              ('huawei ascend g300', 9.746287445747132),
              ('sony xperia z3', 9.677772880088678),
              ('lg google nexus', 9.622053758080268),
              ('sim free lg', 9.518425465063867),
              ('apple iphone 6s', 9.489873393209338),
              ('smartphone sony xperia', 9.398736164751838),
              ('motorola moto e', 9.208653663742641),
              ('lg h340 leon', 9.12702213803825),
              ('samsung n9005 galaxy', 9.077084783747063)],
             'karthik': [('motorola moto z', 9.126515551035514),
              ('lenovo vibe x3', 9.08239032349194),
              ('apple iphone 6s', 8.853761728133719),
              ('samsung galaxy s7', 8.812557048020615),
              ('blu vivo air', 8.768879703522025),
              ('sony xperia xa', 8.68173984215288),
              ('meizu m1 note', 8.651091885854106),
              ('lg g2 d802', 8.500407655990585

In [32]:
## KNNWithMeans- userbased

In [149]:
algo_i = KNNWithMeans(k=10, sim_options={ 'user_based': True})
algo_i.fit(trainset)

test_pred=algo_i.test(testset)
print(accuracy.rmse(test_pred))

pred = pd.DataFrame(test_pred)
# pred[['uid','iid', 'r_ui','est']].sort_values(by = ['uid','r_ui','est'],ascending = False).head(10)
print("top recommended products for user based on KNN")
top_n = get_top_n(test_pred, n=10)
top_n

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.0518
3.0517901953414484
top recommended products for user based on KNN


defaultdict(list,
            {'alexander': [('lg l90 d410', 9.733740710156896),
              ('samsung nexus s', 9.503452651795786),
              ('sony xperia s', 9.432499241267998),
              ('blackberry bold 9780', 9.104632564370364),
              ('htc one x', 8.98343533966499),
              ('htc one smartphone', 8.818571683081387),
              ('huawei ascend mate', 8.795993695797284),
              ('honor 7 smartphone', 8.78480218022114),
              ('nokia lumia 920', 8.644115005366027),
              ('lenovo motorola moto', 8.541184706339369)],
             'juan carlos': [('samsung galaxy ace', 10),
              ('samsung galaxy ace', 10),
              ('huawei p9 lite', 9.719324902175472),
              ('galaxy samsung galaxy', 9.588578566735277),
              ('samsung galaxy note', 9.424349377772906),
              ('samsung galaxy s3', 9.267188121610644),
              ('samsung galaxy s3', 9.267188121610644),
              ('sony xperia p', 9.0889708

In [150]:
# algo_i = KNNWithMeans(k=10, sim_options={ 'user_based': True})
# algo_i.fit(trainset1)

# test_pred=algo_i.test(testset1)
# print(accuracy.rmse(test_pred))

# pred = pd.DataFrame(test_pred)
# pred[['uid','iid', 'r_ui','est']].sort_values(by = ['uid','r_ui','est'],ascending = False).head(10)

In [152]:
algo_i = KNNWithMeans(k=10, sim_options={ 'user_based': False})
algo_i.fit(trainset)

test_pred=algo_i.test(testset)
print(accuracy.rmse(test_pred))

print("top recommended products for Item based on KNN")
top_n = get_top_n(test_pred, n=10)
top_n
# pred = pd.DataFrame(test_pred)
# pred[['uid','iid', 'r_ui','est']].sort_values(by = ['uid','r_ui'],ascending = False).head(10)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.0940
3.0939967304435956
top recommended products for Item based on KNN


defaultdict(list,
            {'alexander': [('huawei ascend mate', 10),
              ('sony xperia s', 10),
              ('huawei ascend g600', 9.994287461368568),
              ('lenovo motorola moto', 9.835957156113393),
              ('htc one smartphone', 9.73597788675107),
              ('nokia c6 unlocked', 9.302562236082824),
              ('lg l90 d410', 9.151245433819657),
              ('samsung nexus s', 9.134897828863346),
              ('honor 7 smartphone', 8.92399103324069),
              ('lg g flex', 8.780911711633735)],
             'juan carlos': [('samsung galaxy ace', 9.526617680618426),
              ('samsung galaxy ace', 9.526617680618426),
              ('lg l bello', 9.40518861981026),
              ('huawei p9 lite', 9.330970872040398),
              ('samsung galaxy s3', 8.974208642382083),
              ('samsung galaxy s3', 8.974208642382083),
              ('samsung galaxy note', 8.909815424939193),
              ('bq aquaris e5', 8.699290082101463),
 

In [37]:
### Cross validate

In [38]:
from surprise.model_selection import cross_validate

In [153]:
algo_user = KNNWithMeans(k=10, sim_options={ 'user_based': True})
cross_validate(algo_user, data1, measures=['RMSE', 'MAE'], cv=5, verbose=True)
test_pred=algo_user.test(testset)
print(accuracy.rmse(test_pred))

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.0025  3.0105  2.9840  3.0126  3.0126  3.0044  0.0109  
MAE (testset)     2.3690  2.3802  2.3507  2.3779  2.3717  2.3699  0.0104  
Fit time          1.19    1.19    1.19    1.01    1.22    1.16    0.07    
Test time         4.29    4.21    4.13    4.34    4.21    4.24    0.07    
RMSE: 2.3860
2.38600805666108


In [154]:
top_cv = get_top_n(test_pred, n=10)
top_cv

defaultdict(list,
            {'alexander': [('huawei ascend mate', 10),
              ('htc one x', 9.763126622869105),
              ('sony ericsson k750i', 9.586751027820915),
              ('samsung galaxy s3', 9.530019967925494),
              ('huawei ascend g600', 9.449439889797251),
              ('htc one smartphone', 9.199174390110715),
              ('nokia c6 unlocked', 8.978927974770249),
              ('sony xperia m4', 8.902603289372875),
              ('lenovo motorola moto', 8.836198570468838),
              ('sony ericsson txt', 8.73360017338344)],
             'juan carlos': [('huawei p9 lite', 10),
              ('apple iphone 5s', 10),
              ('sony xperia z5', 9.737967435330516),
              ('samsung galaxy s3', 9.52399959197863),
              ('samsung galaxy s3', 9.52399959197863),
              ('samsung galaxy ace', 9.417568737623087),
              ('samsung galaxy ace', 9.417568737623087),
              ('galaxy samsung galaxy', 9.377393093787685)

In [155]:
algo_item = KNNWithMeans(k=10, sim_options={ 'user_based': False})
cross_validate(algo_i, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
test_pred=algo_i.test(testset)
print(accuracy.rmse(test_pred))

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.0973  3.0773  3.0927  3.0952  3.1083  3.0941  0.0100  
MAE (testset)     2.4582  2.4415  2.4543  2.4533  2.4714  2.4557  0.0096  
Fit time          11.57   9.74    11.02   10.75   11.61   10.94   0.68    
Test time         21.24   22.64   22.85   23.72   24.05   22.90   0.98    
RMSE: 2.2945
2.2945428292895196


In [156]:
top_cv = get_top_n(test_pred, n=10)
top_cv

defaultdict(list,
            {'alexander': [('htc one smartphone', 10),
              ('huawei ascend g600', 10),
              ('lenovo motorola moto', 10),
              ('honor 7 smartphone', 10),
              ('huawei ascend mate', 9.698431250001432),
              ('htc one x', 9.364739418483428),
              ('lg l90 d410', 9.032467027374937),
              ('samsung nexus s', 9.025320230181341),
              ('nokia c6 unlocked', 9.012462537462538),
              ('sony xperia s', 8.974523019874384)],
             'juan carlos': [('samsung galaxy note', 9.580253844168851),
              ('sony xperia z5', 9.576841237471939),
              ('huawei p9 lite', 9.558952050727974),
              ('sony xperia p', 9.459104393555748),
              ('samsung galaxy ace', 9.40092013898581),
              ('samsung galaxy ace', 9.40092013898581),
              ('apple iphone 5s', 9.38138981350631),
              ('blackberry bold 9790', 9.209568717403231),
              ('samsung ga

Summary:
1. Tried SVD and Kmeans for user-user and item-item recommendations with and without cross validation
2. accuracy based on item-item cross validation is least. So products recommended from item-item with cV can be recommended to users
3. IT can be observed that SVD and KMeans produces slightly different recommendation. Sometime the recommended products are different and sometimes rank of the products.
4. So based on least RMSE products can be recommended to users

In [41]:
# 11. In what business scenario you should use popularity based Recommendation Systems ?
# 12. In what business scenario you should use CF based Recommendation Systems ?
# 13. What other possible methods can you think of which can further improve the recommendation for different users ?

### Popularity based Recommendation :
1. Popularity based, is simplest way to build recommendation system.
2. It is based on popularity i.e average ratings or for item or products
3. Its easiest way to showcase the current trend in the market. So any songs, movie or products on ecommerce site which are popular in the industry, can be made avaiable to user for any exploration. For. example Netflix showing top 10 movies trending in India.
4. This does not provide any personalization, as it is just based on average ratings. 
5. This type of recommondation system is useful for any business which wants to showcase current trend in their industry to users. It also helps to provide recommendation to new users for which there is no user history or profile. It helps in great way to new business who do not have any user base. they can still use this recommendation to show whats popular in the industry. 
6. Few business which can get benifited from RS e-Commerce,Retail,Media,Banking,Telecom,Utilities

### CF based Recommendation Systems
1. Collaborative Filtering is a personalised recommender system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. This type of recommendations are useful in marketing campaigns, retail business and this can be used in upscalling and cross saling.
2. collaberative Recommendations is a personalised recommender system so it helps more to business which has a user base. so based on similarity in user profile, items can be recommended to new customers.
3. Item-item based CF can be useful a business to cater new users based on thier navigation similar items can be recommended to new customers.
4. simillary in collaberation with CF User-user and item-item, products can be suggested to user in same category or different adding more to cross sell and upscale
5. Few business which can get benifited from CF e-Commerce,Retail,Media,Banking,Telecom,Utilities
6. This can help business by Increased sales/conversion, Increased user satisfaction,Increased loyalty and Reduced churn

### Improvement in Recommendations
1. User-user/item-item based recommendations does provide recommendation based on similarity between user profile and item profile, but these recommendations are based on user-user similarity. but lacks to provide any personalizetion based on user preference 
2. So this can be further improved based by considering other user informations like geogrphical information, age if possible, user’s dynamic information from social network, historical log and many more.
3. combining two or more recommendations systems for recommendations.
