# Phone Recommendation System

## Introduction

This project focuses on building a phone recommendation system using collaborative filtering techniques. Collaborative filtering relies on the preferences and behaviors of users to suggest items they may like, based on similar users' ratings. In this case, we aim to recommend mobile phones to users by leveraging the data from user reviews and ratings.

### Dataset Overview

Two datasets are used in this project:

1. **items.csv**: This dataset contains information about mobile phones, including:
   - **asin**: Unique identifier for each phone.
   - **brand**: The brand or manufacturer of the phone (e.g., Apple, Samsung).
   - **title**: The full name or model of the phone.
   - **rating**: The average rating given by users (on a scale of 1 to 5).
   - **price**: The price of the phone.
   - **totalReviews**: The total number of reviews received by the phone.

2. **reviews.csv**: This dataset includes user reviews for each phone model:
   - **asin**: Identifier linking reviews to specific phones.
   - **name**: The name of the user who left the review.
   - **rating**: The individual rating given by the user.
   - **date**: The year when the review was posted.

### Objective

The primary objective of this project is to implement a collaborative filtering model that suggests phones to users based on the ratings provided by similar users. By identifying users with similar tastes and preferences, we can recommend phones that are likely to match their preferences.

# Importing Libraries

In [6]:
import pandas as pd
import numpy as np
from math import sqrt
from sklearn.preprocessing import LabelEncoder

# Data Loading and Overview
- the dataset is loaded from CSV files (items.csv for phone specifications and reviews.csv for reviews). The columns of each dataset are displayed, providing an overview of the available data, such as phone model, brand, price, rating, and total reviews.

In [3]:
itemFile = "items.csv"
ratingFile = "reviews.csv"
items_df = pd.read_csv(itemFile)
ratings_df = pd.read_csv(ratingFile) 

In [4]:
items_df.head()

Unnamed: 0,asin,brand,title,url,image,rating,reviewUrl,totalReviews,price,originalPrice
0,B0000SX2UC,,Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice...,https://www.amazon.com/Dual-Band-Tri-Mode-Acti...,https://m.media-amazon.com/images/I/2143EBQ210...,3.0,https://www.amazon.com/product-reviews/B0000SX2UC,14,0.0,0.0
1,B0009N5L7K,Motorola,Motorola I265 phone,https://www.amazon.com/Motorola-i265-I265-phon...,https://m.media-amazon.com/images/I/419WBAVDAR...,3.0,https://www.amazon.com/product-reviews/B0009N5L7K,7,49.95,0.0
2,B000SKTZ0S,Motorola,MOTOROLA C168i AT&T CINGULAR PREPAID GOPHONE C...,https://www.amazon.com/MOTOROLA-C168i-CINGULAR...,https://m.media-amazon.com/images/I/71b+q3ydkI...,2.7,https://www.amazon.com/product-reviews/B000SKTZ0S,22,99.99,0.0
3,B001AO4OUC,Motorola,Motorola i335 Cell Phone Boost Mobile,https://www.amazon.com/Motorola-i335-Phone-Boo...,https://m.media-amazon.com/images/I/710UO8gdT+...,3.3,https://www.amazon.com/product-reviews/B001AO4OUC,21,0.0,0.0
4,B001DCJAJG,Motorola,Motorola V365 no contract cellular phone AT&T,https://www.amazon.com/Motorola-V365-contract-...,https://m.media-amazon.com/images/I/61LYNCVrrK...,3.1,https://www.amazon.com/product-reviews/B001DCJAJG,12,149.99,0.0


# Data Cleaning and Preprocessing
- The data is cleaned by handling missing values and standardizing the dataset

In [5]:
items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   asin           720 non-null    object 
 1   brand          716 non-null    object 
 2   title          720 non-null    object 
 3   url            720 non-null    object 
 4   image          720 non-null    object 
 5   rating         720 non-null    float64
 6   reviewUrl      720 non-null    object 
 7   totalReviews   720 non-null    int64  
 8   price          720 non-null    float64
 9   originalPrice  720 non-null    float64
dtypes: float64(3), int64(1), object(6)
memory usage: 56.4+ KB


In [6]:
items_df.isnull().sum()

asin             0
brand            4
title            0
url              0
image            0
rating           0
reviewUrl        0
totalReviews     0
price            0
originalPrice    0
dtype: int64

In [7]:
items_df = items_df.dropna()

In [8]:
items_df.columns

Index(['asin', 'brand', 'title', 'url', 'image', 'rating', 'reviewUrl',
       'totalReviews', 'price', 'originalPrice'],
      dtype='object')

In [9]:
items_df = items_df.drop(['url', 'image' , 'reviewUrl' , 'totalReviews' , 'price', 'originalPrice'] , axis = 1)
items_df.head()

Unnamed: 0,asin,brand,title,rating
1,B0009N5L7K,Motorola,Motorola I265 phone,3.0
2,B000SKTZ0S,Motorola,MOTOROLA C168i AT&T CINGULAR PREPAID GOPHONE C...,2.7
3,B001AO4OUC,Motorola,Motorola i335 Cell Phone Boost Mobile,3.3
4,B001DCJAJG,Motorola,Motorola V365 no contract cellular phone AT&T,3.1
5,B001GQ3DJM,Nokia,Nokia 1680 Black Phone (T-Mobile),2.7


In [10]:
ratings_df.head()

Unnamed: 0,asin,name,rating,date,verified,title,body,helpfulVotes
0,B0000SX2UC,Janet,3,"October 11, 2005",False,"Def not best, but not worst",I had the Samsung A600 for awhile which is abs...,1.0
1,B0000SX2UC,Luke Wyatt,1,"January 7, 2004",False,Text Messaging Doesn't Work,Due to a software issue between Nokia and Spri...,17.0
2,B0000SX2UC,Brooke,5,"December 30, 2003",False,Love This Phone,"This is a great, reliable phone. I also purcha...",5.0
3,B0000SX2UC,amy m. teague,3,"March 18, 2004",False,"Love the Phone, BUT...!","I love the phone and all, because I really did...",1.0
4,B0000SX2UC,tristazbimmer,4,"August 28, 2005",False,"Great phone service and options, lousy case!",The phone has been great for every purpose it ...,1.0


In [11]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67986 entries, 0 to 67985
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   asin          67986 non-null  object 
 1   name          67983 non-null  object 
 2   rating        67986 non-null  int64  
 3   date          67986 non-null  object 
 4   verified      67986 non-null  bool   
 5   title         67957 non-null  object 
 6   body          67960 non-null  object 
 7   helpfulVotes  27215 non-null  float64
dtypes: bool(1), float64(1), int64(1), object(5)
memory usage: 3.7+ MB


In [12]:
ratings_df.columns

Index(['asin', 'name', 'rating', 'date', 'verified', 'title', 'body',
       'helpfulVotes'],
      dtype='object')

In [13]:
ratings_df.isnull().sum()

asin                0
name                3
rating              0
date                0
verified            0
title              29
body               26
helpfulVotes    40771
dtype: int64

In [14]:
ratings_df = ratings_df.drop(['verified', 'title', 'body','helpfulVotes'] , axis = 1)
ratings_df=  ratings_df.dropna()
ratings_df.head()

Unnamed: 0,asin,name,rating,date
0,B0000SX2UC,Janet,3,"October 11, 2005"
1,B0000SX2UC,Luke Wyatt,1,"January 7, 2004"
2,B0000SX2UC,Brooke,5,"December 30, 2003"
3,B0000SX2UC,amy m. teague,3,"March 18, 2004"
4,B0000SX2UC,tristazbimmer,4,"August 28, 2005"


In [15]:
ratings_df["date"] = pd.to_datetime(ratings_df["date"] , errors = "coerce")
ratings_df["date"] = ratings_df["date"].dt.year
ratings_df.head()

Unnamed: 0,asin,name,rating,date
0,B0000SX2UC,Janet,3,2005
1,B0000SX2UC,Luke Wyatt,1,2004
2,B0000SX2UC,Brooke,5,2003
3,B0000SX2UC,amy m. teague,3,2004
4,B0000SX2UC,tristazbimmer,4,2005


In [16]:
ratings_df = ratings_df[ratings_df["date"]>=2015]

In [17]:
ratings_df.shape

(61570, 4)

In [18]:
ratings_df.head()

Unnamed: 0,asin,name,rating,date
14,B0009N5L7K,Marcel Thomas,1,2016
17,B0009N5L7K,Stephen Cahill,1,2016
22,B000SKTZ0S,"Kei, San Jose, CA",1,2017
23,B000SKTZ0S,Kristy,1,2019
24,B000SKTZ0S,MARIO GAUTIER,5,2017


In [19]:
encoder = LabelEncoder()
ratings_df["name"] = encoder.fit_transform(ratings_df["name"])

In [20]:
ratings_df = ratings_df.drop("date" , axis = 1)
ratings_df = ratings_df.reset_index(drop = True)
ratings_df.head()

Unnamed: 0,asin,name,rating
0,B0009N5L7K,21499,1
1,B0009N5L7K,30908,1
2,B000SKTZ0S,18266,1
3,B000SKTZ0S,19053,1
4,B000SKTZ0S,20999,5


# User-Item Matrix Creation
A user-item interaction matrix is built, where rows represent users, and columns represent phones (asin). The matrix contains user ratings for different phones and forms the foundation of the collaborative filtering model

In [21]:
userId = 580
userInput = ratings_df[ratings_df["name"]==userId]
userInput = userInput.reset_index(drop = True)
userInput

Unnamed: 0,asin,name,rating
0,B00O2ALRNS,580,1
1,B06XRJQX91,580,5
2,B06Y16RL4W,580,5
3,B07NZXXZB2,580,5


In [22]:
userSubset =  ratings_df[ratings_df["asin"].isin(userInput["asin"].tolist())]
userSubset

Unnamed: 0,asin,name,rating
7435,B00O2ALRNS,26017,1
7436,B00O2ALRNS,14147,1
7437,B00O2ALRNS,1722,1
7438,B00O2ALRNS,18827,2
7439,B00O2ALRNS,12858,1
...,...,...,...
52080,B07NZXXZB2,11374,5
52081,B07NZXXZB2,13146,5
52082,B07NZXXZB2,13061,4
52083,B07NZXXZB2,27198,5


In [23]:
userSubset = userSubset[userSubset["name"]!=userId]

In [24]:
userSubsetgroup = userSubset.groupby("name")

In [25]:
userSubsetgroup = sorted(userSubsetgroup , key = lambda x: len(x[1]) , reverse = True)
userSubsetgroup[0:5]

[(1722,
               asin  name  rating
  7437   B00O2ALRNS  1722       1
  7442   B00O2ALRNS  1722       1
  7446   B00O2ALRNS  1722       5
  7451   B00O2ALRNS  1722       3
  7455   B00O2ALRNS  1722       5
  ...           ...   ...     ...
  52026  B07NZXXZB2  1722       5
  52031  B07NZXXZB2  1722       5
  52041  B07NZXXZB2  1722       5
  52049  B07NZXXZB2  1722       5
  52061  B07NZXXZB2  1722       3
  
  [159 rows x 3 columns]),
 (5131,
               asin  name  rating
  27436  B06Y16RL4W  5131       5
  51907  B07NZXXZB2  5131       4
  51940  B07NZXXZB2  5131       5),
 (6114,
               asin  name  rating
  27865  B06Y16RL4W  6114       2
  27892  B06Y16RL4W  6114       1
  28003  B06Y16RL4W  6114       1),
 (16266,
               asin   name  rating
  7514   B00O2ALRNS  16266       4
  25508  B06XRJQX91  16266       1
  52065  B07NZXXZB2  16266       1),
 (25184,
               asin   name  rating
  27724  B06Y16RL4W  25184       5
  51834  B07NZXXZB2  25184      

In [26]:
userSubsetgroup = userSubsetgroup[0:100]

# Recommendation System Setup

In [27]:
pearsonColleration = {}
for name , group in userSubsetgroup:
    group = group.sort_values(by="asin")
    userInput = userInput.sort_values(by="asin")
    nRatings = len(group)
    temp_df = userInput[userInput["asin"].isin(group["asin"].tolist())]
    tempRatinglist = temp_df["rating"].tolist()
    tempGrouplist = group["rating"].tolist()
    Sxx = sum([i**2 for i in tempRatinglist]) - pow(sum(tempRatinglist) , 2)/float(nRatings)
    Syy = sum([i**2 for i in tempGrouplist]) - pow(sum(tempGrouplist) , 2)/float(nRatings)
    Sxy = sum([i*j for i , j in zip(tempRatinglist , tempGrouplist)]) - sum(tempRatinglist)*sum(tempGrouplist)/float(nRatings)
    if Sxx and Syy !=0:
        pearsonColleration[name]=Sxy/sqrt(Sxx*Syy)
    else :
         pearsonColleration[name]= 0

In [28]:
pearsonColleration

{1722: 0.019473931059564114,
 5131: -0.49999999999999756,
 6114: 0.9999999999999998,
 16266: -1.0000000000000002,
 25184: 0,
 32760: -0.49999999999999756,
 345: 1.0,
 705: 0,
 1219: 0,
 1421: 0,
 2072: -1.0,
 2809: 0,
 7631: 0,
 7831: 0,
 14696: 0,
 16603: 0,
 18827: 0,
 21950: 1.0,
 22349: -1.0,
 25981: 0,
 26864: 1.0,
 28071: 0,
 30087: 0,
 31000: 0,
 32479: 0,
 36070: 0,
 41609: 0,
 57: 0,
 118: 0,
 123: 0,
 138: 0,
 148: 0,
 149: 0,
 212: 0,
 217: 0,
 293: 0,
 295: 0,
 321: 0,
 370: 0,
 379: 0,
 385: 0,
 386: 0,
 504: 0,
 511: 0,
 519: 0,
 537: 0,
 559: 0,
 563: 0,
 584: 0,
 599: 0,
 618: 0,
 750: 0,
 793: 0,
 839: 0,
 879: 0,
 899: 0,
 959: 0,
 978: 0,
 985: 0,
 1015: 0,
 1033: 0,
 1042: 0,
 1078: 0,
 1138: 0,
 1146: 0,
 1157: 0,
 1174: 0,
 1182: 0,
 1289: 0,
 1328: 0,
 1360: 0,
 1369: 0,
 1379: 0,
 1381: 0,
 1385: 0,
 1412: 0,
 1457: 0,
 1489: 0,
 1492: 0,
 1559: 0,
 1568: 0,
 1607: 0,
 1636: 0,
 1637: 0,
 1646: 0,
 1695: 0,
 1718: 0,
 1757: 0,
 1834: 0,
 1874: 0,
 1949: 0,
 1954

In [29]:
pearsonDf = pd.DataFrame.from_dict(pearsonColleration , orient = "index")
pearsonDf.columns = ["similarityIndex"]
pearsonDf["name"] = pearsonDf.index
pearsonDf.index = range(len(pearsonDf))
pearsonDf.head()

Unnamed: 0,similarityIndex,name
0,0.019474,1722
1,-0.5,5131
2,1.0,6114
3,-1.0,16266
4,0.0,25184


In [30]:
topUsers = pearsonDf.sort_values(by = "similarityIndex" , ascending = False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,name
6,1.0,345
20,1.0,26864
17,1.0,21950
2,1.0,6114
0,0.019474,1722


In [31]:
topUsersRating=topUsers.merge(ratings_df, left_on='name', right_on='name', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,name,asin,rating
0,1.0,345,B00O2ALRNS,1
1,1.0,345,B00ZE8HRYK,5
2,1.0,345,B06XRJQX91,5
3,1.0,345,B06Y14K2C6,5
4,1.0,345,B07GNHMZ3N,3


In [32]:
topUsersRating["weighthedRating"]=topUsersRating["similarityIndex"]*topUsersRating["rating"]
topUsersRating

Unnamed: 0,similarityIndex,name,asin,rating,weighthedRating
0,1.0,345,B00O2ALRNS,1,1.0
1,1.0,345,B00ZE8HRYK,5,5.0
2,1.0,345,B06XRJQX91,5,5.0
3,1.0,345,B06Y14K2C6,5,5.0
4,1.0,345,B07GNHMZ3N,3,3.0
...,...,...,...,...,...
6565,0.0,1636,B07NQNK5F4,1,0.0
6566,0.0,1636,B07WSJYDXX,5,0.0
6567,0.0,1607,B06Y16RL4W,2,0.0
6568,0.0,1568,B00JS73V2U,3,0.0


In [33]:
tempTopUsersRating = topUsersRating.groupby("asin").sum()[["similarityIndex" , "weighthedRating"]]
tempTopUsersRating.columns= ["sum_similarityIndex" , "sum_weightedRating"]
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
asin,Unnamed: 1_level_1,Unnamed: 2_level_1
B000SKTZ0S,0.019474,0.019474
B001AO4OUC,0.019474,0.09737
B002AS9WEA,0.019474,0.038948
B002UHS0UI,0.214213,0.564744
B002WTC1NG,0.214213,0.759483


In [34]:
recommendation_df = pd.DataFrame()
recommendation_df["weighted average recommedation score"] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df["asin"] = tempTopUsersRating.index
recommendation_df

Unnamed: 0_level_0,weighted average recommedation score,asin
asin,Unnamed: 1_level_1,Unnamed: 2_level_1
B000SKTZ0S,1.000000,B000SKTZ0S
B001AO4OUC,5.000000,B001AO4OUC
B002AS9WEA,2.000000,B002AS9WEA
B002UHS0UI,2.636364,B002UHS0UI
B002WTC1NG,3.545455,B002WTC1NG
...,...,...
B07YZS6QT3,5.000000,B07YZS6QT3
B07Z8BL2VW,5.000000,B07Z8BL2VW
B07ZDJCL76,3.666667,B07ZDJCL76
B07ZPKZSSC,1.000000,B07ZPKZSSC


In [35]:
recommendation_df = recommendation_df.sort_values(by = "weighted average recommedation score" , ascending=False)
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommedation score,asin
asin,Unnamed: 1_level_1,Unnamed: 2_level_1
B081H6STQQ,5.0,B081H6STQQ
B07PB5BXMZ,5.0,B07PB5BXMZ
B07H7SBBWQ,5.0,B07H7SBBWQ
B07HKQ61NV,5.0,B07HKQ61NV
B07J2Q68N4,5.0,B07J2Q68N4


# Phone Recommendations
Based on the collaborative filtering model, recommendations are generated for users. This section outputs the top phone suggestions for a sample user

In [36]:
items_df.loc[items_df["asin"].isin(recommendation_df.head(20)["asin"].tolist())]

Unnamed: 0,asin,brand,title,rating
101,B010CK8O9Q,Sony,"Sony Xperia Z3 Plus E6533 32GB Black 3G/4G, Du...",3.1
110,B016WYTYFE,Samsung,Samsung Galaxy S5 SM-G900T - 16GB - Shimmery W...,3.4
204,B06X9X15Y8,Apple,"Apple iPhone 7 Plus, 128GB, Silver - Fully Unl...",3.9
279,B074XF6JCD,Motorola,Motorola Moto Z2 Force XT1789 64GB ATT only (S...,3.5
290,B075SPVK8D,Samsung,Samsung Galaxy Note 8 SM-N950U 64GB Gray Veriz...,4.4
446,B07H7SBBWQ,Xiaomi,"Xiaomi Redmi 6 - 64GB + 4GB RAM, Dual Camera, ...",4.1
459,B07HKQ61NV,Apple,Apple iPhone 8 a1905 64GB LTE GSM Unlocked (Re...,3.7
464,B07J2Q68N4,Samsung,Samsung Galaxy S9 G960U Verizon + GSM Unlocked...,3.6
465,B07J4YLV46,Motorola,Motorola Moto Z3 MOTXT192917 Verizon Locked Ed...,3.5
467,B07J5GH52Q,Samsung,Samsung Qi Certified Fast Charge Wireless Char...,4.0


## Conclusion

In this project, we developed a phone recommendation system using collaborative filtering techniques. By leveraging user reviews and ratings, we were able to predict and recommend phones that align with the preferences of individual users. The collaborative filtering model effectively identified similar users, allowing for personalized phone suggestions based on shared tastes and past interactions.

Key insights from the project include:
- **Collaborative Filtering** proved to be a powerful approach, utilizing a user-item interaction matrix to recommend phones based on the behavior of users with similar preferences.
- The model performed well in recommending phones to users based on their previous ratings and the preferences of similar users.

However, there are areas for improvement:
- The model's performance could be enhanced by incorporating more user-specific data, such as demographic information or usage patterns.
- While collaborative filtering works well for users with sufficient rating history, it may struggle with new or inactive users, a problem known as the **cold start** issue.
- Future iterations of the model could explore hybrid approaches, combining collaborative filtering with content-based filtering for a more comprehensive recommendation system.

Overall, this project demonstrates the potential of collaborative filtering to create an effective recommendation system, providing valuable phone suggestions to users based on their previous behavior and the preferences of similar individuals.