# Amazon Cell Phones Reviews
### Context
I am currently working on my undergraduate thesis about sentiment analysis, and I am planning to use Amazon customer reviews on cell phones. There are other existing datasets on Amazon mobile/cell phones, but this dataset focuses on both unlocked and locked carriers, and scoped on ten brands: ASUS, Apple, Google, HUAWEI, Motorola, Nokia, OnePlus, Samsung, Sony, and Xiaomi.

### Content
Published here are two files, items.csv and reviews.csv with a date prefixed which indicates when the data is retrieved.

   items.csv contains retrieved (read: scraped) items from Amazon.com search results using generated URL and specific query string to search only specific brands and has minimal 1 star review.
    
   reviews.csv contains reviews for previously retrieved items at items.csv but not with columns from items.csv.
   
Task website: https://www.kaggle.com/datasets/grikomsn/amazon-cell-phones-reviews

# Import libraries
First, let's get all of the imports out of the way:

In [1]:
# The fundamental package for scientific computing with Python
import numpy as np
#Dataframe manipulation library
import pandas as pd
# Comprehensive library for creating static, animated, and interactive visualizations in Python
import matplotlib.pyplot as plt
# We use scipy to find pearson measurement
from scipy.stats import pearsonr

Now let's read each file into their Dataframes:

In [2]:
#Storing the cell phones information into a pandas dataframe
items = pd.read_csv('items.csv')
#Storing the user information into a pandas dataframe
reviews = pd.read_csv('reviews.csv')
items.head()

Unnamed: 0,asin,brand,title,url,image,rating,reviewUrl,totalReviews,price,originalPrice
0,B0000SX2UC,,Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice...,https://www.amazon.com/Dual-Band-Tri-Mode-Acti...,https://m.media-amazon.com/images/I/2143EBQ210...,3.0,https://www.amazon.com/product-reviews/B0000SX2UC,14,0.0,0.0
1,B0009N5L7K,Motorola,Motorola I265 phone,https://www.amazon.com/Motorola-i265-I265-phon...,https://m.media-amazon.com/images/I/419WBAVDAR...,3.0,https://www.amazon.com/product-reviews/B0009N5L7K,7,49.95,0.0
2,B000SKTZ0S,Motorola,MOTOROLA C168i AT&T CINGULAR PREPAID GOPHONE C...,https://www.amazon.com/MOTOROLA-C168i-CINGULAR...,https://m.media-amazon.com/images/I/71b+q3ydkI...,2.7,https://www.amazon.com/product-reviews/B000SKTZ0S,22,99.99,0.0
3,B001AO4OUC,Motorola,Motorola i335 Cell Phone Boost Mobile,https://www.amazon.com/Motorola-i335-Phone-Boo...,https://m.media-amazon.com/images/I/710UO8gdT+...,3.3,https://www.amazon.com/product-reviews/B001AO4OUC,21,0.0,0.0
4,B001DCJAJG,Motorola,Motorola V365 no contract cellular phone AT&T,https://www.amazon.com/Motorola-V365-contract-...,https://m.media-amazon.com/images/I/61LYNCVrrK...,3.1,https://www.amazon.com/product-reviews/B001DCJAJG,12,149.99,0.0


Choose some features in both items and review data to reommend to our active user.
Also, we remove price with zero values.

In [3]:
np.random.seed(0)
totalItems=items.copy()
items = items[['asin','brand', 'rating', 'price']]
items.rename(columns={'rating': 'avg-rating'}, inplace=True)
items = items[items['price'] != 0]
reviews = reviews[['asin', 'name', 'rating']]
reviews

Unnamed: 0,asin,name,rating
0,B0000SX2UC,Janet,3
1,B0000SX2UC,Luke Wyatt,1
2,B0000SX2UC,Brooke,5
3,B0000SX2UC,amy m. teague,3
4,B0000SX2UC,tristazbimmer,4
...,...,...,...
67981,B081H6STQQ,jande,5
67982,B081H6STQQ,2cool4u,5
67983,B081H6STQQ,simon,5
67984,B081TJFVCJ,Tobiasz Jedrysiak,5


Define a user profile to predict whitch cell phones we can recommend to user.

In [4]:
userProfile = [
            {'asin':'B079HB518K', 'rating':4.5},
            {'asin':'B07KFNRQ5S', 'rating':5},
            {'asin':'B07L78G3D2', 'rating':2},
            {'asin':'B00K0NS0P4', 'rating':3},
            {'asin':'B07V5NSD8N', 'rating':4}
         ] 
usersInput = pd.DataFrame(userProfile)
usersInput

Unnamed: 0,asin,rating
0,B079HB518K,4.5
1,B07KFNRQ5S,5.0
2,B07L78G3D2,2.0
3,B00K0NS0P4,3.0
4,B07V5NSD8N,4.0


Add other fefatures of cell phones that rated by user.

In [5]:
usersInput=usersInput.merge(items, on='asin', how='left')
usersInput

Unnamed: 0,asin,rating,brand,avg-rating,price
0,B079HB518K,4.5,Apple,3.9,199.0
1,B07KFNRQ5S,5.0,Apple,4.1,664.99
2,B07L78G3D2,2.0,Nokia,2.7,79.0
3,B00K0NS0P4,3.0,Motorola,4.0,209.75
4,B07V5NSD8N,4.0,Samsung,4.7,499.99


Add weight to each cell phone that user rated by brand.

In [6]:
brandsOrder={}
currentValue=200
for item in usersInput['brand']:
    if item not in brandsOrder.keys():
        brandsOrder[item]=currentValue
        currentValue-=1
brandsOrder

{'Apple': 200, 'Nokia': 199, 'Motorola': 198, 'Samsung': 197}

Join two table of reviews and items to choose between them to recommend to active user

In [7]:
df = pd.merge(items, reviews, on='asin')
df = df.loc[:, ['asin', 'name', 'price', 'avg-rating', 'rating']]
df

Unnamed: 0,asin,name,price,avg-rating,rating
0,B0009N5L7K,Marcel Thomas,49.95,3.0,1
1,B0009N5L7K,William B.,49.95,3.0,4
2,B0009N5L7K,K. Mcilhargey,49.95,3.0,5
3,B0009N5L7K,Stephen Cahill,49.95,3.0,1
4,B0009N5L7K,Mihir,49.95,3.0,5
...,...,...,...,...,...
56226,B081H6STQQ,jande,948.00,4.5,5
56227,B081H6STQQ,2cool4u,948.00,4.5,5
56228,B081H6STQQ,simon,948.00,4.5,5
56229,B081TJFVCJ,Tobiasz Jedrysiak,478.97,5.0,5


Get all these cell phones with specific asin feature between all items

In [8]:
otherUsers = df[df['asin'].isin(np.asanyarray(usersInput['asin']))]
otherUsers.head(10)

Unnamed: 0,asin,name,price,avg-rating,rating
6125,B00K0NS0P4,farmfreshk,209.75,4.0,1
6126,B00K0NS0P4,Richard H Everson,209.75,4.0,2
6127,B00K0NS0P4,Daisy S,209.75,4.0,5
6128,B00K0NS0P4,Natty AKA Picky Polly,209.75,4.0,4
6129,B00K0NS0P4,Marcus R,209.75,4.0,5
6130,B00K0NS0P4,J. Shaw,209.75,4.0,5
6131,B00K0NS0P4,M. S. Prager,209.75,4.0,4
6132,B00K0NS0P4,Q Johnson,209.75,4.0,5
6133,B00K0NS0P4,J & G,209.75,4.0,4
6134,B00K0NS0P4,Timmy,209.75,4.0,2


Group users by name to specify the rated value of eacg user to each of these cell phones.

In [9]:
groupedUsers = otherUsers.groupby('name')
groupedUsers

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C943593D00>

Sort cell phone's features that bought by each user by their name.

In [10]:
groupedUsers = sorted(groupedUsers, key=lambda x: len(x[1]), reverse=True)
groupedUsers

[('Amazon Customer',
               asin             name   price  avg-rating  rating
  34842  B079HB518K  Amazon Customer  199.00         3.9       5
  34846  B079HB518K  Amazon Customer  199.00         3.9       1
  34848  B079HB518K  Amazon Customer  199.00         3.9       2
  34863  B079HB518K  Amazon Customer  199.00         3.9       5
  34888  B079HB518K  Amazon Customer  199.00         3.9       5
  34891  B079HB518K  Amazon Customer  199.00         3.9       5
  34902  B079HB518K  Amazon Customer  199.00         3.9       5
  34903  B079HB518K  Amazon Customer  199.00         3.9       5
  34908  B079HB518K  Amazon Customer  199.00         3.9       5
  34912  B079HB518K  Amazon Customer  199.00         3.9       3
  34929  B079HB518K  Amazon Customer  199.00         3.9       1
  34938  B079HB518K  Amazon Customer  199.00         3.9       1
  34940  B079HB518K  Amazon Customer  199.00         3.9       5
  34948  B079HB518K  Amazon Customer  199.00         3.9       5
  34

Make active user's dataframe.

In [11]:
userRating = usersInput['rating'].tolist()
userRating

[4.5, 5.0, 2.0, 3.0, 4.0]

In [12]:
pearsonDic={}
new_df = pd.DataFrame(data={'asin': usersInput['asin'], 'rating': np.zeros(len(usersInput))})
for name, group in groupedUsers:
    # Make new dataframe with these asin of specific cell phones and put zero for each rating feature
    new_df = pd.DataFrame(data={'asin': usersInput['asin'], 'rating': np.zeros(len(usersInput))})
    # Get only asin and rating features from grouped users
    group = group[['asin', 'rating']]
    # Find average of rating for each cell phone
    temp_group = group.groupby(['asin']).mean()
    # Make a dataframe which has a all specified cell phones with each user rating
    # We use zero for each cell phone that user is not rated before
    mergedData = temp_group.merge(new_df, on='asin', how='right')
    mergedData['rating'] = mergedData[['rating_x', 'rating_y']].apply(lambda x: x[0],axis=1)
    mergedData.fillna(np.float32(0),inplace=True)
    mergedData.drop(['rating_x','rating_y'],axis=1,inplace=True)
    groupRating=mergedData['rating'].tolist()
    # Fine pearson similarity between user's ratings and each user's ratings
    pearsonDic[name]=pearsonr(userRating,groupRating)[0]    

Make similarity table for each user.

In [13]:
similarity_df=pd.DataFrame.from_dict(pearsonDic, orient='index')
similarity_df.columns=['similarity']
similarity_df['name']=similarity_df.index
similarity_df.sort_values(by='similarity',ascending=False,inplace=True)
similarity_df.reset_index(drop=True,inplace=True)
# Get only similaries which is bigger than zero
similarity_df=similarity_df[similarity_df['similarity']>0]
similarity_df

Unnamed: 0,similarity,name
0,0.851390,Amazon Customer
1,0.603510,JaKyra Finley
2,0.603510,KemD
3,0.603510,Kandury Venkata
4,0.603510,Kaitlyn
...,...,...
312,0.139272,Phillipe Oliveira
313,0.139272,Karem Cerrato
314,0.139272,Eleazar Pollack Halford
315,0.139272,Wilson K Kirwa


Add other features for each cell phone and calculate the weighted value for each of them by multiplying a similarity value and rating value.

In [14]:
topUsersRatings=similarity_df.merge(reviews,on='name')
topUsersRatings['weighted']=topUsersRatings['similarity']*topUsersRatings['rating']
topUsersRatings

Unnamed: 0,similarity,name,asin,rating,weighted
0,0.851390,Amazon Customer,B0000SX2UC,3,2.554169
1,0.851390,Amazon Customer,B000SKTZ0S,1,0.851390
2,0.851390,Amazon Customer,B001AO4OUC,5,4.256949
3,0.851390,Amazon Customer,B002AS9WEA,2,1.702779
4,0.851390,Amazon Customer,B002AS9WEA,3,2.554169
...,...,...,...,...,...
7591,0.139272,Phillipe Oliveira,B07V5NSD8N,5,0.696358
7592,0.139272,Karem Cerrato,B07V5NSD8N,5,0.696358
7593,0.139272,Eleazar Pollack Halford,B07V5NSD8N,5,0.696358
7594,0.139272,Wilson K Kirwa,B07V5NSD8N,4,0.557086


Make weighted matrix with sum of similarity and sum of weighted features.

In [15]:
weightedMatrix=topUsersRatings.groupby('asin').sum()[['similarity','weighted']]
weightedMatrix.columns=['sum_similarity','sum_weighted']
weightedMatrix

Unnamed: 0_level_0,sum_similarity,sum_weighted
asin,Unnamed: 1_level_1,Unnamed: 2_level_1
B0000SX2UC,0.851390,2.554169
B000SKTZ0S,1.454900,3.265429
B001AO4OUC,0.851390,4.256949
B002AS9WEA,1.702779,4.256949
B002UHS0UI,12.662238,30.061423
...,...,...
B07Z8BL2VW,0.851390,4.256949
B07ZDJCL76,3.900460,16.096743
B07ZP9ZWFM,0.371391,1.856953
B07ZPKZSSC,0.851390,0.851390


Make recommendation matrix.

In [16]:
# Calcualte average score by diving sum of weights by sum of similarity
recommendation_df=pd.DataFrame(data={'asin':weightedMatrix.index,'average_score':weightedMatrix['sum_weighted']/weightedMatrix['sum_similarity']})
# Reset index of dataFrame
recommendation_df.reset_index(drop=True,inplace=True)
# Remove user's rated cell phones between all of the cell phones we have
recommendation_df=recommendation_df[~recommendation_df['asin'].isin(usersInput['asin'].tolist())]
# Add other features of remained cell phones to recommend to user
recommendation_df=pd.merge(recommendation_df,totalItems[['asin','brand','url','image','price']],on='asin')
# Remove cell phones with price of under 10 dollars
recommendation_df = recommendation_df[recommendation_df['price'] >10]
# Get only cell phone's brand only similar to brands of cell phones that active user rated
recommendation_df=recommendation_df[recommendation_df['brand'].isin(usersInput['brand'].tolist())]
# Change the order of recommended user by brand simliar to the brands of cell phones that active user rated
for index,row in recommendation_df.iterrows():
    recommendation_df.at[index,'brand']=brandsOrder[row['brand']]
recommendation_df.sort_values(by=['average_score','brand'],inplace=True,ascending=False)
recommendation_df.reset_index(drop=True,inplace=True)
recommendation_df

Unnamed: 0,asin,average_score,brand,url,image,price
0,B06X9X15Y8,5.0,200,https://www.amazon.com/Apple-iPhone-Fully-Unlo...,https://m.media-amazon.com/images/I/61C5dT6qy0...,299.99
1,B076KC34PM,5.0,200,https://www.amazon.com/Apple-iPhone-Unlocked-S...,https://m.media-amazon.com/images/I/61XjpQucvy...,334.98
2,B07753NSQZ,5.0,200,https://www.amazon.com/Apple-iPhone-GSM-Unlock...,https://m.media-amazon.com/images/I/810MbmOEoq...,349.80
3,B077NJQPGB,5.0,200,https://www.amazon.com/Apple-iPhone-Fully-Unlo...,https://m.media-amazon.com/images/I/71A45XDJ09...,256.82
4,B077NTKFDB,5.0,200,https://www.amazon.com/Apple-iPhone-Unlocked-Q...,https://m.media-amazon.com/images/I/61C5dT6qy0...,329.97
...,...,...,...,...,...,...
318,B07QFS3L4G,1.0,197,https://www.amazon.com/Samsung-Galaxy-M20-Unlo...,https://m.media-amazon.com/images/I/418T9z1d6z...,195.00
319,B07STDQDKF,1.0,197,https://www.amazon.com/Samsung-Galaxy-Verizon-...,https://m.media-amazon.com/images/I/71nHQ3SWE+...,534.95
320,B07STFZQ9Y,1.0,197,https://www.amazon.com/Samsung-Galaxy-Verizon-...,https://m.media-amazon.com/images/I/61PoeRICr+...,596.99
321,B07SVFGKCJ,1.0,197,https://www.amazon.com/Samsung-Galaxy-Verizon-...,https://m.media-amazon.com/images/I/41D0pyBkdj...,538.99


Save the output in html file to show.

In [17]:
recommendation_df.to_html('C:/Users/ASUS/Data mining/Presentation/result/recommendation.html',justify='center')
usersInput.to_html('C:/Users/ASUS/Data mining/Presentation/result/usersInput.html',justify='center')