# Sentiment Analysis of Amazon reviews


#### Dataset:
http://deepyeti.ucsd.edu/jianmo/amazon/index.html

Software reviews (459,436 reviews) metadata (26,815 products)

### Evaluating customer reviews on software products sold on Amazon.

In [50]:
import pandas as pd
import numpy as np



In [52]:
# load dataset
df = pd.read_json (r'Software.json', lines=True)

In [53]:
df.shape

(459436, 12)

In [54]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,4,True,"03 11, 2014",A240ORQ2LF9LUI,77613252,{'Format:': ' Loose Leaf'},Michelle W,The materials arrived early and were in excell...,Material Great,1394496000,,
1,4,True,"02 23, 2014",A1YCCU0YRLS0FE,77613252,{'Format:': ' Loose Leaf'},Rosalind White Ames,I am really enjoying this book with the worksh...,Health,1393113600,,
2,1,True,"02 17, 2014",A1BJHRQDYVAY2J,77613252,{'Format:': ' Loose Leaf'},Allan R. Baker,"IF YOU ARE TAKING THIS CLASS DON""T WASTE YOUR ...",ARE YOU KIDING ME?,1392595200,7.0,
3,3,True,"02 17, 2014",APRDVZ6QBIQXT,77613252,{'Format:': ' Loose Leaf'},Lucy,This book was missing pages!!! Important pages...,missing pages!!,1392595200,3.0,
4,5,False,"10 14, 2013",A2JZTTBSLS1QXV,77775473,,Albert V.,I have used LearnSmart and can officially say ...,Best study product out there!,1381708800,,


### Data Exploration

Understanding the data provided and identifying the data needed for evaluation.
Deciding which values need to be excluded.


In [6]:
# Empty values in each column 
df.isnull().sum()

overall                0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             225035
reviewerName          24
reviewText            66
summary               56
unixReviewTime         0
vote              331583
image             457928
dtype: int64

In [7]:
df["verified"].count()

459436

In [8]:
# Number of unverified purchases 
df.verified[df.verified==False].count()


150091

As 150091 unverified purchases would be too many to exclude, it is better to keep them.

In [39]:
#number of individual reviewers
df['reviewerID'].nunique()

375142

In [40]:
#Timeframe of the dataset

print (df.reviewTime.min())
print (df.reviewTime.max())

01 1, 2000
12 9, 2017


In [11]:
#number of products reviewed
df['asin'].nunique()

21663

In [12]:
#Stars rating
df['overall'].value_counts()

5    212452
1    102548
4     73596
3     39395
2     31445
Name: overall, dtype: int64

The given data provides a wide range of product evaluation and is surely interesting to see the results of the analysis.

### Data preparation

For the language analysis there will be no need to keep entries without the actual text. Therefore it would be best to drop the NaN values. 

reviewText            66

summary               56

In [42]:
# Null values
df.isnull().sum()


overall                0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             225030
reviewerName          24
reviewText            60
summary               50
unixReviewTime         0
vote              331577
image             457922
dtype: int64

In [55]:
#Taking a closer look at the NaN values before deleting

df[df['reviewText'].isna()]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
54226,5,True,"06 10, 2012",A1J59HUIE22VK6,B000XYUSMI,,chale,,hallmark/kodak,1339286400,,
62732,5,True,"06 10, 2012",A1J59HUIE22VK6,B000XYUSMI,,chale,,hallmark/kodak,1339286400,,
70273,4,True,"07 11, 2014",A16YPRS120VV5R,B001B057U6,,Boon Kiat,,Four Stars,1405036800,,
86543,5,True,"05 3, 2018",A461O3V81H0GY,B002IKIHEG,{'Format:': ' DVD'},joe,,Five Stars,1525305600,,[https://images-na.ssl-images-amazon.com/image...
86904,5,True,"11 19, 2014",A4EFRSUB5W8Y7,B002IKIHEG,{'Format:': ' Amazon Video'},Marlin G,,Five Stars,1416355200,,
...,...,...,...,...,...,...,...,...,...,...,...,...
453877,5,True,"07 5, 2018",A1HUGNK5PTQBV6,B0130P9E0I,,Amazon Customer,,,1530748800,,
454201,5,True,"04 2, 2018",A1K19WSFBM8C6C,B013EXF9T6,,Joe Mama,,Five Stars,1522627200,,
456027,1,True,"07 18, 2018",AUGXBAROCF67,B01617VPUY,{'Platform:': ' PC/MacDisc'},JLO,,They sent a basic knowingly . After advertisin...,1531872000,,[https://images-na.ssl-images-amazon.com/image...
456446,5,False,"03 28, 2016",A109P5888SV0N1,B016RRQD5A,,a b,,Five Stars,1459123200,,


In [56]:
# NaN values in the summary column
df[df['summary'].isna()]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
5467,5,True,"08 29, 2007",AGZDSR4R8SA2S,B000050ZRE,,Kelly Carlson,I know little about computers. I wanted to be ...,,1188345600,3.0,
6116,4,False,"03 7, 2002",A15S4XW3CRISZ5,B00005AFI4,,Andre Da Costa,Microsoft Publisher 2002 contains new features...,,1015459200,38.0,
16504,5,True,"08 29, 2007",AGZDSR4R8SA2S,B0001FS9NE,,Kelly Carlson,I know little about computers. I wanted to be ...,,1188345600,,
31221,5,True,"08 29, 2007",AGZDSR4R8SA2S,B000EORV8Q,,Kelly Carlson,I know little about computers. I wanted to be ...,,1188345600,,
51457,5,True,"08 29, 2007",AGZDSR4R8SA2S,B0000AZJY6,,Kelly Carlson,I know little about computers. I wanted to be ...,,1188345600,3.0,
56135,5,True,"06 27, 2008",ALD8MK1VA5BQK,B0013OAHTG,,G. Sabio,"First of all, I want to say that the quality o...",,1214524800,6.0,
59963,5,True,"08 29, 2007",AGZDSR4R8SA2S,B0000AZJY6,,Kelly Carlson,I know little about computers. I wanted to be ...,,1188345600,3.0,
64641,5,True,"06 27, 2008",ALD8MK1VA5BQK,B0013OAHTG,,G. Sabio,"First of all, I want to say that the quality o...",,1214524800,6.0,
68884,5,True,"04 11, 2015",AT71IDXZWU9EG,B001AMHWP8,,jonathan,Fully satisfied,,1428710400,,
128674,3,True,"12 10, 2013",A15NSG16V2LJ8N,B005S4Y13K,{'Format:': ' Software Download'},NickR,The devil you know is better than the devil yo...,,1386633600,,


Whilst looking through the review texts with the missing summary, it is noticeable that there is a customer who has repeatably entered the same text for different products.

In [23]:
kellyC = df[df['reviewerID'] == 'AGZDSR4R8SA2S'] 
print(kellyC)

       overall  verified   reviewTime     reviewerID        asin style  \
5467         5      True  08 29, 2007  AGZDSR4R8SA2S  B000050ZRE   NaN   
16504        5      True  08 29, 2007  AGZDSR4R8SA2S  B0001FS9NE   NaN   
31221        5      True  08 29, 2007  AGZDSR4R8SA2S  B000EORV8Q   NaN   
51457        5      True  08 29, 2007  AGZDSR4R8SA2S  B0000AZJY6   NaN   
59963        5      True  08 29, 2007  AGZDSR4R8SA2S  B0000AZJY6   NaN   

        reviewerName                                         reviewText  \
5467   Kelly Carlson  I know little about computers. I wanted to be ...   
16504  Kelly Carlson  I know little about computers. I wanted to be ...   
31221  Kelly Carlson  I know little about computers. I wanted to be ...   
51457  Kelly Carlson  I know little about computers. I wanted to be ...   
59963  Kelly Carlson  I know little about computers. I wanted to be ...   

      summary  unixReviewTime vote image  
5467      NaN      1188345600    3   NaN  
16504     NaN     

In [205]:
kellyCReview = df[df['reviewerID'] == 'AGZDSR4R8SA2S'].sum() 

In [25]:
print(kellyCReview)

overall                                                          25
verified                                                          5
reviewTime        08 29, 200708 29, 200708 29, 200708 29, 200708...
reviewerID        AGZDSR4R8SA2SAGZDSR4R8SA2SAGZDSR4R8SA2SAGZDSR4...
asin              B000050ZREB0001FS9NEB000EORV8QB0000AZJY6B0000A...
style                                                             0
reviewerName      Kelly CarlsonKelly CarlsonKelly CarlsonKelly C...
reviewText        I know little about computers. I wanted to be ...
summary                                                           0
unixReviewTime                                           5941728000
image                                                             0
dtype: object


In [28]:
df[df['reviewerID']=='AGZDSR4R8SA2S']['reviewText'] 

5467     I know little about computers. I wanted to be ...
16504    I know little about computers. I wanted to be ...
31221    I know little about computers. I wanted to be ...
51457    I know little about computers. I wanted to be ...
59963    I know little about computers. I wanted to be ...
Name: reviewText, dtype: object

When taking a closer look it appears that the text is identical for multiple products.

In [57]:
#Review for a WIFI extender
df.iloc[16504]['reviewText']

"I know little about computers. I wanted to be able to use the desktop without having to have our wireless router on all the time. Ordered this, at first it scared me that there was really no instructions. I plugged in the Cat5 from the cable box, ran one cat 5 to desktop, one to wireless router, turned it all back on, and everything works!!! If it's easy enough that I can do it, anyone can. <G>"

In [58]:
#Review for a 4port HUB
df.iloc[31221]['reviewText']

"I know little about computers. I wanted to be able to use the desktop without having to have our wireless router on all the time. Ordered this, at first it scared me that there was really no instructions. I plugged in the Cat5 from the cable box, ran one cat 5 to desktop, one to wireless router, turned it all back on, and everything works!!! If it's easy enough that I can do it, anyone can. <G>"

In [38]:
#checking the time of the reviews
from datetime import datetime, timedelta

unix_ts = 1188345600
reviewTime = (datetime.fromtimestamp(unix_ts) - timedelta(hours=2)).strftime('%Y-%m-%d %H:%M:%S')
print(reviewTime)

2007-08-29 00:00:00


This is probably not a real review from a real customer, as all reviews have been made at the same time and are identical. 
However with dropping the empty values in the columns containing text, these will also be removed.

In [63]:
# dropping rows with empty values in either of the two columns
df.dropna(how='any', subset=['reviewText', 'summary'], axis=0, inplace=True)


In [64]:
# checking if there are any null values left
df.isnull().sum()

overall                0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             224977
reviewerName          24
reviewText             0
summary                0
unixReviewTime         0
vote              331479
image             457822
dtype: int64

### Sentiment Analysis

In [65]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#create SID object
sid = SentimentIntensityAnalyzer()

In [67]:
#polarity score analysis on the reviewText column
df['scores'] = df['reviewText'].apply(lambda review:sid.polarity_scores(review))

In [70]:
#polarity score analysis on the summary column
df['summary_scores'] = df['summary'].apply(lambda review:sid.polarity_scores(review))

The next step would be to make the results visible including only the overal score from 5 to 1 and the scores of the columns 'reviewText' in 'scores' and 'summary' in 'summary_scores'.

In [73]:
# show results of the polarity score analysis
df[['overall', 'scores', 'summary_scores']].head()

Unnamed: 0,overall,scores,summary_scores
0,4,"{'neg': 0.0, 'neu': 0.802, 'pos': 0.198, 'comp...","{'neg': 0.0, 'neu': 0.196, 'pos': 0.804, 'comp..."
1,4,"{'neg': 0.0, 'neu': 0.89, 'pos': 0.11, 'compou...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,1,"{'neg': 0.123, 'neu': 0.837, 'pos': 0.04, 'com...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
3,3,"{'neg': 0.139, 'neu': 0.786, 'pos': 0.074, 'co...","{'neg': 0.736, 'neu': 0.264, 'pos': 0.0, 'comp..."
4,5,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp...","{'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'comp..."


In [90]:
# create the compound values for better understanding

df['compoundRev']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['compoundSum']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

In [93]:
df[['overall', 'scores', 'summary_scores', 'compoundRev', 'compoundSum']].head()

Unnamed: 0,overall,scores,summary_scores,compoundRev,compoundSum
0,4,"{'neg': 0.0, 'neu': 0.802, 'pos': 0.198, 'comp...","{'neg': 0.0, 'neu': 0.196, 'pos': 0.804, 'comp...",0.687,0.687
1,4,"{'neg': 0.0, 'neu': 0.89, 'pos': 0.11, 'compou...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.5709,0.5709
2,1,"{'neg': 0.123, 'neu': 0.837, 'pos': 0.04, 'com...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",-0.7307,-0.7307
3,3,"{'neg': 0.139, 'neu': 0.786, 'pos': 0.074, 'co...","{'neg': 0.736, 'neu': 0.264, 'pos': 0.0, 'comp...",-0.3753,-0.3753
4,5,"{'neg': 0.0, 'neu': 0.868, 'pos': 0.132, 'comp...","{'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'comp...",0.938,0.938


In [96]:
df[['overall', 'compoundRev', 'compoundSum']].head()

Unnamed: 0,overall,compoundRev,compoundSum
0,4,0.687,0.687
1,4,0.5709,0.5709
2,1,-0.7307,-0.7307
3,3,-0.3753,-0.3753
4,5,0.938,0.938


The compound values seem to be equal between the actual review text and the summary. 
For better readability the compound will be displayed as a predicted label, either positive or negative. 

In [98]:
# return 'pos' if the compound score is greater than 0, else return 'neg'
df['reviewLabel'] = df['compoundRev'].apply(lambda c: 'pos' if c >=0 else 'neg')
df['summaryLabel'] = df['compoundSum'].apply(lambda c: 'pos' if c >=0 else 'neg')

In [105]:
df[['overall', 'compoundRev', 'compoundSum', 'reviewLabel', 'summaryLabel']].head(20)

Unnamed: 0,overall,compoundRev,compoundSum,reviewLabel,summaryLabel
0,4,0.687,0.687,pos,pos
1,4,0.5709,0.5709,pos,pos
2,1,-0.7307,-0.7307,neg,neg
3,3,-0.3753,-0.3753,neg,neg
4,5,0.938,0.938,pos,pos
5,4,0.7184,0.7184,pos,pos
6,3,0.4404,0.4404,pos,pos
7,5,0.8399,0.8399,pos,pos
8,5,0.8955,0.8955,pos,pos
9,5,0.7391,0.7391,pos,pos


It appears that the review text and the summary both display the same values and show both either positive or negative. 
However compared to the overall score left by the customers, the sentiment score does not always match. It seems that there is a match on the higher scores such as 4 and 5, but the lower overall scores are quite mixed and can be both negative or positive.


### Evaluating Accuracy

To compare the results of the prediction against the overall score left by the customers, I will have to go a step back and return the predicted scores into numeric labels from 5 to 1.


In [193]:
# list of conditions
conditions = [
    (df['compoundRev'] < 1.0) & (df['compoundRev'] >= 0.8),
    (df['compoundRev'] < 0.8) & (df['compoundRev'] >= 0.6),
    (df['compoundRev'] < 0.6) & (df['compoundRev'] >= 0.4),
    (df['compoundRev'] < 0.4) & (df['compoundRev'] >= 0.2),
    (df['compoundRev'] < 0.2)
    ]

# list of assigned values for each condition
values = [5, 4, 3, 2, 1]

# assign values to the new column by using np.select()
df['newLabel'] = np.select(conditions, values)


In [194]:

df[['overall', 'compoundRev', 'newLabel']].head(20) 


Unnamed: 0,overall,compoundRev,newLabel
0,4,0.687,4
1,4,0.5709,3
2,1,-0.7307,1
3,3,-0.3753,1
4,5,0.938,5
5,4,0.7184,4
6,3,0.4404,3
7,5,0.8399,5
8,5,0.8955,5
9,5,0.7391,4


A first look at the two columns shows that not all new labels, which were calculated based on the text analysis match the rating of the customers. 
It would be interesting to evaluate the accuracy score.

In [195]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [196]:
accuracy_score(df['newLabel'],df['overall'])

0.36623704606810065

As expected the overall accuracy is with about 0.37 quite low. 

In [199]:
# classification report
print(classification_report(df['newLabel'],df['overall']))

              precision    recall  f1-score   support

           1       0.64      0.45      0.53    146757
           2       0.08      0.08      0.08     31294
           3       0.12      0.08      0.09     58284
           4       0.19      0.16      0.18     85921
           5       0.39      0.60      0.47    137064

    accuracy                           0.37    459320
   macro avg       0.28      0.27      0.27    459320
weighted avg       0.37      0.37      0.36    459320



In [201]:
# confusion matrix
print(confusion_matrix(df['newLabel'],df['overall']))

[[65482 17158 16443 15862 31812]
 [ 6942  2361  3783  5958 12250]
 [ 8228  2712  4566 10945 31833]
 [ 8829  3281  5095 14024 54692]
 [13047  5930  9503 26797 81787]]


### Explanation

In [202]:
df.iloc[19]['reviewText']

"Disappointing textbook. To start, the lack of color is dismal, but of less importance. However, from a MARKETING book, I expected it to be a little more eye-grabbing.\nMore importantly this book regularly cites wikipedia as a source, and I caught at least two examples of incorrect information when discussing how companies fit certain profiles. In one example it states that Kit Kat is a Nestle brand when it is produced by Hershey's. It also uses companies as examples that have long since been bought out by other companies. While neither of these two items errodes the core idea of marketing, they were just easily spotted. It makes it difficult to trust the core information when something so simple as a little research could correct these errors. Along with the wikipedia citing, I don't know that I would trust this source as 'authoritive'"

In [203]:
df.iloc[10]['reviewText']

"Maybe it's just me (I have no marketing background but desperately want to learn for my start-up) but I cannot get hardly anything out of this text. I have tried very hard to tread through the writing and learn something useful but chapter after chapter seems to be the exact same thing... overly wordy, rambling & unnecessarily academic writing with no comprehensible message behind it, no strategies I can apply to my own business and no organization that I can make sense of. Feels almost like I'm reading a paper a college student BSed their way through with a whole bunch of long words and cryptic sentences in order to sound impressive. What a rip-off. I paid $70 for this and it has been the most useless book I've paid for in my research process (and I've read 16 other books thus far to help me with my venture).\n\nThe only reason I'm giving this two stars and not one star is that out of the first six chapters I've read, two of them actually made sense. One was on International Marketin

In both cases the criticism in the review is expressed in the tone and by giving examples. Therefore the use of neutral or words classified as positive will outweigh the clearly negative expressions.

In the example the reviewer makes a point of how bad the text is by highlighting the only two things that were good about the book. Expressions like "did a very good job explaining" or "very helpful information" wich in human interaction can easily be understood within the context as the exception of the norm, will have the opposite effect on the computed analysis.

As an example of what can be understood as negative, is the review with the index number 2.

In [204]:
df.iloc[2]['reviewText']

'IF YOU ARE TAKING THIS CLASS DON"T WASTE YOUR MONEY ON THIS SO CALLED BOOK! $140.00 FOR A "BOOK" THAT ISIN\'T EVEN BOUND LOOSE LEAFS, THAT I HAD TO PROVIDE MY OWN BINDER FOR. TURNS OUT YOU CAN BUY ACCESS TO THE BOOK AT MCGRAW HILL CONNECT CORE FOR $70.00\n\nTHIS BOOK IS A COMPLETE WASTE OF MONEY!'