# Exploring our Amazon Consumer Reviews dataset

The goal of this project is to analyse the sentiments of customer reviews on a variety of Amazon products and subsequently develop predictive sentiment models. Following which, I will apply topic modelling techniques to each sentiment category in order to pinpoint the key topics within each category and gain a more precise understanding of customer feeback. 

*NOTE: To see plotly figures, please run notebook in either Jupyter or CoLab*

In [23]:
import pandas as pd
import pandas_profiling as pp
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import graph_objs as go
import plotly.express as px
%matplotlib inline
import spacy
import string
import re
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
import nltk
from nltk.corpus import stopwords

## The Amazon reviews dataset

In [24]:
amz = pd.read_csv('../Amazon product reviews dataset/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
amz.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


### Missing Values & Duplicates

In [25]:
amz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28332 entries, 0 to 28331
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   28332 non-null  object 
 1   dateAdded            28332 non-null  object 
 2   dateUpdated          28332 non-null  object 
 3   name                 28332 non-null  object 
 4   asins                28332 non-null  object 
 5   brand                28332 non-null  object 
 6   categories           28332 non-null  object 
 7   primaryCategories    28332 non-null  object 
 8   imageURLs            28332 non-null  object 
 9   keys                 28332 non-null  object 
 10  manufacturer         28332 non-null  object 
 11  manufacturerNumber   28332 non-null  object 
 12  reviews.date         28332 non-null  object 
 13  reviews.dateSeen     28332 non-null  object 
 14  reviews.didPurchase  9 non-null      object 
 15  reviews.doRecommend  16086 non-null 

Missing values are found within review data attributes, namely didPurchase, doRcommend and id. However, given that these attributes were most likely 'missed out' by reviewers and do not contribute to review sentiment, they need not be handled and we can keep these samples.

In [26]:
print("Number of duplicates found: " + str(np.sum(amz.duplicated())))

Number of duplicates found: 0


### Analysing review ratings
Review ratings will serve as our 'truth labels' or proxy for sentiments, where its scale of 1 to 5 will be representative of negative to positive sentiment.

In [27]:
amz['reviews.rating'].describe()

count    28332.000000
mean         4.514048
std          0.934957
min          1.000000
25%          4.000000
50%          5.000000
75%          5.000000
max          5.000000
Name: reviews.rating, dtype: float64

In [28]:
ratings_data = amz.groupby(['reviews.rating']).count().reset_index().sort_values(by='id', ascending=False)[['reviews.rating', 'id']].rename(columns={'reviews.rating': 'Rating', 'id': 'Count'})
ratings_data['Proportion %'] = np.round((ratings_data.Count/ np.sum(ratings_data.Count)), 2)
ratings_data

Unnamed: 0,Rating,Count,Proportion %
4,5,19897,0.7
3,4,5648,0.2
2,3,1206,0.04
0,1,965,0.03
1,2,616,0.02


In [50]:
# Ratings distribution
px.bar(ratings_data, x='Rating', y='Count', color='Rating', title='Amazon Product: Ratings Distribution')

From distribution above it is observed that the dataset is observably skewed towards high ratings of 4 and 5, which comprise 90% of our Amazon reviews dataset. This means that if we were to predict just 5s for every review we would already achieve 70% accuracy. This would be excaerbated should we categorise our reviews as such to generate sentiment polarities:

- Ratings 1 - 2: Negative
- Ratings 3: Neutral
- Ratings 4 - 5: Positive

By doing so, if we were to predict 'Positive' for every single review in our representative validation and test sets, we would achieve a lower bound accuracy of 90%. This consequently, results in misleading results and a practically useless predictive model. To deal with this, we will perform data augmentation on our reviews.text column for reviews with Negative and Neutral ratings. This will be addressed in our Text Processing phase. In the mean time, we will continue with creating our sentiment polarities.


### Creating sentiment polarities
To perform predictive sentiment analyses, we must first convert our numerical ratings into sentiment polarities. As such, our ratings will be translated as specified earlier.

Further, we will only keep relevant columns in our working dataset, namely:
- id
- name
- brand
- primaryCategories
- reviews.rating
- reviews.text
- reviews.title

In [30]:
# Narrowed down dataset
amz2 = amz[['id', 'name', 'brand', 'primaryCategories', 'reviews.rating', 'reviews.text', 'reviews.title']]
# Add sentiment column
amz2['sentiment'] = amz2['reviews.rating'].map({1: 'Negative', 2: 'Negative', 3: 'Neutral', 4: 'Positive', 5: 'Positive'})
amz2.head()

Unnamed: 0,id,name,brand,primaryCategories,reviews.rating,reviews.text,reviews.title,sentiment
0,AVpgNzjwLJeJML43Kpxn,AmazonBasics AAA Performance Alkaline Batterie...,Amazonbasics,Health & Beauty,3,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Neutral
1,AVpgNzjwLJeJML43Kpxn,AmazonBasics AAA Performance Alkaline Batterie...,Amazonbasics,Health & Beauty,4,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,Positive
2,AVpgNzjwLJeJML43Kpxn,AmazonBasics AAA Performance Alkaline Batterie...,Amazonbasics,Health & Beauty,5,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,Positive
3,AVpgNzjwLJeJML43Kpxn,AmazonBasics AAA Performance Alkaline Batterie...,Amazonbasics,Health & Beauty,5,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Positive
4,AVpgNzjwLJeJML43Kpxn,AmazonBasics AAA Performance Alkaline Batterie...,Amazonbasics,Health & Beauty,5,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Positive


In [31]:
plt.figure(figsize=(30, 20))

fig = go.Figure(go.Funnelarea(
    text = amz2.sentiment.value_counts().index,
    values = amz2.sentiment.value_counts(),
    title = {"position": "top center", "text": "Sentiment Breakdown"}
    ))
fig.show();

<Figure size 2160x1440 with 0 Axes>

### Sentiments across a Products key attributes (i.e. Brands, Primary Product Categories etc.)


In [49]:
pdt_data = amz2.groupby(['primaryCategories', 'sentiment']).count().reset_index()
pdt_data = pdt_data[['primaryCategories', 'sentiment', 'id']].rename(columns={'id': 'count'}).sort_values(['count'])
px.bar(pdt_data, x='primaryCategories', y='count', color='sentiment', title='Sentiment breakdown across Primary Product Categories')

In [33]:
pdt_data.sort_values(by=['primaryCategories', 'count'])

Unnamed: 0,primaryCategories,sentiment,count
0,Animals & Pet Supplies,Neutral,1
1,Animals & Pet Supplies,Positive,5
2,Electronics,Negative,370
3,Electronics,Neutral,551
4,Electronics,Positive,13074
5,"Electronics,Furniture",Positive,2
7,"Electronics,Media",Neutral,3
6,"Electronics,Media",Negative,4
8,"Electronics,Media",Positive,178
10,Health & Beauty,Neutral,534


It can be observed from the product category breakdown above, that certain categories of products have either very few samples (< 10) or do not contain the full spectrum of sentiments. As such, it might be useful to perform data augmentation at the granularity of primaryCategories, so as to ensure that our predictive models are robust to a variety of product types.

### Word length of review across Postive, Neutral & Negative sentiments

In [34]:
# Instantiate spaCy tokenizer
nlp = English()
tokenizer = Tokenizer(nlp.vocab)

# Get document word length
def get_doc_len(text, tokenizer):
    doc_len = len(tokenizer(text))
    return doc_len

In [35]:
amz2['reviews.text.len'] = amz2['reviews.text'].apply(lambda x: get_doc_len(x, tokenizer))
amz2['reviews.text.len'].describe()

count    28332.000000
mean        25.945150
std         37.076179
min          1.000000
25%         10.000000
50%         17.000000
75%         31.000000
max       1539.000000
Name: reviews.text.len, dtype: float64

In [36]:
# Breakdown by sentiment category
review_len_by_sent = amz2.groupby(['sentiment']).mean()['reviews.text.len'].reset_index()
px.bar(review_len_by_sent, x='sentiment', y='reviews.text.len', color='sentiment', title='Review Length vs. Sentiment Polarity')

Negative reviews have an average of about 15 more words as compared to Positive reviews and 8 more words as compared to Neutral reviews.

### Write amz2 dataframe as our working dataset for amazon product reviews:

In [48]:
amz2.to_csv('../Amazon product reviews dataset/Amazon_product_review_with_sent.csv', index=False)

### Top words for Positive, Neutral & Negative reviews

In [37]:
def create_review_corpus(reviews):
    corpus = ''
    for i in reviews:
        corpus += i
    return corpus

In [38]:
positive_reviews = amz2[amz2['sentiment'] == 'Positive']['reviews.text']
neutral_reviews = amz2[amz2['sentiment'] == 'Neutral']['reviews.text']
negative_reviews = amz2[amz2['sentiment'] == 'Negative']['reviews.text']

pos_corpus = create_review_corpus(positive_reviews)
neu_corpus = create_review_corpus(neutral_reviews)
neg_corpus = create_review_corpus(negative_reviews)

In [39]:
# Create custom stop word list
alphabets = list(string.ascii_lowercase)[1:]
all_stopwords = stopwords.words('english') + alphabets
all_punctuation = string.punctuation

# Create custom tokenizer
def custom_tokenizer(nlp):
    prefix_re = re.compile(r'(?<=[:;()[\]+.,!?\\-])[A-Za-z]')
    suffix_re = re.compile(r'(?<=[A-Za-z])[:;()[\]+.,!?\\-]')
    return Tokenizer(nlp.vocab, prefix_search = prefix_re.search, suffix_search = suffix_re.search)

# Create quick preprocessing pipeline
def quick_preprocess(tok_text, all_punctuations, all_stopwords):
    doc = [word.lemma_.lower() for word in tok_text if word.lemma_.lower() not in all_stopwords]
    doc = [word for word in doc if word not in all_punctuations]
    return doc
    

#### Positive Reviews

In [40]:
# Process text
nlp.max_length = 3358828
nlp.tokenizer = custom_tokenizer(nlp)

# Positive
tok_pos_corpus = nlp(pos_corpus)
pos_review_vocab = quick_preprocess(tok_pos_corpus, all_punctuation, all_stopwords)
top_20_pos_words = pd.Series(pos_review_vocab).value_counts().reset_index().rename(columns={'index': 'word', 0: 'count'}).head(20)

In [53]:
px.bar(top_20_pos_words, x='word', y='count', color='word', title='Top 20 words amongst Positive Reviews')

#### Neutral Words

In [42]:
# Neutral
tok_neu_corpus = nlp(neu_corpus)
neu_review_vocab = quick_preprocess(tok_neu_corpus, all_punctuation, all_stopwords)
top_20_neu_words = pd.Series(neu_review_vocab).value_counts().reset_index().rename(columns={'index': 'word', 0: 'count'}).head(20)

In [52]:
px.bar(top_20_neu_words, x='word', y='count', color='word', title='Top 20 words amongst Neutral Reviews')

#### Negative Words

In [44]:
# Negative
tok_neg_corpus = nlp(neg_corpus)
neg_review_vocab = quick_preprocess(tok_neg_corpus, all_punctuation, all_stopwords)
top_20_neg_words = pd.Series(neg_review_vocab).value_counts().reset_index().rename(columns={'index': 'word', 0: 'count'}).head(20)

In [54]:
px.bar(top_20_neg_words, x='word', y='count', color='word', title='Top 20 words amongst Negative Reviews')