## Sentiment Analysis of Amazon Reviews on Musical Instruments using Machine Learning

1. Dataset: https://www.kaggle.com/eswarchandt/amazon-music-reviews
2. Problem statement: Given review text determine the polarity: Positive or Negative
3. Type of problem: Classification, Supervised
4. Data type: Review text and other parameters stored in csv file
5. Performance Measures: Accuracy, Precision, Recall, Confusion Matrix
6. Feature Importance: Not required
7. Interpretability: Why the review is classified as positive or negative

### Classification Algorithms:
1. K-Nearest Neighbor
2. Logistic Regression (one-vs-rest)
3. SVM Classifier
4. Decision Tree
5. Random Forest
6. XGBoost
7. Naive Bays

### Libraries required
1. Pandas
2. Numpy
3. Matplotlib and seaborn
4. nltk

### Dataset descrition

#### Content
This file has reviewer ID , User ID, Reviewer Name, Reviewer text, helpful, Summary(obtained from Reviewer text),Overall Rating on a scale 5, Review time
Description of columns in the file:

1. reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
2. asin - ID of the product, e.g. 0000013714
3. reviewerName - name of the reviewer
4. helpful - helpfulness rating of the review, e.g. 2/3
5. reviewText - text of the review
6. overall - rating of the product
7. summary - summary of the review
8. unixReviewTime - time of the review (unix time)
9. reviewTime - time of the review (raw) 

#### Important Features
1. reviewerID - Its unique for every customer hence can be removed
2. asin - ID of the product - Its unique for every customer hence can be removed
3. reviewerName - Does not impact on final sentiment of the review hence can be removed
4. helpful - Reveiw helpfulness may impact the polarity of the review text hence keep it by following modifications
   percentage of helpfulness = (number of customers found it helful) / (total number of customers found it helpful or not helpful)
5. reviewText: Most important text from which polarity is decided
6. summary: Short summary of the review text hence keep it
7. unixReviewTime: As this is time based data this feature will be helful for data partition into text/train/cv. But this feature is not going to help in deciding of polarity of review text
8. reviewTime: Removed as unixReviewTime is considered

### Import Libraries

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import random
from random import randint
from tqdm import tqdm

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


import re
import nltk
#nltk.download()
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [6]:
!python --version
!python -m timeit "import nltk"

Python 2.7.15 :: Anaconda, Inc.


10 loops, best of 3: 0.933 usec per loop


### Data Preprocessing

In [7]:
rawData = pd.read_csv("Musical_instruments_reviews.csv")

Convert 'reviewText' and 'summary' column into 'string' type refer following link for detailed reason
- https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame


In [8]:
rawData.reviewText=rawData.reviewText.astype(str)
rawData.summary=rawData.summary.astype(str)

In [9]:
rawData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10261 entries, 0 to 10260
Data columns (total 9 columns):
reviewerID        10261 non-null object
asin              10261 non-null object
reviewerName      10234 non-null object
helpful           10261 non-null object
reviewText        10261 non-null object
overall           10261 non-null float64
summary           10261 non-null object
unixReviewTime    10261 non-null int64
reviewTime        10261 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 721.5+ KB


In [10]:
print("Data size shape",rawData.shape)
rawData.head()

('Data size shape', (10261, 9))


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


In [11]:
rawData.sort_values('unixReviewTime',ascending=True).reset_index() 

Unnamed: 0,index,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,4420,AV8MDYLHHTUOY,B000CD3QY2,"Amazon Customer ""eyegor""","[18, 19]",The ability to quickly change the range and se...,4.0,GREAT Wah,1095465600,"09 18, 2004"
1,7413,A33H0WC9MI8OVW,B002Q0WT6U,Clare Chu,"[12, 13]",Jade rosin gives a extra grippiness to the bow...,5.0,Excellent sticky rosin,1096416000,"09 29, 2004"
2,954,A33H0WC9MI8OVW,B0002D0COE,Clare Chu,"[9, 11]",This compact humidifier is easily filled with ...,5.0,"Very Easy to Use, Non-Messy",1096416000,"09 29, 2004"
3,5581,A3SMT15X2QVUR8,B000SZVYLQ,"Victoria Tarrani ""writer, editor, artist, des...","[63, 63]",When I purchased this pedal from a local music...,5.0,Competes with many high-end pedals,1101686400,"11 29, 2004"
4,1560,A3SMT15X2QVUR8,B0002E2EOE,"Victoria Tarrani ""writer, editor, artist, des...","[58, 59]",I purchased this key on a whim. When it arriv...,5.0,This actually works - and works well,1101686400,"11 29, 2004"
5,2004,A3SMT15X2QVUR8,B0002F73YY,"Victoria Tarrani ""writer, editor, artist, des...","[21, 22]",This is an ingenious clutch that engages when ...,5.0,Essential for double bass pedal players,1101686400,"11 29, 2004"
6,2590,A3SMT15X2QVUR8,B0002GXRF2,"Victoria Tarrani ""writer, editor, artist, des...","[22, 22]",These heads are virtually indestructable and p...,5.0,Perfect for rock,1101859200,"12 1, 2004"
7,3721,A1MI9FDCNB3CMR,B0006OHVK2,"Jorge Barbarosa ""the_bassist""","[12, 13]","Flatwound? Ribbon wound? It's all the same, n...",5.0,No Squeaking,1106870400,"01 28, 2005"
8,1941,A1RPTVW5VEOSI,B0002F4MKC,Michael J. Edelman,"[8, 9]",I was in a local music store the other day and...,4.0,Not bad...!,1110499200,"03 11, 2005"
9,2632,A2PD27UKAD3Q00,B0002GXZK4,"Wilhelmina Zeitgeist ""coolartsybabe""","[156, 160]",I was thrilled when my guitar arrived and even...,5.0,This guitar DOES have a BIG SOUND for a small ...,1111708800,"03 25, 2005"


In [12]:
rawData.drop(['asin','reviewerName', 'reviewTime','reviewerID','helpful'], inplace = True, axis = 1)

In [13]:
rawData.head()

Unnamed: 0,reviewText,overall,summary,unixReviewTime
0,"Not much to write about here, but it does exac...",5.0,good,1393545600
1,The product does exactly as it should and is q...,5.0,Jake,1363392000
2,The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000
3,Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000
4,This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800


In [14]:
reviewTextNsummary = rawData[['reviewText','summary']]
ratings = rawData[['overall']]

In [15]:
reviewTextNsummary.head()
ratings.head()

Unnamed: 0,overall
0,5.0
1,5.0
2,5.0
3,5.0
4,5.0


In [16]:
def newRatings(vals):
    if vals > 3.0:
        return 1.0
    else:
        return 0.0

In [17]:
ratings = ratings['overall'].apply(newRatings)

### Text -Data Processing 

#### Data Filtering
1. Convert all sentences into lower case
2. Remove special symbols
3. Remove urls
4. Remove digits
5. Convert i've, can't into i have can not etc.

#### NLP Processing
1. Tokenize sentences
2. remove stopwords
3. Apply Stemming/Lemmatization
4. Convert text words into numerical vectors using
    1. Bag of Words
    2. TFIDF
    3. Word2Vec
    4. Average word2vec

In [18]:
rawData.head(5)

Unnamed: 0,reviewText,overall,summary,unixReviewTime
0,"Not much to write about here, but it does exac...",5.0,good,1393545600
1,The product does exactly as it should and is q...,5.0,Jake,1363392000
2,The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000
3,Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000
4,This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800


In [57]:
rawData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10261 entries, 0 to 10260
Data columns (total 4 columns):
reviewText        10261 non-null object
overall           10261 non-null float64
summary           10261 non-null object
unixReviewTime    10261 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 320.7+ KB


### Data Filtering

In [20]:
contractions = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [58]:
def preProcessText(review):
    review = review.lower() # Convert strin into lower case
    for word in review.split():
        if word.lower() in contractions:
            review = review.replace(word, contractions[word.lower()])
    review = review.replace("?","") # remove question mark
    review = " ".join(re.findall(r"[a-zA-Z]+", review)) # Remove special symbols and numbers from the text 
    review = re.sub(r'^https?:\/\/.*[\r\n]*', '', review, flags=re.MULTILINE) # remove urls from the text    
    return review

In [111]:
processedData = pd.DataFrame()
processedData['Summary'] = rawData.summary.apply(preProcessText)
processedData['ReviewText'] = rawData.reviewText.apply(preProcessText)




In [112]:
processedData.head()

Unnamed: 0,Summary,ReviewText
0,good,not much to write about here but it does exact...
1,jake,the product does exactly as it should and is q...
2,it does the job well,the primary job of this device is to block the...
3,good windscreen for the money,nice windscreen protects my mxl mic and preven...
4,no more pops when i record my vocals,this pop filter is great it looks and performs...


### NLP Processing

#### 1. Tokenize sentences

In [114]:
#https://stackoverflow.com/questions/45878720/dataframe-apply-doesnt-accept-axis-argument?rq=1
processedData['tokenized_Summary'] = processedData[['Summary']].apply(lambda row: nltk.word_tokenize(row['Summary']), axis = 1)
processedData['tokenized_ReviewText'] = processedData[['ReviewText']].apply(lambda row: nltk.word_tokenize(row['ReviewText']), axis = 1)

In [115]:
processedData.head()

Unnamed: 0,Summary,ReviewText,tokenized_Summary,tokenized_ReviewText
0,good,not much to write about here but it does exact...,[good],"[not, much, to, write, about, here, but, it, d..."
1,jake,the product does exactly as it should and is q...,[jake],"[the, product, does, exactly, as, it, should, ..."
2,it does the job well,the primary job of this device is to block the...,"[it, does, the, job, well]","[the, primary, job, of, this, device, is, to, ..."
3,good windscreen for the money,nice windscreen protects my mxl mic and preven...,"[good, windscreen, for, the, money]","[nice, windscreen, protects, my, mxl, mic, and..."
4,no more pops when i record my vocals,this pop filter is great it looks and performs...,"[no, more, pops, when, i, record, my, vocals]","[this, pop, filter, is, great, it, looks, and,..."


#### 2. Stopword removal
- https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

In [116]:
def removeStopwords(column,stopWords):
    filtered_sentence = []
    for w in column:
        if w not in stopWords:
            filtered_sentence.append(w)
    return filtered_sentence

In [117]:
processedData['tokenized_Summary'] = processedData[['tokenized_Summary']].apply(lambda x: removeStopwords(x['tokenized_Summary'], stopWords), axis = 1)
processedData['tokenized_ReviewText'] = processedData[['tokenized_ReviewText']].apply(lambda x: removeStopwords(x['tokenized_ReviewText'], stopWords), axis = 1)

In [118]:
processedData.head()

Unnamed: 0,Summary,ReviewText,tokenized_Summary,tokenized_ReviewText
0,good,not much to write about here but it does exact...,[good],"[not, much, write, exactly, supposed, filters,..."
1,jake,the product does exactly as it should and is q...,[jake],"[product, exactly, quite, affordable, not, rea..."
2,it does the job well,the primary job of this device is to block the...,"[job, well]","[primary, job, device, block, breath, would, o..."
3,good windscreen for the money,nice windscreen protects my mxl mic and preven...,"[good, windscreen, money]","[nice, windscreen, protects, mxl, mic, prevent..."
4,no more pops when i record my vocals,this pop filter is great it looks and performs...,"[pops, record, vocals]","[pop, filter, great, looks, performs, like, st..."


In [119]:
stopWords = set(stopwords.words('english'))
print(len(stopWords))
stopWords.remove('not')
print(len(stopWords))

179
178


In [120]:
Tokenized_data = processedData.copy()

In [121]:
Tokenized_data.drop(['Summary','ReviewText'], axis = 1, inplace = True)
Tokenized_data.head()

Unnamed: 0,tokenized_Summary,tokenized_ReviewText
0,[good],"[not, much, write, exactly, supposed, filters,..."
1,[jake],"[product, exactly, quite, affordable, not, rea..."
2,"[job, well]","[primary, job, device, block, breath, would, o..."
3,"[good, windscreen, money]","[nice, windscreen, protects, mxl, mic, prevent..."
4,"[pops, record, vocals]","[pop, filter, great, looks, performs, like, st..."


#### 3. Stemming or Lemmatization
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

In [122]:
def StemSentence(column):
    porter = PorterStemmer()
    lem_list = []
    for w in column:
        lem_list.append(porter.stem(w))
    return lem_list

In [123]:
Tokenized_data['Ltokenized_Summary'] = Tokenized_data['tokenized_Summary'].apply(StemSentence)
Tokenized_data['Ltokenized_ReviewText'] = Tokenized_data['tokenized_ReviewText'].apply(StemSentence)

In [126]:
Tokenized_data.head(10)

Unnamed: 0,tokenized_Summary,tokenized_ReviewText,Ltokenized_Summary,Ltokenized_ReviewText
0,[good],"[not, much, write, exactly, supposed, filters,...",[good],"[not, much, write, exactli, suppos, filter, po..."
1,[jake],"[product, exactly, quite, affordable, not, rea...",[jake],"[product, exactli, quit, afford, not, realiz, ..."
2,"[job, well]","[primary, job, device, block, breath, would, o...","[job, well]","[primari, job, devic, block, breath, would, ot..."
3,"[good, windscreen, money]","[nice, windscreen, protects, mxl, mic, prevent...","[good, windscreen, money]","[nice, windscreen, protect, mxl, mic, prevent,..."
4,"[pops, record, vocals]","[pop, filter, great, looks, performs, like, st...","[pop, record, vocal]","[pop, filter, great, look, perform, like, stud..."
5,"[best, cable]","[good, bought, another, one, love, heavy, cord...","[best, cabl]","[good, bought, anoth, one, love, heavi, cord, ..."
6,"[monster, standard, instrument, cable]","[used, monster, cables, years, good, reason, l...","[monster, standard, instrument, cabl]","[use, monster, cabl, year, good, reason, lifet..."
7,"[not, fit, fender, strat]","[use, cable, run, output, pedal, chain, input,...","[not, fit, fender, strat]","[use, cabl, run, output, pedal, chain, input, ..."
8,"[great, cable]","[perfect, epiphone, sheraton, ii, monster, cab...","[great, cabl]","[perfect, epiphon, sheraton, ii, monster, cabl..."
9,"[best, instrument, cables, market]","[monster, makes, best, cables, lifetime, warra...","[best, instrument, cabl, market]","[monster, make, best, cabl, lifetim, warranti,..."


In [127]:
stemmedData = Tokenized_data.copy()
stemmedData.drop(['tokenized_Summary','tokenized_ReviewText'], inplace = True, axis = 1)

In [129]:
stemmedData.rename(columns = {'Ltokenized_Summary':'summary', 'Ltokenized_ReviewText':'reviewText'}, inplace = True) 
stemmedData.head()

Unnamed: 0,summary,reviewText
0,[good],"[not, much, write, exactli, suppos, filter, po..."
1,[jake],"[product, exactli, quit, afford, not, realiz, ..."
2,"[job, well]","[primari, job, devic, block, breath, would, ot..."
3,"[good, windscreen, money]","[nice, windscreen, protect, mxl, mic, prevent,..."
4,"[pop, record, vocal]","[pop, filter, great, look, perform, like, stud..."


#### 4. Get number of unique words

In [133]:
unique_words = []
for sent in stemmedData.reviewText:
    for word in sent:
        if word not in unique_words:
            unique_words.append(word) 

In [136]:
for sent in stemmedData.summary:
    for word in sent:
        if word not in unique_words:
            unique_words.append(word) 

In [137]:
print("Unique Words in Text: ",len(unique_words))

('Unique Words in Text: ', 13141)


In [140]:
JonedColumns = stemmedData.summary + stemmedData.reviewText 

In [148]:
JonedColumns.to_frame()
JonedColumns.columns = ['Text']
JonedColumns.Text

0        [good, not, much, write, exactli, suppos, filt...
1        [jake, product, exactli, quit, afford, not, re...
2        [job, well, primari, job, devic, block, breath...
3        [good, windscreen, money, nice, windscreen, pr...
4        [pop, record, vocal, pop, filter, great, look,...
5        [best, cabl, good, bought, anoth, one, love, h...
6        [monster, standard, instrument, cabl, use, mon...
7        [not, fit, fender, strat, use, cabl, run, outp...
8        [great, cabl, perfect, epiphon, sheraton, ii, ...
9        [best, instrument, cabl, market, monster, make...
10       [one, best, instrument, cabl, within, brand, m...
11       [work, great, hardli, use, got, need, found, n...
12       [get, use, size, not, use, use, larg, sustain,...
13       [awesom, love, use, yamaha, ypt, work, great, ...
14       [work, bought, use, home, studio, control, mid...
15       [definit, not, season, piano, player, bought, ...
16       [durabl, instrument, cabl, fender, cabl, perfe.

Create Dictionary to store frequency count of words in given corpus
- https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/

In [None]:
wordFreq = {}
for sent in JonedColumns.Text:
    for word in sent:
        if word not in wordFreq.keys():
            wordFreq[word] = 1
        else:
            wordFreq[word] += 1

#### Check for data balanced or imbalanced

In [114]:
ratings.value_counts()

1.0    9022
0.0    1239
Name: overall, dtype: int64

#### As dataset is imbalanced, following performance measures can be used
1. Confusion Matrix
2. Precision and Recall
3. RoC Curve
4. F1-Score