## Sentiment Analysis of Amazon Reviews on Musical Instruments using Machine Learning

1. Dataset: https://www.kaggle.com/eswarchandt/amazon-music-reviews
2. Problem statement: Given review text determine the polarity: Positive or Negative
3. Type of problem: Classification, Supervised
4. Data type: Review text and other parameters stored in csv file
5. Performance Measures: Accuracy, Precision, Recall, Confusion Matrix
6. Feature Importance: Not required
7. Interpretability: Why the review is classified as positive or negative

### Classification Algorithms:
1. K-Nearest Neighbor
2. Logistic Regression (one-vs-rest)
3. SVM Classifier
4. Decision Tree
5. Random Forest
6. XGBoost
7. Naive Bays

### Libraries required
1. Pandas
2. Numpy
3. Matplotlib and seaborn
4. nltk

### Dataset descrition

#### Content
This file has reviewer ID , User ID, Reviewer Name, Reviewer text, helpful, Summary(obtained from Reviewer text),Overall Rating on a scale 5, Review time
Description of columns in the file:

1. reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
2. asin - ID of the product, e.g. 0000013714
3. reviewerName - name of the reviewer
4. helpful - helpfulness rating of the review, e.g. 2/3
5. reviewText - text of the review
6. overall - rating of the product
7. summary - summary of the review
8. unixReviewTime - time of the review (unix time)
9. reviewTime - time of the review (raw) 

#### Important Features
1. reviewerID - Its unique for every customer hence can be removed
2. asin - ID of the product - Its unique for every customer hence can be removed
3. reviewerName - Does not impact on final sentiment of the review hence can be removed
4. helpful - Reveiw helpfulness may impact the polarity of the review text hence keep it by following modifications
   percentage of helpfulness = (number of customers found it helful) / (total number of customers found it helpful or not helpful)
5. reviewText: Most important text from which polarity is decided
6. summary: Short summary of the review text hence keep it
7. unixReviewTime: As this is time based data this feature will be helful for data partition into text/train/cv. But this feature is not going to help in deciding of polarity of review text
8. reviewTime: Removed as unixReviewTime is considered

### Import Libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import random
from random import randint
from tqdm import tqdm

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Data Preprocessing

In [26]:
rawData = pd.read_csv("Musical_instruments_reviews.csv")

In [27]:
print("Data size shape",rawData.shape)
rawData.head()

('Data size shape', (10261, 9))


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


In [28]:
rawData.sort_values('unixReviewTime',ascending=True).reset_index() 

Unnamed: 0,index,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,4420,AV8MDYLHHTUOY,B000CD3QY2,"Amazon Customer ""eyegor""","[18, 19]",The ability to quickly change the range and se...,4.0,GREAT Wah,1095465600,"09 18, 2004"
1,7413,A33H0WC9MI8OVW,B002Q0WT6U,Clare Chu,"[12, 13]",Jade rosin gives a extra grippiness to the bow...,5.0,Excellent sticky rosin,1096416000,"09 29, 2004"
2,954,A33H0WC9MI8OVW,B0002D0COE,Clare Chu,"[9, 11]",This compact humidifier is easily filled with ...,5.0,"Very Easy to Use, Non-Messy",1096416000,"09 29, 2004"
3,5581,A3SMT15X2QVUR8,B000SZVYLQ,"Victoria Tarrani ""writer, editor, artist, des...","[63, 63]",When I purchased this pedal from a local music...,5.0,Competes with many high-end pedals,1101686400,"11 29, 2004"
4,1560,A3SMT15X2QVUR8,B0002E2EOE,"Victoria Tarrani ""writer, editor, artist, des...","[58, 59]",I purchased this key on a whim. When it arriv...,5.0,This actually works - and works well,1101686400,"11 29, 2004"
5,2004,A3SMT15X2QVUR8,B0002F73YY,"Victoria Tarrani ""writer, editor, artist, des...","[21, 22]",This is an ingenious clutch that engages when ...,5.0,Essential for double bass pedal players,1101686400,"11 29, 2004"
6,2590,A3SMT15X2QVUR8,B0002GXRF2,"Victoria Tarrani ""writer, editor, artist, des...","[22, 22]",These heads are virtually indestructable and p...,5.0,Perfect for rock,1101859200,"12 1, 2004"
7,3721,A1MI9FDCNB3CMR,B0006OHVK2,"Jorge Barbarosa ""the_bassist""","[12, 13]","Flatwound? Ribbon wound? It's all the same, n...",5.0,No Squeaking,1106870400,"01 28, 2005"
8,1941,A1RPTVW5VEOSI,B0002F4MKC,Michael J. Edelman,"[8, 9]",I was in a local music store the other day and...,4.0,Not bad...!,1110499200,"03 11, 2005"
9,2632,A2PD27UKAD3Q00,B0002GXZK4,"Wilhelmina Zeitgeist ""coolartsybabe""","[156, 160]",I was thrilled when my guitar arrived and even...,5.0,This guitar DOES have a BIG SOUND for a small ...,1111708800,"03 25, 2005"


In [29]:
rawData.drop(['asin','reviewerName', 'reviewTime','reviewerID','helpful'], inplace = True, axis = 1)

In [39]:
rawData.head()

Unnamed: 0,reviewText,overall,summary,unixReviewTime
0,"Not much to write about here, but it does exac...",5.0,good,1393545600
1,The product does exactly as it should and is q...,5.0,Jake,1363392000
2,The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000
3,Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000
4,This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800


In [48]:
reviewTextNsummary = rawData[['reviewText','summary']]
ratings = rawData[['overall']]

In [49]:
reviewTextNsummary.head()
ratings.head()

Unnamed: 0,overall
0,5.0
1,5.0
2,5.0
3,5.0
4,5.0


In [54]:
def newRatings(vals):
    if vals > 3.0:
        return 1.0
    else:
        return 0.0

In [55]:
ratings = ratings['overall'].apply(newRatings)

### Text -Data Processing 

#### Data Filtering
1. Convert all sentences into lower case
2. Remove special symbols
3. Remove urls
4. Remove digits
5. Convert i've, can't into i have can not etc.

#### NLP Processing
1. Tokenize sentences
2. Apply lemmatization
3. Convert text words into numerical vectors using
    1. Bag of Words
    2. TFIDF
    3. Word2Vec
    4. Average word2vec