In this project we'll be performing a sentiment analysis on "Health and Personal Care" product reviews from Amazon. The goal is create a model that can accurately classify whether a review a positive or negative based on its text.

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [44]:
amazon_data = pd.read_json("Health_and_Personal_Care_5.json",lines=True)
amazon_data.head(3)

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,159985130X,"[1, 1]",5,This is a great little gadget to have around. ...,"01 5, 2011",ALC5GH8CAMAI7,AnnN,Handy little gadget,1294185600
1,159985130X,"[1, 1]",4,I would recommend this for a travel magnifier ...,"02 18, 2012",AHKSURW85PJUE,"AZ buyer ""AZ buyer""",Small & may need to encourage battery,1329523200
2,159985130X,"[75, 77]",4,What I liked was the quality of the lens and t...,"06 8, 2010",A38RMU1Y5TDP9,"Bob Tobias ""Robert Tobias""",Very good but not great,1275955200


In [45]:
amazon_data.shape

(346355, 9)

In [46]:
amazon_data.isnull().sum()

asin                 0
helpful              0
overall              0
reviewText           0
reviewTime           0
reviewerID           0
reviewerName      3051
summary              0
unixReviewTime       0
dtype: int64

The most relevant features for our model appear to be "overall" and "reviewText". For simplicty, we'll say any "overall" value greater than or equal to 4 would indicate a positive rating. 

## NLP Preprocessing

In [47]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [48]:
def text_cleaner(text_data):
    text = re.sub("[^a-zA-Z]",' ',text_data)
    text = text.lower()
    text = text.split()
    lmz = WordNetLemmatizer()
    text = [lmz.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text

Our dataset is rather large for making a bag of words model. The time to clean all 346,355 texts would be too much for an average computer to handle. 

In [49]:
amz = amazon_data[["reviewText"]]
amz.head(3)

Unnamed: 0,reviewText
0,This is a great little gadget to have around. ...
1,I would recommend this for a travel magnifier ...
2,What I liked was the quality of the lens and t...


In [50]:
# lets try selecting 25% of our data to make our sample
amz = amz.sample(frac=0.25,random_state=42)

In [51]:
amz.shape

(86589, 1)

In [52]:
corpus = []
for text in amz["reviewText"]:
    clean_text = text_cleaner(text)
    corpus.append(clean_text)

In [53]:
len(corpus)

86589

In [54]:
# lets create our "sentiment" dependent variable "Rating"
from textblob import TextBlob
amz["cleanText"] = corpus
amz["Rating"] = amz["cleanText"].apply(lambda x: TextBlob(x).sentiment[0])

In [60]:
# lets make dependent variable with 1 indicating positive review and 0 indicating negative review
amz["RatingScore"] = np.where(amz["Rating"] > 0, 1,0)

In [61]:
amz.head(3)

Unnamed: 0,reviewText,cleanText,Rating,RatingScore
64943,I keep my scruff trimmed up great with this. W...,keep scruff trimmed great included guard go cl...,0.291667,1
270734,Great quality and smell could even be worn alo...,great quality smell could even worn alone care...,0.215,1
75565,Gaia Herbs comes through again with this fabul...,gaia herb come fabulous product liquid high qu...,0.386667,1


In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_v = TfidfVectorizer(
    max_df = 0.5,
    min_df = 2,
    use_idf = True,
    norm = u'l2',
    smooth_idf = True
)

In [62]:
X = tfidf_v.fit_transform(amz["cleanText"])
y = amz.iloc[:,3].values

In [63]:
X.shape

(86589, 27890)

In [34]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

## Naive Bayes

In [68]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [69]:
print("Model accuracy score: ", clf.score(X_test,y_test))

Model accuracy score:  0.8099665088347384


In [72]:
# trying out Bernoulli Naive Bayes to see if there's a difference in accuracy (using tfidf matrix)
from sklearn.naive_bayes import BernoulliNB

clfb = BernoulliNB()
clfb.fit(X_train,y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [73]:
print("Model accuracy score: ", clfb.score(X_test,y_test))

Model accuracy score:  0.7730107402702391


It appears Multinomial Naive Bayes performed better than the Bernoulli Naive Bayes algorithm for this dataset. This is most likely due to the fact that the tfidf matrix works better with Multinomial Naive Bayes than Bernoulli Naive Bayes. 

## Logistic Regression

In [71]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [74]:
print("Model accuracy score: ", logreg.score(X_test,y_test))

Model accuracy score:  0.8559302459868345


## Random Forest

In [75]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [76]:
print("Model accuracy score: ", rfc.score(X_test,y_test))

Model accuracy score:  0.8211687261808522


## XGBoost

In [77]:
from xgboost import XGBClassifier

xgc = XGBClassifier()
xgc.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [78]:
print("Model accuracy score: ", xgc.score(X_test,y_test))

Model accuracy score:  0.820418062131886


  if diff:


Out of all the models used above, Logistic Regression had the highest accuracy. This could be attributed by how we set up our independent and dependent features (converting our dependent feature to binary values). It was a surprise to see how accurate the Naive Bayes models turned out, was not sure how the models would work with the tfidf matrix of independent features. Also, the ensemble models appear to be not overfitting, but cannot say for certain since trying to use cross validation with this large dataset would be time inefficient. 

Lets explore the Logistic model further...

In [80]:
from sklearn.metrics import confusion_matrix

y_pred = logreg.predict(X_test)
confusion_matrix = confusion_matrix(y_test,y_pred)
print(confusion_matrix)

[[ 1287  2044]
 [  451 13536]]


Reviews correctly classified: 14823/17318 (85.59%) 

Reviews incorrectly classified: 2495/17318 (14.41%) 

False Positive: 451/13987 (identified positive but negative review)

False Negative: 2044/3331 (identified negative but positive review)

Sensitivity: 13536/13987 (positive reviews correctly classified)

Specificity: 1287/3331 (negative reviews correctly classified)