# NLP - Sentiment Analysis for Amazon Product Reviews
# Naive Bayes Classifier - based on Bayes' Probability Theorem

In this notebook we will not be doing sentiment analysis based on a corpus of text, but we'll apply Naive Bayes' statistical method to calculate the probability of customers liking current whey protein products from Amazon. This method classify whether consumers will provide a positive comment or a negative review.

In [22]:
# Import libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Libraries for Naive Bayes
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from mlxtend.plotting import plot_decision_regions

In [2]:
# Read scraped results from CSV
df = pd.read_csv('Whey_Protein_Amazon_Preprocessed_Reviews.csv')

In [4]:
# Change data type for 'Review' to 'string' & fill empty cells (from CSV) with NA
df['Reviews'] = df['Reviews'].astype('string')
df = df.fillna('NA')
# Drop extra unnamed column
#col_0 = df.columns[0]
#df.drop(col_0, axis = 1, inplace = True)

In [5]:
#!python -m nltk.downloader stopwords

In [6]:
print(df.dtypes)

ID                int64
Product_Name     object
Date             object
Rating_Score    float64
Reviews          string
Link             object
Product_ID       object
dtype: object


## Naive Bayes Classifier Approach

In [7]:
# Tfidf Vectorizer
stopset= set(stopwords.words('english'))
vectorizer= TfidfVectorizer(use_idf = True, \
                            lowercase = True, \
                            token_pattern = '[a-zA-Z.0-9+#-/]*[^.\s]', \
                            strip_accents = 'ascii', \
                            stop_words = stopset)

In [8]:
# A) We assign new column with values of 'one' to 3, 4 & 5 star-comments, and a 'zero' to 1 & 2 star-comments
# B) We assign new column with values of 'one' to 4 & 5 star-comments, and a 'zero' to 1, 2 & 3 star-comments
df['sentiments'] = df['Rating_Score'].apply(lambda x: 0 if x in [1, 2] else 1)
#df['sentiments'] = df['Rating_Score'].apply(lambda x: 0 if x in [1, 2, 3] else 1)

In [9]:
# In this case our dependant variable will be 'sentiments' as 0 (didn't liket) 
# OR 1 (did like the product or are neutral)
y = df.sentiments.values
X = df.Reviews.values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(y)

In [10]:
# We split the data into 80% train and 20% test parts
X_train, X_test, y_train, y_test = train_test_split(X, encoded_labels, stratify = encoded_labels)

In [11]:
# We store words in a dictionary called ‘word_counts’. All the unique words in the corpus are stored in ‘vocab’
vec = CountVectorizer(max_features = 3000)
X_trained = vec.fit_transform(X_train)
vocab = vec.get_feature_names_out()
X_trained = X_trained.toarray()
word_counts = {}
for l in range(2):
    word_counts[l] = defaultdict(lambda: 0)
for i in range(X_trained.shape[0]):
    l = y_train[i]
    for j in range(len(vocab)):
        word_counts[l][vocab[j]] += X_trained[i][j]

In [12]:
# We need to perform Laplace smoothing to handle words in the test set which are absent in the training set. 
# We define a function ‘laplace_smoothing’ which takes the vocabulary and the raw ‘word_counts’ dictionary and 
# returns the smoothened conditional probabilities.
def laplace_smoothing(n_label_items, vocab, word_counts, word, text_label):
    a = word_counts[text_label][word] + 1
    b = n_label_items[text_label] + len(vocab)
    return math.log(a/b)

In [13]:
# We define the ‘fit’ and ‘predict’ functions for our classifier
def group_by_label(x, y, labels):
    data = {}
    for l in labels:
        data[l] = x[np.where(y == l)]
    return data

In [14]:
def fit(x, y, labels):
    n_label_items = {}
    log_label_priors = {}
    n = len(x)
    grouped_data = group_by_label(x, y, labels)
    for l, data in grouped_data.items():
        n_label_items[l] = len(data)
        log_label_priors[l] = math.log(n_label_items[l] / n)
    return n_label_items, log_label_priors

In [15]:
#!python -m nltk.downloader punkt

In [16]:
def predict(n_label_items, vocab, word_counts, log_label_priors, labels, x):
    result = []
    for text in x:
        label_scores = {l: log_label_priors[l] for l in labels}
        words = set(sent_tokenize(text))
        for word in words:
            if word not in vocab: continue
            for l in labels:
                log_w_given_l = laplace_smoothing(n_label_items, vocab, word_counts, word, l)
                label_scores[l] += log_w_given_l
        result.append(max(label_scores, key = label_scores.get))
    return result

In [17]:
labels = [0,1]
n_label_items, log_label_priors = fit(X_train, y_train, labels)
pred = predict(n_label_items, vocab, word_counts, log_label_priors, labels, X_test)
print("Accuracy of prediction on test set : ", accuracy_score(y_test, pred))

Accuracy of prediction on test set :  0.9445114595898673


# Conclusion 1 - Case when out hypothesis states that 3-star ratings are biased towards ‘positive’ sentiment: The classifier is now fitted on the X_train and is used to predict labels for the X_test. The accuracy of the positive sentiment prediction on the test set comes out to be 94.45%, which is excellent! This means that the probability of a customer liking current available whey protein products is 94.5%. 

# Conclusion 2 - Case when out hypothesis states that 3-star ratings are biased towards ‘negative’ sentiment: The classifier is now fitted on the X_train and is used to predict labels for the X_test. The accuracy of the positive sentiment prediction on the test set comes out to be 85.65%. This means that the probability of a customer liking current available whey protein products decreases 8.8% according to this model/classifier.