In [None]:
from IPython import display
display.Image("https://www.filehippopc.online/wp-content/uploads/2021/04/sentiment-analysis.png")

# Sentiment Analysis
Sentiment analysis of product reviews, an application problem, has recently become very popular in text mining and computational linguistics research. Here, we want to study the correlation between the Amazon product reviews and the rating of the products given by the customers. We use traditional machine learning algorithms including Naive Bayes analysis, Support Vector Machines and Logistic Regression. By comparing these results, we could get a better understanding of the these algorithms. They could also act as a supplement to other fraud scoring detection methods.

# Importing packages and loading data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import math
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [None]:
df= pd.read_csv('1429_1.csv')

In [None]:
df.head(10)

# Dataset Description
Our dataset comes from **Consumer Reviews of Amazon Products**.This dataset has **34660** data points in total. Each example includes the type, name of the product as well as the text review and the rating of the product. To better utilize the data, first we extract the rating and review column since these two are the essential part of this project.Then, we found that there are some data points which has no ratings when we went through the data. After eliminating those examples, we have **34627** data points in **total**.

In [None]:
print("Data points before elimination : ",len(df))
df=df.dropna(subset=["reviews.rating"])
print("Data points after elimination : ",len(df))
df.head(2)

In [None]:
df.head(10)

# Exploratory Data Analysis (EDA)
To have a brief overview of the dataset, we have plot the distribution of the ratings.it shows that we have 5 classes - rating 1 to 5 as well as the distribution among them. Also, these five classes are actually imbalanced as class 1 and class 2 have small amount of data while class 5 has more than 20000 reviews.

In [None]:
sns.countplot(x='reviews.rating', data=df)

plt.title('Distribution of rating scores')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

### Due to the high imbalance of our dataset, we find and added more datapoints with low ratings from other resources.
We think this might help us solve the problem of data imbalance.

In [None]:
# load the other dataset
df2 = pd.read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv")
df3 = pd.read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv")

# Eliminating data points which has no ratings
df2=df2.dropna(subset=["reviews.rating"])
df3=df3.dropna(subset=["reviews.rating"])

# using only data of rating lower than or equal to 3 and resetting index after filtering rows
df2 = df2[df2["reviews.rating"] <= 3].reset_index(drop=True) 
df3 = df3[df3["reviews.rating"] <= 3].reset_index(drop=True)
df2['reviews.rating'].value_counts().sort_index(ascending=False)
df3['reviews.rating'].value_counts().sort_index(ascending=False)

# concatenation
data = pd.concat([df, df2, df3])
len(data)

In [None]:
sns.countplot(x='reviews.rating', data=data)

In [None]:
data.describe()

Based on the descriptive statistics above, we see the following:

* Average review score has decreased to 4.38, with low standard deviation
* Most review are positive from 2nd quartile onwards
* The average for number of reviews helpful (reviews.numHelpful) is 0.65 but high standard deviation
* The data are pretty spread out around the mean, and since can't have negative people finding something helpful, then this is only on the right tail side
* The range of most reviews will be between 0-13 people finding helpful (reviews.numHelpful)
* The most helpful review was helpful to 814 people
* This could be a detailed, rich review that will be worth looking a

In [None]:
data.info()

Based on the information above:

* Drop reviews.userCity, reviews.userProvince, reviews.id, and reviews.didPurchase since these values are floats (for exploratory analysis only)
* Not every category have maximum number of values in comparison to total number of values
* reviews.text category has minimum missing data (37727/37728) -> Good news!
* We need to clean up the name column by referencing asins (unique products) since we have 7000 missing values

In [None]:
data["asins"].unique()

In [None]:
asins_unique = len(data["asins"].unique())
print("Number of Unique ASINs: " + str(asins_unique))

Next, we will explore the following columns:

* asins
* reviews.rating
* (reviews.numHelpful - not possible since numHelpful is only between 0-13 as per previous analysis in Raw Data)
* (reviews.text - not possible since text is in long words)

# reviews.rating / ASINs

In [None]:
asins_count_ix = data["asins"].value_counts().index
plt.subplots(2,1,figsize=(16,12))
plt.subplot(2,1,1)
data["asins"].value_counts().plot(kind="bar", title="ASIN Frequency",color=['yellow', 'red', 'blue'])
plt.subplot(2,1,2)
sns.pointplot(x="asins", y="reviews.rating", order=asins_count_ix, data=data )
plt.xticks(rotation=90)
plt.show()

* 1a) The most frequently reviewed products have their average review ratings in the 4.5 - 4.8 range, with little variance
* 1b) Although there is a slight inverse relationship between the ASINs frequency level and average review ratings for the first 4 ASINs, this relationship is not significant since the average review for the first 4 ASINs are rated between 4.5 - 4.8, which is considered good overall reviews
* 2a) For ASINs with lower frequencies as shown on the bar graph (top), we see that their corresponding average review ratings on the point-plot graph (bottom) has significantly higher variance as shown by the length of the vertical lines. As a result, we suggest that, the average review ratings for ASINs with lower frequencies are not significant for our analysis due to high variance
* 2b) On the other hand, due to their lower frequencies for ASINs with lower frequencies, we suggest that this is a result of lower quality products
* 2c) Furthermore, the last 4 ASINs have no variance due to their significantly lower frequencies, and although the review ratings are a perfect 5.0, but we should not consider the significance of these review ratings due to lower frequency as explained in 2a)

**Note that point-plot graph automatically takes the average of the review.rating data.**

# Analysis 
Using the features in place, we will build a classifier that can determine a review's sentiment.

## Set Target Variable (Sentiments)
Segregate ratings from 1-5 into positive, neutral, and negative.

In [None]:
def sentiments(rating):
    if (rating == 5) or (rating == 4):
        return "Positive"
    elif rating == 3:
        return "Neutral"
    elif (rating == 2) or (rating == 1):
        return "Negative"

# Splitting Dataset into Train and Test Set
* Before we explore the dataset we're going to split it into training set and test sets
* Our goal is to eventually train a sentiment analysis classifier
* Since the majority of reviews are positive (5 stars), we will need to do a stratified split on the reviews score to ensure that we don't train the classifier on imbalanced data
* To use sklearn's **train_test_split** class, we're going to convert all review rating to **integer** datatype

In [None]:
# from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
data["reviews.rating"] = data["reviews.rating"].astype(int)
data["Sentiment"] = data["reviews.rating"].apply(sentiments)

X_train, X_test,y_train, y_test = train_test_split(
    data["reviews.text"], data["Sentiment"] , test_size=0.20, random_state=42, stratify= data["reviews.rating"] )

In [None]:
print("Training Sample :",len(X_train))
print("Testing sample :", len(X_test))

# Feature Engineering and Selection
Here we will turn content into numerical feature vectors using the  **Bag of Words** strategy:

* **Assign fixed integer id to each word occurrence (integer indices to word occurrence dictionary)**
* **X[i,j] where i is the integer indices, j is the word occurrence, and X is an array of words (our training set)**


In order to implement the **Bag of Words** strategy, we will use SciKit-Learn's **CountVectorizer** to performs the following:

* Text preprocessing:
* Tokenization (breaking sentences into words)
* Stopwords (filtering "the", "are", etc)
* Occurrence counting (builds a dictionary of features from integer indices with word occurrences)
* Feature Vector (converts the dictionary of text documents into a feature vector)

In [None]:
# Replace "nan" with space
X_train = X_train.fillna(' ')
X_test = X_test.fillna(' ')
y_train = y_train.fillna(' ')
y_test = y_test.fillna(' ')

# Text preprocessing and occurance counting

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train) 
X_train_counts.shape
print("Training Sample :",(X_train_counts.shape[0]))
print("Distinct Words :", (X_train_counts.shape[1]))

With longer documents, we typically see higher average count values on words that carry very little meaning, this will overshadow shorter documents that have lower average counts with same frequencies, as a result, we will use **TfidfTransformer** to reduce this redundancy:

* **Term Frequencies (Tf) divides number of occurrences for each word by total number of words**
* **Term Frequencies times Inverse Document Frequency (Tfidf) downscales the weights of each word (assigns less value to unimportant stop words ie. "the", "are", etc)**

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [None]:
from sklearn.pipeline import Pipeline
clf_multiNB_pipe = Pipeline([("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("clf_nominalNB", MultinomialNB())])
clf_multiNB_pipe.fit(X_train, y_train)
predictedMultiNB = clf_multiNB_pipe.predict(X_test)
np.mean(predictedMultiNB == y_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
clf_logReg_pipe = Pipeline([("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("clf_logReg", LogisticRegression())])
clf_logReg_pipe.fit(X_train, y_train)

predictedLogReg = clf_logReg_pipe.predict(X_test)
np.mean(predictedLogReg == y_test)
print('Accuracy: {}'. format(accuracy_score(y_test, predictedLogReg)))

In [None]:
from sklearn.svm import LinearSVC
clf_linearSVC_pipe = Pipeline([("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("clf_linearSVC", LinearSVC())])
clf_linearSVC_pipe.fit(X_train, y_train)

predictedLinearSVC = clf_linearSVC_pipe.predict(X_test)
np.mean(predictedLinearSVC == y_test)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print(classification_report(y_test,predictedLinearSVC))
print('Accuracy: {}'. format(accuracy_score(y_test, predictedLinearSVC)))

In [None]:
from sklearn import metrics
metrics.confusion_matrix(y_test, predictedLinearSVC)