# Sentiment Analysis Using Text Classification Techniques on Amazon Reviews Dataset

- Libraries Used:  __Pandas, Contractions, Numpy, nltk, re, sklearn__
- Time to Run the code: __Approximately 4 minutes__


In [None]:
import pandas as pd
import contractions
import numpy as np
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import re


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hetvishah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hetvishah/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/hetvishah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hetvishah/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Read Data
The following cell reads the data.tsv file (available at [Amazon Reviews Dataset](https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-beauty-v1-00-tsv-zip)), stored in the working directory. Since the file data has a tabular structure, I have used a seperator to read it. I have also added an "on_bad_lines" parameter which handles any unstructured lines in the data by skipping it.

In [None]:
dataFile = pd.read_csv(r'data.tsv', sep='\t', on_bad_lines = 'skip')

  dataFile = pd.read_csv(r'data.tsv', sep='\t', on_bad_lines = 'skip')


## Keep Reviews and Ratings
In the next cell, I am filtering the required columns for the analysis, namely: "review_body", "review_headline" and "star_rating". After carefully analyzing the "star_rating" column, I noticed some unwanted datetype strings. To clean that, I remove all the rows which are not numeric. Secondly, to make sure all the rows are of the same datatype, I am converting the columns, "review_body" and "review_headline" to string type.   

In [None]:
reducedDataFile = dataFile.filter(['review_body','review_headline','star_rating'])
reducedDataFile = reducedDataFile[pd.to_numeric(reducedDataFile['star_rating'], errors='coerce').notnull()]
reducedDataFile = reducedDataFile.astype({'star_rating':'int'})
reducedDataFile = reducedDataFile.astype({'review_body':'str'})

 ## We form three classes and select 20000 reviews randomly from each class.
In the following as per the homework requirment, I am assigning "classes" to every review by its star rating. There are 3 classes with the distribution as:
- Class 1 -> Star rating 1 and 2
- Class 2 -> Star rating 3
- Class 3 -> Star rating 4 and 5



The function "def Classes(x)" returns a list with the belonging class as per the above requirement. I use the inbuild python assign function to convert the list into a dataframe column.
Once the classes are created, the code downsamples the dataset and randomly selects 60K rows(20K from each class) for further analysis.

In [None]:
def classes(x):
    if x == 1 or x == 2:
        temp = 1
    elif x == 3:
        temp = 2
    else:
        temp = 3
    return temp
tempClassList = []
tempClassList = reducedDataFile['star_rating'].apply(classes)
reviewDataFrame = reducedDataFile.assign(classes = tempClassList)
reviewDataFrame = reviewDataFrame.groupby(['classes']).sample(n=20000)

# Data Cleaning



# Pre-processing
For the preprocessing part, the snippet below is responsible to clean the "review_body" column. I have commented the meaning of every line below. An important observation in the cleaning process were the stop words removing many  negative words from the sentence. To avoid that, I tag the first instances of 3 negative words: no, not and never to NEGATE, NEG and NEGATIVE(Reference: https://reader.elsevier.com/reader/sd/pii/S1877050913001385?token=F183B53A28D401B590A6DD7C190DFCBBA1D4F045114C33277B8BB4789B7ADCB9B9514DF3AC285AADDD9D5B2932E4EB1D&originRegion=us-east-1&originCreation=20230125105004 ). It incorporates the negation of a sentence. I also observed that the stopword library contained words like: weren, haven, hasn etc showing negation. The contraction library used cannot fix these. Hence I am manually fixing these to its original meaning.

In [None]:
#Character count before preprocessing
BeforeCleaning = sum(reviewDataFrame['review_body'].str.len())/len(reviewDataFrame['review_body'])

#FOR REVIEW BODY
#Tokenization

#Convert reviews to lowercase
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.lower())
#Remove any data inside brackets
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('\[.*?\]', '', x))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('\(.*?\)', '', x))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('\{.*?\}', '', x))
#Remove web URL's from the reviews
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('https:\/\/.*', '', x))
#Remove HTML tags
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('<[^<]+?>', '', x))
#Remove words with numericals
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('\w*\d\w*', '', x))
#Remove '\n' from the text. I noticed some text containing them.
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x:  re.sub('\n', '', x))
#Fix words like couldn't -> could not using the contraction library
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: contractions.fix(x))
#Manually include the negative stopwords in a sentence
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" weren ", " were not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" haven ", " have not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" wouldn ", " would not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" hasn ", " has not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" shouldn ", " should not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" hadn ", " had not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" mightn ", " might not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" couldn ", " could not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" wasn ", " was not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" needn ", " need not "))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" not ", " NEG ",1))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" no ", " NEGATE ",1))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: x.replace(" never ", " NEGATIVE ",1))
#Removing Alphanumeric characters
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
#Removing any extra space
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: " ".join(x.split()))

The snippet below cleans the "review_headline" column. Here I have not included the negative tags to keep the headline meaning intact since a headline is generally small. And to differentiate from review_body, I have converted all headlines into uppercase and contactinated them in review_body

In [None]:
#FOR REVIEW Headline

#Data Preprocessing

#Convert text to uppercase
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: x.upper())
#remove words inside parenthesis
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('\[.*?\]', '', x))
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('\(.*?\)', '', x))
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('\{.*?\}', '', x))
#remove web url's
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('https:\/\/.*', '', x))
#Remove html tags
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('<[^<]+?>', '', x))
#remove words containing numbers
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('\w*\d\w*', '', x))
#Remove '\n'
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x:  re.sub('\n', '', x))
#Fix contracted words
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: contractions.fix(x))
#Tag only three negative instances
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: x.replace(" not ", " NEG ",1))
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: x.replace(" no ", " NEGATE ",1))
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: x.replace(" never ", " NEGATIVE ",1))
#Remove symbols
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
#Remove extra space
reviewDataFrame['review_headline'] = reviewDataFrame['review_headline'].apply(lambda x: " ".join(x.split()))

#Concatinate two strings
reviewDataFrame['review_body'] = reviewDataFrame['review_headline'] + ' ' + reviewDataFrame['review_body']
AfterCleaning = sum(reviewDataFrame['review_body'].str.len())/len(reviewDataFrame['review_body'])

print("Average length of reviews before cleaning: ", BeforeCleaning)
print("Average length of reviews after cleaning: ", AfterCleaning)

Average length of reviews before cleaning:  269.00935
Average length of reviews after cleaning:  277.7574833333333


There is an increase in words since I am concatinating review_headline to review_body. The word length might also have increased due to word-tagging.

## remove the stop words
Here using the nltk library inbuilt stopwords list, I remove all stopwords from the review. It helps getting rid of the common words and improved the accuracy as the review is no more dominated by those.

In [None]:
from nltk.corpus import stopwords
BeforePreprocessing = AfterCleaning
filtered_words = set(stopwords.words('english'))
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: ' '.join([word for word in x.split() if word not in filtered_words]))

## perform lemmatization  
This snippet performs lemmatization. Here when directly using the lemmatizer library, we can only lemmatize nouns.
For better pre-processing, I created a getPosTag function(Ref: StackOverflow). This function first tags the word in a sentence to verb, adjective, noun and adverb and returns the lemmating parameter. Once the words are tagged we can use the lemmatizer function to automatically lemmatize all these texts. After finishing lemmatization, I noticed some one and two lettered words which were still present in the reviews. These words were either a typing error, or something irrelevant to the review. And so I remove all the words from our review with length less than or equal to 2.

In [None]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
def getPosTag(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def lemmatize(token):
    lemmatizer = WordNetLemmatizer()
    tokenizedList = pos_tag(word_tokenize(token))
    lemmStr = ''
    for val in tokenizedList:
        Tag = getPosTag(val[1])
        if Tag!= '':
            lemmStr+= ' '+ lemmatizer.lemmatize(val[0], Tag)
        else:
            lemmStr+= ' '+ val[0]
    return lemmStr
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: lemmatize(x))


#Remove tokens of length 1 or 2
reviewDataFrame['review_body'] = reviewDataFrame['review_body'].apply(lambda x: ' '.join([word for word in x.split() if len(word) >= 3]))
AfterPreprocessing = sum(reviewDataFrame['review_body'].str.len())/len(reviewDataFrame['review_body'])
print("Average length of reviews before pre-processing: ", BeforePreprocessing)
print("Average length of reviews after pre-processing: ", AfterPreprocessing)


Average length of reviews before pre-processing:  277.7574833333333
Average length of reviews after pre-processing:  170.36905


# TF-IDF Feature Extraction
TF-IDF is the last feature extraction process. Here I used the inbuilt train_test_split library to convert the data into test and training data, with 80% of the data being test data. Once converted, I create a TfidfVectorizer object and use it to create features for the model. The labels being the class rating

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

trainingData, testingData , trainY, testY = train_test_split(reviewDataFrame['review_body'].values,reviewDataFrame['classes'].values,test_size=0.2,random_state=123, stratify = reviewDataFrame['classes'].values )
vector = TfidfVectorizer()
trainX = vector.fit_transform(trainingData)
testX = vector.transform(testingData)


Once we have the features and labels ready, we use four linear models to compare accuracy

# Perceptron

Using the sklearn inbuilt perceptron model, the output here is as follows:
- Class 1 precision, recall, f1-score
- Class 2 precision, recall, f1-score
- Class 3 precision, recall, f1-score
- Avg Precision, Avg recall, Avg f1-score


In [None]:
from sklearn.linear_model import Perceptron
from sklearn.metrics import classification_report

#alpha: Regularization term multiplier
#eta0: Constant by which the updates are multiplied
#n_iter_no_change: Number of iterations with no improvement to wait before early stopping
#early_stopping: use early stopping to terminate training when validation score is not improving.
modelPerceptron = Perceptron(random_state=0,alpha = 0.2,eta0 = 10,n_iter_no_change = 10,early_stopping=True)

modelPerceptron.fit(trainX, trainY)
accuracyPerceptron = modelPerceptron.score(testX,testY)
Predictions = modelPerceptron.predict(testX)
reportPerceptron = classification_report(testY,Predictions, digits=5, output_dict = True)

print(reportPerceptron['1']['precision'],',',reportPerceptron['1']['recall'],',',reportPerceptron['1']['f1-score'])
print(reportPerceptron['2']['precision'],',',reportPerceptron['2']['recall'],',',reportPerceptron['2']['f1-score'])
print(reportPerceptron['3']['precision'],',',reportPerceptron['3']['recall'],',',reportPerceptron['3']['f1-score'])
print(reportPerceptron['macro avg']['precision'],',',reportPerceptron['macro avg']['recall'],',', reportPerceptron['macro avg']['f1-score'])

0.739521800281294 , 0.65725 , 0.6959629384513568
0.6130250117980179 , 0.6495 , 0.6307356154406409
0.7613501307344901 , 0.80075 , 0.7805531863043743
0.7046323142712674 , 0.7025 , 0.702417246732124


# SVM
Using the sklearn inbuilt linear SVM model with the "hinge" loss function, the output is as follows:
- Class 1 precision, recall, f1-score
- Class 2 precision, recall, f1-score
- Class 3 precision, recall, f1-score
- Avg Precision, Avg recall, Avg f1-score



In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

#tot: Tolerance for stopping criteria.
# C: strength of the regularization is inversely proportional to C
#intercept_scaling: increase the effect of regularization on synthetic feature weight (and therefore on the intercept).
#max_iter: maximum number of passes over training data
modelSVM = LinearSVC(loss  = 'hinge',tol = 1e-4, C = 0.7,intercept_scaling = 0.1, max_iter = 5000)

modelSVM.fit(trainX, trainY)
accuracySVM = modelSVM.score(testX,testY)
PredictionsSVM = modelSVM.predict(testX)
reportSVM = classification_report(testY,PredictionsSVM, digits=5, output_dict = True)

print(reportSVM['1']['precision'],',',reportSVM['1']['recall'],',',reportSVM['1']['f1-score'])
print(reportSVM['2']['precision'],',',reportSVM['2']['recall'],',',reportSVM['2']['f1-score'])
print(reportSVM['3']['precision'],',',reportSVM['3']['recall'],',',reportSVM['3']['f1-score'])
print(reportSVM['macro avg']['precision'],',',reportSVM['macro avg']['recall'],',', reportSVM['macro avg']['f1-score'])


0.7569959339870844 , 0.79125 , 0.7737440410707738
0.7166441136671178 , 0.662 , 0.6882391163092918
0.8273520853540253 , 0.853 , 0.8399803052683408
0.7669973776694091 , 0.7687500000000002 , 0.7673211542161354


# Logistic Regression
Using the sklearn inbuilt Logistic Regression model, and the "saga" solver, the output is as follows:
- Class 1 precision, recall, f1-score
- Class 2 precision, recall, f1-score
- Class 3 precision, recall, f1-score
- Avg Precision, Avg recall, Avg f1-score

In [None]:
from sklearn.linear_model import LogisticRegression


modelLR = LogisticRegression(tol = 1e-4, C = 0.8,solver = 'saga',max_iter=10000)
modelLR.fit(trainX, trainY)
accuracyLR = modelLR.score(testX,testY)
PredictionsLR = modelLR.predict(testX)
reportLR = classification_report(testY,PredictionsLR, digits=5, output_dict = True)
print(reportLR['1']['precision'],',',reportLR['1']['recall'],',',reportLR['1']['f1-score'])
print(reportLR['2']['precision'],',',reportLR['2']['recall'],',',reportLR['2']['f1-score'])
print(reportLR['3']['precision'],',',reportLR['3']['recall'],',',reportLR['3']['f1-score'])
print(reportLR['macro avg']['precision'],',',reportLR['macro avg']['recall'],',', reportLR['macro avg']['f1-score'])



0.7647058823529411 , 0.78 , 0.7722772277227723
0.7054322876817138 , 0.6915 , 0.6983966670874889
0.8457114278569643 , 0.8455 , 0.8456057007125892
0.7719498659638732 , 0.7723333333333334 , 0.7720931985076168


# Naive Bayes
Using the sklearn inbuilt Multinomial Naive Bayes model and hypertuning the alpha parameter, the output here is as follows:
- Class 1 precision, recall, f1-score
- Class 2 precision, recall, f1-score
- Class 3 precision, recall, f1-score
- Avg Precision, Avg recall, Avg f1-score

In [None]:
from sklearn.naive_bayes import MultinomialNB
modelNB = MultinomialNB(alpha = 20)
modelNB.fit(trainX, trainY)
accuracyNB = modelNB.score(testX,testY)
PredictionsNB = modelNB.predict(testX)
reportNB = classification_report(testY,PredictionsNB, digits=5, output_dict = True)
print(reportNB['1']['precision'],',',reportNB['1']['recall'],',',reportNB['1']['f1-score'])
print(reportNB['2']['precision'],',',reportNB['2']['recall'],',',reportNB['2']['f1-score'])
print(reportNB['3']['precision'],',',reportNB['3']['recall'],',',reportNB['3']['f1-score'])
print(reportNB['macro avg']['precision'],',',reportNB['macro avg']['recall'],',', reportNB['macro avg']['f1-score'])


0.7401613297482278 , 0.757 , 0.7484859720677296
0.6492504409171076 , 0.73625 , 0.6900187441424555
0.879632374740587 , 0.74175 , 0.8048284280482843
0.7563480484686408 , 0.745 , 0.747777714752823
