## Kindle Review Sentiment Analysis Project

5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

This dataset is taken from Amazon product data, Julian McAuley, UCSD website.
Dataset link: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html

#### Best Practices 
- Data Preprocessing and Cleaning
- Train, Test split
- BOW, TF-IDF, Word2Vec
- Train ML Algorithms

In [27]:
import gzip
import json

In [28]:
data = []
with gzip.open('Kindle_dataset/reviews_Kindle_Store_5.json.gz', 'rt', encoding='utf-8') as f:
    for line in f:
        if line.strip():  # skip empty lines
            data.append(json.loads(line))

In [29]:
import pandas as pd
df = pd.DataFrame(data)

In [30]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1F6404F1VG29J,B000F83SZQ,Avidreader,"[0, 0]",I enjoy vintage books and movies so I enjoyed ...,5.0,Nice vintage story,1399248000,"05 5, 2014"
1,AN0N05A9LIJEQ,B000F83SZQ,critters,"[2, 2]",This book is a reissue of an old one; the auth...,4.0,Different...,1388966400,"01 6, 2014"
2,A795DMNCJILA6,B000F83SZQ,dot,"[2, 2]",This was a fairly interesting read. It had ol...,4.0,Oldie,1396569600,"04 4, 2014"
3,A1FV0SX13TWVXQ,B000F83SZQ,"Elaine H. Turley ""Montana Songbird""","[1, 1]",I'd never read any of the Amy Brewster mysteri...,5.0,I really liked it.,1392768000,"02 19, 2014"
4,A3SPTOKDG7WBLN,B000F83SZQ,Father Dowling Fan,"[0, 1]","If you like period pieces - clothing, lingo, y...",4.0,Period Mystery,1395187200,"03 19, 2014"


In [31]:
## selecting only the required columns
df = df[['reviewText', 'overall']]
df.head()

Unnamed: 0,reviewText,overall
0,I enjoy vintage books and movies so I enjoyed ...,5.0
1,This book is a reissue of an old one; the auth...,4.0
2,This was a fairly interesting read. It had ol...,4.0
3,I'd never read any of the Amy Brewster mysteri...,5.0
4,"If you like period pieces - clothing, lingo, y...",4.0


In [32]:
## renaming the "overall" column to 'rating'
df = df.rename(columns = {'overall': 'rating'})
df.head()

Unnamed: 0,reviewText,rating
0,I enjoy vintage books and movies so I enjoyed ...,5.0
1,This book is a reissue of an old one; the auth...,4.0
2,This was a fairly interesting read. It had ol...,4.0
3,I'd never read any of the Amy Brewster mysteri...,5.0
4,"If you like period pieces - clothing, lingo, y...",4.0


In [33]:
df.shape

(982619, 2)

In [34]:
## checking for missing values
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [35]:
df['rating'].unique()

array([5., 4., 3., 2., 1.])

In [36]:
## converting rating values to int
df['rating'] = df['rating'].astype(int)
df['rating'].dtype

dtype('int32')

In [37]:
df['rating'].unique()

array([5, 4, 3, 2, 1])

In [38]:
df['rating'].value_counts()

rating
5    575264
4    254013
3     96194
2     34130
1     23018
Name: count, dtype: int64

In [39]:
## categorizing the reviews positive - 1 and negative - 0;
df['rating'] = df['rating'].apply(lambda x: 1 if x > 3 else 0)

In [46]:
df['rating'].value_counts()

rating
1    829277
0    153342
Name: count, dtype: int64

In [47]:
## lowering all the reviews
df['reviewText'] = df['reviewText'].str.lower()

In [50]:
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\himan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [51]:
from bs4 import BeautifulSoup

In [61]:
## removing special characters
stop_words = set(stopwords.words('english'))

def clean_review_text(text):
    # Remove special characters (keep letters, numbers, and hyphens)
    text = re.sub(r'[^a-zA-Z0-9\s-]', '', text)
    
    # Remove stopwords
    text = " ".join([word for word in text.split() if word.lower() not in stop_words])
    
    # Remove HTML tags using BeautifulSoup
    text = BeautifulSoup(text, 'html.parser').get_text()

    # Remove extra spaces
    text = " ".join(text.split())
    
    return text


In [62]:
# Apply the function to the 'reviewText' column
df['reviewText'] = df['reviewText'].map(clean_review_text)

In [63]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage books movies enjoyed reading boo...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mysteries one reall...,1
4,like period pieces - clothing lingo enjoy myst...,1


In [64]:
## importing the lemmatizer
from nltk.stem import WordNetLemmatizer

In [65]:
## initializing lemmatizer object
lemmatizer = WordNetLemmatizer()

In [66]:
def lemma_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [67]:
df['reviewText'] = df['reviewText'].apply(lambda x: lemma_words(x))

In [68]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage book movie enjoyed reading book ...,1
1,book reissue old one author born 1910 era say ...,1
2,fairly interesting read old- style terminology...,1
3,id never read amy brewster mystery one really ...,1
4,like period piece - clothing lingo enjoy myste...,1


In [69]:
## Dependent and Independent Feature
X = df['reviewText']
y = df['rating']

In [70]:
## Train, Test and split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [72]:
## Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

In [73]:
## TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [88]:
## As our Matrix is Sparse, we are going to use Naive Bayes Algorithm
from sklearn.naive_bayes import MultinomialNB
nb_model_bow = MultinomialNB().fit(X_train_bow, y_train)
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)

In [76]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [77]:
y_pred_bow = nb_model_bow.predict(X_test_bow)

In [78]:
y_pred_tfidf = nb_model_bow.predict(X_test_tfidf)

In [79]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

BOW accuracy:  0.8931886181840386
TFIDF accuracy:  0.8646577517249802


In [80]:
print("Classification Report for Bag of Words:")
print(classification_report(y_test, y_pred_bow))

Classification Report for Bag of Words:
              precision    recall  f1-score   support

           0       0.70      0.54      0.61     30433
           1       0.92      0.96      0.94    166091

    accuracy                           0.89    196524
   macro avg       0.81      0.75      0.77    196524
weighted avg       0.89      0.89      0.89    196524



In [81]:
print("Classification Report for TF-IDF:")
print(classification_report(y_test, y_pred_tfidf))

Classification Report for TF-IDF:
              precision    recall  f1-score   support

           0       0.87      0.15      0.25     30433
           1       0.86      1.00      0.93    166091

    accuracy                           0.86    196524
   macro avg       0.87      0.57      0.59    196524
weighted avg       0.86      0.86      0.82    196524

