# Dataset 

A small subset of dataset of product reviews from Amazon Kindle Store category.

5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

## ***Categories***

1) asin - ID of the product, like B000FA64PK

2) helpful - helpfulness rating of the review - example: 2/3.

3) overall - rating of the product.

4) reviewText - text of the review (heading).

5) reviewTime - time of the review (raw).

6) reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN

7) reviewerName - name of the reviewer.

8) summary - summary of the review (description).

9) unixReviewTime - unix timestamp.


## Acknowledgement 

This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

## Steps to perform 

1. Processing and cleaning the data
2. Splitting the data ( in order to avoid overfitting )
3. Applying BagOfWords / TF-IDF / word2vec -- avg word2vec
4. Training the model
5. Predict 

## Loading the dataset 

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
df = pd.read_csv('/kaggle/input/kindle-review/kindle_reviews.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
2,2,B000F83SZQ,"[2, 2]",4,This was a fairly interesting read. It had ol...,"04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
3,3,B000F83SZQ,"[1, 1]",5,I'd never read any of the Amy Brewster mysteri...,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
4,4,B000F83SZQ,"[0, 1]",4,"If you like period pieces - clothing, lingo, y...","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200


## Information about the data 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982619 entries, 0 to 982618
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      982619 non-null  int64 
 1   asin            982619 non-null  object
 2   helpful         982619 non-null  object
 3   overall         982619 non-null  int64 
 4   reviewText      982597 non-null  object
 5   reviewTime      982619 non-null  object
 6   reviewerID      982619 non-null  object
 7   reviewerName    978797 non-null  object
 8   summary         982500 non-null  object
 9   unixReviewTime  982619 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 75.0+ MB


Going to use two columns : reviewText , rating for sentiment analysis 

In [5]:
# renaming overall as rating 
df=df.rename(columns={'overall':'rating'})

In [6]:
df = df[['reviewText','rating']]

In [7]:
df.shape

(982619, 2)

In [8]:
df.isnull().sum()

reviewText    22
rating         0
dtype: int64

In [9]:
# dropping those null values , since it doesnt matter in a 900000+ dataset .
df = df.dropna(subset=['reviewText'])

In [10]:
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [11]:
df['rating'].value_counts()

rating
5    575246
4    254010
3     96193
2     34130
1     23018
Name: count, dtype: int64

Most of the ratings are either 5 or 4 , afterwards there is sudden drop for the rating = 3 , and very less reviews get a rating of 2 or 1 .

The problem is if i were to maintain a positive and negative review ratio , the reviews coming in between get merged in the low / negative reviews . Hence droping the columns with 3 as a rating (neutral)


## Preprocessing the data 


In [12]:
## Map positive review to 1 and negative review to 0 .
df = df[df['rating'] != 3]
df['rating'] = df['rating'].apply(lambda x : 0 if x<3 else 1)

In [13]:
df.shape

(886404, 2)

In [14]:
df['rating'].unique()

array([1, 0])

In [15]:
df['rating'].value_counts()

rating
1    829256
0     57148
Name: count, dtype: int64

In [16]:
df['reviewText']=df['reviewText'].str.lower()

In [17]:
df.head()

Unnamed: 0,reviewText,rating
0,i enjoy vintage books and movies so i enjoyed ...,1
1,this book is a reissue of an old one; the auth...,1
2,this was a fairly interesting read. it had ol...,1
3,i'd never read any of the amy brewster mysteri...,1
4,"if you like period pieces - clothing, lingo, y...",1


Presence of highly unbalanced data , therefore trying to balance out the numbers of positive and negative reviews 

In [18]:
positive = df[df['rating'] == 1]
negative = df[df['rating'] == 0]

positive_sampled = positive.sample(len(negative), random_state=42)

df_balanced = pd.concat([positive_sampled, negative], ignore_index=True).sample(frac=1, random_state=42)

## Cleaning of the data 


In [19]:
import re 
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
stop_words = set(stopwords.words('english'))

## Removing the regular expressions 

In [20]:

## removing the special characters 
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0_9]+','',x))
## removing the stopwords

df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))
## removing the urls 
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'https?://\S+|www\.\S+', '', x))
## removing html tags 
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## removing any additional spaces 
df['reviewText']=df['reviewText'].apply(lambda x:" ".join(x.split()))

In [21]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage books movies enjoyed reading boo...,1
1,book reissue old one author born 90 era say ne...,1
2,fairly interesting read old style terminologyi...,1
3,id never read amy brewster mysteries one reall...,1
4,like period pieces clothing lingo enjoy myster...,1


In [22]:
from nltk.stem import WordNetLemmatizer

In [23]:
lemmatizer = WordNetLemmatizer()

In [24]:
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    

In [25]:
df['reviewText']=df['reviewText'].apply(lambda x : lemmatize_text(x))

In [26]:
df.head()

Unnamed: 0,reviewText,rating
0,enjoy vintage book movie enjoyed reading book ...,1
1,book reissue old one author born 90 era say ne...,1
2,fairly interesting read old style terminologyi...,1
3,id never read amy brewster mystery one really ...,1
4,like period piece clothing lingo enjoy mystery...,1


## Train test split 

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train,X_test,y_train,y_test = train_test_split(df['reviewText'],df['rating'],test_size = 0.2)

## Applying Bag of Words / TF-IDF

In [29]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()



In [30]:
X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()


In [32]:
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

## Model training using BOW and TF-IDF

In [33]:
from sklearn.naive_bayes import MultinomialNB

In [34]:
nb_model_bow = MultinomialNB().fit(X_train_bow,y_train)

In [35]:
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf,y_train)

In [36]:
from sklearn.metrics import confusion_matrix,accuracy_score , classification_report

In [37]:
y_pred_bow = nb_model_bow.predict(X_test_bow)

In [38]:
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

In [39]:
print("BOW Accuracy score",accuracy_score(y_test,y_pred_bow))

BOW Accuracy score 0.9557200151172432


In [41]:
print("TF-IDF Accuracy score",accuracy_score(y_test,y_pred_tfidf))

TF-IDF Accuracy score 0.9355937748546094


## Implementing Word2Vec 

In [42]:
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import word_tokenize

# Tokenize review text into list of words
df['tokens'] = df['reviewText'].apply(word_tokenize)

# Train Word2Vec model
w2v_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=2, workers=4)


In [43]:
def get_avg_word2vec(tokens, model, k=100):
    vec = np.zeros(k)
    count = 0
    for word in tokens:
        if word in model.wv:
            vec += model.wv[word]
            count += 1
    if count != 0:
        vec /= count
    return vec

# Compute average vectors
df['w2v_vector'] = df['tokens'].apply(lambda x: get_avg_word2vec(x, w2v_model, k=100))


In [44]:
X = np.array(df['w2v_vector'].tolist())
y = df['rating'].values

X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(X, y, test_size=0.2, random_state=42)


In [45]:
from sklearn.linear_model import LogisticRegression
clf_w2v = LogisticRegression(max_iter=1000)
clf_w2v.fit(X_train_w2v, y_train_w2v)

y_pred_w2v = clf_w2v.predict(X_test_w2v)


In [46]:
from sklearn.metrics import classification_report, accuracy_score
print("Word2Vec Accuracy:", accuracy_score(y_test_w2v, y_pred_w2v))
print(classification_report(y_test_w2v, y_pred_w2v))


Word2Vec Accuracy: 0.9568425268359272
              precision    recall  f1-score   support

           0       0.76      0.49      0.60     11562
           1       0.97      0.99      0.98    165719

    accuracy                           0.96    177281
   macro avg       0.86      0.74      0.79    177281
weighted avg       0.95      0.96      0.95    177281



## Trying to predict the sentiment 

In [47]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub('[^a-z A-z 0_9]+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = BeautifulSoup(text, 'lxml').get_text()
    text = " ".join(text.split())
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return text, tokens



In [48]:
def predict_bow(text):
    clean_text, _ = preprocess_text(text)
    bow_vec = bow.transform([clean_text])
    pred = nb_model_bow.predict(bow_vec)[0]
    return "Positive" if pred == 1 else "Negative"


In [49]:
def predict_tfidf(text):
    clean_text, _ = preprocess_text(text)
    tfidf_vec = tfidf.transform([clean_text])
    pred = nb_model_tfidf.predict(tfidf_vec)[0]
    return "Positive" if pred == 1 else "Negative"


In [50]:
def get_avg_word2vec(tokens, model, k=100):
    vec = np.zeros(k)
    count = 0
    for word in tokens:
        if word in model.wv:
            vec += model.wv[word]
            count += 1
    return vec / count if count != 0 else vec

def predict_w2v(text):
    _, tokens = preprocess_text(text)
    vec = get_avg_word2vec(tokens, w2v_model, k=100).reshape(1, -1)
    pred = clf_w2v.predict(vec)[0]
    return "Positive" if pred == 1 else "Negative"


In [51]:
def predict_all_models(text):
    print(f"\nReview: {text}")
    print("BoW Prediction     :", predict_bow(text))
    print("TF-IDF Prediction  :", predict_tfidf(text))
    print("Word2Vec Prediction:", predict_w2v(text))


In [52]:
predict_all_models("This Kindle book was absolutely amazing, I couldn't stop reading!")
predict_all_models("Worst experience ever. I regret buying it.")



Review: This Kindle book was absolutely amazing, I couldn't stop reading!
BoW Prediction     : Positive
TF-IDF Prediction  : Positive
Word2Vec Prediction: Positive

Review: Worst experience ever. I regret buying it.
BoW Prediction     : Positive
TF-IDF Prediction  : Positive
Word2Vec Prediction: Negative


In [53]:
predict_all_models("Really nice , but it could be better ")


Review: Really nice , but it could be better 
BoW Prediction     : Positive
TF-IDF Prediction  : Positive
Word2Vec Prediction: Positive


In [54]:
predict_all_models("At first, I thought this book was going to change my life — turns out, it just wasted a week of it.")


Review: At first, I thought this book was going to change my life — turns out, it just wasted a week of it.
BoW Prediction     : Positive
TF-IDF Prediction  : Positive
Word2Vec Prediction: Negative


In [55]:
predict_all_models("“I really wanted to love this book. The premise was intriguing, and the initial chapters showed promise. However, as I continued reading, it became painfully clear that the plot was going nowhere, the characters were one-dimensional, and the dialogue felt robotic. I pushed through hoping it would get better, but sadly, it never did.”")


Review: “I really wanted to love this book. The premise was intriguing, and the initial chapters showed promise. However, as I continued reading, it became painfully clear that the plot was going nowhere, the characters were one-dimensional, and the dialogue felt robotic. I pushed through hoping it would get better, but sadly, it never did.”
BoW Prediction     : Negative
TF-IDF Prediction  : Positive
Word2Vec Prediction: Negative


In [56]:
predict_all_models("“I hated the first few chapters. The pacing was awful, the characters felt bland, and I almost gave up entirely. But I’m so glad I stuck with it — the story blossomed in the second half, the characters grew in depth, and by the end, I was genuinely moved. One of the most rewarding reads I’ve had in a while.”")


Review: “I hated the first few chapters. The pacing was awful, the characters felt bland, and I almost gave up entirely. But I’m so glad I stuck with it — the story blossomed in the second half, the characters grew in depth, and by the end, I was genuinely moved. One of the most rewarding reads I’ve had in a while.”
BoW Prediction     : Negative
TF-IDF Prediction  : Positive
Word2Vec Prediction: Negative
