# ***KINDLE REVIEW SENTIMENT ANALYSIS***
## **About Dataset**
Context: This is a subset of dataset of book reviews from kindle store category.

**Content** 
* 5-crore dataset of product reviews from amazon jindle store category from May 1996 - july 2014. Contans total of 98260 entries. Each product has at least 5 reviews in the dataset.

**Columns**
* `asin` - ID of the product, like B000FA64PK
* `helpful` - Helpful rating of the review - example 2/3
* `overall` - Rating of the product.
* `reviewText` -
*  text of the review (heading).
* `reviewTime` - time of the review (raw)
* `reviewerID` - ID of the eeviewer, like A3SPTOKDG7WBLN
* `reviewerName` - name of the reviewer.
* `summary` - summary of thr reviwer
* `unixReviewTime` - unix timestamp

**Inspiration**
- Sentimanet Analysis
- Understanding how people rate usefulness of a review/ what factor influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on review alone

**Process**
The dataset contans 11 columns. But in this project we only need Two columns  i.e.- `rating` and `reviewText` for sentiment analysis	

**Steps**
1. Preprocessing and Cleaning
2. Train Test Split
3. Bag Of Words(BOW)/ Word2Vec
4. Train with Machine Learning Algorithm
5. Model Evaluation and Performance Measures

In [19]:
# Load the dataset
import pandas as pd
data = pd.read_csv('Data/all_kindle_review.csv')
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [20]:
data.shape

(12000, 11)

In [21]:
data = data[['reviewText','rating']]

In [22]:
data.shape

(12000, 2)

In [23]:
data.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [24]:
# Check missing values
data.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [25]:
data['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [26]:
data['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

### Great!!

## 1. Preprocessing and cleaning the data

In [27]:
# Positive review - 1 and Negative review - 0
data['rating'] = data['rating'].apply(lambda x :0 if x<3 else 1)

In [28]:
data.rating.unique()

array([1, 0], dtype=int64)

In [29]:
data.rating.value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [30]:
# 1.1 Lower case all the data
data["reviewText"] = data['reviewText'].str.lower()

In [31]:
# 1.2 Removing all the  special characters
import re
data['reviewText'] = data['reviewText'].apply(lambda x: re.sub('[^a-z A-Z 0-9]+',"",x))

In [32]:
# 1.3 Removing the stopwords
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')
data['reviewText'] = data['reviewText'].apply(lambda x: " ".join([y for y in x.split() if y not in stopwords.words('english')]))

In [33]:
# 1.4 Removing urls
data['reviewText'] = data['reviewText'].apply(lambda x: re.sub(r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?","",str(x)))

In [34]:
# 1.5 Removing html tags
from bs4 import BeautifulSoup
data['reviewText'] = data['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

In [35]:
data['reviewText']

0        jace rankin may short hes nothing mess man hau...
1        great short read didnt want put read one sitti...
2        ill start saying first four books wasnt expect...
3        aggie angela lansbury carries pocketbooks inst...
4        expect type book library pleased find price right
                               ...                        
11995    valentine cupid vampire jena ian another vampi...
11996    read seven books series apocalypticadventure o...
11997    book really wasnt cuppa situation man capturin...
11998    tried use charge kindle didnt even register ch...
11999    taking instruction look often hidden world sex...
Name: reviewText, Length: 12000, dtype: object

In [40]:
original_data = pd.read_csv('data/all_kindle_review.csv')
original_data = original_data[['reviewText','rating']]

In [41]:
original_data['reviewText']

0        Jace Rankin may be short, but he's nothing to ...
1        Great short read.  I didn't want to put it dow...
2        I'll start by saying this is the first of four...
3        Aggie is Angela Lansbury who carries pocketboo...
4        I did not expect this type of book to be in li...
                               ...                        
11995    Valentine cupid is a vampire- Jena and Ian ano...
11996    I have read all seven books in this series. Ap...
11997    This book really just wasn't my cuppa.  The si...
11998    tried to use it to charge my kindle, it didn't...
11999    Taking Instruction is a look into the often hi...
Name: reviewText, Length: 12000, dtype: object

In [43]:
len(oroginal_data) == len(data)

True

In [46]:
# 1.6 Apply lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word,pos='v') for word in text.split()])

data['reviewText'] = data['reviewText'].apply(lambda x: lemmatize_words(x))

In [47]:
data.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sit s...,1
2,ill start say first four book wasnt expect 34c...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library please find price right,1


## 2. Train Test Split

In [48]:
# 2.1 Split the data into X and y
X = data['reviewText']
y = data['rating']

In [49]:
# 2.2 train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [53]:
X_train

9182     look forward book come double space every para...
11091    already own book spouse forget already part li...
6428     cool forget request rate come make mine unreli...
288      short short story basically scene party one ni...
2626     secret service agent secrests even longer serv...
                               ...                        
11964    download book read review usually read preview...
5191     far one hottest book ive ever get hand ondont ...
5390     even though book free reservation base majorit...
860      little mushy 34must take care woman folk34 cha...
7270     book good good set charaterswith background le...
Name: reviewText, Length: 9600, dtype: object

In [54]:
y_train

9182     1
11091    0
6428     1
288      0
2626     1
        ..
11964    0
5191     1
5390     0
860      1
7270     1
Name: rating, Length: 9600, dtype: int64

In [50]:
X_train.shape

(9600,)

In [55]:
y_train.shape

(9600,)

In [56]:
X_test

1935         really great read wish would hope find author
6494     nope try cant read take greatest delight delet...
1720     story line drug like book much mystery fan wou...
9120     read several angel book one work didnt really ...
360      possibly worst book ever read begin positively...
                               ...                        
1195     enjoy read think fan humorous must err use lik...
11877    pleasantly surprise book enjoy m dubois tell s...
5421     love best friend since 15 year old 30 he serve...
3855     fascinate book enough twist turn keep read lon...
4414     plot note publisher blurb publisher make fun o...
Name: reviewText, Length: 2400, dtype: object

In [52]:
y_test

1935     1
6494     0
1720     0
9120     0
360      0
        ..
1195     1
11877    1
5421     1
3855     1
4414     1
Name: rating, Length: 2400, dtype: int64

In [58]:
X_test.shape

(2400,)

In [59]:
y_test.shape

(2400,)

## 3. Bag of Word ,TF-IDF and Word2Vec

In [64]:
# 3.1. Applying Bag of word
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

In [65]:
# 3.2. Applying tfidf 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

In [66]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [67]:
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [127]:
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

# Download the necessary NLTK data files
nltk.download('punkt')

# Define the tokenization function
def tokenize(text):
    return word_tokenize(text.lower())  # Convert to lowercase to maintain consistency

# Apply tokenization to train and test data
X_train_tokenized = X_train.apply(tokenize)
X_test_tokenized = X_test.apply(tokenize)

# Train the Word2Vec model on the tokenized data
sentences = X_train_tokenized.tolist()  # Convert tokenized data to list of lists
wv = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=0)

# Define function to get average vector
def get_average_vector(tokens, model):
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

# Extract features for training and test data
X_train_wv = np.array(X_train_tokenized.apply(lambda tokens: get_average_vector(tokens, wv)).tolist())
X_test_wv = np.array(X_test_tokenized.apply(lambda tokens: get_average_vector(tokens, wv)).tolist())

print(f"Shape of X_train_wv: {X_train_wv.shape}")
print(f"Shape of X_test_wv: {X_test_wv.shape}")



[nltk_data] Downloading package punkt to C:\Users\Pavilion\OneDrive\De
[nltk_data]     sktop\UdemyMLCourse\venv\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Shape of X_train_wv: (9600, 100)
Shape of X_test_wv: (2400, 100)


In [128]:
X_train_wv.shape

(9600, 100)

## 4. Applying Machine Learning Algorithms

In [129]:
# 4.1. Applying Gaussian Naive Bayes algorithm
from sklearn.naive_bayes import GaussianNB
nb_model_bow = GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf = GaussianNB().fit(X_train_tfidf,y_train)
nb_model_wv = GaussianNB().fit(X_train_wv,y_train)


# 5. Evaluation and Performance Measures

In [130]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

# Predict with X_test
y_pred_bow = nb_model_bow.predict(X_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)
y_pred_wv = nb_model_wv.predict(X_test_wv)

print(f"Accuracy BOW : {accuracy_score(y_test,y_pred_bow)*100:.2f}")
print(f"Classification Report BOW :\n{classification_report(y_test,y_pred_bow)}")
print(f"Confusion Matrix BOW:\n{confusion_matrix(y_test,y_pred_bow)}")
print('\n\n')

print(f"Accuracy TF-IDF : {accuracy_score(y_test,y_pred_tfidf)*100:.2f}")
print(f"Classification Report TF-IDF :\n{classification_report(y_test,y_pred_tfidf)}")
print(f"Confusion Matrix TF-IDF:\n{confusion_matrix(y_test,y_pred_tfidf)}")
print('\n\n')

print(f"Accuracy Word2Vec : {accuracy_score(y_test,y_pred_wv)*100:.2f}")
print(f"Classification Report Word2Vec :\n{classification_report(y_test,y_pred_wv)}")
print(f"Confusion Matrix Word2Vec:\n{confusion_matrix(y_test,y_pred_wv)}")


Accuracy BOW : 57.21
Classification Report BOW :
              precision    recall  f1-score   support

           0       0.41      0.65      0.50       803
           1       0.75      0.53      0.62      1597

    accuracy                           0.57      2400
   macro avg       0.58      0.59      0.56      2400
weighted avg       0.64      0.57      0.58      2400

Confusion Matrix BOW:
[[521 282]
 [745 852]]



Accuracy TF-IDF : 57.54
Classification Report TF-IDF :
              precision    recall  f1-score   support

           0       0.41      0.63      0.50       803
           1       0.75      0.55      0.63      1597

    accuracy                           0.58      2400
   macro avg       0.58      0.59      0.57      2400
weighted avg       0.63      0.58      0.59      2400

Confusion Matrix TF-IDF:
[[507 296]
 [723 874]]



Accuracy Word2Vec : 71.25
Classification Report Word2Vec :
              precision    recall  f1-score   support

           0       0.56      

# **- - - - - - - - - -  * - - - - - - - - - -**