# Kindle Review Analysis
##### Inspiration
    - Sentimental analysis on reviews.
    - Understanding how people rate usefulness of a review /  What factors influence helpfulness of a review.
    - Fake review / Outliers
    - Best rated product IDs, or similarity between products based.

## Best practices
1. Pre-processing and clearing
2. Train Test Split
3. BOW, TFIDF, Word2Vec
4. Train ML Algorithms

### URL FOR DATASET : https://www.kaggle.com/datasets/bharadwaj6/kindle-reviews

In [2]:
import pandas as pd
data = pd.read_csv('kindle_reviews.csv')
data.head(2)

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,B000F83SZQ,"[0, 0]",5,I enjoy vintage books and movies so I enjoyed ...,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
1,1,B000F83SZQ,"[2, 2]",4,This book is a reissue of an old one; the auth...,"01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400


In [3]:
data = data[['reviewText', 'overall']]

In [5]:
data = data.sample(10000)
data.shape

(10000, 2)

In [6]:
## unique values
data.overall.unique()

array([5, 4, 2, 3, 1], dtype=int64)

In [7]:
data.overall.value_counts()

5    5848
4    2547
3    1031
2     343
1     231
Name: overall, dtype: int64

In [8]:
## missing values
data.isnull().sum()

reviewText    0
overall       0
dtype: int64

In [9]:
## so, we can remove the null values present in the dataset
data.dropna(inplace=True)

In [10]:
## recheck for null values
data.isnull().sum()

reviewText    0
overall       0
dtype: int64

In [11]:
## Pre-Processing and cleaning
### positive review 1
### negative review 0
data['overall'] = data['overall'].apply(lambda x:0 if x<3 else 1)

In [12]:
data.overall.value_counts()

1    9426
0     574
Name: overall, dtype: int64

In [13]:
## 1. lower the cases
data['reviewText'] = data['reviewText'].str.lower()

In [14]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
# Step 1: Remove special characters
data['reviewText'] = data['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', str(x)))

# Step 2: Remove stopwords
stop_words = set(stopwords.words('english'))
data['reviewText'] = data['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word.lower() not in stop_words]))

# Step 3: Remove URLs
data['reviewText'] = data['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://[\w_-]+(?:\.[\w_-]+)+[\w.,@?^=%&:/~+#-]*', '', str(x)))

# Step 4: Remove HTML tags
data['reviewText'] = data['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Step 5: Remove extra spaces
data['reviewText'] = data['reviewText'].apply(lambda x: " ".join(x.split()))

In [16]:
data.head()

Unnamed: 0,reviewText,overall
367334,im opinion story good like characters even lea...,1
315361,leather lace rocknroll interesting story roman...,1
729208,love love love author awakened series keeps ge...,1
327736,short story characterization plot resolution c...,0
705292,losing control sexy good read strong alpha alw...,1


In [17]:
## applying word net leematization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [18]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [19]:
data['reviewText'] = data['reviewText'].apply(lambda x:lemmatize_words(x))

In [20]:
data.head()

Unnamed: 0,reviewText,overall
367334,im opinion story good like character even lead...,1
315361,leather lace rocknroll interesting story roman...,1
729208,love love love author awakened series keep get...,1
327736,short story characterization plot resolution c...,0
705292,losing control sexy good read strong alpha alw...,1


In [21]:
## train-test-split
from sklearn.model_selection import train_test_split
x = data['reviewText']
y = data['overall']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

In [23]:
x_train_bow = bow.fit_transform(x_train).toarray()
x_test_bow = bow.transform(x_test).toarray()

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
x_train_tf = tf.fit_transform(x_train).toarray()
x_test_tf = tf.transform(x_test).toarray()

In [25]:
x_train_tf, x_train_bow

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64))

In [26]:
## naive bayes bow
from sklearn.naive_bayes import GaussianNB
nb_model_bow = GaussianNB().fit(x_train_bow, y_train)

In [27]:
## naive bayes tfidf
from sklearn.naive_bayes import GaussianNB
nb_model_tf = GaussianNB().fit(x_train_tf, y_train)

In [28]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [31]:
y_pred_bow = nb_model_bow.predict(x_test_bow)

In [32]:
y_pred_tf = nb_model_tf.predict(x_test_tf)

In [33]:
## accuracy
print(f"BOW Accuracy {accuracy_score(y_test, y_pred_bow)}")
print(f"TFIDF Accuracy {accuracy_score(y_test, y_pred_tf)}")

BOW Accuracy 0.82
TFIDF Accuracy 0.82
