#### 📚 About the Dataset:

This dataset is a smaller slice of Amazon’s Kindle Store book reviews, collected between May 1996 and July 2014. It includes 982,619 reviews, where each product and reviewer has at least 5 reviews, ensuring consistent and meaningful data.

#### 💡 What’s Inside:

- asin: Unique ID for each product (e.g., B000FA64PK)

- helpful: Ratio showing how helpful others found the review (e.g., 2 out of 3 users)

- overall: The rating given to the product

- reviewText: The main review content

- reviewTime: Date of the review

- reviewerID: ID of the reviewer

- reviewerName: Name of the reviewer (if available)

- summary: Short title or summary of the review

- unixReviewTime: Review date as a UNIX timestamp

#### 🔍 Dataset Origin:

Sourced from Julian McAuley’s research group at UCSD, this dataset is part of a larger collection of Amazon product reviews. All rights to the data belong to the original creators.

#### 🧠 Why Use This?
- Perform sentiment analysis on book reviews

- Study what makes a review helpful

- Identify fake reviews or unusual patterns

- Find top-rated books or explore product similarities based purely on review text

- Try out other creative data analysis ideas!

Implementing the best practice

1. Preprocessing and cleaning the text
2. Train Test Split
3. Apply the BOW, TFIDF, Word2Vec
4. Train the ML model

In [2]:
import pandas as pd
df = pd.read_csv("all_kindle_review.csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [3]:
df = df[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [4]:
# Basic data inspection
df.columns

Index(['reviewText', 'rating'], dtype='object')

In [5]:
print("The number of unique labels in the reviewText feature : ", df['reviewText'].nunique())
print("The number of unique labels in the rating feature : ", df['rating'].nunique())

The number of unique labels in the reviewText feature :  12000
The number of unique labels in the rating feature :  5


In [6]:
# Finding out the unique values 
print("The unique values in the reviewText are : ", df['rating'].unique())

The unique values in the reviewText are :  [3 5 4 2 1]


In [7]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [8]:
df.isna().sum()
# No null values found

reviewText    0
rating        0
dtype: int64

In [9]:
df['reviewText'][0]

'Jace Rankin may be short, but he\'s nothing to mess with, as the man who was just hauled out of the saloon by the undertaker knows now. He\'s a famous bounty hunter in Oregon in the 1890s who, when he shot the man in the saloon, just finished a years long quest to avenge his sister\'s murder and is now trying to figure out what to do next. When the snotty-nosed farm boy he just rescued from a gang of bullies offers him money to kill a man who forced him off his ranch, he reluctantly agrees to bring the man to justice, but not to kill him outright. But, first he needs to tell his sister\'s widower the news.Kyla "Kyle" Springer Bailey has been riding the trails and sleeping on the ground for the past month while trying to find Jace. She wants revenge on the man who killed her husband and took her ranch, amongst other crimes, and she\'s not so keen on the detour Jace wants to take. But she realizes she\'s out of options, so she hides behind her boy persona as best she can and tries to ke

#### Preprocessing and cleaning the text:

Here we have five raing [1,2,3,4,5], we are considering any rating that's below 3 as negative or else positive

In [10]:
# Positive review is 1 and Negative review is 0
df['rating'] = df['rating'].apply(lambda x:0 if x<3 else 1)


In [11]:
df['rating'].value_counts()
# Though its  50% differnce in the labels, its fine to not consider as imbalanced dataset. ML models works well.

rating
1    8000
0    4000
Name: count, dtype: int64

In [12]:
# 1. Lowering all the words in the reviewText
df['reviewText'] = df['reviewText'].str.lower()

In [13]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))

## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))

## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))

## Remove html tags
df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

  df['reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())


In [62]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [19]:
# 2 Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [20]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [21]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

#### Train Test Split

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['reviewText'],df['rating'],test_size=0.2)

#### Applying the BOW:

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

In [24]:
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

In [25]:
X_train_bow

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Training the Naive Bayes model:

In [26]:
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()

In [27]:
nb_model.fit(X_train_bow,y_train)

In [28]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

In [29]:
y_pred_bow = nb_model.predict(X_test_bow)

#### Metrics:

In [30]:
confusion_matrix(y_test, y_pred_bow)

array([[517, 274],
       [723, 886]], dtype=int64)

In [31]:
print(classification_report(y_test,y_pred_bow))

              precision    recall  f1-score   support

           0       0.42      0.65      0.51       791
           1       0.76      0.55      0.64      1609

    accuracy                           0.58      2400
   macro avg       0.59      0.60      0.57      2400
weighted avg       0.65      0.58      0.60      2400



In [32]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))

BOW accuracy:  0.5845833333333333


#### Applying TF-IDF:

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [34]:
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

In [35]:
y_pred_tfidf = nb_model.predict(X_test_tfidf)

#### Metrics

In [36]:
confusion_matrix(y_test, y_pred_tfidf)

array([[511, 280],
       [714, 895]], dtype=int64)

In [37]:
print(classification_report(y_test,y_pred_tfidf))

              precision    recall  f1-score   support

           0       0.42      0.65      0.51       791
           1       0.76      0.56      0.64      1609

    accuracy                           0.59      2400
   macro avg       0.59      0.60      0.57      2400
weighted avg       0.65      0.59      0.60      2400



In [38]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

TFIDF accuracy:  0.5858333333333333


#### Applying Word2Vec:

In [41]:
import numpy as np

In [40]:
# Tokenize the cleaned review, BOW and TFIDF works for sentences but word2vec needs the tokens first
X_train_tokens = X_train.apply(lambda x: x.split())
X_test_tokens = X_test.apply(lambda x: x.split())

In [44]:
# Train the Word2Vec model:
from gensim.models import Word2Vec

# Train the word2vec on the training data
w2v_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=2, workers=4)

In [45]:
# Create Fixed Length Review Vectors:
# For each review, average the vectors of all words present in the Word2Vec vocabulary.

def get_review_vector(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

X_train_vec = np.array([get_review_vector(tokens, w2v_model, 100) for tokens in X_train_tokens])
X_test_vec = np.array([get_review_vector(tokens, w2v_model, 100) for tokens in X_test_tokens])


In [47]:
# Train the GaussianNB Classifier
# Now we can use these vectors as features for our classifier.

nb_model.fit(X_train_vec, y_train)


In [48]:
# Make predictions on the X_test_vec
y_pred = nb_model.predict(X_test_vec)

#### Metrics:

In [49]:
print(confusion_matrix(y_test, y_pred))

[[ 588  203]
 [ 561 1048]]


In [50]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.51      0.74      0.61       791
           1       0.84      0.65      0.73      1609

    accuracy                           0.68      2400
   macro avg       0.67      0.70      0.67      2400
weighted avg       0.73      0.68      0.69      2400



In [52]:
print("Word2Vec accuracy: ", np.round(accuracy_score(y_test, y_pred),2))

Word2Vec accuracy:  0.68


#### Trying out different classification models:

In [56]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def compare_classifiers_table(X_train_vec, y_train, X_test_vec, y_test):
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.naive_bayes import GaussianNB
    # from sklearn.neural_network import MLPClassifier
    from xgboost import XGBClassifier  

    models = {
        "Logistic Regression": LogisticRegression(max_iter=1000),
        "Random Forest": RandomForestClassifier(),
        "SVM": SVC(),
        "KNN": KNeighborsClassifier(),
        "Decision Tree": DecisionTreeClassifier(),
        "GaussianNB": GaussianNB(),
        # "MLP": MLPClassifier(max_iter=500),
        "XGBoost": XGBClassifier(eval_metric='mlogloss'), 
    }

    results = []

    for name, model in models.items():
        model.fit(X_train_vec, y_train)
        y_pred = model.predict(X_test_vec)
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='macro', zero_division=0)
        recall = recall_score(y_test, y_pred, average='macro', zero_division=0)
        f1 = f1_score(y_test, y_pred, average='macro', zero_division=0)
        results.append({
            "Model": name,
            "Accuracy": accuracy,
            "Precision (macro)": precision,
            "Recall (macro)": recall,
            "F1-score (macro)": f1
        })

    df_results = pd.DataFrame(results)
    print(df_results.sort_values(by="F1-score (macro)", ascending=False).reset_index(drop=True))
    return df_results


In [57]:
# Usage:
df_metrics = compare_classifiers_table(X_train_vec, y_train, X_test_vec, y_test)

                 Model  Accuracy  Precision (macro)  Recall (macro)  \
0              XGBoost  0.749583           0.716400        0.698513   
1        Random Forest  0.747083           0.714334        0.688936   
2  Logistic Regression  0.754583           0.729114        0.684567   
3                  SVM  0.750000           0.726322        0.673115   
4           GaussianNB  0.681667           0.674740        0.697350   
5                  KNN  0.720000           0.679568        0.657811   
6        Decision Tree  0.672917           0.628479        0.626873   

   F1-score (macro)  
0          0.705226  
1          0.697337  
2          0.696078  
3          0.684732  
4          0.669526  
5          0.664519  
6          0.627626  


#### Best Performing Model
- Logistic Regression has the highest accuracy (0.7546) and precision (macro) (0.7291).

- XGBoost has the highest recall (macro) (0.6985) and F1-score (macro) (0.7052), but its accuracy is slightly lower than Logistic Regression.

- SVM and Random Forest are close contenders, but slightly behind Logistic Regression and XGBoost.

#### Conclusion:
- Logistic Regression is the best overall performer on our data, but XGBoost is a close second and may be preferable if we prioritize F1-score or recall. Both are strong choices for our sentiment analysis task!

- Logistic Regression is preferred for this project due to its simplicity interpretability, and strong performance. 