## Project Workflow


### 1.Preprocessing and cleaning (feature engineering)
### 2.Train-test split
### 3.Feature extraction using Bag of Words, TF-IDF, or word2vec
### 4.Training machine learning algorithms

In [46]:
#load the dataset
import pandas as pd
df=pd.read_csv('all_kindle_review.csv')

In [47]:
df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000
...,...,...,...,...,...,...,...,...,...,...,...
11995,11995,2183,B001DUGORO,"[0, 0]",4,Valentine cupid is a vampire- Jena and Ian ano...,"02 28, 2014",A1OKS5Q1HD8WQC,lisa jon jung,jena,1393545600
11996,11996,6272,B002JCSFSQ,"[2, 2]",5,I have read all seven books in this series. Ap...,"05 16, 2011",AQRSPXLNEQAMA,TerryLP,Peacekeepers Series,1305504000
11997,11997,12483,B0035N1V7K,"[0, 1]",3,This book really just wasn't my cuppa. The si...,"07 26, 2013",A2T5QLT5VXOJAK,hwilson,a little creepy,1374796800
11998,11998,3640,B001W1XT40,"[1, 2]",1,"tried to use it to charge my kindle, it didn't...","09 17, 2013",A28MHD2DDY6DXB,"Allison A. Slater ""Gryphon50""",didn't work,1379376000


In [48]:
df=df[['reviewText','rating']]
df.head(6)

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4
5,Aislinn is a little girl with big dreams. Afte...,5


In [49]:
df.shape

(12000, 2)

In [50]:
#mising values
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [51]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [52]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

### preprocessing and cleaning

####  For sentiment analysis, we simplify the rating into binary categories: negative and positive. Ratings less than 3 are labeled as negative (0), and ratings greater than or equal to 3 are labeled as positive (1). This transformation is applied using a lambda function.

In [53]:
#positive review as 1 and negative review as 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)


In [54]:
df['rating'].unique()


array([1, 0], dtype=int64)

In [55]:
df['rating'].value_counts()


rating
1    8000
0    4000
Name: count, dtype: int64

### Text Preprocessing Steps
We perform several preprocessing steps on the review text to prepare it for modeling:

#### Convert all text to lowercase.
#### Remove special characters.
#### Remove stopwords.
#### Remove URLs and email addresses.
#### Remove HTML tags.
#### Remove extra spaces.

In [58]:
#lower all cases
df['reviewText']=df['reviewText'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText']=df['reviewText'].str.lower()


In [59]:
#clean the data-> remove all special characters
import re
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import nltk

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [60]:
def clean_text(text):
    # Lowercase
    text = text.lower()
    # Remove URLs and emails
    text = re.sub(r'(http|https|ftp|ssh)://[^\s]+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    # Remove HTML tags
    text = BeautifulSoup(text, 'lxml').get_text()
    # Remove special characters except alphanumeric and spaces
    text = re.sub(r'[^a-z0-9 ]', ' ', text)
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    # Join words back
    text = ' '.join(words)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [61]:
df['reviewText'] = df['reviewText'].apply(clean_text)
print(df['reviewText'].head())

  text = BeautifulSoup(text, 'lxml').get_text()


0    jace rankin may short nothing mess man hauled ...
1    great short read want put read one sitting sex...
2    start saying first four books expecting conclu...
3    aggie angela lansbury carries pocketbooks inst...
4    expect type book library pleased find price right
Name: reviewText, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(clean_text)


In [62]:
#lemmitizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [63]:
df['reviewText'] = df['reviewText'].apply(lemmatize_text)
print(df['reviewText'].head())

0    jace rankin may short nothing mess man hauled ...
1    great short read want put read one sitting sex...
2    start saying first four book expecting conclud...
3    aggie angela lansbury carry pocketbook instead...
4    expect type book library pleased find price right
Name: reviewText, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewText'] = df['reviewText'].apply(lemmatize_text)


In [68]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    df['reviewText'], df['rating'], test_size=0.20, random_state=42
)

print(f'Training samples: {len(X_train)}')
print(f'Testing samples: {len(X_test)}')

Training samples: 9600
Testing samples: 2400


In [69]:
#bow
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

In [70]:

x_train_bow = bow.fit_transform(x_train)
x_test_bow = bow.transform(x_test)

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x_train_tfidf = tfidf.fit_transform(x_train)
x_test_tfidf = tfidf.transform(x_test)

In [72]:
### Converting Sparse Matrices to Arrays
x_train_bow = x_train_bow.toarray()
x_test_bow = x_test_bow.toarray()
x_train_tfidf = x_train_tfidf.toarray()
x_test_tfidf = x_test_tfidf.toarray()

In [74]:
from sklearn.naive_bayes import GaussianNB
nb_model_bow = GaussianNB()


In [75]:
nb_model_bow.fit(x_train_bow, y_train)
nb_model_tfidf = GaussianNB()
nb_model_tfidf.fit(x_train_tfidf, y_train)

In [76]:
y_pred_bow = nb_model_bow.predict(x_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(x_test_tfidf)

In [77]:
from sklearn.metrics import accuracy_score, confusion_matrix
print('Bag of Words Accuracy:', accuracy_score(y_test, y_pred_bow))
print('TF-IDF Accuracy:', accuracy_score(y_test, y_pred_tfidf))
print('Bag of Words Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_bow))
print('TF-IDF Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_tfidf))

Bag of Words Accuracy: 0.5654166666666667
TF-IDF Accuracy: 0.5720833333333334
Bag of Words Confusion Matrix:
[[509 294]
 [749 848]]
TF-IDF Confusion Matrix:
[[490 313]
 [714 883]]


Bag of Words and TF-IDF are used to convert text data into vectors for machine learning models.
The CountVectorizer and TfidfVectorizer from sklearn are used for feature extraction.
Gaussian Naive Bayes is applied to both Bag of Words and TF-IDF features.
The accuracy for both models is around 58%, indicating that Word2Vec may perform better on large datasets.

In [80]:
import re

def tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-z ]', '', text)
    return text.split()

df['tokens'] = df['reviewText'].apply(tokenize)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tokens'] = df['reviewText'].apply(tokenize)


In [81]:
from gensim.models import Word2Vec

w2v = Word2Vec(
    sentences=df['tokens'],
    vector_size=100,
    window=5,
    min_count=2
)


In [82]:
import numpy as np

def sent_vector(tokens, model):
    vec = np.zeros(model.vector_size)
    count = 0
    for word in tokens:
        if word in model.wv:
            vec += model.wv[word]
            count += 1
    return vec / count if count else vec

X_w2v = np.array([sent_vector(t, w2v) for t in df['tokens']])


In [85]:
x_train, x_test, y_train, y_test = train_test_split(
    X_w2v, df['rating'], test_size=0.2, random_state=42
)



from sklearn.linear_model import LogisticRegression

model_w2v = LogisticRegression(max_iter=1000)
model_w2v.fit(x_train, y_train)

y_pred_w2v = model_w2v.predict(x_test)
print("Word2Vec Accuracy:", accuracy_score(y_test, y_pred_w2v))


Word2Vec Accuracy: 0.76875
