# Amazon Book Review Classification with BERT

In this notebook, a binary classification model is trained and fined tuned using BERT on an Amazon Book Review dataset to predict whether a review is positive or negative.


# Outline
- [ 1 - Packages ](#1)
- [ 2 - Preprocessing Data](#2)
  - [ 2.1 Loading and Visualizing the Data](#2.1)
  - [ 2.2 Preprocessing](#2.2)
  - [ 2.3  Text Processing](#2.3)
  - [ 2.4 Data Split](#2.4)
- [ 3 - Classification Model](#3)
  - [ 3.1 BERT Model](#3.1)
  - [ 3.2 Training](#3.2)
- [ 4 - Results](#4)

<a name="1"></a>
## 1 - Packages 

Below are all the needed packages for this notebook.
- [numpy](https://www.numpy.org) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
- [tensorflow](https://www.tensorflow.org/) is an end-to-end machine learning platform.
- [scikit-learn](https://scikit-learn.org/stable/) is a library of simple and efficient tools for predictive data analysis.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, Dropout
from sklearn.model_selection import train_test_split

<a name="2"></a>
## 2 - Preprocessing Data

The dataset for the model we'll build contains information about 3M book reviews for 212404 unique books and users that provided the review for each book.
The dataset can be found here: [Amazon Book Reviews](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?utm_source=pocket_mylist)
<br/><br/>
<a name="2.1"></a>
### 2.1 Loading and Visualizing the Data


In [2]:
#Load data
data = pd.read_csv("./Data/Books_rating.csv")

In [3]:
data.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


In [4]:
print(f"Shape(rows, columns): {data.shape}\n")
print(f"review/score values: {data['review/score'].unique().tolist()}")

Shape(rows, columns): (3000000, 10)

review/score values: [4.0, 5.0, 1.0, 3.0, 2.0]


<a name="2.2"></a>
### 2.2 Preprocessing

We can see we have 3 million items with 10 columns, and the review score for the books is on a range from 1 to 5. We want to train a binary model, so lets examine some of the reviews with a rating in the middle more closely.

In [5]:
#Get dataframe with rows that have only 3 as score value
trees = data.loc[data['review/score'] == 3.0]
trees.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
51,B0007FIF28,The Overbury affair (Avon),,A2GERYVE64DIPL,lisamac,0/0,3.0,1313366400,Overbury,Full of intrigue and a good overview of the co...
72,1858683092,Mensa Number Puzzles (Mensa Word Games for Kids),,A1AYN4J7T43M11,"""dirtpile""",4/4,3.0,981417600,Made me wish I was Einstein.,Not much I can say about this book... It's ful...
76,0792391810,Vector Quantization and Signal Compression (Th...,76.94,A30DX2BO4Y4NLU,Moosh,3/5,3.0,1113609600,Comprehensive but marred by poor printing,"This book appears to be a ""print on demand"" st..."
81,0974289108,The Ultimate Guide to Law School Admission: In...,14.95,A1KZ0RDJZQSY4O,sayock,27/29,3.0,1090368000,No &quot;Insider&quot; Secrets,If you are someone who is fairly new to the la...
105,B000NKGYMK,Alaska Sourdough,,A258YNWJW2264M,"Tessa F. Briggs ""Tessa B""",8/14,3.0,1241827200,Not your quick refrence cookbook,After having a chance to read through the book...


In [6]:
print(f"Review 1: {trees['review/text'].iloc[0]}\n")
print(f"Review 2: {trees['review/text'].iloc[1]}\n")
print(f"Review 3: {trees['review/text'].iloc[2]}\n")

Review 1: Full of intrigue and a good overview of the court of James 1 and the key players. Provides a good general history of all the facts of the case.


Review 3: This book appears to be a "print on demand" style book which manifests itself, in this case, as a poor quality hardcover. Some of the text is laughably bad, the few images near the end look like they are snapshots from a black and white TV screen, but most bothersome is the "muddy" look of the text. The price seems a bit steep for such a poor quality print. That said, the actual content is very comprehensive. One of the authors (Gray) is co-creator of the now commonplace LBG (Linde Buzo Gray) VQ algorithm. Fortunately, the quality of the content shines through the terrible printing.



We can see the reviews with a rating of 3 can be either postive or negative, or even both. We'll exclude the score reviews with a value of 3, as we'll want to train our model with clear negative/positive reviews.

In [7]:
#Copy of original dataframe
df = data.copy()

#Rename columns for simplicity
df.rename(columns = {'review/text':'text'}, inplace = True)
df.rename(columns = {'review/score':'score'}, inplace = True)
df.rename(columns = {'review/summary':'summary'}, inplace = True)

#Datarame without rows with 3 as score value
df = df[df.score != 3.0]

Lets drop the unecessary columns. For our training data, we'll only need the score and text columns. The summary will also be kept for now as reference.

In [8]:
df = df.drop(['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/time'], axis=1)
df.head()

Unnamed: 0,score,summary,text
0,4.0,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,5.0,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,5.0,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,4.0,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,4.0,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


Lets also check for missing values and get rid of those rows.

In [9]:
#null values
df.isnull().sum()

score       0
summary    36
text        8
dtype: int64

In [10]:
print(f"# of rows: {len(df)}")

# of rows: 2745705


In [11]:
#Dataframe without null values in text
df = df[df['text'].notna()]
df.isnull().sum()

score       0
summary    36
text        0
dtype: int64

In [12]:
print(f"# of rows: {len(df)}")

# of rows: 2745697


Now lets transform the score variable into binary values. For this we'll assign a value of 0 to scores 1.0 and 2.0, and a value of 1 to 4.0 and 5.0. A score of 0 being a negative review and 1 a positive review.

In [13]:
#Change score values to binary
classes = {1.0: 0, 2.0: 0, 4.0: 1, 5.0: 1}
df['score'] = df['score'].map(classes)
df.head()

Unnamed: 0,score,summary,text
0,1,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,1,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,1,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,1,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,1,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


In [14]:
#Number of rows per value
df['score'].value_counts()

1    2392951
0     352746
Name: score, dtype: int64

The dataset is highly unbalanced, so we'll have to reduce the number of positive reviews to match those of negative reviews.

In [15]:
#Compute difference by value and balance dataset
pos = df['score'].value_counts()[1]
neg = df['score'].value_counts()[0]
df.drop(df[df.score == 1].index[-(pos-neg):], inplace=True)
df.score.value_counts()

1    352746
0    352746
Name: score, dtype: int64

<a name="2.3"></a>
### 2.3 Text Processing

Now we can process and clean the text itself. For this, lets remove unecessary vocabulary like stopwords and single characters, as well as punctuation and numbers.

Stopwords are unecessary words that will be of no use as keywords, such as: about, are, at, because, does, etc.

In [16]:
import string
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

stop = set(stopwords.words("english"))

def remove_stopwords(text):
    filtered_words = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(filtered_words)

def remove_punct(text):
    translator = str.maketrans("", "", string.punctuation)
    return text.translate(translator)

def remove_nums(text):
    translator = str.maketrans("", "", "0123456789")
    return text.translate(translator)

def remove_single_char(text):
    threshold = 1
    filtered_words =[word for word in text.split() if len(word) > threshold]
    return " ".join(filtered_words)

In [17]:
#Process text
df['text'] = df.text.map(remove_punct)
df['text'] = df.text.map(remove_nums)
df['text'] = df.text.map(remove_single_char)
df['text'] = df.text.map(remove_stopwords)

<a name="2.4"></a>
### 2.4 Data Split

The dataset is now ready. Before we define our model, lets get our our x and y arrays from the dataframe, and split them into training and validation sets.

In [18]:
#Convert pandas dataframe to numpy array
x = df.text.to_numpy() #text column
y = df.score.to_numpy() #score column

In [19]:
#Split data
train_x, val_x, train_y, val_y = train_test_split(x, y, test_size=0.3, random_state=0)

The validation set will be 30% of the data.

In [20]:
#y values
train_y

array([0, 0, 1, ..., 1, 0, 1], dtype=int64)

In [21]:
print(f"train length: {len(train_x)}")
print(f"validation length: {len(val_x)}")

train length: 493844
validation length: 211648


<a name="3"></a>
## 3 - Classification Model

<a name="3.1"></a>
### 3.1 BERT Model

Bert processor and encoder from tensorflow_hub:

In [22]:
#Load BERT model preprocessor and encoder
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")



The model will be a functional model and will include a dropout layer to reduce overfitting and an output Dense layer with a sigmoid activation.

In [23]:
#Build functional model
text_input = Input(shape=(), dtype=tf.string, name="text")
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

l = Dropout(0.3, name="dropout")(outputs['pooled_output'])
l = Dense(1, activation='sigmoid', name="output")(l)

model = tf.keras.Model(inputs=[text_input], outputs=[l])

In [24]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                      

<a name="3.2"></a>
### 3.2 Training

In [25]:
#Configure and train
model.compile(
    #loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
)

history = model.fit(
    train_x, train_y,
    batch_size = 32,
    validation_data = (val_x, val_y),
    epochs = 3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<a name="4"></a>
## 4 - Results

The results of the loss function and accuracy for both of the training data and the validation data is close, this means the model is not overfitting and can be used for generalization and predictions. With a loss function of 0.56 and an accuracy of 0.71 for the training data, and a loss function of 0.52 and accuracy of 0.75 for the validation data, we have a good model for this data. Data augementation, along with other models could be implemented to elicit better results.