# Text Classification With BERT and KerasNLP

Now since I am done building the sentiment analysis model using different algorithms, I will make use of BERT, a popular Masked Language Model which is bidirectional (it has access to the words left and right) to build a the text classification model and also KerasNLP, which provides a simple Keras API for training and finetuning NLP models to classify the sentiments.

In [2]:
# import the required libraries

import pandas as pd
import numpy as np
import re
import zipfile
import os
import string
import tensorflow as tf
from tensorflow import keras
import keras_nlp
from transformers import BertTokenizer, TFBertForSequenceClassification
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split

Using TensorFlow backend


In [3]:
# load the exported data
df1 = pd.read_csv('/kaggle/input/sentiments/exported_sentiments.csv')

In [4]:
# encode the target labels
df1['Sentiments'] = df1['Sentiments'].replace({
    'negative': 0,
    'positive': 1
})
df1['Sentiments'].value_counts()

Sentiments
0    59
1    41
Name: count, dtype: int64

In [5]:
X = df1['Feedback']
y = df1['Sentiments']

In [6]:
print(y)
print()
X.to_frame()

0     1
1     1
2     0
3     0
4     0
     ..
95    1
96    0
97    0
98    1
99    1
Name: Sentiments, Length: 100, dtype: int64



Unnamed: 0,Feedback
0,"The man is too fast in his teaching,he clearly..."
1,The class is dry but he really puts in efforts
2,The course is shit and it's a threat to my bra...
3,"He no try at all, didn’t teach well."
4,Ogbeni you sef know as e dae go
...,...
95,easy and no wahala
96,terrible way of teaching with the I-dont-care ...
97,do not like coding
98,this practical is hard on top 1 unit course haba


In [7]:
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/README  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.sense  
  inflating: /usr/share/nltk_data

In [8]:
# Text Preprocessing of the texts column using NLTK
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r'\b[0-9]+\b\s*', '', text)
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

X_preprocessed = [preprocess_text(text) for text in X]

# Split the preprocessed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.25)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(pd.Series(X_preprocessed), y, test_size=0.25)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(75,) (75,)
(25,) (25,)


In [10]:
# Convert labels to one-hot encoded format
y_train = tf.keras.utils.to_categorical(y_train, num_classes=2, dtype='float32')
y_test = tf.keras.utils.to_categorical(y_test, num_classes=2, dtype='float32')

In [11]:
# load the pretrained BERT model that has been finetuned for sentiment analysis

model_name = "bert_tiny_en_uncased_sst2"
classifier = keras_nlp.models.BertClassifier.from_preset(
    model_name,
    num_classes=2,
    load_weights = True,
    activation='sigmoid' # for the binary classification task
)

Downloading data from https://storage.googleapis.com/keras-nlp/models/bert_tiny_en_uncased_sst2/v1/vocab.txt
Downloading data from https://storage.googleapis.com/keras-nlp/models/bert_tiny_en_uncased_sst2/v1/model.h5


The next step is to compile and train the model. The aim here is to use the pre-trained model and finetune it on the dataset.

In [12]:
classifier.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(),
    jit_compile=True,
     metrics=["accuracy"],
)
# Access backbone programatically (e.g., to change `trainable`).
classifier.backbone.trainable = False
# Fit again.
classifier.fit(x=X_train, y=y_train, validation_data=(X_test,y_test), batch_size=64)



<keras.callbacks.History at 0x7c6ecf0c60b0>

In [13]:
# evaluate the model on the testing data
classifier.evaluate(X_test, y_test,batch_size=32)



[0.3557356894016266, 0.8799999952316284]

In [15]:
# checking the model to see performance on new samples
sentiment_categories = ["negative", "positive"]

new_examples = list(df1['Feedback'].sample(10))

scores = classifier.predict([preprocess_text(example) for example in new_examples])

for i, score in enumerate(scores):
    print(f"{new_examples[i]}:➡ {sentiment_categories[np.argmax(score)]} with a { (100 * np.max(score)).round(2) } percent confidence.")
    print()

love to code and course is about coding. A plus for me:➡ negative with a 81.2 percent confidence.

The course is shit and it's a threat to my brain,the teaching mode is so poor :➡ negative with a 93.55 percent confidence.

great teaching method from lecturer:➡ negative with a 64.99 percent confidence.

nice:➡ positive with a 77.77 percent confidence.

this course is hard:➡ negative with a 92.44 percent confidence.

He no try at all, didn’t teach well.:➡ negative with a 84.03 percent confidence.

Akanni, you are a bad teacher wtf:➡ negative with a 93.4 percent confidence.

I just hope I pass this course cos omo:➡ negative with a 91.12 percent confidence.

I struggled at the start but it all went easy as time goes by:➡ negative with a 81.65 percent confidence.

The teaching mode is okay as the lecturer do revision of what's being taught from time to time.:➡ negative with a 80.12 percent confidence.



### Improving model accuracy

In [16]:
from tensorflow.keras.callbacks import ReduceLROnPlateau

# Define a learning rate scheduler
lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-6)

# During model fitting
classifier.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), batch_size=32, callbacks=[lr_scheduler])



<keras.callbacks.History at 0x7c6ea8a22950>

In [17]:
# evaluate the model on the testing data
classifier.evaluate(X_test, y_test,batch_size=32)



[0.3510168194770813, 0.9200000166893005]

In [18]:
# checking the model to see performance on new samples
sentiment_categories = ["negative", "positive"]

new_examples = list(df1['Feedback'].sample(10))

scores = classifier.predict([preprocess_text(example) for example in new_examples])

for i, score in enumerate(scores):
    print(f"{new_examples[i]}:➡ {sentiment_categories[np.argmax(score)]} with a { (100 * np.max(score)).round(2) } percent confidence.")
    print()

Omooooo is all I can say😭:➡ negative with a 78.77 percent confidence.

good one:➡ positive with a 87.95 percent confidence.

Teaching mode is bad but course is sometimes easy to understand:➡ negative with a 77.9 percent confidence.

I learnt a lot in the course, but the lecturers are too demanding:➡ negative with a 68.98 percent confidence.

hard and lecturer is fast when teaching:➡ negative with a 53.13 percent confidence.

The lecturer is good and his course is also good:➡ positive with a 82.27 percent confidence.

do not like coding:➡ negative with a 78.99 percent confidence.

The course is very very difficult and the lecturer no dey even make am easy.
He actually taught and explained 27 pages of the material within 3 hours.:➡ negative with a 75.1 percent confidence.

love to code:➡ positive with a 89.35 percent confidence.

The outline of the course is difficult and lecturer is bad:➡ negative with a 85.93 percent confidence.



In [163]:
# # Set some layers of the BERT backbone to trainable
# classifier.backbone.layers[-3:].trainable = True

# # Compile and fit the model again
# classifier.compile(
#     loss=keras.losses.BinaryCrossentropy(),
#     optimizer=keras.optimizers.Adam(learning_rate=1e-5),  # Adjust the learning rate
#     metrics=["accuracy"]
# )
# classifier.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), batch_size=32, callbacks=[lr_scheduler])

## Finetune BERT With User-controlled Preprocessing

In [19]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    model_name,
    sequence_length=128,
)

In [20]:
training_data = tf.data.Dataset.from_tensor_slices(([X_train], [y_train]))
validation_data = tf.data.Dataset.from_tensor_slices(([X_test], [y_test]))

train_cached = (
    training_data.map(preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)
)
test_cached = (
    validation_data.map(preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)
)

In [23]:
# Pretrained classifier.
classifier2 = keras_nlp.models.BertClassifier.from_preset(
    model_name,
    preprocessor=None,
    num_classes=2,
    load_weights = True,
    activation='sigmoid'
)
classifier2.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer=keras.optimizers.Adam(),
    jit_compile=True,
     metrics=["accuracy"],
)
classifier2.fit(train_cached, validation_data=test_cached,epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7c6e795c20e0>

In [22]:
# checking the model to see performance on new samples
sentiment_categories = ["negative", "positive"]

new_examples = list(df1['Feedback'].sample(30))

test_data =  preprocessor([preprocess_text(example) for example in new_examples])
scores = classifier2.predict(test_data)

for i, score in enumerate(scores):
    print(f"{new_examples[i]}:➡ {sentiment_categories[np.argmax(score)]} with a { (100 * np.max(score)).round(2) } percent confidence.")
    print()

neutral:➡ negative with a 98.74 percent confidence.

good:➡ positive with a 98.73 percent confidence.

easy and no wahala:➡ positive with a 98.73 percent confidence.

way to go. Nice job from lecturer:➡ positive with a 98.73 percent confidence.

Awful & terrible from both the course and lecturer:➡ negative with a 98.74 percent confidence.

this course is hard:➡ negative with a 98.74 percent confidence.

thank God for my coding skills  bruh:➡ negative with a 98.74 percent confidence.

The lecturer is fucking terrible. With his I-dont-care attitude towards students. The worst lecturer so far.:➡ negative with a 98.73 percent confidence.

I hate this course plus the man:➡ negative with a 98.74 percent confidence.

the teaching method is so terrible :➡ negative with a 98.74 percent confidence.

positive experience:➡ positive with a 98.72 percent confidence.

Applied my math's knowledge from 200L for the most part of course:➡ negative with a 98.74 percent confidence.

I felt the course was a

## Saving models

In [26]:
# first model
classifier.save("sentiment_model1", save_format='tf')

# second model
classifier2.save("sentiment_model2", save_format='tf')

In [27]:
# first model
classifier.save("keras1", save_format='keras')

# second model
classifier2.save("keras2", save_format='keras')

## Download saved models

In [30]:
directory_to_zip = "/kaggle/working/keras1"
output_zip_file = "/kaggle/working/keras1.zip"

# Create a Zip file
with zipfile.ZipFile(output_zip_file, 'w') as zipf:
    for root, dirs, files in os.walk(directory_to_zip):
        for file in files:
            zipf.write(os.path.join(root, file))
            
print(f"Zip file created: {output_zip_file}")

Zip file created: /kaggle/working/keras1.zip


In [31]:
import zipfile
import os

directory_to_zip = "/kaggle/working/keras2"
output_zip_file = "/kaggle/working/keras2.zip"

# Create a Zip file
with zipfile.ZipFile(output_zip_file, 'w') as zipf:
    for root, dirs, files in os.walk(directory_to_zip):
        for file in files:
            zipf.write(os.path.join(root, file))
            
print(f"Zip file created: {output_zip_file}")

Zip file created: /kaggle/working/keras2.zip
