## Text Classification With BERT and KerasNLP

Now since I am done building the sentiment analysis model using the XGBoost algorithm, I will make use of BERT, a popular Masked Language Model which is bidirectional (it has access to the words left and right) to build a the text classification model and also KerasNLP, which provides a simple Keras API for training and finetuning NLP models to classify the sentiments.

In [1]:
# import the required libraries

import pandas as pd
import numpy as np
import re
import string
import tensorflow as tf
from tensorflow import keras
import keras_nlp
from transformers import BertTokenizer, TFBertForSequenceClassification
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split

Using TensorFlow backend


In [2]:
# load the exported data
df1 = pd.read_csv('tweets2.csv')
df1.head()

Unnamed: 0,tweet,time,year,month,day,day_name,time_of_tweet,hour_of_the_day,processed_tweet,sentiments,character_count,word_count
0,We celebrate our dynamic GMD/CEO; Dr. Ebenezer...,2023-06-07 07:28:33,2023,June,7,Wednesday,07:28:33,7,celebrate dynamic gmdceo dr ebenezer onyeagwu ...,positive,91,12
1,"If you believe it, you will get it.",2023-06-07 06:25:39,2023,June,7,Wednesday,06:25:39,6,believe get,negative,11,2
2,"If you believe it, you will get it.",2023-06-07 06:17:12,2023,June,7,Wednesday,06:17:12,6,believe get,negative,11,2
3,"If you believe it, you will get it.",2023-06-07 05:04:10,2023,June,7,Wednesday,05:04:10,5,believe get,negative,11,2
4,"If you believe it, you will get it.\n\nSimply ...",2023-06-07 05:02:33,2023,June,7,Wednesday,05:02:33,5,believe get simply visit information,negative,36,5


In [3]:
# encode the target labels
df1['sentiments'] = df1['sentiments'].replace({
    'negative': 0,
    'positive': 1
})
df1['sentiments'].value_counts()

0    5495
1    4045
Name: sentiments, dtype: int64

In [4]:
X = df1['tweet']
y = df1['sentiments']

In [5]:
# Text Preprocessing of the texts column using NLTK
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r'\b[0-9]+\b\s*', '', text)
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

X_preprocessed = [preprocess_text(text) for text in X]

# Split the preprocessed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.25)

In [6]:
# Convert labels to one-hot encoded format
y_train = tf.keras.utils.to_categorical(y_train, num_classes=2, dtype='float32')
y_test = tf.keras.utils.to_categorical(y_test, num_classes=2, dtype='float32')

In [10]:
# load the pretrained BERT model that has been finetuned for sentiment analysis

model_name = "bert_tiny_en_uncased_sst2"
classifier = keras_nlp.models.BertClassifier.from_preset(
    model_name,
    num_classes=2,
    load_weights = True,
    activation='sigmoid' # for the binary classification task
)

The next step is to compile and train the model. The aim here is to use the pre-trained model and finetune it on the dataset.

In [11]:
classifier.compile(
    loss=keras.losses.BinaryCrossentropy(),
    optimizer=keras.optimizers.Adam(),
    jit_compile=True,
     metrics=["accuracy"],
)
# Access backbone programatically (e.g., to change `trainable`).
classifier.backbone.trainable = False
# Fit again.
classifier.fit(x=X_train, y=y_train, validation_data=(X_test,y_test), batch_size=32)



<keras.callbacks.History at 0x24dd4018c40>

In [12]:
# evaluate the model on the testing data
classifier.evaluate(X_test, y_test,batch_size=32)



[0.3683411180973053, 0.8360586762428284]

In [14]:
# checking the model to see performance on new samples
sentiment_categories = ["negative", "positive"]
scores = classifier.predict([preprocess_text("Nigerian Banks are doing pretty okay but need to do better with their awful customer service!")])
print(scores)
print(f"{sentiment_categories[np.argmax(scores)]} with a { (100 * np.max(scores)).round(2) } percent confidence.")

[[0.30848458 0.69052   ]]
positive with a 69.05 percent confidence.


In [21]:
# checking the model to see performance on new samples
sentiment_categories = ["negative", "positive"]
new_examples = [
    "@ZenithBank, what's up with your ATMs? 🏧🤷‍♀️ Half of them are out of cash, and the rest are always broken. Do you guys even maintain them?",
    "I swear @gtbank has the worst online banking platform! 😠📱 It's slow, clunky, and full of bugs. Time to find a better bank.",
    "How hard is it for @UBA to answer a simple email? 📧🤦‍♂️ Been waiting for days, and still no response. Way to treat your customers!",
    "Dear @FidelityBankPLC, your interest rates are a joke! 💤💤 Might as well keep my money under the mattress.",
    "Just had the best experience at First Bank 🎉 Love their friendly staff and quick service! 💯",
    "Ugh, seriously @ZenithBank? 🙄 Been waiting in line for ages, and no one seems to care. Time to switch banks, I guess. 😒",
    "Shoutout to @FidelityBankPLC 🙌 Just got my savings interest, and it's way better than I expected! 💰",
    "Naija banks, step up your game! 🚀 We need more innovative products and better customer support!",
    "Make una no go vex perosn with this early morning poor service all this banks ooo!",
    "Zenith bank, abeg make una allow this money drop or revise it. Abeg, the money is in need for urgent medical attention",
    "Awon Bank yi ti ya werey sha",
    "Why am I receiving pos debit for February and March over a declined transaction?? Is the bank robbing me @gtbank_help",
    "Okay, First Bank na better bank"
]

scores = classifier.predict([preprocess_text(example) for example in new_examples])

for i, score in enumerate(scores):
    print(f"{new_examples[i]}: {sentiment_categories[np.argmax(score)]} with a { (100 * np.max(score)).round(2) } percent confidence.")
    print()

@ZenithBank, what's up with your ATMs? 🏧🤷‍♀️ Half of them are out of cash, and the rest are always broken. Do you guys even maintain them?: negative with a 96.76 percent confidence.

I swear @gtbank has the worst online banking platform! 😠📱 It's slow, clunky, and full of bugs. Time to find a better bank.: positive with a 52.71 percent confidence.

How hard is it for @UBA to answer a simple email? 📧🤦‍♂️ Been waiting for days, and still no response. Way to treat your customers!: negative with a 96.69 percent confidence.

Dear @FidelityBankPLC, your interest rates are a joke! 💤💤 Might as well keep my money under the mattress.: negative with a 83.24 percent confidence.

Just had the best experience at First Bank 🎉 Love their friendly staff and quick service! 💯: positive with a 95.99 percent confidence.

Ugh, seriously @ZenithBank? 🙄 Been waiting in line for ages, and no one seems to care. Time to switch banks, I guess. 😒: negative with a 95.63 percent confidence.

Shoutout to @FidelityBank