#  Fine-tuning BERT with Tensorflow for Text Classification.

In this notebook, we'll fine-tune BERT model with tensorflow to classify tweet text data into different classes of cyberbullying.The data was taken from kaggle from this link.

https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification

In [1]:
# !pip install neattext

## Import libraries.

In [2]:
import spacy
import re
import nltk
import string
import sklearn
import neattext as nt
import neattext.functions as nfx
import pandas as pd 
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences

from transformers import AutoTokenizer, TFBertModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
bert = TFBertModel.from_pretrained('bert-base-cased')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Surface\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Read data with pandas

In [4]:
data = pd.read_csv('cyberbullying_tweets.csv')

### Inspecting the data.

In [5]:
data.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [6]:
data.groupby('cyberbullying_type').describe()

Unnamed: 0_level_0,tweet_text,tweet_text,tweet_text,tweet_text
Unnamed: 0_level_1,count,unique,top,freq
cyberbullying_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
age,7992,7992,The girls who bullied me in middle and high sc...,1
ethnicity,7961,7959,Racism won't stop as long as u stil select ur ...,2
gender,7973,7948,No offense. @NigelBigMeech I'm not sexist but ...,2
not_cyberbullying,7945,7937,Our pancakes are selling like hotcakes Shaz - ...,2
other_cyberbullying,7823,7823,@KirinDave I love it. I'm invested in it. My f...,1
religion,7998,7997,A Pakistani court has sentenced 86 members of ...,2


In [7]:
data['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

In [8]:
data['cyberbullying_type'].nunique()

6

In [9]:
data['cyberbullying_type'].count()

47692

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47692 entries, 0 to 47691
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_text          47692 non-null  object
 1   cyberbullying_type  47692 non-null  object
dtypes: object(2)
memory usage: 745.3+ KB


In [11]:
data.shape

(47692, 2)

### The cyberbullying classes are aggregated together by type, so we need to shuffle them.

In [12]:
data.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [13]:
data.tail()

Unnamed: 0,tweet_text,cyberbullying_type
47687,"Black ppl aren't expected to do anything, depe...",ethnicity
47688,Turner did not withhold his disappointment. Tu...,ethnicity
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity
47691,Bro. U gotta chill RT @CHILLShrammy: Dog FUCK ...,ethnicity


In [14]:
data.isnull().sum()

tweet_text            0
cyberbullying_type    0
dtype: int64

In [15]:
data.drop_duplicates(inplace=True)

In [16]:
data.shape

(47656, 2)

### Shuffle data and split it into training and test sets.

In [17]:
data_train, data_test = train_test_split(data, test_size = 0.3, random_state = 42, shuffle = True, stratify = data.cyberbullying_type)

## Clean data using neattext library.

### Remove hashtags, multiple spaces and user-handles.

In [18]:
data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_hashtags)
data_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_hashtags)


Unnamed: 0,tweet_text,cyberbullying_type
37246,when cis ppl spread hate & misinfo about trans...,age
12581,@Thats_So_Lauren they like making over the top...,gender
29310,"So, just to save you all some time, BSD is my ...",other_cyberbullying
44701,Fuck I fucking hate dumb Hawaiians with nigger...,ethnicity
30210,RT @grbradbury: @davidlipson @Nick_Xenophon ca...,other_cyberbullying


In [19]:
# data_train['tweet_text'] = data_train['tweet_text'].apply(lambda x: nfx.remove_custom_pattern(x, term_pattern=r'&#\$ '))

data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_userhandles)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_userhandles)


In [20]:
data_train.tail()

Unnamed: 0,tweet_text,cyberbullying_type
44713,Ur first one that is so apt!,ethnicity
31973,Art was something I turned to when I was raped...,age
46332,I disagree with your reading of it. Why would ...,ethnicity
14085,she was hating on jlaw for things she did when...,gender
43904,first impression: very blunt and straightforwa...,ethnicity


In [22]:
data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_multiple_spaces)
data_train.head()

Unnamed: 0,tweet_text,cyberbullying_type
37246,when cis ppl spread hate & misinfo about trans...,age
12581,they like making over the top gay jokes invol...,gender
29310,"So, just to save you all some time, BSD is my ...",other_cyberbullying
44701,Fuck I fucking hate dumb Hawaiians with nigger...,ethnicity
30210,RT can you please call him an evil capitalist ...,other_cyberbullying


### We'll take only a section of our training data because we are running on CPU.

In [21]:
data_train  = data_train[:12000]
data_train.shape

(12000, 2)

In [23]:
data_train.tail()

Unnamed: 0,tweet_text,cyberbullying_type
37779,OMG the peeps are such bullies to Jamie hiding...,age
18801,Hmm the whole world can see it .Terrorism is a...,religion
4835,Drasko and Steve need to hate-fuck each other ...,not_cyberbullying
35922,This woman on Love It it List It talks just li...,age
27149,The author also fails to understand that nearl...,other_cyberbullying


In [24]:
data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_stopwords)
data_train.head()

Unnamed: 0,tweet_text,cyberbullying_type
37246,cis ppl spread hate & misinfo trans ppl though...,age
12581,like making gay jokes involving rape...not fun...,gender
29310,"So, save time, BSD family. negative response G...",other_cyberbullying
44701,Fuck fucking hate dumb Hawaiians nigger lips!,ethnicity
30210,RT evil capitalist sellout cunt https://t.co/N...,other_cyberbullying


In [26]:
data_train['tweet_text'] = data_train['tweet_text'].apply(nfx.remove_urls)
data_train.head()

Unnamed: 0,tweet_text,cyberbullying_type
37246,cis ppl spread hate & misinfo trans ppl though...,age
12581,like making gay jokes involving rape...not fun...,gender
29310,"So, save time, BSD family. negative response G...",other_cyberbullying
44701,Fuck fucking hate dumb Hawaiians nigger lips!,ethnicity
30210,RT evil capitalist sellout cunt,other_cyberbullying


### Check the difference classes in our target variable for train and test sets.

In [29]:
data_train['cyberbullying_type'].unique()

array(['age', 'gender', 'other_cyberbullying', 'ethnicity',
       'not_cyberbullying', 'religion'], dtype=object)

In [30]:
data_test['cyberbullying_type'].unique()

array(['age', 'other_cyberbullying', 'religion', 'gender', 'ethnicity',
       'not_cyberbullying'], dtype=object)

In [31]:
# We'll take only a portion of the test set also.

data_test = data_test[:500]
data_test.shape

(500, 2)

### Encoding the target variable with scikit-learn label encoder. We do this for both train and test sets separately to avoid data leakage. 

In [32]:
label_enc = LabelEncoder() 


In [33]:
data_train['cyberbullying_type'] = label_enc.fit_transform(data_train['cyberbullying_type'])
data_train.head()

Unnamed: 0,tweet_text,cyberbullying_type
37246,cis ppl spread hate & misinfo trans ppl though...,0
12581,like making gay jokes involving rape...not fun...,2
29310,"So, save time, BSD family. negative response G...",4
44701,Fuck fucking hate dumb Hawaiians nigger lips!,1
30210,RT evil capitalist sellout cunt,4


In [34]:
data_test['cyberbullying_type'] = label_enc.transform(data_test['cyberbullying_type'])


In [35]:
data_test['cyberbullying_type'].unique()

array([0, 4, 5, 2, 1, 3])

In [36]:
data_train.dtypes

tweet_text            object
cyberbullying_type     int32
dtype: object

In [37]:
data_train['cyberbullying_type'].unique()

array([0, 2, 4, 1, 3, 5])

In [38]:
data_train['cyberbullying_type'].value_counts()

1    2070
3    2021
5    2015
2    1996
0    1953
4    1945
Name: cyberbullying_type, dtype: int64

#### Label encoder encodes data by alphabetical order. 

In [39]:
# data_train['cyberbullying_type'].value_counts()


# '''
# Cyberbullying labels are

# 0 - age
# 1 - ethnicity
# 2 - gender
# 3 - not_cyberbullying
# 4 - other_cyberbullying
# 5 - religion
# '''

### Tokenize train input text with bert's Autotokenizer that we imported earlier.

In [40]:
x_train = tokenizer(
#     text = x_train.tolist(),
    text = data_train['tweet_text'].tolist(),
    add_special_tokens = True,
    max_length = 100,
    truncation = True,
    padding = True,
    return_tensors = 'tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True
)

In [41]:
x_train['input_ids']

<tf.Tensor: shape=(12000, 100), dtype=int32, numpy=
array([[  101,   172,  1548, ...,     0,     0,     0],
       [  101,  1176,  1543, ...,     0,     0,     0],
       [  101,  1573,   117, ...,     0,     0,     0],
       ...,
       [  101,  1987, 25611, ...,     0,     0,     0],
       [  101,  1590,  2185, ...,     0,     0,     0],
       [  101,  2351, 12169, ...,     0,     0,     0]])>

### Building the model and fine-tuning the model.

The first value returned by BERT model at index 0 is the last hidden state, 1 means pooler_output
We need only the hidden state, so that we can add more layers and fine-tune the model.
We'll use functional API

In [42]:
max_len = 100


input_ids = Input(shape=(max_len,), dtype=tf.int32, name='input_ids')
input_mask = Input(shape=(max_len,), dtype=tf.int32, name='attention_mask')

# 0 is the last hidden state, 1 means pooler_output
# We need only the hidden state, so that we can add more layers and fine-tune the model.
# We'll use functional API
embeddings = bert(input_ids, attention_mask=input_mask)[0]
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(32, activation='relu')(out)

y = Dense(6, activation='sigmoid')(out)

model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
model.layers[2].trainable = True

### Compile the model.

In [43]:
# Bert model requires a specific learning rate as stated in the huggingface website

optimizer = Adam(
    learning_rate=5e-05,
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0
)

loss = CategoricalCrossentropy(from_logits=True)
metric = CategoricalAccuracy('balanced_accuracy')

model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metric)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 100)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 100)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 100,                                           

In [45]:
bert_train = model.fit(
    x={'input_ids':x_train['input_ids'], 'attention_mask':x_train['attention_mask']},
    y=to_categorical(data_train.cyberbullying_type),
    validation_data=(
        {'input_ids':x_test['input_ids'], 'attention_mask':x_test['attention_mask']},to_categorical(data_test.cyberbullying_type)
    ),
    epochs=1,
    batch_size=36
)

  return dispatch_target(*args, **kwargs)




### We achieved balanced accuracy of 75% and validation balanced accuracy of 83.6%.

We trained the data in one epoch, so this accuracy is quite ok.

#### Tokenize test data.

In [44]:
x_test = tokenizer(
    text = data_test['tweet_text'].tolist(), 
    add_special_tokens = True,
    max_length = 100,
    truncation = True,
    padding = True,
    return_tensors = 'tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True
)


In [46]:
# To use bert model later, we need to save both the weights and the model architecture and load them both. 
# We already saved the model architechture above when we downloaded the model.


model.save_weights('model_cyber.h5')

# To use the mode again, load it into memory
# model.load_weights('model_cyber.h5')

In [47]:
pred_raw = model.predict({'input_ids':x_test['input_ids'], 'attention_mask':x_test['attention_mask']})



We want to check the prediction of the first input in the test set. Bert gives the probability of each class. We'll use np.argmax to get the index oc the highest probability.

In [48]:
pred_raw[0]

array([0.99881345, 0.14684954, 0.49971724, 0.72339106, 0.4689358 ,
       0.3500493 ], dtype=float32)

In [49]:
y_pred = np.argmax(pred_raw, axis=1)

In [50]:
data_test.cyberbullying_type 

31785    0
30876    4
17444    5
9823     2
45433    1
        ..
27093    4
14133    2
36007    0
26678    4
27029    4
Name: cyberbullying_type, Length: 500, dtype: int32

#### Checking the classification report.

These values in the classification report look good.

In [52]:
print(classification_report(data_test.cyberbullying_type, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        88
           1       0.98      0.96      0.97        85
           2       0.89      0.85      0.87        78
           3       0.65      0.45      0.53        74
           4       0.62      0.78      0.69        98
           5       0.90      0.99      0.94        77

    accuracy                           0.84       500
   macro avg       0.84      0.83      0.83       500
weighted avg       0.84      0.84      0.83       500

