# Fine Tuning BERT for Multiclass Text Classification

## Model - 'bert-base-cased'


## Dataset Link - https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv

## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------


-------------------------

In [1]:
!pip install transformers



In [2]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import tensorflow as tf
from transformers import BertTokenizer

In [4]:
df=pd.read_csv("/content/sample_data/bbc-text.csv")
df.head(5)

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 909 entries, 0 to 908
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  909 non-null    object
 1   text      909 non-null    object
dtypes: object(2)
memory usage: 14.3+ KB


In [6]:
df.groupby('category').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
business,210,205,sec to rethink post-enron rules the us stock m...,2
entertainment,157,154,prince crowned top music earner prince earne...,2
politics,168,164,brown outlines third term vision gordon brown ...,2
sport,214,213,hantuchova in dubai last eight daniela hantuch...,2
tech,160,156,microsoft gets the blogging bug software giant...,2


In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [8]:
token = tokenizer.encode_plus(
    df['text'].iloc[0],
    max_length=256,
    truncation=True,
    padding='max_length',
    add_special_tokens=True,
    return_tensors='tf'
)

In [9]:
token.input_ids

<tf.Tensor: shape=(1, 256), dtype=int32, numpy=
array([[  101,   189,  1964,  2174,  1107,  1103,  1493,  1104,  6827,
         1114,  1313,  4041,  2344, 13441,  1344,   118,  5754,   189,
         1964,  1116,  1105,  3539,  1888, 18898,  1116,  2232,  1154,
         1103,  1690,  1395,  1103,  1236,  1234,  2824,   189,  1964,
         1209,  1129,  8276,  1193,  1472,  1107,  1421,  1201,  1159,
          119,  1115,  1110,  2452,  1106,  1126,  6640,  5962,  1134,
         5260,  1120,  1103,  2683,  8440, 11216,  1437,  1107, 17496,
         1396, 11305,  1106,  6265,  1293,  1292,  1207,  7951,  1209,
         3772,  1141,  1104,  1412,  9122,  1763, 15370,   119,  1114,
         1103,  1366,  2020,  1103, 10209,  8473,  1105,  1168,  3438,
         1209,  1129,  4653,  1106,  6827,  2258,  1313,  6379,  1194,
         6095,  5989, 21359,  1513,  8178,  1116,  2557,  1105, 26577,
         1555, 12263,  1106,  1524,  4045,  1105, 15139,  5197,   119,
         1141,  1104,  1103, 

In [10]:
X_input_ids = np.zeros((len(df), 256))
X_attn_masks = np.zeros((len(df), 256))

# Histogram of the count of text

In [11]:
df['count'] = df['text'].apply(lambda x: len(x.split()))

In [12]:
def generate_training_data(df, ids, masks, tokenizer):
    for i, text in tqdm(enumerate(df['text'])):
        tokenized_text = tokenizer.encode_plus(
            text,
            max_length=256,
            truncation=True,
            padding='max_length',
            add_special_tokens=True,
            return_tensors='tf'
        )
        ids[i, :] = tokenized_text.input_ids
        masks[i, :] = tokenized_text.attention_mask
    return ids, masks

In [13]:
X_input_ids, X_attn_masks = generate_training_data(df, X_input_ids, X_attn_masks, tokenizer)

0it [00:00, ?it/s]

In [14]:
df['encoded_text'] = df['category'].astype('category').cat.codes
df['encoded_text']

0      4
1      0
2      3
3      3
4      1
      ..
904    0
905    1
906    3
907    2
908    4
Name: encoded_text, Length: 909, dtype: int8

In [15]:
labels = np.zeros((len(df), 5))
labels[np.arange(len(df)), df['encoded_text'].values] = 1
labels

array([[0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [16]:
# creating a data pipeline using tensorflow dataset utility, creates batches of data for easy loading...
dataset = tf.data.Dataset.from_tensor_slices((X_input_ids, X_attn_masks, labels))
dataset.take(1) # one sample data

<_TakeDataset element_spec=(TensorSpec(shape=(256,), dtype=tf.float64, name=None), TensorSpec(shape=(256,), dtype=tf.float64, name=None), TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

In [17]:
def SentimentDatasetMapFunction(input_ids, attn_masks, labels):
    return {
        'input_ids': input_ids,
        'attention_mask': attn_masks
    }, labels

In [18]:
dataset = dataset.map(SentimentDatasetMapFunction) # converting to required format

In [19]:
dataset.take(1)

<_TakeDataset element_spec=({'input_ids': TensorSpec(shape=(256,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(256,), dtype=tf.float64, name=None)}, TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

In [20]:
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True) # batch size, drop any left out tensor

In [21]:
dataset.take(1)

<_TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 256), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(16, 256), dtype=tf.float64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>

In [54]:
p = 0.8
train_size = int((len(df)//16)*p)

In [55]:
train_size

44

# Train Test SPlit

In [56]:
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)

# Model Definition

In [57]:
from transformers import TFBertModel

In [58]:
model = TFBertModel.from_pretrained('bert-base-cased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [59]:
# defining 2 input layers for input_ids and attn_masks
input_ids = tf.keras.layers.Input(shape=(256,), name='input_ids', dtype='int32')
attn_masks = tf.keras.layers.Input(shape=(256,), name='attention_mask', dtype='int32')

bert_embds = model.bert(input_ids, attention_mask=attn_masks)[1] # 0 -> activation layer (3D), 1 -> pooled output layer (2D)
intermediate_layer = tf.keras.layers.Dense(512, activation='relu', name='intermediate_layer')(bert_embds)
output_layer = tf.keras.layers.Dense(5, activation='softmax', name='output_layer')(intermediate_layer) # softmax -> calcs probs of classes

sentiment_model = tf.keras.Model(inputs=[input_ids, attn_masks], outputs=output_layer)
sentiment_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 256)]                0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 256)]                0         []                            
 )                                                                                                
                                                                                                  
 bert (TFBertMainLayer)      TFBaseModelOutputWithPooli   1083102   ['input_ids[0][0]',           
                             ngAndCrossAttentions(last_   72         'attention_mask[0][0]']      
                             hidden_state=(None, 256, 7                                     

In [60]:
optim = tf.keras.optimizers.legacy.Adam(learning_rate=1e-5, decay=1e-6)
loss_func = tf.keras.losses.CategoricalCrossentropy()
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')


In [61]:
sentiment_model.compile(optimizer=optim, loss=loss_func, metrics=[acc])

In [62]:
hist = sentiment_model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


# Saving & Loading the model

In [65]:
save_directory = "/saved_models"

sentiment_model.save(save_directory)



In [66]:
tokenizer.save_pretrained(save_directory)

('/saved_models/tokenizer_config.json',
 '/saved_models/special_tokens_map.json',
 '/saved_models/vocab.txt',
 '/saved_models/added_tokens.json')

# Loading Pre-Trained Model

In [67]:
sentiment_model = tf.keras.models.load_model(save_directory)

tokenizer = BertTokenizer.from_pretrained(save_directory)

def prepare_data(input_text, tokenizer):
    token = tokenizer.encode_plus(
        input_text,
        max_length=256,
        truncation=True,
        padding='max_length',
        add_special_tokens=True,
        return_tensors='tf'
    )
    return {
        'input_ids': tf.cast(token.input_ids, tf.float64),
        'attention_mask': tf.cast(token.attention_mask, tf.float64)
    }

def make_prediction(model, processed_data, classes=['Business', 'Entertainment', 'Politics', 'Sport', 'Tech']):
    probs = model.predict(processed_data)[0]
    return classes[np.argmax(probs)]

In [68]:
input_text = input('Enter news text here: ')
processed_data = prepare_data(input_text, tokenizer)
result = make_prediction(sentiment_model, processed_data=processed_data)
print(f"Predicted Sentiment: {result}")

Enter news text here: politics
Predicted Sentiment: Sport


# Inferencing with Pytorch

In [70]:
import torch

from transformers import TFBertModel

tokenizer_fine_tuned_pt = BertTokenizer.from_pretrained(save_directory)


model_fine_tuned_pt = TFBertModel.from_pretrained(save_directory, from_tf = True )


OSError: ignored

In [None]:
test_text="bjp won the election in 3 different state and now the new chieff minister is jaat"
predict_input_pt = tokenizer_fine_tuned_pt(test_text, truncation = True, padding = True, return_tensors = 'pt' )

model_fine_tuned_pt.eval()

# Perform inference
output_pt = model_fine_tuned_pt(**predict_input_pt)

prediction_value_pt = torch.argmax(output_pt.logits, dim=1).item()

prediction_value_pt

In [None]:
!zip -r saved_models.zip /saved_models


In [None]:
from google.colab import files

# files.download('/content/saved_models.zip')
