# BERT Models 
##By: Sanchana Mohankumar

# Problem Statement:
In this notebook we are implementing 2 transformer model BERT and DistilBERT using bert-base-cased, distilbert-base-cased  Tokenizer and Model. And we are checking for better performing models using the following metrics to detect whether the posted tweet is real or fake
*   Accuracy
*   F1-Score
*   Precision
*   Recall

# Installation

In [None]:
#!pip install -q tensorflow-text==2.6.0
!pip install transformers

# Libraries

In [2]:
import pandas as pd
import numpy as np

import torch
import tensorflow as tf
from transformers import AutoTokenizer,TFBertModel, BertTokenizer, BertModel, TFTrainer, TFTrainingArguments
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from sklearn.metrics import classification_report

from torch import nn
from tqdm import tqdm
from torch.optim import Adam

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Load Data
As we can see below we are exporting covid19 tweet train, valid and test data for further analysis

### Training Dataset

In [4]:
#Load the tweet training set
covid_train = pd.read_csv("/content/drive/MyDrive/Covid_data_fake_news/Constraint_Train.csv") 
covid_train.head(2)

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real


### Validation Dataset

In [5]:
#Load the tweet validation set
covid_valid = pd.read_csv("/content/drive/MyDrive/Covid_data_fake_news/Constraint_Val.csv") 
covid_valid.head(2)

Unnamed: 0,id,tweet,label
0,1,Chinese converting to Islam after realising th...,fake
1,2,11 out of 13 people (from the Diamond Princess...,fake


### Test Dataset

In [6]:
#Load the tweet Test set
covid_test = pd.read_csv("/content/drive/MyDrive/Covid_data_fake_news/english_test_with_labels.csv") 
covid_test.head(2)

Unnamed: 0,id,tweet,label
0,1,Our daily update is published. States reported...,real
1,2,Alfalfa is the only cure for COVID-19.,fake


# Preparing data to use for implementing in model
- As we know the dataset contains label column with Fake and real we are now converting into 0's and 1's, we now replace the categorical value with a numeric value. We are applying this for Train, Validation and Test data

*   Fake as 0 
*   Real as 1. 

In [7]:
encoded_dict = {'fake':0,'real':1}

covid_train['label'] = covid_train.label.map(encoded_dict)
covid_valid['label'] = covid_valid.label.map(encoded_dict)
covid_test['label']  = covid_test.label.map(encoded_dict)

In [8]:
y_train = to_categorical(covid_train.label)
y_valid = to_categorical(covid_valid.label)
y_test = to_categorical(covid_test.label)

#BERT Model using Hugging face transfomers by Tensorflow and Keras

## Tokenize

- We will download the pretrained tokenizer for the BERT uncased model. This tokenizer will be used to convert text tokens to numbers and we will tokenize the raw dataset to tokenized sentences.

- It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

In [9]:
# Calling pretrained Tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]


- Before training, we need to convert the input textual data into BERT’s input data format using a tokenizer.

- Since we have loaded bert-base-cased, so tokenizer will also be Bert-base-cased.

In [None]:
# tokenizer used here is from bert-base-cased
x_train = tokenizer(
    text=covid_train.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True
    )
x_valid = tokenizer(
    text=covid_valid.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True
    )

x_test = tokenizer(
    text=covid_test.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True)

In [None]:
x_train

{'input_ids': <tf.Tensor: shape=(6420, 70), dtype=int32, numpy=
array([[  101,  1109,  2891, ...,     0,     0,     0],
       [  101,  1311,  2103, ...,     0,     0,     0],
       [  101,  6679,  1193, ...,     0,     0,     0],
       ...,
       [  101,   168,   138, ...,     0,     0,     0],
       [  101,   138, 11787, ...,     0,     0,     0],
       [  101,  1135,  1144, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(6420, 70), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

# Model

In [None]:
# Calling pretrained Model
bert = TFBertModel.from_pretrained('bert-base-cased')

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Bert layers accept three input arrays, input_ids, attention_mask, token_type_ids

- input_ids means our input words encoding, then attention mask,

- token_type_ids is necessary for the question-answering model; in this case, we will not pass token_type_ids.

For the Bert layer, we need two input layers, in this case, input_ids, attention_mask.
Embeddings contain hidden states of the Bert layer.

In [None]:
input_ids       = x_train['input_ids']
attention_mask  = x_train['attention_mask']

We are using functional API to design our model.

In [None]:
input_ids       = Input(shape=(70,), dtype=tf.int32, name="input_ids")        # Max_Length is set to 70
attention_mask  = Input(shape=(70,), dtype=tf.int32, name="attention_mask")  

embeddings = bert(input_ids, attention_mask)[0] 

out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(24, activation='relu')(out)
out = tf.keras.layers.Dropout(0.5)(out)
#out = Dense(32,activation = 'relu')(out)

y = Dense(2,activation = 'softmax')(out) #set to 2 as we have 2 Intends 

model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=y)
model.layers[2].trainable = True

# Model Summary 

In [None]:
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 70)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 70)]         0           []                               
                                                                                                  
 tf_bert_model_4 (TFBertModel)  TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 70,                                          

# Model Compilation
Defining learning parameters and compiling the model.
- Learning_rate = 2e-05 the learning rate for the model will be significantly lower.

- Loss = CategoricalCrossentropy since we are passing the categorical data as the target.

- Balanced accuracy will take care of our average accuracy for all the classes.



In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate =2e-5,
                                     epsilon=1e-08,
                                     decay=0.01,
                                     clipnorm=1.0) # this learning rate is for bert model , taken from huggingface website 

# Set loss and metrics
loss =CategoricalCrossentropy(from_logits = True)
metric = CategoricalAccuracy('balanced_accuracy'),
# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)

# Model Training
- You have the model ready with x_train, y_train. You can now train the model.

- Training and fine-tuning of the BERT model takes a bit longer time. so be Patience

In [None]:
validation_history = model.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} , y = y_train,
    validation_data = ({'input_ids':x_valid['input_ids'],'attention_mask':x_valid['attention_mask']}, y_valid),
    epochs=5, batch_size = 12)

model.save('model_bert.h5')

#### Testing our model on the test data.

In [None]:
predicted_bert = model.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})
predicted_bert[0]

array([1.9634266e-04, 9.9980372e-01], dtype=float32)

Taking the index of value having maximum probability.

In [None]:
y_predicted_bert = np.argmax(predicted_bert, axis = 1)
y_true_bert = covid_test.label

#### Classification Report

In [None]:
print(classification_report(y_true_bert, y_predicted_bert))

              precision    recall  f1-score   support

           0       0.97      0.96      0.96      1020
           1       0.96      0.97      0.97      1120

    accuracy                           0.97      2140
   macro avg       0.97      0.97      0.97      2140
weighted avg       0.97      0.97      0.97      2140



#Distil BERT Model using Hugging face transfomers by Tensorflow and Keras

### Tokenize
- We will download the pretrained tokenizer for the Distilbert uncased model. This tokenizer will be used to convert text tokens to numbers and we will tokenize the raw dataset to tokenized sentences.

- It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation

In [11]:
# Calling pretrained Tokenizer
distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]


- Before training, we need to convert the input textual data into Distilbert input data format using a tokenizer.

- Since we have loaded distilbert-base-cased, so tokenizer will also be distilbert-base-cased.

In [12]:
# tokenizer used here is from bert-base-cased
x_train = distilbert_tokenizer(
    text=covid_train.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True
    )
x_valid = distilbert_tokenizer(
    text=covid_valid.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True
    )

x_test = distilbert_tokenizer(
    text=covid_test.tweet.tolist(),
    add_special_tokens=True,
    max_length=70,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',           # tf if tensorflow pt if pytorch 
    return_token_type_ids = False, # Can be set true if its question answering case
    return_attention_mask = True,
    verbose = True)

In [13]:
x_train

{'input_ids': <tf.Tensor: shape=(6420, 70), dtype=int32, numpy=
array([[  101,  1109,  2891, ...,     0,     0,     0],
       [  101,  1311,  2103, ...,     0,     0,     0],
       [  101,  6679,  1193, ...,     0,     0,     0],
       ...,
       [  101,   168,   138, ...,     0,     0,     0],
       [  101,   138, 11787, ...,     0,     0,     0],
       [  101,  1135,  1144, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(6420, 70), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

### Model

In [14]:
# Calling pretrained Model
distilbert = TFBertModel.from_pretrained('distilbert-base-cased')

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


Downloading tf_model.h5:   0%|          | 0.00/338M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFBertModel: ['activation_13', 'vocab_transform', 'vocab_layer_norm', 'distilbert', 'vocab_projector']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['bert']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Distillbert layers accept three input arrays, input_ids, attention_mask, token_type_ids

- input_ids means our input words encoding, then attention mask,

- token_type_ids is necessary for the question-answering model; in this case, we will not pass token_type_ids.

For the Distilbert layer, we need two input layers, in this case, input_ids, attention_mask.
Embeddings contain hidden states of the Bert layer.

In [15]:
input_ids       = x_train['input_ids']
attention_mask  = x_train['attention_mask']

In [22]:
input_ids       = Input(shape=(70,), dtype=tf.int32, name="input_ids")        # Max_Length is set to 70
attention_mask  = Input(shape=(70,), dtype=tf.int32, name="attention_mask")  

embeddings = distilbert(input_ids, attention_mask)[0] 

out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(24, activation='relu')(out)
out = tf.keras.layers.Dropout(0.5)(out)
out = Dense(32,activation = 'relu')(out)
#out = tf.keras.layers.Dropout(0.5)(out)
y = Dense(2,activation = 'softmax')(out) #set to 2 as we have 2 Intends 

model_distilbert = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=y)
model_distilbert.layers[2].trainable = True
model_distilbert.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 70)]         0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 70)]         0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 70,                                          

# Model Compilation
Defining learning parameters and compiling the model.
- Learning_rate = 2e-05 the learning rate for the model will be significantly lower.

- Loss = CategoricalCrossentropy since we are passing the categorical data as the target.

- Balanced accuracy will take care of our average accuracy for all the classes.

In [23]:
optimizer = tf.keras.optimizers.Adam(learning_rate =2e-5,
                                     epsilon=1e-08,
                                     decay=0.01,
                                     clipnorm=1.0) # this learning rate is for bert model , taken from huggingface website 

# Set loss and metrics
loss =CategoricalCrossentropy(from_logits = True)
metric = CategoricalAccuracy('balanced_accuracy'),
# Compile the model
model_distilbert.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)

# Model Training
- You have the model ready with x_train, y_train. You can now train the model.

- Training and fine-tuning of the Distilbert model takes a bit longer time. so be Patience

In [None]:
validation_history = model_distilbert.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} , y = y_train,
    validation_data = ({'input_ids':x_valid['input_ids'],'attention_mask':x_valid['attention_mask']}, y_valid),
    epochs=5, batch_size = 12)

#### Model Evaluation 
#### Testing our model on the test data.

In [None]:
predicted_distilbert = model_distilbert.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})
predicted_distilbert[0]

array([0.0512109, 0.9487891], dtype=float32)

Taking the index of value having maximum probability.

In [None]:
y_predicted_distilbert = np.argmax(predicted_distilbert, axis = 1)
y_true_distilbert = covid_test.label

#### Classification Report

In [None]:
print(classification_report(y_true_distilbert, y_predicted_distilbert))

              precision    recall  f1-score   support

           0       0.91      0.90      0.91      1020
           1       0.91      0.92      0.92      1120

    accuracy                           0.91      2140
   macro avg       0.91      0.91      0.91      2140
weighted avg       0.91      0.91      0.91      2140



# Conclusion
**Advantages of BERT꞉**
- BERT works well for task‑specific models. The BERT model has been trained on a large corpus, making it easier for smaller, more
defined tasks. Metrics can be fine‑tuned and be used immediately.
- The accuracy of the model is outstanding because it is frequently updated. You can achieve this with successful fine‑tuning.
- The BERT model is available and pre‑trained in more than 100 languages. This can be useful for projects that are not Englishbased.

**Disadvantages of BERT꞉**
- The model is large because of the training structure and corpus.
It is slow to train because it is big and there are a lot of weights to update.
- It is expensive. It requires more computation because of its size, which comes at a cost.

**Advantages of DistilBERT꞉**
- DistilBERT retains 97% performance of the BERT with 40% fewer parameters than BERT

**Disadvantages of DistilBERT꞉**

- It is expensive. It requires more computation because of its size, which comes at a cost.

Let us compare the results of trained models. Following table lists the different metrics evaluated:

| Model              | Accuracy | Precision | Recall | F1-Score |
|--------------------|----------|-----------|--------|----------|
| Trained BERT       | 97       | 97        | 97     | 96       |
| Trained DistilBERT | 91       | 91        | 91     | 91       |