<a href="https://colab.research.google.com/github/Chirag-Bansal/Machine-Learning-Collection/blob/master/Fine_Tune_BERT_for_Text_Classification_with_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align=center> Fine-Tune BERT for Text Classification with TensorFlow</h2>

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 1: BERT Classification Model</p>
</div>

The pretrained BERT model used in this project is [available](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2) on [TensorFlow Hub](https://tfhub.dev/).

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



### Install TensorFlow and TensorFlow Model Garden

In [None]:
import tensorflow as tf
print(tf.version.VERSION)

2.7.0


In [None]:
!pip install -q tensorflow==2.3.0

In [None]:
!git clone --depth 1 -b v2.3.0 https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 2650, done.[K
remote: Counting objects: 100% (2650/2650), done.[K
remote: Compressing objects: 100% (2311/2311), done.[K
remote: Total 2650 (delta 506), reused 1388 (delta 306), pack-reused 0[K
Receiving objects: 100% (2650/2650), 34.02 MiB | 6.46 MiB/s, done.
Resolving deltas: 100% (506/506), done.
Note: checking out '400d68abbccda2f0f6609e3a924467718b144233'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>



In [None]:
# install requirements to use tensorflow/models repository
!pip install -Uqr models/official/requirements.txt
# you may have to restart the runtime afterwards

[K     |████████████████████████████████| 7.8 MB 4.2 MB/s 
[K     |████████████████████████████████| 203 kB 71.8 MB/s 
[K     |████████████████████████████████| 15.7 MB 59 kB/s 
[K     |████████████████████████████████| 11.3 MB 53.0 MB/s 
[K     |████████████████████████████████| 296 kB 70.3 MB/s 
[K     |████████████████████████████████| 99 kB 9.4 MB/s 
[K     |████████████████████████████████| 38.2 MB 25 kB/s 
[K     |████████████████████████████████| 213 kB 52.8 MB/s 
[K     |████████████████████████████████| 4.0 MB 74.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 68.7 MB/s 
[K     |████████████████████████████████| 352 kB 57.2 MB/s 
[K     |████████████████████████████████| 1.2 MB 45.6 MB/s 
[K     |████████████████████████████████| 11.2 MB 54.9 MB/s 
[K     |████████████████████████████████| 47.6 MB 35 kB/s 
[K     |████████████████████████████████| 596 kB 68.2 MB/s 
[K     |████████████████████████████████| 3.1 MB 57.2 MB/s 
[K     |███████████████████

## Download and Import the Quora Insincere Questions Dataset

In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import sys
sys.path.append('models')
from official.nlp.data import classifier_data_lib
from official.nlp.bert import tokenization
from official.nlp import optimization

In [None]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

TF Version:  2.7.0
Eager mode:  True
Hub version:  0.12.0
GPU is NOT AVAILABLE


A downloadable copy of the [Quora Insincere Questions Classification data](https://www.kaggle.com/c/quora-insincere-questions-classification/data) can be found [https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip](https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip). Decompress and read the data into a pandas DataFrame.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip',
                 compression='zip', low_memory=False)
df.shape

In [None]:
df.head(20)

In [None]:
df.target.plot(kind='hist', title='Target distribution')

## Create tf.data.Datasets for Training and Evaluation

In [None]:
train_df, remaining = train_test_split(df, random_state=42, train_size=0.0075,stratify=df.target.values)
valid_df, _ = train_test_split(remaining, random_state=42, train_size=0.00075,stratify=remaining.target.values)

train_df.shape, valid_df.shape

In [None]:
train_df.target.plot(kind='hist', title='Train Distribution')

In [None]:
with tf.device('/cpu:0'):
  train_data = tf.data.Dataset.from_tensor_slices((train_df.question_text.values,train_df.target.values))
  valid_data = tf.data.Dataset.from_tensor_slices((valid_df.question_text.values,valid_df.target.values))

  for text, label in train_data.take(1):
    print(text,label)

## Download a Pre-trained BERT Model from TensorFlow Hub

In [None]:
"""
Each line of the dataset is composed of the review text and its label
- Data preprocessing consists of transforming text to BERT input features:
input_word_ids, input_mask, segment_ids
- In the process, tokenizing the text is done with the provided BERT model tokenizer
"""

label_list = [0,1]# Label categories
max_seq_length = 128 # maximum length of (token) input sequences
train_batch_size = 32

# Get BERT layer and tokenizer:
# More details here: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [None]:
tokenizer.wordpiece_tokenizer.tokenize('hi, how are you doing?')

In [None]:
tokenizer.convert_tokens_to_ids(tokenizer.wordpiece_tokenizer.tokenize('hi, how are you doing?'))

## Tokenize and Preprocess Text for BERT

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1-SpKFELnEvBMBqO7h3iypo8q9uUUo96P' />
    <p style="text-align: center;color:gray">Figure 2: BERT Tokenizer</p>
</div>

We'll need to transform our data into a format BERT understands. This involves two steps. First, we create InputExamples using `classifier_data_lib`'s constructor `InputExample` provided in the BERT library.

In [None]:
# This provides a function to convert row to input features and label

def to_feature(text, label, label_list=label_list, max_seq_length=max_seq_length, tokenizer=tokenizer):
  example = classifier_data_lib.InputExample(guid=None,
                                             text_a=text.numpy(),
                                             text_b=None,
                                             label=label.numpy())
  feature = classifier_data_lib.convert_single_example(0,example,label_list,max_seq_length,tokenizer)

  return (feature.inputs_ids, feature.input_mask, feature.segment_ids, feature.label_id)
  

You want to use [`Dataset.map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) to apply this function to each element of the dataset. [`Dataset.map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) runs in graph mode.

- Graph tensors do not have a value.
- In graph mode you can only use TensorFlow Ops and functions.

So you can't `.map` this function directly: You need to wrap it in a [`tf.py_function`](https://www.tensorflow.org/api_docs/python/tf/py_function). The [`tf.py_function`](https://www.tensorflow.org/api_docs/python/tf/py_function) will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

## Wrap a Python Function into a TensorFlow op for Eager Execution

In [None]:
def to_feature_map(text, label):
  input_ids, input_mask, segment_ids, label_id = tf.py_function(to_feature, inp=[text,label],
                                                                Tout=[tf.int32,tf.int32,tf.int32,tf.int32])
  
  input_ids.set_shape([max_seq_length])
  input_mask.set_shape([max_seq_length])
  segment_ids.set_shape([max_seq_length])
  label_id.set_shape([])

  x = {
      'input_word_ids': input_ids,
       'input_mask' : input_mask,
       'input_type_ids' : segment_ids
  }

  return (x,label_id)
  

## Create a TensorFlow Input Pipeline with `tf.data`

In [None]:
with tf.device('/cpu:0'):
  # train
  train_data = (train_data.map(to_feature_map,
      num_parallel_calls = tf.data.experimental.AUTOTUNE)
  .shuffle(100)
  .batch(32,drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))

  # valid
  valid_data = (valid_data.map(to_feature_map,
      num_parallel_calls = tf.data.experimental.AUTOTUNE)
  .batch(32,drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))
  

The resulting `tf.data.Datasets` return `(features, labels)` pairs, as expected by [`keras.Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit):

In [None]:
# train data spec
train_data.element_spec

In [None]:
# valid data spec
valid_data.element_spec

## Add a Classification Head to the BERT Layer

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 3: BERT Layer</p>
</div>

In [None]:
# Building the model
def create_model():
  input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
  input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="input_mask")
  input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                      name="input_type_ids")
  
  pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

  drop = tf.keras.layers.Dropout(0.4)(pooled_output)
  output = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(drop)

  model = tf.keras.Model(
      inputs={
          'input_word_ids': input_word_ids,
       'input_mask' : input_mask,
       'input_type_ids' : input_type_ids
      },
      outputs=output)
  return model


## Fine-Tune BERT for Text Classification

In [None]:
model = create_model()
model.complie(optimizer=tf.Keras.optimizers.Adam(learning_rate=2e-5),
              loss=tf.Keras.losses.BinaryCrossentropy(),
              metric=[tf.Keras.metrics.BinaryAccuracy()])
model.summary()

In [None]:
tf.keras.utils.plot_model(model=model, show_shapes=True, dpi =76)

In [None]:
# Train model
epochs = 4
history = model.fit(train_data,
                    validation_data=valid_data,
                    epochs=epochs,
                    verbose=1)

## Evaluate the BERT Text Classification Model

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

In [None]:
plot_graphs(history,'loss')

In [None]:
plot_graphs(history,'binary_accuracy')