<a href="https://colab.research.google.com/github/Nazneen-akram/coursera-rep/blob/main/Fine_Tune_BERT_for_Text_Classification_with_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align=center> Fine-Tune BERT for Text Classification with TensorFlow</h2>

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 1: BERT Classification Model</p>
</div>

In this [project](https://www.coursera.org/projects/fine-tune-bert-tensorflow/), you will learn how to fine-tune a BERT model for text classification using TensorFlow and TF-Hub.

The pretrained BERT model used in this project is [available](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2) on [TensorFlow Hub](https://tfhub.dev/).

### Learning Objectives

By the time you complete this project, you will be able to:

- Build TensorFlow Input Pipelines for Text Data with the [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data) API
- Tokenize and Preprocess Text for BERT
- Fine-tune BERT for text classification with TensorFlow 2 and [TF Hub](https://tfhub.dev)

### Prerequisites

In order to be successful with this project, it is assumed you are:

- Competent in the Python programming language
- Familiar with deep learning for Natural Language Processing (NLP)
- Familiar with TensorFlow, and its Keras API

### Contents

This project/notebook consists of several Tasks.

- **[Task 1]()**: Introduction to the Project.
- **[Task 2]()**: Setup your TensorFlow and Colab Runtime
- **[Task 3]()**: Download and Import the Quora Insincere Questions Dataset
- **[Task 4]()**: Create tf.data.Datasets for Training and Evaluation
- **[Task 5]()**: Download a Pre-trained BERT Model from TensorFlow Hub
- **[Task 6]()**: Tokenize and Preprocess Text for BERT
- **[Task 7]()**: Wrap a Python Function into a TensorFlow op for Eager Execution
- **[Task 8]()**: Create a TensorFlow Input Pipeline with `tf.data`
- **[Task 9]()**: Add a Classification Head to the BERT `hub.KerasLayer`
- **[Task 10]()**: Fine-Tune BERT for Text Classification
- **[Task 11]()**: Evaluate the BERT Text Classification Model

## Task 2: Setup your TensorFlow and Colab Runtime.

You will only be able to use the Colab Notebook after you save it to your Google Drive folder. Click on the File menu and select “Save a copy in Drive…

![Copy to Drive](https://drive.google.com/uc?id=1CH3eDmuJL8WR0AP1r3UE6sOPuqq8_Wl7)


### Check GPU Availability

Check if your Colab notebook is configured to use Graphical Processing Units (GPUs). If zero GPUs are available, check if the Colab notebook is configured to use GPUs (Menu > Runtime > Change Runtime Type).

![Hardware Accelerator Settings](https://drive.google.com/uc?id=1qrihuuMtvzXJHiRV8M7RngbxFYipXKQx)


In [51]:
!nvidia-smi

Fri Nov 17 10:41:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P0    29W /  70W |   2391MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Install TensorFlow and TensorFlow Model Garden

In [52]:
import tensorflow as tf
print(tf.version.VERSION)

2.14.0


In [53]:
#!pip install -q tensorflow==2.3.0

[31mERROR: Could not find a version that satisfies the requirement tensorflow==2.3.0 (from versions: 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1, 2.14.0rc0, 2.14.0rc1, 2.14.0, 2.14.1, 2.15.0rc0, 2.15.0rc1, 2.15.0)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==2.3.0[0m[31m
[0m

In [54]:
!git clone --depth 1 -b v2.3.0 https://github.com/tensorflow/models.git

fatal: destination path 'models' already exists and is not an empty directory.


In [55]:
# install requirements to use tensorflow/models repository
!pip install -Uqr models/official/requirements.txt
# you may have to restart the runtime afterwards

  Preparing metadata (setup.py) ... [?25l[?25hdone


## Restart the Runtime

**Note**
After installing the required Python packages, you'll need to restart the Colab Runtime Engine (Menu > Runtime > Restart runtime...)

![Restart of the Colab Runtime Engine](https://drive.google.com/uc?id=1xnjAy2sxIymKhydkqb0RKzgVK9rh3teH)

## Task 3: Download and Import the Quora Insincere Questions Dataset

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import sys
sys.path.append('models')
from official.nlp.data import classifier_data_lib
from official.nlp.bert import tokenization
from official.nlp import optimization


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



In [2]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

TF Version:  2.14.0
Eager mode:  True
Hub version:  0.15.0
GPU is available


A downloadable copy of the [Quora Insincere Questions Classification data](https://www.kaggle.com/c/quora-insincere-questions-classification/data) can be found [https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip](https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip). Decompress and read the data into a pandas DataFrame.

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import io
import requests
import zipfile

In [4]:
# Download the zip file from the link
url = "https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip"
response = requests.get(url)

In [5]:
# Extract the csv file from the zip file
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    with z.open("train.csv") as f:
        # Read the csv file into a pandas DataFrame
        df = pd.read_csv(f)

# Print the first five rows of the DataFrame
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


## Task 4: Create tf.data.Datasets for Training and Evaluation

In [6]:
train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)

In [7]:
# Create TensorFlow Datasets
with tf.device('/cpu:0'):
    train_dataset = tf.data.Dataset.from_tensor_slices((train_df['question_text'].values, train_df['target'].values))
    eval_dataset = tf.data.Dataset.from_tensor_slices((eval_df['question_text'].values, eval_df['target'].values))

    # Shuffle and batch the datasets
    batch_size = 32
    train_dataset = train_dataset.shuffle(buffer_size=len(train_df)).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    eval_dataset = eval_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

## Task 5: Download a Pre-trained BERT Model from TensorFlow Hub

In [None]:
"""
Each line of the dataset is composed of the review text and its label
- Data preprocessing consists of transforming text to BERT input features:
input_word_ids, input_mask, segment_ids
- In the process, tokenizing the text is done with the provided BERT model tokenizer
"""

 # Label categories
 # maximum length of (token) input sequences


# Get BERT layer and tokenizer:
# More details here: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2



In [8]:
# Label categories
label_categories = [0, 1]

In [9]:
# Maximum length of (token) input sequences
max_seq_length = 128

# Get BERT layer and tokenizer from TensorFlow Hub

input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2",
                            trainable=True)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [13]:
from bert import run_classifier

ImportError: ignored

In [14]:
#pip install bert-for-tf2

In [20]:
from bert import run_classifier

# Create InputExamples
train_InputExamples = train_df.apply(lambda x: bert.run_classifier.InputExample(guid=None,
                                                                              text_a=x['question_text'],
                                                                              text_b=None,
                                                                              label=x['target']), axis=1)

# Convert InputExamples to features
train_data = bert.run_classifier.file_based_convert_examples_to_features(train_InputExamples, label_list=label_categories, max_seq_length=max_seq_length, tokenizer=tokenizer)

# Convert features to numpy arrays
train_input_ids = np.array([feature.input_ids for feature in train_data])
train_input_mask = np.array([feature.input_mask for feature in train_data])
train_segment_ids = np.array([feature.segment_ids for feature in train_data])
train_labels = np.array([feature.label_id for feature in train_data])

# Example usage:
print("Input IDs Shape:", train_input_ids.shape)
print("Input Mask Shape:", train_input_mask.shape)
print("Segment IDs Shape:", train_segment_ids.shape)
print("Labels Shape:", train_labels.shape)

ImportError: ignored

## Task 6: Tokenize and Preprocess Text for BERT

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1-SpKFELnEvBMBqO7h3iypo8q9uUUo96P' />
    <p style="text-align: center;color:gray">Figure 2: BERT Tokenizer</p>
</div>

We'll need to transform our data into a format BERT understands. This involves two steps. First, we create InputExamples using `classifier_data_lib`'s constructor `InputExample` provided in the BERT library.

In [48]:
# Define a function to pre-process the text and label
def preprocess_text(text, label):
    # Pre-process the text
    encoded_text = tokenizer.encode_plus(text, max_length=128, pad_to_max_length=True)
    # Return the input ids, token type ids, attention masks, and label
    return encoded_text["input_ids"], encoded_text["token_type_ids"], encoded_text["attention_mask"], label

# Define a wrapper function to use tf.py_function
def wrapper_fn(text, label):
    # Use tf.py_function to wrap the pre-processing function
    return tf.py_function(preprocess_text, inp=[text, label], Tout=[tf.int32, tf.int32, tf.int32, tf.int32])

# Create a dummy dataset of text and labels
text = ["Sentence to embed", "Another sentence to embed", "One more sentence to embed"]
label = [0, 1, 0]
dataset = tf.data.Dataset.from_tensor_slices((text, label))

(…)orflow/bert_en_uncased_L-12_H-768_A-12/2: 0.00B [00:00, ?B/s]

(…)orflow/bert_en_uncased_L-12_H-768_A-12/2: 0.00B [00:00, ?B/s]

In [49]:
# Apply the wrapper function to the dataset
dataset = dataset.map(wrapper_fn)

# Shuffle, batch, and prefetch the dataset
dataset = dataset.shuffle(buffer_size=3).batch(batch_size=2).prefetch(tf.data.AUTOTUNE)

# Print the results
for input_ids, token_type_ids, attention_masks, label in dataset:
    print("Input ids:", input_ids)
    print("Token type ids:", token_type_ids)
    print("Attention masks:", attention_masks)
    print("Label:", label)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


InvalidArgumentError: ignored

In [45]:
#pip install transformers

In [19]:
#!pip install bert-for-tf2

You want to use [`Dataset.map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) to apply this function to each element of the dataset. [`Dataset.map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) runs in graph mode.

- Graph tensors do not have a value.
- In graph mode you can only use TensorFlow Ops and functions.

So you can't `.map` this function directly: You need to wrap it in a [`tf.py_function`](https://www.tensorflow.org/api_docs/python/tf/py_function). The [`tf.py_function`](https://www.tensorflow.org/api_docs/python/tf/py_function) will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

## Task 7: Wrap a Python Function into a TensorFlow op for Eager Execution

In [None]:
def to_feature_map(text, label):



## Task 8: Create a TensorFlow Input Pipeline with `tf.data`

In [None]:
with tf.device('/cpu:0'):
  # train


  # valid



The resulting `tf.data.Datasets` return `(features, labels)` pairs, as expected by [`keras.Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit):

In [None]:
# train data spec


In [None]:
# valid data spec


## Task 9: Add a Classification Head to the BERT Layer

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 3: BERT Layer</p>
</div>

In [None]:
# Building the model
def create_model():


## Task 10: Fine-Tune BERT for Text Classification

In [None]:
# Train model


## Task 11: Evaluate the BERT Text Classification Model

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()