# **Fine tune Bert Model for classification**
In this Notebook you will learn how to  transformer based large language model on your own dataset

- preprocess a text dataset

  - Load dataset
  - Analyze features and target
  - Tokenize text
  - add padding to make same length
  - Convert Dict to Tensorflow dataset for trainable keras model
- Optimize the model
  - You will learn how to optimize the model hyperparameters
- Train model on your own prepared dataset
- Evaluate your trained model
  - You will learn how to evaluate model 'accuracy' and 'f1' score
  

## Import basic libraries and different packages will be used in this notebook







In [1]:
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from datasets import load_dataset
from tensorflow.keras.losses import SparseCategoricalCrossentropy

**Load dataset**

In [2]:
raw_datasets = load_dataset("glue", "sst2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

**Analyze features and target columns of the raw dataset**

In [3]:
raw_datasets['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

**Load Tensorflow pretrained model and tokenizer using same checkpoint**

In [4]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Helper function to tokenize text sentence from each row from the raw data**

In [5]:
def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

**DataCollatorWithPadding Used for making equal length of each tokenized sentence by adding padding id('0')**

In [6]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

**Map each key and value to tokenize function and get values in the formate of DatasetDict in `tokenized_datasets` variable**

In [7]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

**Now Again Analyze the `tokenized_datasets` and see some extra features are present in this dataset. We only require three things for training our model**


1.   input_ids
2.   token_type_ids
3.   attention_mask



In [8]:
tokenized_datasets['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## **Create Tensorflow Dataset**
 - `Columns = `Features name from our tokenized dataset that will be used for training
 - `label_cols =` target feature name from the tokenized dataset

**By running the below cell dataset is prepared for training our model. You have done tough challenge. Cheer up!...**




In [9]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


**Optimize the Hyperparameters**

In [10]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Define the optimizer configuration
optimizer_config = {
    "class_name": "Adam",
    "config": {
        "learning_rate": 5e-5,
    },
}

# Create the optimizer object
optimizer = tf.keras.optimizers.deserialize(optimizer_config)

# Compile the model
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

**Fit the model on prepared dataset and model will start training**

In [11]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x79bce60ea350>

In [12]:
preds = model.predict(tf_validation_dataset)["logits"]



In [14]:
class_preds = np.argmax(preds, axis=1)
print(preds.shape, class_preds.shape)

(872, 2) (872,)


**Evaluate the Model**

In [17]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8738532110091743, 'f1': 0.8791208791208791}