<h2>Sequence Classifier for GLUE dataset</h2>

In this notebook we'll take advantage of the collection of pre-trained LLM model from the hugging face platform to fine tune a sequence classifier, i.e. to check if two sentences are equivalent or not.</br>


In [None]:
!pip install datasets
!pip install evaluate

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
from IPython.display import display
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding, TFAutoModelForSequenceClassification
import tensorflow as tf
import evaluate

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

We start by loading the dataset: "mrpc" (Microsoft Research Paraphrase Corpus) is a collection of of pairs of sentences that may or may not be paraphrases (i.e. with the same meaning) and is one of the 10 dataset in the <it>GLUE benchmark</it>, an academic benchmark used to measure the performance of ML models:

In [None]:
#dataset = load_dataset("glue", "qqp")
dataset = load_dataset("glue", "mrpc")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [None]:
display(dataset['train'][:5])

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

<h3>Tokenizer</h3>

Before feeding the data to a LLM model we need to convert text into numbers, to do so we use a tokenizer:

In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sample_input = dataset['train'][0]
#sample_output = tokenizer(sample_input['question1'], sample_input['question2'])
sample_output = tokenizer(sample_input['sentence1'], sample_input['sentence2'])
sample_output

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

where:</br>
<ul>
<li>input_ids are the token IDs, where a token is a word or a significant subword.</li>
<li>token_type_ids tracks wheter a token belong to the first or the second sentence.</li>
<li>attention_mask defines which token are relevant and which should be neglected (for example due to padding).</li>
</ul>
As a check we can also convert the token IDs back to text:

In [None]:
tokenizer.convert_ids_to_tokens(sample_output['input_ids'])

['[CLS]',
 'am',
 '##ro',
 '##zi',
 'accused',
 'his',
 'brother',
 ',',
 'whom',
 'he',
 'called',
 '"',
 'the',
 'witness',
 '"',
 ',',
 'of',
 'deliberately',
 'di',
 '##stor',
 '##ting',
 'his',
 'evidence',
 '.',
 '[SEP]',
 'referring',
 'to',
 'him',
 'as',
 'only',
 '"',
 'the',
 'witness',
 '"',
 ',',
 'am',
 '##ro',
 '##zi',
 'accused',
 'his',
 'brother',
 'of',
 'deliberately',
 'di',
 '##stor',
 '##ting',
 'his',
 'evidence',
 '.',
 '[SEP]']

To convert all the data and keep it in dataset form we use a map with a tokenizing function.</br>
In the map we set <it>batched</it> to true to convert multiple sample at once and speeding up the tokenization, while in the function we set the truncation parameter, but not the padding, otherwise all the text will be lenghten up to the longest text in the dataset; in our case since the data will be passed on in batches we'll use dynamic padding, so that a text will be lengthened at most as the longest text in its batch:

In [None]:
def tokenize_func(sample):
  #return tokenizer(sample['question1'], sample['question2'], truncation = True)
  return tokenizer(sample['sentence1'], sample['sentence2'], truncation = True)

In [None]:
dataset_tokenized = dataset.map(tokenize_func, batched = True)
dataset_tokenized

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Dynamic padding will be set by a data collator; as a final preprocessing step we convert the dataset in a Tensorflow format that we can use for training with Keras:

In [None]:
BATCH_SIZE = 8
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = 'tf')

dataset_train_tf = dataset_tokenized['train'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = ['label'],
    shuffle = True,
    collate_fn = data_collator,
    batch_size = BATCH_SIZE,
)

dataset_val_tf = dataset_tokenized['validation'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = ['label'],
    shuffle = True,
    collate_fn = data_collator,
    batch_size = BATCH_SIZE,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


<h3>Model setup/training</h3>

The transformers library also provide a sequence classifier; we'll load it with the same checkpoint of the tokenizer (BERT uncased) to ensure consistency in the expected input/output of the model blocks:

In [None]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)
model.summary()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


We setup the training phase:
<ul>
<li>Due to the dataset size, personal limitation with hardware, and since we only need fine tuning we limit ourselves with 1 epoch, this is acceptable since in this project we just want to showcase how Hugging face transformers works, but for meaningful results we suggests multiple epochs.</li>
<li>We define a learning rate scheduler with exponential decay: the initial learning rate is of order $10^{-5}$ since the model was already pre-trained, moreover this setup let us plan a decay inside an epoch.</li>
<li>The scheduler is passed to and Adam optimizer which typically provide good performance.</li>
</ul>

In [None]:
NUM_EPOCHS = 1

#number of training steps, not that dataset_train_tf is already divided by batch size
s = NUM_EPOCHS * len(dataset_train_tf)
learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate = 5e-5,
    decay_steps = s,
    decay_rate = 0.1
)

model.compile(
    optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate_schedule),
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True),
    metrics = tf.metrics.SparseCategoricalAccuracy(),
)

history = model.fit(
    dataset_train_tf,
    validation_data = dataset_val_tf,
    epochs = NUM_EPOCHS,
)



In [None]:
#pd.DataFrame(history.history).plot(figsize = (8, 5))
#plt.grid(True)
#plt.show()

The dataset also has an associated metric we can use using the evaluate library: first we compute the model predictions, which returns logits and retrieve the predicted class by getting the index of the highest logit, then we compare the prediction with true values:

In [None]:
preds = model.predict(dataset_val_tf)["logits"]
class_preds = np.argmax(preds, axis = 1)
#metric = evaluate.load("glue", "qqp")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions = class_preds, references = dataset["validation"]["label"])



Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.6029411764705882, 'f1': 0.7317880794701986}

Obviously this result has limited meaning due to the simple setup, different approaches may be tested with:
<ul>
<li>Higher number of epochs.</li>
<li>Different learning rate scheduler, for example a polynomial decay of a reduction on plateau of the validation loss.</li>
<li>Adding dropout to reduce overfitting.</li>
</ul>