<a href="https://colab.research.google.com/github/SohaHussain/HuggingFace-course/blob/main/fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.0-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.2 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 42.5 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 46.4 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 45.5 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 48.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.6 MB/s 
Collecti

#### loading data from datasets library

In [3]:
from datasets import load_dataset

we are using MRPC dataset which contains 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). 

#### loading and caching MRPC (Microsoft Research Paraphase Corpus) dataset with GLUEbenchmark

In [4]:
raw_dataset = load_dataset("glue","mrpc")
raw_dataset

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

#### to access data in raw_dataset object, use indexing just like in a dict

In [5]:
raw_train_dataset = raw_dataset["train"]
raw_train_dataset.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

here label is of type ClassLabel and mapping is given by names folder

In [6]:
raw_train_dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

In [7]:
raw_val_dataset = raw_dataset["validation"]
raw_val_dataset[10]

{'idx': 79,
 'label': 1,
 'sentence1': 'The delegates said raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .',
 'sentence2': 'Bin Laden ’ s men pointed out that raising and distributing funds has been complicated by the U.S. crackdown on jihadi charitable foundations , bank accounts of terror-related organizations and money transfers .'}

In [8]:
raw_test_dataset = raw_dataset["test"]
raw_test_dataset[10]

{'idx': 10,
 'label': 1,
 'sentence1': 'Consumers would still have to get a descrambling security card from their cable operator to plug into the set .',
 'sentence2': 'To watch pay television , consumers would insert into the set a security card provided by their cable service .'}

### Preprocessing

we need to convert the text to numbers the model can make sense of.

In [9]:
from transformers import AutoTokenizer
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

we can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects.

In [10]:
# example usage

inputs = tokenizer("this is first sentence.","this is second sentence.")
inputs

{'input_ids': [101, 1142, 1110, 1148, 5650, 119, 102, 1142, 1110, 1248, 5650, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

here *token_type_ids* tell the model which part is of sentence 1 and which of sentence 2.

In [11]:
# example usage
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'second',
 'sentence',
 '.',
 '[SEP]']

we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences.

if we select a different checkpoint, we won’t necessarily have the token_type_ids in our tokenized inputs (for instance, they’re not returned if you use a DistilBERT model). They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options. So, one way to preprocess the training dataset is:

In [12]:
tokenized_dataset = tokenizer(
    raw_dataset["train"]["sentence1"],
    raw_dataset["train"]["sentence2"],
    padding = True,
    truncation = True
)

this method has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization.

To keep the data as a dataset, we will use the Dataset.map() method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [13]:
def tokenize_func(example):
  return tokenizer(example["sentence1"],example["sentence2"],truncation=True)

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids. Note that it also works if the example dictionary contains several samples (each key as a list of sentences) since the tokenizer works on lists of pairs of sentences, as seen before. This will allow us to use the option batched=True in our call to map(), which will greatly speed up the tokenization.

Note that we’ve left the padding argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!



Now we can apply the tokenization function on all our datasets at once. We’re using batched=True in our call to map so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing.

In [14]:
tokenized_dataset = raw_dataset.map(tokenize_func,batched=True)
tokenized_dataset

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding.



### Dynamic padding

Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [15]:
from transformers import DataCollatorWithPadding

data_col = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = "tf")

 let’s grab a few samples from our training set that we would like to batch together. Here, we remove the columns idx, sentence1, and sentence2 as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the lengths of each entry in the batch:



In [16]:
sample = tokenized_dataset["train"][:10]
sample = {k:v for k,v in sample.items() if k not in ["idx","sentence1","sentence2"]}
[len(x) for x in sample["input_ids"]]

[52, 59, 47, 69, 60, 50, 66, 32, 48, 64]

 Dynamic padding means the samples in this batch should all be padded to a length of 69, the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Let’s double-check that our data_collator is dynamically padding the batch properly:

In [17]:
batch = data_col(sample)
{k:v.shape for k,v in batch.items()}

{'attention_mask': TensorShape([10, 69]),
 'input_ids': TensorShape([10, 69]),
 'labels': TensorShape([10]),
 'token_type_ids': TensorShape([10, 69])}

Now that we have our dataset and a data collator, we need to put them together. We could manually load batches and collate them, but that’s a lot of work, and probably not very performant either. Instead, there’s a simple method that offers a performant solution to this problem: to_tf_dataset(). This will wrap a tf.data.Dataset around your dataset, with an optional collation function. tf.data.Dataset is a native TensorFlow format that Keras can use for model.fit(), so this one method immediately converts a 🤗 Dataset to a format that’s ready for training.

In [18]:
import tensorflow as tf

tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns = ["attention_mask","input_ids","token_type_ids"],
    label_cols = ["labels"],
    shuffle = True,
    collate_fn = data_col,
    batch_size = 8,
)

In [19]:
tf_val_dataset = tokenized_dataset["validation"].to_tf_dataset(
    columns = ["attention_mask","input_ids","token_type_ids"],
    label_cols = ["labels"],
    shuffle = False,
    collate_fn = data_col,
    batch_size = 8,
)

### fine tuning

In [20]:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)


Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(
    optimizer="adam",
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(tf_train_dataset,
          validation_data = tf_val_dataset)



<keras.callbacks.History at 0x7fc9f834dad0>

the above code runs but the loss declines only slowly or sporadically. The primary cause is the learning rate. As with the loss, when we pass Keras the name of an optimizer as a string, Keras initializes that optimizer with default values for all parameters, including learning rate. The transformer models benefit from a much lower learning rate than the default for Adam, which is 1e-3, also written as 10 to the power of -3, or 0.001. 5e-5 (0.00005), which is some twenty times lower, is a much better starting point.

We can slowly reduce the learning rate over the course of training.  In Keras, the best way to do this is to use a learning rate scheduler. A good one to use is PolynomialDecay — despite the name, with default settings it simply linearly decays the learning rate from the initial value to the final value over the course of training, which is exactly what we want. In order to use a scheduler correctly, though, we need to tell it how long training is going to be. We compute that as num_train_steps below.

In [26]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size=8
num_epochs=5

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs

num_train_steps = len(tf_train_dataset)*num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps)

from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)


 let’s reload the model, to reset the changes to the weights from the training run we just did, and then we can compile it with the new optimizer.

In [27]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)
loss=SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
model.fit(tf_train_dataset, validation_data=tf_val_dataset, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc977706810>

### model predictions

In [29]:
preds = model.predict(tf_val_dataset)["logits"]

This returns the logits from the output head of the model, one per class.
We can convert these logits into the model’s class predictions by using argmax to find the highest logit, which corresponds to the most likely class.

In [31]:
import numpy as np
probabilities = tf.nn.softmax(preds)
class_preds = np.argmax(probabilities,axis=1)

In [32]:
print(preds.shape, class_preds.shape)

(408, 2) (408,)


let’s use these preds to compute some metrics. We can load the metrics associated with the dataset as easily as we loaded the dataset, this time with the load_metric() function. The object returned has a compute() method we can use to do the metric calculation.

In [34]:
from datasets import load_metric
metric = load_metric("glue","mrpc")
metric.compute(predictions=class_preds, references=raw_dataset["validation"]["label"])

{'accuracy': 0.8504901960784313, 'f1': 0.8950086058519794}