## COMP8240 - Major project

The authors of the original paper had released the code of their experiments
along with their trained models of DistilBERT in the Pytorch Transformers
library to be used in Python.

Replicating the original work will involve downloading GLUE benchmark datasets using datasets library and performing
pre-processing before training the model using pre-trained DistilBERT
model in Python within Google Colaboratory environment. 

Training the model
will involve writing of Python codes within Google Colab notebook environment
based on Fine-tune a pretrained model tutorial provided by Hugging Face. URL link is provided below.

https://huggingface.co/docs/transformers/training

Another reference URL is as per below,
https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification_quantization_inc.ipynb#scrollTo=zVvslsfMIrIh


Install datasets and transformers
packages

In [None]:
# Install datasets and transformers

! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Check tranforer version
import transformers

print(transformers.__version__)

4.23.1


Download the following GLUE benchmark datasets. 


In [None]:
# Modify 'task' variable with each of GLUE tasks one run at a time.

# GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
task = "rte"

#model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" -- Used for sst2 dataset
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
max_train_samples = 100

Load the dataset

In [None]:
# Use datasets library for downloading data and metric used for the evaluation

from datasets import load_dataset, load_metric

In [None]:
# Use task name to load_dataset and load_metric functions. Note: mnli-mm is special case 
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)

Downloading and preparing dataset glue/rte (download: 680.81 KiB, generated: 1.83 MiB, post-processed: Unknown size, total: 2.49 MiB) to /root/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/697k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2490 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/277 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]



```
Downloading builder script: 100%
28.8k/28.8k [00:00<00:00, 480kB/s]
Downloading metadata: 100%
28.7k/28.7k [00:00<00:00, 462kB/s]
Downloading readme: 100%
22.0k/22.0k [00:00<00:00, 360kB/s]
Downloading and preparing dataset glue/stsb (download: 784.05 KiB, generated: 1.09 MiB, post-processed: Unknown size, total: 1.86 MiB) to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Downloading data: 100%
803k/803k [00:00<00:00, 8.85MB/s]
Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/stsb/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.
100%
3/3 [00:00<00:00, 57.24it/s]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
  after removing the cwd from sys.path.
Downloading builder script:
5.76k/? [00:00<00:00, 17.1kB/s]
```



Pre-process the data

Pre-processing the loaded data is required before feeding them to the model.

For this process, 'AutoTokenizer.from_pretrained' method is used to instantiate
the tokenizer which then will tokenize the data and convert into a format
which the model expects. 

When calling this method, I have passed distilbertbase-
uncased model for pretrained model parameter.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapsh


```
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}
```

Dictionary to keep track of column names containing the sentences

In [None]:
# keep track of the correspondence task to column names:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [None]:
# double checking to see it works on current dataset:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence 1: No Weapons of Mass Destruction Found in Iraq Yet.
Sentence 2: Weapons of Mass Destruction Found in Iraq.


In [None]:
# function that will preprocess samples
max_seq_length = min(128, tokenizer.model_max_length)
padding = "max_length"

def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

In [None]:
# Use 'map' method to apply preprocess function to all of sentences in dataset.

encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

Fine-tune the model

Download the pre-trained model and fine-tune it.

`AutoModelForSequenceClassification` class will be used since the task is on sentence classification. 

The number of labels are 2 except for STS-B which has 3 labels.

In [None]:
# download the pretrained model and fine-tune it
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/pytorch_model.bin
Some weights of the model checkpoint at distilbert-base-uncased were not used when in



```
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/043235d6088ecd3dd5fb5ca3592b6913fd516027/pytorch_model.bin
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
```



Initialise arguments which contains the attributes to customise training.

In [None]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
# Function to compute metrics

import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# Instantiate trainer

from transformers import default_data_collator

validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
def train_func(model):
    trainer.model_wrapped = model
    trainer.model = model    
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.save_model() 
    trainer.save_metrics("train", metrics)
    trainer.save_state()

In [None]:
# finetune our model by just calling the train method:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence1, idx, sentence2. If sentence1, idx, sentence2 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2490
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 780
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.695367,0.512635
2,No log,0.671643,0.559567
3,No log,0.697532,0.628159
4,0.599100,0.768304,0.620939
5,0.599100,0.828201,0.610108


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence1, idx, sentence2. If sentence1, idx, sentence2 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-finetuned-rte/checkpoint-156
Configuration saved in distilbert-base-uncased-finetuned-rte/checkpoint-156/config.json
Model weights saved in distilbert-base-uncased-finetuned-rte/checkpoint-156/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-rte/checkpoint-156/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-rte/checkpoint-156/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been i

TrainOutput(global_step=780, training_loss=0.4891508053510617, metrics={'train_runtime': 11077.0478, 'train_samples_per_second': 1.124, 'train_steps_per_second': 0.07, 'total_flos': 527113134340248.0, 'train_loss': 0.4891508053510617, 'epoch': 5.0})

```
Epoch	Training Loss	Validation Loss	Matthews Correlation
    1	     0.523800	       0.532709	            0.409330
    2	     0.346000	       0.512254	            0.469412
    3	     0.233000	       0.557987	            0.532073
    4	     0.170300	       0.807962	            0.535357
    5	     0.117500	       0.849115	            0.546065

cola
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2675
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  [2675/2675 1:42:26, Epoch 5/5]
Epoch	Training Loss	Validation Loss	Matthews Correlation
1	0.525700	0.531297	0.421483
2	0.349900	0.519009	0.486351
3	0.242300	0.548332	0.537004
4	0.177200	0.751369	0.542127
5	0.130100	0.794502	0.539626
```
---

```
mnli
***** Running training *****
  Num examples = 392702
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 147264
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [147264/147264 3:05:37, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
1	0.510700	0.509039	0.805502
2	0.435200	0.515022	0.818441
3	0.305000	0.714891	0.818237
```
---
```
mrpc
***** Running training *****
  Num examples = 3668
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1150
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [1150/1150 2:10:05, Epoch 5/5]
Epoch	Training Loss	Validation Loss	Accuracy	F1
1	No log	0.419512	0.808824	0.865979
2	No log	0.380192	0.845588	0.895175
3	0.450300	0.489498	0.843137	0.891892
4	0.450300	0.564360	0.850490	0.896435
5	0.186800	0.619606	0.852941	0.898649
```
---
```
qnli
***** Running training *****
  Num examples = 104743
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 19641
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [19641/19641 46:57, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
1	0.346800	0.290548	0.882299
2	0.249100	0.308109	0.881933
3	0.182800	0.355076	0.893282
```
---
```
qqp
***** Running training *****
  Num examples = 363846
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 68223
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [68223/68223 2:00:26, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.273800	0.262496	0.886000	0.850750
2	0.205600	0.268540	0.897156	0.864666
3	0.156400	0.330116	0.902350	0.869531
```
---
```
rte
***** Running training *****
  Num examples = 2490
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 780
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [780/780 3:04:10, Epoch 5/5]
Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.695367	0.512635
2	No log	0.671643	0.559567
3	No log	0.697532	0.628159
4	0.599100	0.768304	0.620939
5	0.599100	0.828201	0.610108
```
---
```
sst2
***** Running training *****
  Num examples = 67349
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 12630
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [12630/12630 15:21, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
1	0.189600	0.348546	0.903670
2	0.119000	0.371430	0.909404
3	0.089800	0.416540	0.904817
```
---
```
stsb
***** Running training *****
  Num examples = 5749
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1800
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [1800/1800 2:41:17, Epoch 5/5]
Epoch	Training Loss	Validation Loss	Pearson	Spearmanr
1	No log	0.701316	0.858809	0.856591
2	1.054900	0.557682	0.870256	0.866578
3	0.408200	0.591738	0.867210	0.864538
4	0.408200	0.572526	0.870534	0.867496
5	0.252100	0.551649	0.871539	0.867857
The following columns in the evaluation set don't have a correspondin
```
---
```
wnli
***** Running training *****
  Num examples = 635
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [200/200 21:18, Epoch 5/5]
Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.691467	0.563380
2	No log	0.694759	0.535211
3	No log	0.695949	0.535211
4	No log	0.698332	0.478873
5	No log	0.698421	0.492958
```







In [None]:
# check with the evaluate method for the best model
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence1, idx, sentence2. If sentence1, idx, sentence2 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16


{'eval_loss': 0.6975318789482117,
 'eval_accuracy': 0.628158844765343,
 'eval_runtime': 77.3666,
 'eval_samples_per_second': 3.58,
 'eval_steps_per_second': 0.233,
 'epoch': 5.0}

```
cola
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 [66/66 00:30]
{'eval_loss': 0.849115252494812,
 'eval_matthews_correlation': 0.5460649174667868,
 'eval_runtime': 31.1777,
 'eval_samples_per_second': 33.453,
 'eval_steps_per_second': 2.117,
 'epoch': 5.0}

 ***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 [66/66 00:34]
{'eval_loss': 0.7513687610626221,
 'eval_matthews_correlation': 0.5421267938482849,
 'eval_runtime': 35.5279,
 'eval_samples_per_second': 29.357,
 'eval_steps_per_second': 1.858,
 'epoch': 5.0}
```
---
```
mnli
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 8
 [1227/1227 00:23]
{'eval_loss': 0.5150220394134521,
 'eval_accuracy': 0.8184411614875191,
 'eval_runtime': 23.7727,
 'eval_samples_per_second': 412.868,
 'eval_steps_per_second': 51.614,
 'epoch': 3.0}
```
---
```
mrpc
***** Running Evaluation *****
  Num examples = 408
  Batch size = 16
 [26/26 00:49]
{'eval_loss': 0.6196056604385376,
 'eval_accuracy': 0.8529411764705882,
 'eval_f1': 0.8986486486486487,
 'eval_runtime': 51.1012,
 'eval_samples_per_second': 7.984,
 'eval_steps_per_second': 0.509,
 'epoch': 5.0}
```
---
```
qnli
***** Running Evaluation *****
  Num examples = 5463
  Batch size = 16
 [342/342 00:16]
{'eval_loss': 0.35507574677467346,
 'eval_accuracy': 0.8932820794435292,
 'eval_runtime': 16.3771,
 'eval_samples_per_second': 333.575,
 'eval_steps_per_second': 20.883,
 'epoch': 3.0}
```
---
```
qqp
***** Running Evaluation *****
  Num examples = 40430
  Batch size = 16
 [2527/2527 01:16]
{'eval_loss': 0.33011600375175476,
 'eval_accuracy': 0.9023497402918624,
 'eval_f1': 0.8695307336417714,
 'eval_runtime': 77.0041,
 'eval_samples_per_second': 525.037,
 'eval_steps_per_second': 32.816,
 'epoch': 3.0}
```
---
```
rte
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16
 [18/18 01:12]
{'eval_loss': 0.6975318789482117,
 'eval_accuracy': 0.628158844765343,
 'eval_runtime': 77.3666,
 'eval_samples_per_second': 3.58,
 'eval_steps_per_second': 0.233,
 'epoch': 5.0}
```
---
```
sst2
***** Running Evaluation *****
  Num examples = 872
  Batch size = 16
 [55/55 00:01]
{'eval_loss': 0.3714296221733093,
 'eval_accuracy': 0.9094036697247706,
 'eval_runtime': 1.321,
 'eval_samples_per_second': 660.12,
 'eval_steps_per_second': 41.636,
 'epoch': 3.0}
```
---
```
stsb
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 16
 [94/94 01:43]
{'eval_loss': 0.5516488552093506,
 'eval_pearson': 0.8715385809065247,
 'eval_spearmanr': 0.8678567440532469,
 'eval_runtime': 104.5147,
 'eval_samples_per_second': 14.352,
 'eval_steps_per_second': 0.899,
 'epoch': 5.0}
```
---
```
wnli
***** Running Evaluation *****
  Num examples = 71
  Batch size = 16
 [5/5 00:06]
{'eval_loss': 0.6914669275283813,
 'eval_accuracy': 0.5633802816901409,
 'eval_runtime': 8.5434,
 'eval_samples_per_second': 8.311,
 'eval_steps_per_second': 0.585,
 'epoch': 5.0}
```





Define an evaluation function

In [None]:
metric_name = "eval_" + ("pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy")

def eval_func(model):
    trainer.model = model
    metrics = trainer.evaluate()
    return metrics.get(metric_name)

fp_model_result = eval_func(model)
print(f"The full-precision **{task}** model has an {metric_name} of {round(fp_model_result * 100, 2)}.")

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence1, idx, sentence2. If sentence1, idx, sentence2 are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16


The full-precision **rte** model has an eval_accuracy of 62.82.


```
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 [66/66 01:06]
The full-precision **cola** model has an eval_matthews_correlation of 54.61.

***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
 [66/66 08:04]
The full-precision **cola** model has an eval_matthews_correlation of 54.21.
```
---
```
***** Running Evaluation *****
  Num examples = 9815
  Batch size = 8
 [1227/1227 00:49]
The full-precision **mnli** model has an eval_accuracy of 81.84.
```
---
```
***** Running Evaluation *****
  Num examples = 408
  Batch size = 16
 [26/26 01:36]
The full-precision **mrpc** model has an eval_accuracy of 85.29.
```
---
```
***** Running Evaluation *****
  Num examples = 5463
  Batch size = 16
 [342/342 00:32]
The full-precision **qnli** model has an eval_accuracy of 89.33.
```
---
```
***** Running Evaluation *****
  Num examples = 40430
  Batch size = 16
 [2527/2527 02:35]
The full-precision **qqp** model has an eval_accuracy of 90.23.
```
---
```
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16
 [18/18 02:44]
The full-precision **rte** model has an eval_accuracy of 62.82.
```
---
```
***** Running Evaluation *****
  Num examples = 872
  Batch size = 16
 [55/55 00:02]
The full-precision **sst2** model has an eval_accuracy of 90.94.
```
---
```
***** Running Evaluation *****
  Num examples = 1500
  Batch size = 16
 [94/94 03:26]
The full-precision **stsb** model has an eval_pearson of 87.15.
```
---
```
***** Running Evaluation *****
  Num examples = 71
  Batch size = 16
 [5/5 00:15]
The full-precision **wnli** model has an eval_accuracy of 56.34.
```




