## NOTEBOOK: Fine-tune a BERT-Tiny Model

This notebook evaluates, simulates Envise-conditions, and fine-tunes a BERT-Tiny model for Envise-behavior using the Idiom Software Stack and the Accuracy Estimator tool. 

We are using the BERT "Question-Answer" model: [mrm8488/bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2). It is relatively small and we can complete all the steps within 30 minutes. You can find the model details on Hugging Face.

The dataset for this model is ``squadv2``.

Run this Jupyter notebook on an environment that has a GPU instance.

This notebook is in continuation to the `bert-tiny-profile.ipynb` notebook. To understand the full context, you run it before continuing with this notebook.

**PROCESS**

The following process describes the fine-tuning process used in this notebook:

* First, evaluate the out-of-box accuracy of the FP32 BERT-Tiny model.

* Next, estimate the model's accuracy in Envise precision by running the model using the Idiom API that calls the Envise Accuracy Estimator. Note that this is an estimate and most probably will be lower than the original FP32 accuracy.

* Last, we apply the fine-tuning algorithm on the model by calling the Idiom API that returns a "fine-tuned" FP32 model with an improved accuracy score (closer to the original FP32 accuracy score). 

**SYSTEM COMPONENT MINIMUM REQUIREMENTS**

* CPU: Any X86-64 architecture with 4 cores
* RAM: 64 GB memory
* GPU: One Nvidia 2080

#### Install dependencies

In [1]:
!pip install -r requirements.txt



#### Set up Imports

In [2]:
import logging
from dataclasses import dataclass
from dataclasses import field
from typing import Optional
import torch
from trainer_qa import QuestionAnsweringTrainer
from transformers import HfArgumentParser
from transformers import TrainingArguments
from transformers.utils import check_min_version
from transformers.utils.versions import require_version

  from .autonotebook import tqdm as notebook_tqdm


#### Set up Idiom Imports

In [3]:
from idiom.ml.torch import setup_for_evaluation
from idiom.ml.torch import setup_for_export
from idiom.ml.torch import setup_for_tuning

#### Set up Imports from support scripts 

In [4]:
from bert_args import ModelArguments, IdiomMLArguments, DataTrainingArguments
from trainer_support import get_trainer_support

#### Version Sanity Checks

In [5]:
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.16.2")

require_version(
    "datasets>=1.8.0",
    "To fix: pip install -r examples/pytorch/question-answering/requirements.txt",
)

logger = logging.getLogger(__name__)


BERT fine-tuning and evaluation is parameterized with many argments, most of them use their default values. 

The arguments that are tailored for this notebook are available the the JSON file, ``tiny-bert-args.json``. 

In [6]:
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser(
    (
        ModelArguments,
        IdiomMLArguments,
        DataTrainingArguments,
        TrainingArguments,
    )
)
(
    model_args,
    idiom_ml_args,
    data_args,
    training_args,
) = parser.parse_json_file(json_file="tiny-bert-args.json")
# added
training_args._n_gpu = 1  # force this
print(f'{model_args}, \n{idiom_ml_args}, \n{data_args}, \n{training_args}')


ModelArguments(model_name_or_path='mrm8488/bert-tiny-finetuned-squadv2', config_name=None, tokenizer_name=None, cache_dir=None, model_revision='main', use_auth_token=False), 
IdiomMLArguments(do_envise_eval=True, finetune_with_dft=True, finetune_with_ept=False, idiom_ml_seed=42, setup_for_export=True), 
DataTrainingArguments(dataset_name='squad_v2', dataset_config_name=None, train_file=None, validation_file=None, test_file=None, overwrite_cache=False, preprocessing_num_workers=None, max_seq_length=384, pad_to_max_length=True, max_train_samples=None, max_eval_samples=None, max_predict_samples=None, version_2_with_negative=True, null_score_diff_threshold=0.0, doc_stride=128, n_best_size=20, max_answer_length=30), 
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug

The notebook needs support functions such as model and tokenizer download, training and evaluation datasets, and pre- and post-processing functions. 
All of these support functions have been placed outside the notebook.

We just fetch them below.

In [7]:
model_params, all_datasets, other_params \
    = get_trainer_support(model_args, idiom_ml_args, data_args, training_args, logger)

trainer, model = model_params['trainer'], model_params['model']
eval_dataset, train_dataset  = all_datasets['eval_dataset'], all_datasets['train_dataset']
eval_examples = all_datasets['eval_examples']
tokenizer, data_collator = other_params['tokenizer'], other_params['data_collator']
post_processing_function  = other_params['post_processing_function']
compute_metrics = other_params['compute_metrics']
last_checkpoint = other_params['last_checkpoint']


08/19/2022 17:25:15 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-06,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_

100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 456.03it/s]
[INFO|configuration_utils.py:644] 2022-08-19 17:25:15,686 >> loading configuration file https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2/resolve/main/config.json from cache at /home/auro/.cache/huggingface/transformers/1c9c47debcf1ea704edc79a69cba3adee79cab3027129d23952abb913d834dc6.21458fa63aa73ccb6cf8f15144ae7279582dfa5bf1b3c4c7091f60b0a047af05
[INFO|configuration_utils.py:680] 2022-08-19 17:25:15,688 >> Model config BertConfig {
  "_name_or_path": "mrm8488/bert-tiny-finetuned-squadv2",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_laye

08/19/2022 17:25:16 - INFO - datasets.fingerprint - Parameter 'function'=<function get_trainer_support.<locals>.prepare_validation_features at 0x7fdc800d7310> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead.


First, we evaluate the model as we received it, i.e., with FP32 precision.

The accuracy metrics we track is `eval_exact`, the Exact Match. 

Exact Match is a match/no-match  measure of whether the evaluation output matches the
ground truth answer exactly. This is a strict metric.

#### Evaluate out-of-the-box accuracy

In [8]:
# Out of the Box Model Evaluation
if training_args.do_eval:
    oob_metrics = trainer.evaluate()
    trainer.log_metrics("Out of the Box eval matrics", oob_metrics)


[INFO|trainer.py:553] 2022-08-19 17:25:20,775 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
[INFO|trainer.py:2340] 2022-08-19 17:25:20,782 >> ***** Running Evaluation *****
[INFO|trainer.py:2342] 2022-08-19 17:25:20,783 >>   Num examples = 12106
[INFO|trainer.py:2345] 2022-08-19 17:25:20,783 >>   Batch size = 32


08/19/2022 17:25:42 - INFO - utils_qa - Post-processing 11873 example predictions split into 12106 features.


100%|█████████████████████████████████████████████████████████████| 11873/11873 [00:36<00:00, 329.53it/s]

08/19/2022 17:26:18 - INFO - utils_qa - Saving predictions to tune_output/eval_predictions.json.
08/19/2022 17:26:18 - INFO - utils_qa - Saving nbest_preds to tune_output/eval_nbest_predictions.json.





08/19/2022 17:26:21 - INFO - utils_qa - Saving null_odds to tune_output/eval_null_odds.json.
08/19/2022 17:26:23 - INFO - datasets.metric - Removing /home/auro/.cache/huggingface/metrics/squad_v2/default/default_experiment-1-0.arrow
***** Out of the Box eval matrics metrics *****
  eval_HasAns_exact      =  9.3455
  eval_HasAns_f1         = 11.6074
  eval_HasAns_total      =    5928
  eval_NoAns_exact       = 87.7376
  eval_NoAns_f1          = 87.7376
  eval_NoAns_total       =    5945
  eval_best_exact        = 50.0969
  eval_best_exact_thresh =     0.0
  eval_best_f1           = 50.2419
  eval_best_f1_thresh    =     0.0
  eval_exact             = 48.5977
  eval_f1                =  49.727
  eval_total             =   11873


Next, we estimate the accuracy on Envise. Note, this is an estimate using Envise precision and the metric `eval_ecact` is expected to be lower that the original FP32 accuracy.

#### Evaluate for Envise Precision

In [9]:
# Envise Evaluation
if idiom_ml_args.do_envise_eval:
    setup_for_evaluation(model)
    
if training_args.do_eval:
    logger.info("*** Envise Eval ***")
    metrics = trainer.evaluate()

    max_eval_samples = (
        data_args.max_eval_samples
        if data_args.max_eval_samples is not None
        else len(eval_dataset)
    )
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)



08/19/2022 17:26:23 - INFO - __main__ - *** Envise Eval ***


[INFO|trainer.py:553] 2022-08-19 17:26:23,978 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
[INFO|trainer.py:2340] 2022-08-19 17:26:23,980 >> ***** Running Evaluation *****
[INFO|trainer.py:2342] 2022-08-19 17:26:23,981 >>   Num examples = 12106
[INFO|trainer.py:2345] 2022-08-19 17:26:23,982 >>   Batch size = 32


08/19/2022 17:27:02 - INFO - utils_qa - Post-processing 11873 example predictions split into 12106 features.


100%|█████████████████████████████████████████████████████████████| 11873/11873 [00:36<00:00, 325.90it/s]

08/19/2022 17:27:39 - INFO - utils_qa - Saving predictions to tune_output/eval_predictions.json.
08/19/2022 17:27:39 - INFO - utils_qa - Saving nbest_preds to tune_output/eval_nbest_predictions.json.





08/19/2022 17:27:41 - INFO - utils_qa - Saving null_odds to tune_output/eval_null_odds.json.
08/19/2022 17:27:44 - INFO - datasets.metric - Removing /home/auro/.cache/huggingface/metrics/squad_v2/default/default_experiment-1-0.arrow
***** eval metrics *****
  eval_HasAns_exact      =  7.6923
  eval_HasAns_f1         = 10.5563
  eval_HasAns_total      =    5928
  eval_NoAns_exact       = 85.7023
  eval_NoAns_f1          = 85.7023
  eval_NoAns_total       =    5945
  eval_best_exact        =   50.08
  eval_best_exact_thresh =     0.0
  eval_best_f1           = 50.1101
  eval_best_f1_thresh    =     0.0
  eval_exact             = 46.7531
  eval_f1                = 48.1831
  eval_samples           =   12106
  eval_total             =   11873


#### Fine-tuning

In [10]:
if training_args.do_train and idiom_ml_args.finetune_with_dft:
    # prepare model to simulate training on Envise with DFT
    trainer_dft = QuestionAnsweringTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        eval_examples=eval_examples if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        post_process_function=post_processing_function,
        compute_metrics=compute_metrics,
    )
    train_dft_dataloader = trainer_dft.get_train_dataloader()
    inputs_dnf = next(iter(train_dft_dataloader))

    def batch_process_func(model, inputs):
        device = next(model.parameters()).device
        for k in inputs:
            inputs[k] = inputs[k].to(device)
        with torch.no_grad():
            model(**inputs)

    setup_for_tuning(
        model,
        inputs=inputs_dnf,
        batch_process_func=batch_process_func,
    )
    del trainer_dft

# fine-tuning
if training_args.do_train:
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)

    metrics = train_result.metrics
    max_train_samples = (
        data_args.max_train_samples
        if data_args.max_train_samples is not None
        else len(train_dataset)
    )
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))

    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()




Computing outputs of first model (source)...




Computing outputs of second model (target)...


[INFO|trainer.py:1244] 2022-08-19 17:27:47,425 >> ***** Running training *****
[INFO|trainer.py:1245] 2022-08-19 17:27:47,425 >>   Num examples = 131422
[INFO|trainer.py:1246] 2022-08-19 17:27:47,426 >>   Num Epochs = 1
[INFO|trainer.py:1247] 2022-08-19 17:27:47,426 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1248] 2022-08-19 17:27:47,427 >>   Total train batch size (w. parallel, distributed & accumulation) = 12
[INFO|trainer.py:1249] 2022-08-19 17:27:47,427 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1250] 2022-08-19 17:27:47,427 >>   Total optimization steps = 10952


Computing output noise tensors...
Registering hooks into the model...
Successfully registered 13 hooks.


Step,Training Loss
1000,2.9694
2000,2.9816
3000,2.9487
4000,2.9356
5000,2.9065
6000,2.9554
7000,2.9194
8000,2.93
9000,2.9356
10000,2.9206


[INFO|trainer.py:2090] 2022-08-19 17:28:50,167 >> Saving model checkpoint to tune_output/checkpoint-1000
[INFO|configuration_utils.py:430] 2022-08-19 17:28:50,170 >> Configuration saved in tune_output/checkpoint-1000/config.json
[INFO|modeling_utils.py:1074] 2022-08-19 17:28:50,205 >> Model weights saved in tune_output/checkpoint-1000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2074] 2022-08-19 17:28:50,207 >> tokenizer config file saved in tune_output/checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2080] 2022-08-19 17:28:50,208 >> Special tokens file saved in tune_output/checkpoint-1000/special_tokens_map.json
[INFO|trainer.py:2090] 2022-08-19 17:29:43,562 >> Saving model checkpoint to tune_output/checkpoint-2000
[INFO|configuration_utils.py:430] 2022-08-19 17:29:43,566 >> Configuration saved in tune_output/checkpoint-2000/config.json
[INFO|modeling_utils.py:1074] 2022-08-19 17:29:43,600 >> Model weights saved in tune_output/checkpoint-2000/pytorch_model.

***** train metrics *****
  epoch                    =        1.0
  total_flos               =   111970GF
  train_loss               =     2.9368
  train_runtime            = 0:09:58.62
  train_samples            =     131422
  train_samples_per_second =     219.54
  train_steps_per_second   =     18.295


#### Simulate Evaluation after Fine-tuning

In [11]:
if idiom_ml_args.do_envise_eval:
    setup_for_evaluation(model)

# Evaluation
if training_args.do_eval:
    logger.info("*** Evaluate ***")
    metrics = trainer.evaluate()

    max_eval_samples = (
        data_args.max_eval_samples
        if data_args.max_eval_samples is not None
        else len(eval_dataset)
    )
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

08/19/2022 17:37:46 - INFO - __main__ - *** Evaluate ***


[INFO|trainer.py:553] 2022-08-19 17:37:46,069 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id.
[INFO|trainer.py:2340] 2022-08-19 17:37:46,074 >> ***** Running Evaluation *****
[INFO|trainer.py:2342] 2022-08-19 17:37:46,075 >>   Num examples = 12106
[INFO|trainer.py:2345] 2022-08-19 17:37:46,075 >>   Batch size = 32


08/19/2022 17:38:17 - INFO - utils_qa - Post-processing 11873 example predictions split into 12106 features.


100%|█████████████████████████████████████████████████████████████| 11873/11873 [00:35<00:00, 330.11it/s]

08/19/2022 17:38:53 - INFO - utils_qa - Saving predictions to tune_output/eval_predictions.json.
08/19/2022 17:38:53 - INFO - utils_qa - Saving nbest_preds to tune_output/eval_nbest_predictions.json.





08/19/2022 17:38:55 - INFO - utils_qa - Saving null_odds to tune_output/eval_null_odds.json.
08/19/2022 17:38:58 - INFO - datasets.metric - Removing /home/auro/.cache/huggingface/metrics/squad_v2/default/default_experiment-1-0.arrow
***** eval metrics *****
  epoch                  =     1.0
  eval_HasAns_exact      =  6.1741
  eval_HasAns_f1         =  8.0907
  eval_HasAns_total      =    5928
  eval_NoAns_exact       =  90.513
  eval_NoAns_f1          =  90.513
  eval_NoAns_total       =    5945
  eval_best_exact        =   50.08
  eval_best_exact_thresh =     0.0
  eval_best_f1           = 50.1179
  eval_best_f1_thresh    =     0.0
  eval_exact             = 48.4039
  eval_f1                = 49.3609
  eval_samples           =   12106
  eval_total             =   11873


#### Export the model to ONNX

In [15]:
if idiom_ml_args.setup_for_export:
    model = setup_for_export(model)
    model.save_pretrained("bert-tiny-idiom-finetuned")
    tokenizer.save_pretrained("bert-tiny-idiom-finetuned")


[INFO|configuration_utils.py:430] 2022-08-19 17:40:12,869 >> Configuration saved in bert-tiny-idiom-finetuned/config.json
[INFO|modeling_utils.py:1074] 2022-08-19 17:40:12,923 >> Model weights saved in bert-tiny-idiom-finetuned/pytorch_model.bin
[INFO|tokenization_utils_base.py:2074] 2022-08-19 17:40:12,924 >> tokenizer config file saved in bert-tiny-idiom-finetuned/tokenizer_config.json
[INFO|tokenization_utils_base.py:2080] 2022-08-19 17:40:12,924 >> Special tokens file saved in bert-tiny-idiom-finetuned/special_tokens_map.json


In [16]:
!python -m transformers.onnx --model=bert-tiny-idiom-finetuned --feature=question-answering bert-tiny-idiom-onnx

Using framework PyTorch: 1.10.0+cu111
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'end_logits', 'start_logits'})
	- Validating ONNX Model output "start_logits":
		-[✓] (2, 8) matches (2, 8)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "end_logits":
		-[✓] (2, 8) matches (2, 8)
		-[✓] all values close (atol: 1e-05)
All good, model saved at: bert-tiny-idiom-onnx/model.onnx


#### Conclusion

This tutorial shows how to use the Accuracy Estimator to simulate Envise conditions that evaluate the accuracy scores of a BERT-Tiny model.

The results could vary slightly but are expected to stay in the range below:


|   Eval type	    |   EM score	|
|-------------------|---------------|
|Out of the Box     |  48.5977  	|
|Envise w/no-tuning |  46.8037      |
|One-epoch tuning   |  48.4629      |   


If you fine-tune for more epochs, you could get better scores.