## NOTEBOOK: Fine-tune a BERT-Tiny Model

This notebook evaluates, simulates Envise-conditions, and fine-tunes a BERT-Tiny model for Envise-behavior using the Idiom Software Stack and the Accuracy Estimator tool. 

We are using the BERT "Question-Answer" model: [mrm8488/bert-tiny-finetuned-squadv2](https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2). It is relatively small and we can complete all the steps within 30 minutes. You can find the model details on Hugging Face.

The dataset for this model is ``squadv2``.

Run this Jupyter notebook on an environment that has a GPU instance.

This notebook is in continuation to the `bert-tiny-profile.ipynb` notebook. To understand the full context, you run it before continuing with this notebook.

**PROCESS**

The following process describes the fine-tuning process used in this notebook:

* First, evaluate the out-of-box accuracy of the FP32 BERT-Tiny model.

* Next, estimate the model's accuracy in Envise precision by running the model using the Idiom API that calls the Envise Accuracy Estimator. Note that this is an estimate and most probably will be lower than the original FP32 accuracy.

* Last, we apply the fine-tuning algorithm on the model by calling the Idiom API that returns a "fine-tuned" FP32 model with an improved accuracy score (closer to the original FP32 accuracy score). 

**SYSTEM COMPONENT MINIMUM REQUIREMENTS**

* CPU: Any X86-64 architecture with 4 cores
* RAM: 64 GB memory
* GPU: One Nvidia 2080

#### Install dependencies

In [None]:
!pip install -r requirements.txt

#### Set up Imports

In [None]:
import logging
from dataclasses import dataclass
from dataclasses import field
from typing import Optional
import torch
from trainer_qa import QuestionAnsweringTrainer
from transformers import HfArgumentParser
from transformers import TrainingArguments
from transformers.utils import check_min_version
from transformers.utils.versions import require_version

#### Set up Idiom Imports

In [None]:
from idiom.ml.torch import setup_for_evaluation
from idiom.ml.torch import setup_for_export
from idiom.ml.torch import setup_for_tuning

#### Set up Imports from support scripts 

In [None]:
from bert_args import ModelArguments, IdiomMLArguments, DataTrainingArguments
from trainer_support import get_trainer_support

#### Version Sanity Checks

In [None]:
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.16.2")

require_version(
    "datasets>=1.8.0",
    "To fix: pip install -r examples/pytorch/question-answering/requirements.txt",
)

logger = logging.getLogger(__name__)


BERT fine-tuning and evaluation is parameterized with many argments, most of them use their default values. 

The arguments that are tailored for this notebook are available the the JSON file, ``tiny-bert-args.json``. 

In [None]:
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser(
    (
        ModelArguments,
        IdiomMLArguments,
        DataTrainingArguments,
        TrainingArguments,
    )
)
(
    model_args,
    idiom_ml_args,
    data_args,
    training_args,
) = parser.parse_json_file(json_file="tiny-bert-args.json")
# added
training_args._n_gpu = 1  # force this
print(f'{model_args}, \n{idiom_ml_args}, \n{data_args}, \n{training_args}')


The notebook needs support functions such as model and tokenizer download, training and evaluation datasets, and pre- and post-processing functions. 
All of these support functions have been placed outside the notebook.

We just fetch them below.

In [None]:
model_params, all_datasets, other_params \
    = get_trainer_support(model_args, idiom_ml_args, data_args, training_args, logger)

trainer, model = model_params['trainer'], model_params['model']
eval_dataset, train_dataset  = all_datasets['eval_dataset'], all_datasets['train_dataset']
eval_examples = all_datasets['eval_examples']
tokenizer, data_collator = other_params['tokenizer'], other_params['data_collator']
post_processing_function  = other_params['post_processing_function']
compute_metrics = other_params['compute_metrics']
last_checkpoint = other_params['last_checkpoint']


First, we evaluate the model as we received it, i.e., with FP32 precision.

The accuracy metrics we track is `eval_exact`, the Exact Match. 

Exact Match is a match/no-match  measure of whether the evaluation output matches the
ground truth answer exactly. This is a strict metric.

#### Evaluate out-of-the-box accuracy

In [None]:
# Out of the Box Model Evaluation
if training_args.do_eval:
    oob_metrics = trainer.evaluate()
    trainer.log_metrics("Out of the Box eval matrics", oob_metrics)


Next, we estimate the accuracy on Envise. Note, this is an estimate using Envise precision and the metric `eval_ecact` is expected to be lower that the original FP32 accuracy.

#### Evaluate for Envise Precision

In [None]:
# Envise Evaluation
if idiom_ml_args.do_envise_eval:
    setup_for_evaluation(model)
    
if training_args.do_eval:
    logger.info("*** Envise Eval ***")
    metrics = trainer.evaluate()

    max_eval_samples = (
        data_args.max_eval_samples
        if data_args.max_eval_samples is not None
        else len(eval_dataset)
    )
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)



#### Fine-tuning

In [None]:
if training_args.do_train and idiom_ml_args.finetune_with_dft:
    # prepare model to simulate training on Envise with DFT
    trainer_dft = QuestionAnsweringTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        eval_examples=eval_examples if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        post_process_function=post_processing_function,
        compute_metrics=compute_metrics,
    )
    train_dft_dataloader = trainer_dft.get_train_dataloader()
    inputs_dnf = next(iter(train_dft_dataloader))

    def batch_process_func(model, inputs):
        device = next(model.parameters()).device
        for k in inputs:
            inputs[k] = inputs[k].to(device)
        with torch.no_grad():
            model(**inputs)

    setup_for_tuning(
        model,
        inputs=inputs_dnf,
        batch_process_func=batch_process_func,
    )
    del trainer_dft

# fine-tuning
if training_args.do_train:
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)

    metrics = train_result.metrics
    max_train_samples = (
        data_args.max_train_samples
        if data_args.max_train_samples is not None
        else len(train_dataset)
    )
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))

    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()




#### Simulate Evaluation after Fine-tuning

In [None]:
if idiom_ml_args.do_envise_eval:
    setup_for_evaluation(model)

# Evaluation
if training_args.do_eval:
    logger.info("*** Evaluate ***")
    metrics = trainer.evaluate()

    max_eval_samples = (
        data_args.max_eval_samples
        if data_args.max_eval_samples is not None
        else len(eval_dataset)
    )
    metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

#### Export the model to ONNX

In [None]:
if idiom_ml_args.setup_for_export:
    model = setup_for_export(model)
    model.save_pretrained("bert-tiny-idiom-finetuned")
    tokenizer.save_pretrained("bert-tiny-idiom-finetuned")


In [None]:
!python -m transformers.onnx --model=bert-tiny-idiom-finetuned --feature=question-answering bert-tiny-idiom-onnx

#### Conclusion

This tutorial shows how to use the Accuracy Estimator to simulate Envise conditions that evaluate the accuracy scores of a BERT-Tiny model.

The results could vary slightly but are expected to stay in the range below:


|   Eval type	    |   EM score	|
|-------------------|---------------|
|Out of the Box     |  48.5977  	|
|Envise w/no-tuning |  46.8037      |
|One-epoch tuning   |  48.4629      |   


If you fine-tune for more epochs, you could get better scores.