# Tutorial: Hardware-aware Training and Hyper-parameter Optimiation for MobileBERT / SQuADv1.1

### Authors: [Corey Lammie](https://www.linkedin.com/in/coreylammie/)

<a href="https://colab.research.google.com/github/IBM/aihwkit/blob/master/notebooks/iscas-tutorial/mobilebert_squad.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

The IBM Analog Hardware Acceleration Kit (AIHWKIT) is an open-source Python toolkit for exploring and using the capabilities of in-memory computing devices (PCM, RRAM and others) in the context of artificial intelligence. The PyTorch integration consists of a series of primitives and features that allow using the toolkit within PyTorch.
The GitHub repository can be found at: https://github.com/IBM/aihwkit
To learn more about Analog AI and the harware befind it, refer to this webpage: https://aihw-composer.draco.res.ibm.com/about

This notebook demonstrates how AIHWKIT can be used in conjunction with [W&B](https://wandb.ai/site) to perform hyper-parameter optimization.

## Install the AIHWKIT
This tutorial assumes that you have installed the AIHWKIT. If you have not, it can be installed by commenting out lines in the following cell:

In [None]:
# To install the cpu-only enabled kit, un-comment the line below
#!pip install aihwkit

# To install the GPU-enabled wheel, un-comment the lines below
# !wget https://aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud/aihwkit-0.9.0+cuda117-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
# !pip install aihwkit-0.9.0+cuda117-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

## Install other Requirements

In [None]:
!pip install wandb accelerate transformers tqdm

## Authenticate W&B
If you do not already have a W&B account, please create one [here](https://wandb.ai/site)

In [None]:
import wandb


wandb.login()

## Define a RPU Config
The RPU configuration specifies the parameters necessary for hardware-aware training.

In [4]:
import aihwkit
from aihwkit.simulator.presets.inference import StandardHWATrainingPreset


rpu_config = StandardHWATrainingPreset()

## Define a Configururation File Describing the Training Configurations and Parameters to Optimize
The following attributes decribe how the optimization is performed:
* **method**: The optimization method to use (Bayesian Optimization/Grid search/etc.).
* **metirc**: The optimization goal and item.

A full decription of each parameter is as follows:
* **logging_step_frequency**: The interval (number of steps) to perform logging for.
* **max_seq_len**: The maximum sequence length of the transformer model (MobileBERT).
* **batch_size_train**: The batch size used during training.
* **batch_size_eval**: The batch size used during evaluation.
* **weight_decay**: The L2 weight decay parameter (used during training).
* **num_training_epochs**: The number of fine-tuning/training epochs.
* **learning_rate**: The learning rate.

In [5]:
configuration = """
method: bayes
early_terminate:
  min_iter: 100
  type: hyperband
metric:
  goal: minimize
  name: training_loss
parameters:
  logging_step_frequency:
    value: 5
  max_seq_length:
    value: 320
  batch_size_train:
    value: 16
  batch_size_eval:
    value: 32
  weight_decay:
    value: 0.0005
  num_training_epochs:
    value: 1
  learning_rate:
    distribution: uniform
    max: 6.0 # 10 ** -max
    min: 1.0 # 10 ** -min
"""
with open("configuration.yaml", "w") as file:
    file.write(configuration[1:])

## Include Utility Functions to Load the Model and Dataset/Perform Training/Evaluation

In [6]:
import os
import torch
import wandb
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import (
    AutoConfig,
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    squad_convert_examples_to_features,
)
from transformers.data.metrics.squad_metrics import (
    compute_predictions_logits,
    squad_evaluate,
)
from transformers.data.processors.squad import SquadResult, SquadV1Processor


def to_list(tensor):
    return tensor.detach().cpu().tolist()


def load_and_cache_examples(
    model_name_or_path,
    tokenizer,
    max_seq_length,
    evaluate=False,
    output_examples=False,
    overwrite_cache=False,
    cache_dir="data",
):
    cached_features_file = os.path.join(
        "cached_{}_{}_{}".format(
            "dev" if evaluate else "train",
            list(filter(None, model_name_or_path.split("/"))).pop(),
            str(max_seq_length),
        ),
    )
    if os.path.exists(cached_features_file) and not overwrite_cache:
        features_and_dataset = torch.load(cached_features_file)
        features, dataset, examples = (
            features_and_dataset["features"],
            features_and_dataset["dataset"],
            features_and_dataset["examples"],
        )
    else:
        import tensorflow_datasets as tfds

        tfds_examples = tfds.load("squad", data_dir=cache_dir)
        examples = SquadV1Processor().get_examples_from_dataset(
            tfds_examples, evaluate=evaluate
        )
        features, dataset = squad_convert_examples_to_features(
            examples=examples,
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=128,
            max_query_length=64,
            is_training=not evaluate,
            return_dataset="pt",
            threads=8,
        )
        torch.save(
            {"features": features, "dataset": dataset, "examples": examples},
            cached_features_file,
        )

    if output_examples:
        return dataset, examples, features

    return dataset


def evaluate(
    model,
    tokenizer,
    examples,
    features,
    eval_dataloader,
    cache_dir,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    early_exit_n_iters=-1,
):
    all_results = []
    batch_idx = 0
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }
            feature_indices = batch[3]
            outputs = model(**inputs)

        for i, feature_index in enumerate(feature_indices):
            eval_feature = features[feature_index.item()]
            unique_id = int(eval_feature.unique_id)
            output = [to_list(output[i]) for output in outputs.to_tuple()]
            if len(output) >= 5:
                start_logits = output[0]
                start_top_index = output[1]
                end_logits = output[2]
                end_top_index = output[3]
                cls_logits = output[4]
                result = SquadResult(
                    unique_id,
                    start_logits,
                    end_logits,
                    start_top_index=start_top_index,
                    end_top_index=end_top_index,
                    cls_logits=cls_logits,
                )

            else:
                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)

            all_results.append(result)

        if batch_idx == early_exit_n_iters:
          break

        batch_idx += 1

    output_prediction_file = os.path.join(cache_dir, "predictions.json")
    output_nbest_file = os.path.join(cache_dir, "nbest_predictions.json")
    predictions = compute_predictions_logits(
        examples[0:early_exit_n_iters],
        features[0:early_exit_n_iters],
        all_results[0:early_exit_n_iters],
        20,
        30,
        True,
        output_prediction_file,
        output_nbest_file,
        None,
        False,
        False,
        0.0,
        tokenizer,
    )
    results = squad_evaluate(examples[0:early_exit_n_iters], predictions)
    return results


def train_epoch(
    train_dataloader,
    model,
    optimizer,
    scheduler,
    current_epoch,
    logging_step_frequency,
    wandb_logging=False,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
    early_exit_n_iters=-1,
):
    model.train()
    step = 0
    n_steps = len(train_dataloader)
    with tqdm(train_dataloader, desc="Iteration") as tepoch:
        for batch in tepoch:
            batch = tuple(t.to(device) for t in batch)
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
                "start_positions": batch[3],
                "end_positions": batch[4],
            }
            outputs = model(**inputs)
            loss = outputs[0]
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            model.zero_grad(set_to_none=True)
            if step % logging_step_frequency == 0:
                tepoch.set_postfix(loss=loss.item(), lr=scheduler.get_lr()[0])
                if wandb_logging:
                    wandb.log(
                        {
                            "step": n_steps * current_epoch + step,
                            "training_loss": loss.item(),
                        }
                    )

            if step == early_exit_n_iters:
              break

            step += 1



def load_model_tokenizer(
    model_id,
    cache_dir="cache",
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
):
    config = AutoConfig.from_pretrained(
        model_id,
        cache_dir=cache_dir,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_id,
        do_lower_case=True,
        cache_dir=cache_dir,
        use_fast=False,
    )
    model = AutoModelForQuestionAnswering.from_pretrained(
        model_id,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
    )
    model = model.to(device)
    return model, tokenizer


def load_dataloader_examples_features(
    model_id, tokenizer, evaluate, batch_size=16, max_seq_length=320,
):
    dataset, examples, features = load_and_cache_examples(
        model_id, tokenizer, max_seq_length, evaluate=evaluate, output_examples=True
    )
    if evaluate:
        sampler = SequentialSampler(dataset)
    else:
        sampler = RandomSampler(dataset)

    dataloader = DataLoader(
        dataset,
        sampler=sampler,
        batch_size=batch_size,
    )
    return dataloader, examples, features

## Define a Function to Load the Model (MobileBERT) and Dataset (SQuADv1.1), and to Execute the Training Loop

In [7]:
from aihwkit.optim import AnalogAdam
from transformers import get_linear_schedule_with_warmup
from aihwkit.nn.conversion import convert_to_analog
import numpy as np


def main(t_inferences=[0., 3600., 86400.], n_reps=5, early_exit_n_iters=-1):
    wandb.init()
    max_seq_length = wandb.config.max_seq_length
    logging_step_frequency = wandb.config.logging_step_frequency
    batch_size_train = wandb.config.batch_size_train
    batch_size_eval = wandb.config.batch_size_eval
    weight_decay = wandb.config.weight_decay
    num_training_epochs = wandb.config.num_training_epochs
    learning_rate = 10 ** -wandb.config.learning_rate
    model_id = "csarron/mobilebert-uncased-squad-v1"
    model, tokenizer = load_model_tokenizer(model_id, "data")
    print("Loading and parsing training features.")
    train_dataloader, train_examples, train_features = load_dataloader_examples_features(model_id, tokenizer, evaluate=False, batch_size=batch_size_train, max_seq_length=max_seq_length)
    print("Loading and parsing evaluation features.")
    test_dataloader, test_examples, test_features = load_dataloader_examples_features(model_id, tokenizer, evaluate=True, batch_size=batch_size_eval, max_seq_length=max_seq_length)
    no_decay = ["bias", "LayerNorm.weight"]
    model = convert_to_analog(model, rpu_config, verbose=False)
    optimizer_grouped_parameters = [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": weight_decay,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AnalogAdam(
        optimizer_grouped_parameters, lr=learning_rate,
    )
    t_total = len(train_dataloader) // num_training_epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=0, num_training_steps=t_total
    )
    model.zero_grad()
    for current_epoch in range(0, num_training_epochs):
        print("Training epoch: ", current_epoch)
        model.train()
        train_epoch(
            train_dataloader,
            model,
            optimizer,
            scheduler,
            current_epoch=current_epoch,
            logging_step_frequency=logging_step_frequency,
            wandb_logging=True,
            early_exit_n_iters=early_exit_n_iters,
        )
        with torch.no_grad():
            model.eval()
            for t in t_inferences:
              print('t_inference:', t)
              f1_scores = []
              for rep in range(n_reps):
                  model.drift_analog_weights(t)
                  result = evaluate(model, tokenizer, test_examples, test_features, test_dataloader, cache_dir='data', early_exit_n_iters=early_exit_n_iters)
                  f1_scores.append(result['f1'])

              print("=====", t, np.mean(f1_scores), np.std(f1_scores))

## Load the Configuration File and Execute the Optimization Loop

In [None]:
import yaml


with open('configuration.yaml') as f:
    sweep_configuration = yaml.load(f, Loader=yaml.FullLoader)

sweep_id = wandb.sweep(sweep=sweep_configuration, project="mobilebert_squadv1")
wandb.agent(sweep_id, function=main, count=10)