##### Copyright 2021 Habana Labs, Ltd. an Intel Company.

# Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

# DistilBert Sequence Classification with IMDb Reviews

**An adaptation of [Huggingface Sequence Classification with IMDB Reviews](https://github.com/huggingface/notebooks/blob/master/transformers_doc/pytorch/custom_datasets.ipynb) using Habana Gaudi AI processors.**

## Overview

This tutorial will take you through one example of using Huggingface Transformers models with IMDB datasets. The guide shows the workflow for training the model using Gaudi and is meant to be illustrative rather than definitive. 

Note: The dataset can be explored in the Huggingface model hub (IMDb), and can be alternatively downloaded with the Huggingface NLP library with load_dataset("imdb").

## Setup

Let’s start by downloading the dataset from the Large Movie Review Dataset webpage.

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [None]:
!tar -xf aclImdb_v1.tar.gz

### Install required libraries
We will install the Habana version of transformers inside the docker.

In [None]:
pip install numpy pandas scikit-learn datasets optimum.habana

This data is organized into pos and neg folders with one text file per example.

In [None]:
import os
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('./aclImdb/train')
test_texts, test_labels = read_imdb_split('./aclImdb/test')

We now have a train and test dataset, but let’s also also create a validation set which we can use for for evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:

In [None]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

Alright, we’ve read in our dataset. Now let’s tackle tokenization. We’ll eventually train a classifier using pre-trained DistilBert, so let’s use the DistilBert tokenizer.

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Now we can simply pass our texts to the tokenizer. We’ll pass truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to be no longer than the model’s maximum input length. This will allow us to feed batches of sequences into the model at the same time.

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

## Building the model and Fine-tuning with Trainer on Gaudi

Now, let’s turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a torch.utils.data.Dataset object and implementing __len__ and __getitem__. We put the data in this format so that the data can be easily batched such that each key in the batch encoding corresponds to a named parameter of the forward() method of the model we will train.

In [None]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model to fine-tune, define the TrainingArguments and instantiate a Trainer. Next, let's enable the training on Gaudi by setting the variables in TrainingArguments. 
- The argument use_hpu is to set default device being Gaudi;
- The argument hmp is to enable mixed precision; 
- ops_hmp_bf16 and ops_hmp_fp32 files are required to specify the BF16 op list and BF16 op list; 
- The hmp_verbose controls the printout of datatype conversion between BF16 and FP32. 
In this example, we set hmp_verbose=False for a clean output.

In [None]:
from transformers import DistilBertForSequenceClassification
from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

training_args = GaudiTrainingArguments(
    use_habana=True,
    use_lazy_mode=True,
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    gaudi_config_name='gaudi_config.json', 
)

Now we can train the model from the previously saved checkpoint or the pretrained model. The default set of the full training is 3 epochs.

In [None]:
if not os.path.isdir("./results/checkpoint-3500"):
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
    trainer = GaudiTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
    )
    trainer.train()
else:
    model = DistilBertForSequenceClassification.from_pretrained("./results/checkpoint-3500")

After the training finishes, we can evaluate the training results using the validation dataset. The function compute_metrics is used to calculate the accuracy number.

In [None]:
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)
trainer = GaudiTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
    )

## Print out the final results

At the end of the training, we can print out the final training/evaluation result.

In [None]:
print("**************** Evaluation below************")
metrics = trainer.evaluate()
metrics["eval_samples"] = len(val_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

## Gaudi training tips based trainer in huggingface transformers

In TrainingArguments setup:
- Set use_hpu=True to enable Gaudi device.
- Set use_lazy_mode=True to enable lazy mode for better performance.
- Set use_fused_adam=True to use Gaudi customized adam optimizer for better performance.
- Set use_fused_clip_norm=True to use Gaudi customized clip_norm kernel for better performance.
- Set mixed precision hmp=True.
    * The default hmp_verbose value is True. The setting hmp_verbose=False helps a clean printout.
    * For mixed precision, the following flags are needed.
      * hmp_bf16='./ops_bf16_distilbert_pt.txt',
      * hmp_fp32='./ops_fp32_distilbert_pt.txt',


## Summary

One can easily enable their model script on Gaudi by specifying a few Gaudi arguments in TrainingArguments.