##### Copyright 2021 Habana Labs, Ltd. an Intel Company.

# Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

# DistilBert Sequence Classification with IMDb Reviews

**An adaptation of [Huggingface Sequence Classification with IMDB Reviews](https://github.com/huggingface/notebooks/blob/master/transformers_doc/pytorch/custom_datasets.ipynb) using Habana Gaudi AI processors.**

## Overview

This tutorial will take you through one example of using Huggingface Transformers models with IMDB datasets. The guide shows the workflow for training the model using Gaudi and is meant to be illustrative rather than definitive. 

Note: The dataset can be explored in the Huggingface model hub (IMDb), and can be alternatively downloaded with the Huggingface NLP library with load_dataset("imdb").

## Setup

Let’s start by downloading the dataset from the Large Movie Review Dataset webpage.

In [1]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2022-03-02 06:42:46--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving proxy-us.intel.com (proxy-us.intel.com)... 10.1.192.48
Connecting to proxy-us.intel.com (proxy-us.intel.com)|10.1.192.48|:911... connected.
Proxy request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-03-02 06:42:49 (38.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
!tar -xf aclImdb_v1.tar.gz

### Install required libraries
We will install the Habana version of transformers inside the docker.

In [3]:
!git clone --depth=1 https://github.com/HabanaAI/Model-References.git

Cloning into 'Model-References'...
remote: Enumerating objects: 2956, done.[K
remote: Counting objects: 100% (2956/2956), done.[K
remote: Compressing objects: 100% (2132/2132), done.[K
remote: Total 2956 (delta 832), reused 2323 (delta 745), pack-reused 0[K
Receiving objects: 100% (2956/2956), 19.04 MiB | 2.38 MiB/s, done.
Resolving deltas: 100% (832/832), done.
Checking out files: 100% (2668/2668), done.


In [4]:
pip install Model-References/PyTorch/nlp/finetuning/huggingface/bert/transformers/.

Processing ./Model-References/PyTorch/nlp/finetuning/huggingface/bert/transformers
Collecting filelock
  Downloading https://files.pythonhosted.org/packages/cd/f1/ba7dee3de0e9d3b8634d6fbaa5d0d407a7da64620305d147298b683e5c36/filelock-3.6.0-py3-none-any.whl
Collecting huggingface-hub<1.0,>=0.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/c8/df/1b454741459f6ce75f86534bdad42ca17291b14a83066695f7d2c676e16c/huggingface_hub-0.4.0-py3-none-any.whl (67kB)
[K     |████████████████████████████████| 71kB 1.8MB/s eta 0:00:01
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/ec/e5/407e634cbd3b96a9ce6960874c5b66829592ead9ac762bd50662244ce20b/sacremoses-0.0.47-py2.py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 3.2MB/s eta 0:00:01
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/e4/bd/10c052faa46f4effb18651b66f01010872f8eddb5f4034d72c08818129bd/tokenizers-0.10.3-cp38-cp38-ma

In [5]:
pip install scikit-learn 

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/40/d3/206905d836cd496c1f78a15ef92a0f0477d74113b4f349342bf31dfd62ca/scikit_learn-1.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7MB)
[K     |████████████████████████████████| 26.7MB 1.0MB/s eta 0:00:01     |████████████████████████▊       | 20.6MB 1.6MB/s eta 0:00:04
Collecting scipy>=1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d2/27/b2648569175ba233cb6ad13029f8df4049a581c268156c5dd1db5ca44a8c/scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6MB)
[K     |████████████████████████████████| 41.6MB 2.0MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/61/cf/6e354304bcb9c6413c4e02a747b600061c21d38ba51e7e544ac7bc66aecc/threadpoolc

In [6]:
pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/a6/45/ecbd6d5d6385b9702f8bb53801c66379edf044b373bbb77f184289cd3811/datasets-1.18.3-py3-none-any.whl (311kB)
[K     |████████████████████████████████| 317kB 3.1MB/s eta 0:00:01
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/6a/cf/50f4cfde85d90c2b3e9c98b46e17d190bbdd97b54d3e0876e1d9360e487f/xxhash-3.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212kB)
[K     |████████████████████████████████| 215kB 3.5MB/s eta 0:00:01
Collecting pyarrow!=4.0.0,>=3.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/7d/fb38132dd606533b36a3fde8b17db95a36351dc58afbc6dc6b3d668ef3f0/pyarrow-7.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7MB)
[K     |████████████████████████████████| 26.7MB 9.5MB/s eta 0:00:01
Collecting multiprocess
[?25l  Downloading https://files.pythonhosted.org/packages/e6/22/b09b8394f8c86ff0cfebd725ea96bba0accd4a4b2be437bcba

This data is organized into pos and neg folders with one text file per example.

In [7]:
import os
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('./aclImdb/train')
test_texts, test_labels = read_imdb_split('./aclImdb/test')

We now have a train and test dataset, but let’s also also create a validation set which we can use for for evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:

In [8]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

Alright, we’ve read in our dataset. Now let’s tackle tokenization. We’ll eventually train a classifier using pre-trained DistilBert, so let’s use the DistilBert tokenizer.

In [9]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Now we can simply pass our texts to the tokenizer. We’ll pass truncation=True and padding=True, which will ensure that all of our sequences are padded to the same length and are truncated to be no longer than the model’s maximum input length. This will allow us to feed batches of sequences into the model at the same time.

In [10]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

## Building the model and Fine-tuning with Trainer on Gaudi

Now, let’s turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a torch.utils.data.Dataset object and implementing __len__ and __getitem__. We put the data in this format so that the data can be easily batched such that each key in the batch encoding corresponds to a named parameter of the forward() method of the model we will train.

In [11]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model to fine-tune, define the TrainingArguments and instantiate a Trainer. Next, let's enable the training on Gaudi by setting the variables in TrainingArguments. 
- The argument use_hpu is to set default device being Gaudi;
- The argument hmp is to enable mixed precision; 
- The hmp_opt_level defines the level of optimization and it has two optional values: 'O1' and 'O2', its defaulte value is 'O1'. For hmp_opt_level='O1', hmp_bf16 and hmp_fp32 are required to specify the BF16 op list and BF16 op list; 
- The hmp_verbose controls the printout of datatype conversion between BF16 and FP32. 
In this example, we set mixed precision optimization level hmp_opt_level='O1' and hmp_verbose=False for a clean output.

In [12]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    use_hpu=True,
    use_lazy_mode=True,
    use_fused_adam=True,
    use_fused_clip_norm=True,
    hmp=True,
    hmp_bf16='./ops_bf16_distilbert_pt.txt',
    hmp_fp32='./ops_fp32_distilbert_pt.txt',
    hmp_verbose=False,
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib


synapse_logger INFO. pid=14 at /home/jenkins/workspace/cdsoftwarebuilder/create-pytorch---bpt-d/repos/pytorch-integration/pytorch_helpers/synapse_logger/synapse_logger.cpp:340 Done command: restart


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Now we can train the model from the previously saved checkpoint or the pretrained model. The default set of the full training is 3 epochs.

In [13]:
if not os.path.isdir("./results/checkpoint-3500"):
    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
    trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
    )
    trainer.train()
else:
    model = DistilBertForSequenceClassification.from_pretrained("./results/checkpoint-3500")

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_clas

hmp:verbose_mode  False
hmp:opt_level O1


***** Running training *****
  Num examples = 20000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3750


Step,Training Loss
10,0.6875
20,0.6813
30,0.6875
40,0.6875
50,0.6844
60,0.6813
70,0.6625
80,0.6312
90,0.6
100,0.4656


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3000
Configuration saved in ./results/checkpoint-3

After the training finishes, we can evaluate the training results using the validation dataset. The function compute_metrics is used to calculate the accuracy number.

In [14]:
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
    )

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

Enabled lazy mode


hmp:verbose_mode  False
hmp:opt_level O1


## Print out the final results

At the end of the training, we can print out the final training/evaluation result.

In [15]:
print("**************** Evaluation below************")
metrics = trainer.evaluate()
metrics["eval_samples"] = len(val_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** Running Evaluation *****
  Num examples = 5000
  Batch size = 64


**************** Evaluation below************


***** eval metrics *****
  eval_accuracy           =     0.9248
  eval_loss               =     0.2556
  eval_runtime            = 0:00:26.47
  eval_samples            =       5000
  eval_samples_per_second =    188.837
  eval_steps_per_second   =      2.984


## Gaudi training tips based trainer in huggingface transformers

In TrainingArguments setup:
- Set use_hpu=True to enable Gaudi device.
- Set use_lazy_mode=True to enable lazy mode for better performance.
- Set use_fused_adam=True to use Gaudi customized adam optimizer for better performance.
- Set use_fused_clip_norm=True to use Gaudi customized clip_norm kernel for better performance.
- Set mixed precision hmp=True.
    * The default hmp_verbose value is True. The setting hmp_verbose=False helps a clean printout.
    * The default hmp_opt_level value is 'O1'
    * For hmp_opt_level='O1', the following flags are needed.
      * hmp_bf16='./ops_bf16_distilbert_pt.txt',
      * hmp_fp32='./ops_fp32_distilbert_pt.txt',


## Summary

One can easily enable their model script on Gaudi by specifying a few Gaudi arguments in TrainingArguments.