# Task-specific knowledge distillation for BERT using Hugging Face Transformers
### Text Classification Example using `BERT-Base` as Teacher and `BERT-Tiny` as Student

Welcome to our end-to-end task-specific knowledge distilattion Text-Classification example using Transformers, PyTorch & Amazon SageMaker. Distillation is the process of training a small "student" to mimic a larger "teacher". In this example, we will use [BERT-base](https://huggingface.co/textattack/bert-base-uncased-SST-2) as Teacher and [BERT-Tiny](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) as Student. We will use [Text-Classification](https://huggingface.co/tasks/text-classification) as task-specific knowledge distillation task and the [Stanford Sentiment Treebank v2 (SST-2)](https://paperswithcode.com/dataset/sst) dataset for training.


They are two different types of knowledge distillation, the Task-agnostic knowledge distillation (right) and the Task-specific knowledge distillation (left). In this example we are going to use the Task-specific knowledge distillation.

![knowledge-distillation](https://github.com/JadMokdad/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/imgs/knowledge-distillation.png?raw=1)
_Task-specific distillation (left) versus task-agnostic distillation (right). Figure from FastFormers by Y. Kim and H. Awadalla [arXiv:2010.13382]._


In Task-specific knowledge distillation a "second step of distillation" is used to "fine-tune" the model on a given dataset. This idea comes from the [DistilBERT paper](https://arxiv.org/pdf/1910.01108.pdf) where it was shown that a student performed better than simply finetuning the distilled language model:

> We also studied whether we could add another step of distillation during the adaptation phase by fine-tuning DistilBERT on SQuAD using a BERT model previously fine-tuned on SQuAD as a teacher for an additional term in the loss (knowledge distillation). In this setting, there are thus two successive steps of distillation, one during the pre-training phase and one during the adaptation phase. In this case, we were able to reach interesting performances given the size of the model:79.8 F1 and 70.4 EM, i.e. within 3 points of the full model.

If you are more interested in those topics you should defintely read:
* [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)
* [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382)

Especially the [FastFormers paper](https://arxiv.org/abs/2010.13382) contains great research on what works and doesn't work when using knowledge distillation.

---

Huge thanks to [Lewis Tunstall](https://www.linkedin.com/in/lewis-tunstall/) and his great [Weeknotes: Distilling distilled transformers](https://lewtun.github.io/blog/weeknotes/nlp/huggingface/transformers/2021/01/17/wknotes-distillation-and-generation.html#fn-1)


## Installation

In [1]:
#%pip install "pytorch==1.10.1"
%pip install transformers datasets tensorboard --upgrade

!sudo apt-get install git-lfs

Collecting transformers
  Downloading transformers-4.43.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting tensorboard
  Downloading tensorboard-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2

In [2]:
!pip install --upgrade peft

Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.13.0->peft)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.13.0->peft)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.13.0->peft)
  Using cached nvidia_cufft_cu12-11.

In [3]:
pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.2-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.2-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.2


This example will use the [Hugging Face Hub](https://huggingface.co/models) as remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join).
If you already have an account you can skip this step.
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk.

In [4]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Setup & Configuration

In this step we will define global configurations and paramters, which are used across the whole end-to-end fine-tuning proccess, e.g. `teacher` and `studen` we will use.

In this example, we will use [BERT-base](textattack/bert-base-uncased-SST-2) as Teacher and [BERT-Tiny](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) as Student. Our Teacher is already fine-tuned on our dataset, which makes it easy for us to directly start the distillation training job rather than fine-tuning the teacher first to then distill it afterwards.

_**IMPORTANT**: This example will only work with a `Teacher` & `Student` combination where the Tokenizer is creating the same output._

Additionally, describes the [FastFormers: Highly Efficient Transformer Models for Natural Language Understanding](https://arxiv.org/abs/2010.13382) paper an additional phenomenon.
> In our experiments, we have observed that dis-
tilled models do not work well when distilled to a
different model type. Therefore, we restricted our
setup to avoid distilling RoBERTa model to BERT
or vice versa. The major difference between the
two model groups is the input token (sub-word) em-
bedding. We think that different input embedding
spaces result in different output embedding spaces,
and knowledge transfer with different spaces does
not work well

In [5]:
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForSequenceClassification, DataCollatorWithPadding, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
from datasets import load_dataset, load_metric
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

In [6]:
# student_id = "google/bert_uncased_L-2_H-128_A-2"
student_id = "FacebookAI/roberta-base"
# teacher_id = "textattack/bert-base-uncased-SST-2"
peft_teacher_id = "hassanalsawadi/roberta-large-lora"
peft_teacher_config = PeftConfig.from_pretrained(peft_teacher_id)

# name for our repository on the hub
repo_name = "tiny-bert-sst2-distilled"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

Below are some checks to make sure the `Teacher` & `Student` are creating the same output.

In [7]:
# init tokenizer
teacher_tokenizer = AutoTokenizer.from_pretrained(peft_teacher_config.base_model_name_or_path)
student_tokenizer = AutoTokenizer.from_pretrained(student_id)

# sample input
sample = "This is a basic example, with different words to test."

# assert results
assert teacher_tokenizer(sample) == student_tokenizer(sample), "Tokenizers haven't created the same output"


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Dataset & Pre-processing

As Dataset we will use the [Stanford Sentiment Treebank v2 (SST-2)](https://paperswithcode.com/dataset/sst) a text-classification for `sentiment-analysis`, which is included in the [GLUE benchmark](https://gluebenchmark.com/). The dataset is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges. It uses the two-way (positive/negative) class split, with only sentence-level labels.


In [8]:
dataset_id="glue"
dataset_config="sst2"

To load the `sst2` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [9]:
dataset = load_dataset(dataset_id,dataset_config)
dataset

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

### Pre-processing & Tokenization

To distill our model we need to convert our "Natural Language" to token IDs. This is done by a 🤗 Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary). If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

We are going to use the tokenizer of the `Teacher`, but since both are creating same output you could also go with the `Student` tokenizer.


In [10]:
tokenizer = AutoTokenizer.from_pretrained(peft_teacher_config.base_model_name_or_path)

Additionally we add the `truncation=True` and `max_length=512` to align the length and truncate texts that are bigger than the maximum size allowed by the model.

In [11]:
def process(examples):
    tokenized_inputs = tokenizer(
        examples["sentence"], truncation=True, max_length=512, padding='max_length'
    )
    return tokenized_inputs

tokenized_datasets = dataset.map(process, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label","labels")

tokenized_datasets["test"].features

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

{'sentence': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [12]:
tokenized_datasets["test"].column_names

['sentence', 'labels', 'idx', 'input_ids', 'attention_mask']

In [13]:
# example of creating an iterable data loader from the tokenized training input

# uncomment this when using BERT model
#tokenized_datasets["train"].set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

# uncomment this when using Llama model
tokenized_datasets["train"].set_format(type='torch', columns=['input_ids', 'idx', 'attention_mask', 'labels'])

dataloader = torch.utils.data.DataLoader(tokenized_datasets["train"], batch_size=32)
next(iter(dataloader))

{'labels': tensor([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
         0, 1, 0, 1, 1, 0, 0, 1]),
 'idx': tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]),
 'input_ids': tensor([[    0, 37265,    92,  ...,     1,     1,     1],
         [    0, 10800,  5069,  ...,     1,     1,     1],
         [    0,  6025,  6138,  ...,     1,     1,     1],
         ...,
         [    0,   627,  6197,  ...,     1,     1,     1],
         [    0,   627,   814,  ...,     1,     1,     1],
         [    0,   261,    70,  ...,     1,     1,     1]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

In [14]:
# another example to take a sample of the original data
# useful to examine the data
iterable_dataset = dataset["train"].shuffle(seed=42).to_iterable_dataset()
sample_list = list(iterable_dataset.take(10))
sample_list

[{'sentence': 'klein , charming in comedies like american pie and dead-on in election , ',
  'label': 1,
  'idx': 32326},
 {'sentence': 'be fruitful ', 'label': 1, 'idx': 27449},
 {'sentence': 'soulful and ', 'label': 1, 'idx': 60108},
 {'sentence': 'the proud warrior that still lingers in the souls of these characters ',
  'label': 1,
  'idx': 23141},
 {'sentence': 'covered earlier and much better ', 'label': 0, 'idx': 35226},
 {'sentence': 'wise and powerful ', 'label': 1, 'idx': 66852},
 {'sentence': 'a powerful and reasonably fulfilling gestalt ',
  'label': 1,
  'idx': 65093},
 {'sentence': 'smart and newfangled ', 'label': 1, 'idx': 47847},
 {'sentence': 'it too is a bomb . ', 'label': 0, 'idx': 39440},
 {'sentence': 'guilty about it ', 'label': 0, 'idx': 56428}]

In [15]:
# follow this link if you're having troubles running GPU on Colab https://saturncloud.io/blog/how-to-activate-gpu-computing-in-google-colab/
!nvidia-smi

Mon Jul 29 19:15:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              48W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Distilling the model using `PyTorch` and `DistillationTrainer`


Now that our `dataset` is processed, we can distill it. Normally, when fine-tuning a transformer model using PyTorch you should go with the `Trainer-API`. The [Trainer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.Trainer) class provides an API for feature-complete training in PyTorch for most standard use cases.

In our example we cannot use the `Trainer` out-of-the-box, since we need to pass in two models, the `Teacher` and the `Student` and compute the loss for both. But we can subclass the `Trainer` to create a `DistillationTrainer` which will take care of it and only overwrite the [compute_loss](https://github.com/huggingface/transformers/blob/c4ad38e5ac69e6d96116f39df789a2369dd33c21/src/transformers/trainer.py#L1962) method as well as the `init` method. In addition to this we also need to subclass the `TrainingArguments` to include the our distillation hyperparameters.


In [16]:
class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        # place teacher on same device as student
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
          outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # Soften probabilities and compute distillation loss
        loss_function = nn.KLDivLoss(reduction="batchmean")
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # Return weighted student loss
        loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

### Hyperparameter Definition, Model Loading

In [17]:
# create label2id, id2label dicts for nice outputs for the model
labels = tokenized_datasets["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

label2id, id2label

({'negative': '0', 'positive': '1'}, {'0': 'negative', '1': 'positive'})

In [30]:
# Set requires_grad to True for all parameters
for param in teacher_model.parameters():
    param.requires_grad_(False)

In [33]:
from huggingface_hub import HfFolder

# create label2id, id2label dicts for nice outputs for the model
labels = tokenized_datasets["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# define training args
training_args = DistillationTrainingArguments(
    output_dir=repo_name,
    num_train_epochs=7,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=True,
    learning_rate=6e-5,
    seed=33,
    # logging & evaluation strategies
    logging_dir=f"{repo_name}/logs",
    logging_strategy="epoch", # to get more information to TB
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="tensorboard",
    # push to hub parameters
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repo_name,
    hub_token=HfFolder.get_token(),
    # distilation parameters
    alpha=0.5,
    temperature=4.0,
    warmup_steps=100,
    max_steps=200,
    )

# define data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True, as needed
    load_in_8bit=False,  # if using 4-bit, set this to False
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype='int8',
)

config = BitsAndBytesConfig(
    load_in_4bit=True, # quantize the model to 4-bits when you load it
    load_in_8bit=False,
    bnb_4bit_quant_type="nf4", # use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_use_double_quant=True, # nested quantization scheme to quantize the already quantized weights
    bnb_4bit_compute_dtype=torch.bfloat16, # use bfloat16 for faster computation,
    llm_int8_skip_modules=["classifier", "pre_classifier"]
)

# define model
teacher_model = AutoModelForSequenceClassification.from_pretrained(
    peft_teacher_config.base_model_name_or_path,
    quantization_config=config,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
    return_dict=True,
    device_map='auto'
)
teacher_model.requires_grad_(False)

# Set requires_grad to True for all parameters
for param in teacher_model.parameters():
    param.requires_grad_(False)


teacher_model = PeftModel.from_pretrained(teacher_model, peft_teacher_id)


# define student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student_id,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


adapter_model.safetensors:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Evaluation metric

we can create a `compute_metrics` function to evaluate our model on the test set. This function will be used during the training process to compute the `accuracy` & `f1` of our model.

In [34]:
# define metrics and metrics function
accuracy_metric = load_metric( "accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    return {
        "accuracy": acc["accuracy"],
    }

  accuracy_metric = load_metric( "accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


## Training

Start training with calling `trainer.train`

In [42]:
# define training args
training_args = DistillationTrainingArguments(
    output_dir=repo_name,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    fp16=True,
    learning_rate=6e-5,
    seed=33,
    # logging & evaluation strategies
    logging_dir=f"{repo_name}/logs",
    logging_strategy="steps", # to get more information to TB
    evaluation_strategy="steps",
    save_strategy="steps",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="tensorboard",
    # push to hub parameters
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repo_name,
    hub_token=HfFolder.get_token(),
    # distilation parameters
    alpha=0.5,
    temperature=4.0,
    warmup_steps=100,
    # max_steps=200,
    eval_steps=1000,
    save_steps=1000,
    logging_steps=1000,
    )

trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)



start training using the `DistillationTrainer`.

In [43]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
1000,0.5616,0.379434,0.891055
2000,0.332,0.311878,0.905963
3000,0.2627,0.238475,0.901376
4000,0.205,0.280751,0.909404


TrainOutput(global_step=4210, training_loss=0.33283523867645626, metrics={'train_runtime': 931.9398, 'train_samples_per_second': 72.268, 'train_steps_per_second': 4.517, 'total_flos': 1.772026646744064e+16, 'train_loss': 0.33283523867645626, 'epoch': 1.0})

In [None]:
# Create a data collator that handles padding dynamically
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create data loader for the training set
train_loader = DataLoader(tokenized_datasets["train"], batch_size=32, collate_fn=data_collator)

# Iterate over the data and print a batch
def iterate_and_print_batch(data_loader):
    num = 1
    for batch in data_loader:
        print(f"Batch num {num}")
        num = num + 1
        print("Features:", batch["input_ids"])
        print("Labels:", batch["labels"])
        break  # Print only the first batch

iterate_and_print_batch(train_loader)

Batch num 1
Features: tensor([[    2, 37265,    92,  ...,     1,     1,     1],
        [    2, 10800,  5069,  ...,     1,     1,     1],
        [    2,  6025,  6138,  ...,     1,     1,     1],
        ...,
        [    2,   627,  6197,  ...,     1,     1,     1],
        [    2,   627,   814,  ...,     1,     1,     1],
        [    2,   261,    70,  ...,     1,     1,     1]])
Labels: tensor([0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
        0, 1, 0, 1, 1, 0, 0, 1])


In [None]:
# Ensure the models are on the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# teacher_model.to(device)
# student_model.to(device)


# Function to perform inference
def run_inference(text, tokenizer, model, device):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Run the input through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the logits and apply softmax to get probabilities
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=-1)

    return probabilities

# Example input text
prompt = "The movie was nice!"
# prompt = "holden caulfield did it better ."
# prompt = "it does n't believe in itself , it has no sense of humor ... it 's just plain bored .	"
# # Define the prompt with examples
# prompt = """
# Examples of positive sentences:
# ""
# 1. I love this product!
# 2. This is the best movie I have ever seen.
# ""
# Examples of negative sentences:
# ""
# 1. I hate this product.
# 2. This is the worst movie I have ever seen.
# ""

# Classify the sentiment of the following sentence as positive or negative.
# Input sentence: holden caulfield did it better .
# """

# Run inference using the teacher model
teacher_probabilities = run_inference(prompt, tokenizer, teacher_model, device)
print("Teacher Model Probabilities:", teacher_probabilities)

# Run inference using the student model
student_probabilities = run_inference(prompt, tokenizer, student_model, device)
print("Student Model Probabilities:", student_probabilities)

Teacher Model Probabilities: tensor([[0.0046, 0.9956]], device='cuda:0', dtype=torch.float16)
Student Model Probabilities: tensor([[0.0015, 0.9985]], device='cuda:0')


In [None]:
tokenized_datasets["test"].set_format(type='torch', columns=['input_ids', 'idx', 'attention_mask', 'labels'])

# Create data loader for the test set
test_loader = DataLoader(tokenized_datasets["test"], batch_size=32, collate_fn=data_collator)

In [None]:
tokenized_datasets["validation"].set_format(type='torch', columns=['input_ids', 'idx', 'attention_mask', 'labels'])

# Create data loader for the test set
valid_loader = DataLoader(tokenized_datasets["validation"], batch_size=32, collate_fn=data_collator)

In [None]:
# Ensure the models are on the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
teacher_model.to(device)
student_model.to(device)

# Function to perform inferences and calculate loss
def evaluate_models(test_loader, teacher_model, student_model, device, alpha=0.5, temperature=2.0):
    teacher_model.eval()
    student_model.eval()
    loss_function = torch.nn.KLDivLoss(reduction="batchmean")

    total_loss = 0
    total_batches = 0

    with torch.no_grad():
        for batch in valid_loader:
            inputs = {key: value.to(device) for key, value in batch.items()}
            labels = inputs.pop("labels")

            # Teacher model outputs
            teacher_outputs = teacher_model(**inputs)
            teacher_logits = teacher_outputs.logits

            # Student model outputs
            student_outputs = student_model(**inputs)
            student_logits = student_outputs.logits

            # Calculate loss
            student_loss = F.cross_entropy(student_logits, labels)
            distillation_loss = loss_function(
                F.log_softmax(student_logits / temperature, dim=-1),
                F.softmax(teacher_logits / temperature, dim=-1)
            ) * (temperature ** 2)

            loss = alpha * student_loss + (1 - alpha) * distillation_loss
            total_loss += loss.item()
            total_batches += 1

    average_loss = total_loss / total_batches
    return average_loss

# Calculate loss on the test set
average_loss = evaluate_models(test_loader, teacher_model, student_model, device)
print("Average Loss:", average_loss)

TypeError: OPTForSequenceClassification.forward() got an unexpected keyword argument 'idx'

## Hyperparameter Search for Distillation parameter `alpha` & `temperature` with optuna

The parameter `alpha` & `temparature` in the `DistillationTrainer` can also be used when doing Hyperparamter search to maxizime our "knowledge extraction". As Hyperparamter Optimization framework are we using [Optuna](https://optuna.org/), which has a integration into the `Trainer-API`. Since we the `DistillationTrainer` is a sublcass of the `Trainer` we can use the `hyperparameter_search` without any code changes.


In [None]:
%pip install optuna

To do Hyperparameter Optimization using `optuna` we need to define our hyperparameter space. In this example we are trying to optimize/maximize the `num_train_epochs`, `learning_rate`, `alpha` & `temperature` for our `student_model`.

In [None]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

To start our Hyperparmeter search we just need to call `hyperparameter_search` provide our `hp_space` and number of trials to run.

In [None]:
def student_init():
    return AutoModelForSequenceClassification.from_pretrained(
        student_id,
        num_labels=num_labels,
        id2label=id2label,
        label2id=label2id
    )

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=50,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2024-07-20 22:36:30,472] A new study created in memory with name: no-name-d38b57e5-356a-4cf6-8961-5611a9e7904d
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss


[W 2024-07-20 22:36:46,060] Trial 0 failed with parameters: {'num_train_epochs': 10, 'learning_rate': 0.00045007371479837804, 'alpha': 0.6604966653966935, 'temperature': 25} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/integration_utils.py", line 211, in _objective
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1923, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2273, in _inner_training_loop
    and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
KeyboardInterrupt
[W 2024-07-20 22:36:46,065] Trial 0 failed with value None.


KeyboardInterrupt: 

Since optuna is just finding the best hyperparameters we need to fine-tune our model again using the best hyperparamters from the `best_run`.

In [None]:
# overwrite initial hyperparameters with from the best_run
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# Define a new repository to store our distilled model
best_model_ckpt = "tiny-bert-best"
training_args.output_dir = best_model_ckpt

We have overwritten the default Hyperparameters with the one from our `best_run` and can start the training now.

In [None]:
# Create a new Trainer with optimal parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()


# save best model, metrics and create model card
trainer.create_model_card(model_name=training_args.hub_model_id)
trainer.push_to_hub()

In [None]:
from huggingface_hub import HfApi

whoami = HfApi().whoami()
username = whoami['name']

print(f"https://huggingface.co/{username}/{repo_name}")

## Results

We were able to achieve a `accuracy` of 0.8337, which is a very good result for our model. Our distilled `Tiny-Bert` has 96% less parameters than the teacher `bert-base` and runs ~46.5x faster while preserving over 90% of BERT’s performances as measured on the SST2 dataset.

| model | Parameter | Speed-up | Accuracy |
|------------|-----------|----------|----------|
| BERT-base  | 109M      | 1x       | 93%      |
| tiny-BERT  | 4M        | 46.5x    | 83%      |

_Note: The [FastFormers paper](https://arxiv.org/abs/2010.13382) uncovered that the biggest boost in performance is observerd when having 6 or more layers in the student. The [google/bert_uncased_L-2_H-128_A-2](https://huggingface.co/google/bert_uncased_L-2_H-128_A-2) we used only had 2, which means when changing our student to, e.g. `distilbert-base-uncased` we should better performance in terms of accuracy._

If you are now planning to implement and add task-specific knowledge distillation to your models. I suggest to take a look at the [sagemaker-distillation](https://github.com/philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/sagemaker-distillation.ipynb), which shows how to run task-specific knowledge distillation on Amazon SageMaker. For the example i created a script deriving this notebook to make it as easy as possible to use for you. You only need to define your `teacher_id`, `student_id` as well as your `dataset` config to run task-specific knowledge distillation for `text-classification`.

```python
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'teacher_id':'textattack/bert-base-uncased-SST-2',           
    'student_id':'google/bert_uncased_L-2_H-128_A-2',           
    'dataset_id':'glue',           
    'dataset_config':'sst2',             
    # distillation parameter
    'alpha': 0.5,
    'temparature': 4,
    # hpo parameter
    "run_hpo": True,
    "n_trials": 100,            
}

# create the Estimator
huggingface_estimator = HuggingFace(..., hyperparameters=hyperparameters)

# start knwonledge distillation training
huggingface_estimator.fit()
```