<a href="https://colab.research.google.com/github/azizbarank/distilroberta-base-sst-2-distilled/blob/main/knowledge_distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary packages

In [1]:
!pip install transformers datasets tensorboard
!sudo apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


## Chhosing our "teacher" and "student" models

In [2]:
student = "distilroberta-base"
teacher = "textattack/roberta-base-SST-2"

## Loading our SST-2 part of the GLUE dataset

In [3]:
from datasets import load_dataset

dataset = load_dataset("glue","sst2")



  0%|          | 0/3 [00:00<?, ?it/s]

## Tokenization

### Initiating the tokenizer of our student model

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(student)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [5]:
def process(examples):
    tokenized_inputs = tokenizer(
        examples["sentence"], truncation=True, max_length=512
    )
    return tokenized_inputs

sst2_enc = dataset.map(process, batched=True)
sst2_enc = sst2_enc.rename_column("label","labels")

sst2_enc["test"].features



  0%|          | 0/1 [00:00<?, ?ba/s]



{'sentence': Value(dtype='string', id=None),
 'labels': ClassLabel(num_classes=2, names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Creating our Knowledge Distillation Trainer

In [6]:
from transformers import TrainingArguments

class DistillationTrainingArguments(TrainingArguments):
    def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
        super().__init__(*args, **kwargs)

        self.alpha = alpha
        self.temperature = temperature

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher = teacher_model
        self._move_model_to_device(self.teacher,self.model.device)
        self.teacher.eval()

    def compute_loss(self, model, inputs, return_outputs=False):

        # compute student output
        outputs_student = model(**inputs)
        student_loss=outputs_student.loss
        # compute teacher output
        with torch.no_grad():
          outputs_teacher = self.teacher(**inputs)

        # assert size
        assert outputs_student.logits.size() == outputs_teacher.logits.size()

        # compute distillation loss and soften probabilities
        loss_function = nn.KLDivLoss(reduction="batchmean")
        loss_logits = (loss_function(
            F.log_softmax(outputs_student.logits / self.args.temperature, dim=-1),
            F.softmax(outputs_teacher.logits / self.args.temperature, dim=-1)) * (self.args.temperature ** 2))
        # return weighted student loss
        loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
        return (loss, outputs_student) if return_outputs else loss

## Defining the Metric

In [8]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    return {
        "accuracy": acc["accuracy"],
    }

  """


## Defining the Training Arguments

In [9]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from huggingface_hub import HfFolder

# id2label, label2id dicts for the outputs for the model
labels = sst2_enc["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# training arguments
training_args = DistillationTrainingArguments(
    output_dir="distilroberta-base-sst2-distilled",
    num_train_epochs=7, per_device_train_batch_size=128,
    per_device_eval_batch_size=128, fp16=True, 
    learning_rate=6e-5, seed=33, 
    logging_dir=f"distilroberta-base-sst2-distilled/logs",
    logging_strategy="epoch", evaluation_strategy="epoch",
    save_strategy="epoch", save_total_limit=2, 
    load_best_model_at_end=True, metric_for_best_model="accuracy", 
    report_to="tensorboard", push_to_hub=False,
    alpha=0.5, temperature=4.0
    )

# data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# teacher model
teacher_model = AutoModelForSequenceClassification.from_pretrained(
    teacher,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

# student model
student_model = AutoModelForSequenceClassification.from_pretrained(
    student,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

Some weights of the model checkpoint at textattack/roberta-base-SST-2 were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm

## Training

In [10]:
trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Using cuda_amp half precision backend


In [11]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 67349
  Num Epochs = 7
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 3689
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5134,0.435302,0.895642
2,0.2153,0.433695,0.909404
3,0.15,0.335221,0.917431
4,0.1179,0.362863,0.917431
5,0.0956,0.366062,0.915138
6,0.0817,0.327687,0.916284
7,0.0729,0.31592,0.920872


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/checkpoint-527
Configuration saved in distilroberta-base-sst2-distilled/checkpoint-527/config.json
Model weights saved in distilroberta-base-sst2-distilled/checkpoint-527/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/checkpoint-527/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/checkpoint-527/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expe

TrainOutput(global_step=3689, training_loss=0.17812904881151365, metrics={'train_runtime': 1077.7269, 'train_samples_per_second': 437.442, 'train_steps_per_second': 3.423, 'total_flos': 5988547867083024.0, 'train_loss': 0.17812904881151365, 'epoch': 7.0})

## Installing Optuna for Hyperparameter Tuning

In [12]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.2-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 41.5 MB/s 
Collecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 64.2 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 10.1 MB/s 
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako
  Downloading Mako-1.2.3-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.0 MB/s 
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.10.0-py2.py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 78.9 MB/s 
Collecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (2

## Defining the Hyperparamater Space to be optimized over

In [13]:
def hp_space(trial):
    return {
      "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 10),
      "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3 ,log=True),
      "alpha": trial.suggest_float("alpha", 0, 1),
      "temperature": trial.suggest_int("temperature", 2, 30),
      }

## Running the Hyperparameter Search

In [14]:
def student_init():
    return AutoModelForSequenceClassification.from_pretrained(
        student,
        num_labels=num_labels,
        id2label=id2label,
        label2id=label2id
    )

trainer = DistillationTrainer(
    model_init=student_init,
    args=training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    n_trials=2,
    direction="maximize",
    hp_space=hp_space
)

print(best_run)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilroberta-base/snapshots/c1149320821601524a8d373726ed95bbd2bc0dc2/config.json
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": "0",
    "positive": "1"
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file pytorc

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5906,0.479539,0.895642
2,0.2478,0.480304,0.909404
3,0.1731,0.476644,0.909404
4,0.132,0.438816,0.909404
5,0.1028,0.45274,0.90711
6,0.0843,0.392715,0.919725
7,0.0723,0.381166,0.918578
8,0.0618,0.359256,0.917431
9,0.0558,0.361851,0.922018


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-0/checkpoint-527
Configuration saved in distilroberta-base-sst2-distilled/run-0/checkpoint-527/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-0/checkpoint-527/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-527/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-0/checkpoint-527/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence.

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3958,0.349726,0.904817
2,0.1842,0.385815,0.916284
3,0.1179,0.348604,0.918578
4,0.0829,0.359995,0.925459


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 128
Saving model checkpoint to distilroberta-base-sst2-distilled/run-1/checkpoint-527
Configuration saved in distilroberta-base-sst2-distilled/run-1/checkpoint-527/config.json
Model weights saved in distilroberta-base-sst2-distilled/run-1/checkpoint-527/pytorch_model.bin
tokenizer config file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-527/tokenizer_config.json
Special tokens file saved in distilroberta-base-sst2-distilled/run-1/checkpoint-527/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence.

BestRun(run_id='1', objective=0.9254587155963303, hyperparameters={'num_train_epochs': 4, 'learning_rate': 0.00014093912322591537, 'alpha': 0.8464471686848708, 'temperature': 7})


## Updating the training arguments

In [15]:
# overwriting the previous hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(training_args, k, v)

# new repository
best_model_ckpt = "distilroberta-best"
training_args.output_dir = best_model_ckpt

## Final Training

In [16]:
# New Trainer with the updated parameters
optimal_trainer = DistillationTrainer(
    student_model,
    training_args,
    teacher_model=teacher_model,
    train_dataset=sst2_enc["train"],
    eval_dataset=sst2_enc["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

optimal_trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 67349
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 2108


Epoch,Training Loss,Validation Loss,Accuracy
1,0.144,0.37922,0.90711
2,0.1085,0.466671,0.911697
3,0.0786,0.359551,0.915138
4,0.0574,0.358214,0.920872


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 128
Saving model checkpoint to distilroberta-best/checkpoint-527
Configuration saved in distilroberta-best/checkpoint-527/config.json
Model weights saved in distilroberta-best/checkpoint-527/pytorch_model.bin
tokenizer config file saved in distilroberta-best/checkpoint-527/tokenizer_config.json
Special tokens file saved in distilroberta-best/checkpoint-527/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore 

TrainOutput(global_step=2108, training_loss=0.0971325217433401, metrics={'train_runtime': 617.2345, 'train_samples_per_second': 436.457, 'train_steps_per_second': 3.415, 'total_flos': 3418702066089216.0, 'train_loss': 0.0971325217433401, 'epoch': 4.0})