<a href="https://colab.research.google.com/github/CinthiaS/knowledge-distillation/blob/main/KnowledgeDistillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Knowledge Distillation

Knowledge distillation é o processo de transferência de conhecimento entre dois modelos, chamados de modelo professor e modelo estudante. Basicamente, a ideia é treinar um smaller model (estudante) de modo que ele seja capaz de reproduzir o conhecimento do large model (professor).

Atualmente, existem alguns algoritmos, com diferentes estratégias para realizar o knowledge distillation, como exemplo:

- Adversarial Distillation;
- Multi-Teacher Distillation;
- Cross-Modal Distillation; e etc.

Nesse notebook, vamos usar a estratégia de knowlege distillation proposta por Reimers and Gurevych [1].

Código adaptado de [Github](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/distillation/model_distillation.py)

Dados usado no treinamento do modelo.

[dataset1](https://sbert.net/datasets/AllNLI.tsv.gz)

[dataset2](https://sbert.net/datasets/stsbenchmark.tsv.gz)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
cd '/content/drive/My Drive/Colab Notebooks'

/content/drive/My Drive/Colab Notebooks


In [6]:
!pip install sentence-transformers
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.14.1-py3-

In [25]:
from torch.utils.data import DataLoader
from sentence_transformers import models, losses, evaluation
from sentence_transformers import LoggingHandler, SentenceTransformer, util, InputExample
from sentence_transformers.datasets import ParallelSentencesDataset
import logging
from datetime import datetime
import os
import gzip
import pandas as pd
import csv
import random
from sklearn.decomposition import PCA
import torch

import warnings
warnings.filterwarnings("ignore")

In [None]:
inference_batch_size = 64
train_batch_size = 64

In [None]:
teacher_model_name = 'stsb-roberta-base-v2'
teacher_model = SentenceTransformer(teacher_model_name)

In [None]:
output_path = "output/model-distillation-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

## Prepare datasets

In [8]:
nli_dataset_path = 'datasets/AllNLI.tsv.gz'
sts_dataset_path = 'datasets/stsbenchmark.tsv.gz'

In [39]:
def prepare_nli_dataset(df):

  df_train = df.loc[df['split'] == 'train']
  df_dev = df.loc[df['split'] == 'dev']

  train_sentences = df_train['sentence1'].tolist() + df_train['sentence2'].tolist() 
  random.shuffle(train_sentences)

  dev_sentences = df_dev['sentence1'].tolist() + df_dev['sentence2'].tolist() 
  random.shuffle(dev_sentences)

  return train_sentences, dev_sentences

In [28]:
def prepare_sts_dataset(df):
  
  dev_samples = []
  for index, row in df.iterrows():
      if row['split'] == 'dev':
          score = float(row['score']) / 5.0
          dev_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=score))

  return dev_samples

In [40]:
nli = pd.read_csv(nli_dataset_path, sep='\t', compression='gzip',error_bad_lines=False)

Skipping line 593843: expected 6 fields, saw 7
Skipping line 602994: expected 6 fields, saw 7
Skipping line 644944: expected 6 fields, saw 7

Skipping line 669147: expected 6 fields, saw 7
Skipping line 719671: expected 6 fields, saw 7
Skipping line 727867: expected 6 fields, saw 7
Skipping line 742137: expected 6 fields, saw 7
Skipping line 747285: expected 6 fields, saw 7

Skipping line 790984: expected 6 fields, saw 7
Skipping line 855878: expected 6 fields, saw 7
Skipping line 883143: expected 6 fields, saw 7



In [41]:
train_sentences, dev_sentences = prepare_nli_dataset(nli)

In [27]:
sts = pd.read_csv(sts_dataset_path, sep='\t', compression='gzip',error_bad_lines=False, encoding='utf8', quoting=csv.QUOTE_NONE,)

In [29]:
dev_samples = prepare_sts_dataset(sts)

## Evaluate Teacher model in STS task

In [30]:
dev_evaluator_sts = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

In [31]:
dev_evaluator_sts(teacher_model)

0.896508823316642

## Define student model

Nesse notebook, irei mostrar duas opções para criarmos um modelo destilado. A primeir opção cria um modelo estudante que possui apenas algumas camadas do modelo professor. Na segunda opção, a ideia é usar um outro modelo de versão menor e treiná-lo a fim de que ele reproduza o comportamento do modelo professor. Para definir em qual desses modos será feito o treinamento, basta colocar o parâmetro use_layer_reduction = True or False.

In [33]:
use_layer_reduction = True

In [34]:
if use_layer_reduction:

    student_model = SentenceTransformer(teacher_model_name)
    auto_model = student_model._first_module().auto_model
    layers_to_keep = [1, 4, 7, 10]          #Keep 4 layers from the teacher

    logging.info("Remove layers from student. Only keep these layers: {}".format(layers_to_keep))
    new_layers = torch.nn.ModuleList([layer_module for i, layer_module in enumerate(auto_model.encoder.layer) if i in layers_to_keep])
    auto_model.encoder.layer = new_layers
    auto_model.config.num_hidden_layers = len(layers_to_keep)
else:

    word_embedding_model = models.Transformer('nreimers/TinyBERT_L-4_H-312_v2')
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

In [35]:

if student_model.get_sentence_embedding_dimension() < teacher_model.get_sentence_embedding_dimension():
    logging.info("Student model has fewer dimensions than the teacher. Compute PCA for down projection")
    pca_sentences = train_sentences[0:20000]
    pca_embeddings = teacher_model.encode(pca_sentences, convert_to_numpy=True)
    pca = PCA(n_components=student_model.get_sentence_embedding_dimension())
    pca.fit(pca_embeddings)

    #Add Dense layer to teacher that projects the embeddings down to the student embedding size
    dense = models.Dense(in_features=teacher_model.get_sentence_embedding_dimension(), out_features=student_model.get_sentence_embedding_dimension(), bias=False, activation_function=torch.nn.Identity())
    dense.linear.weight = torch.nn.Parameter(torch.tensor(pca.components_))
    teacher_model.add_module('dense', dense)

    logging.info("Teacher Performance with {} dimensions:".format(teacher_model.get_sentence_embedding_dimension()))
    dev_evaluator_sts(teacher_model)

In [43]:
train_data = ParallelSentencesDataset(student_model=student_model,
                                      teacher_model=teacher_model,
                                      batch_size=inference_batch_size,
                                      use_embedding_cache=False)
train_data.add_dataset([[str(sent)] for sent in train_sentences], max_sentence_length=256)

In [44]:
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
train_loss = losses.MSELoss(model=student_model)

In [45]:
dev_evaluator_mse = evaluation.MSEEvaluator(dev_sentences, dev_sentences, teacher_model=teacher_model)

In [None]:
student_model.fit(train_objectives=[(train_dataloader, train_loss)],
                  evaluator=evaluation.SequentialEvaluator([dev_evaluator_sts, dev_evaluator_mse]),
                  epochs=1,
                  warmup_steps=1000,
                  evaluation_steps=5000,
                  output_path=output_path,
                  save_best_model=True,
                  optimizer_params={'lr': 1e-4, 'eps': 1e-6},
                  use_amp=True)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/17815 [00:00<?, ?it/s]

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-46-0a08db0c63ef>", line 1, in <cell line: 1>
    student_model.fit(train_objectives=[(train_dataloader, train_loss)],
  File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py", line 709, in fit
    with autocast():
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/autocast_mode.py", line 25, in __init__
    super().__init__("cuda", enabled=enabled, dtype=dtype, cache_enabled=cache_enabled)
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 203, in __init__
    if enabled and torch.cuda.amp.common.amp_definitely_not_available() and self.device == 'cuda':
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/common.py", line 7, in amp_definitely_not_available
    return not (torch.cuda.

# Referências

Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813.