In [None]:
!git clone https://github.com/huggingface/transformers.git
!pip install transformers
!pip install datasets

Cloning into 'transformers'...
remote: Enumerating objects: 139757, done.[K
remote: Counting objects: 100% (1735/1735), done.[K
remote: Compressing objects: 100% (696/696), done.[K
remote: Total 139757 (delta 1125), reused 1392 (delta 882), pack-reused 138022[K
Receiving objects: 100% (139757/139757), 137.79 MiB | 20.70 MiB/s, done.
Resolving deltas: 100% (104604/104604), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-c

In [1]:
!pip install sentencepiece


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
!pip install protobuf==3.20.1

This notebook demonstrates how to use a script that provides a way to improve the speed and memory performance of a zero-shot classifier by training a more efficient student model from the zero-shot teacher's predictions over an unlabeled dataset.

For a given sequence, the zero-shot classification pipeline requires each possible label to be fed through the large NLI model separately. This requirement slows results considerably, particularly for tasks with a large number of classes K.

We'll use the `tyqiangz/multilingual-sentiments` dataset for this example.

In [1]:
from datasets import load_dataset
train, test = load_dataset("tyqiangz/multilingual-sentiments", "all", split=['train', 'test'])

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset multilingual-sentiments (/home/lxyuan/.cache/huggingface/datasets/tyqiangz___multilingual-sentiments/all/1.0.0/b7cdd8874d82679e59432edf79e074f595c4ad26d2e562eba4fb55f361691b07)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 301.07it/s]


In [2]:
train

Dataset({
    features: ['text', 'source', 'language', 'label'],
    num_rows: 270399
})

In [3]:
test

Dataset({
    features: ['text', 'source', 'language', 'label'],
    num_rows: 14465
})

In [4]:
unique_lang = set(train["language"])
unique_lang

{'arabic',
 'chinese',
 'english',
 'french',
 'german',
 'hindi',
 'indonesian',
 'italian',
 'japanese',
 'malay',
 'portuguese',
 'spanish'}

In [5]:
filtered_train = train.filter(lambda example: example["language"] not in ["italian", "japanese", "portuguese"])
filtered_test = test.filter(lambda example: example["language"] not in ["italian", "japanese", "portuguese"])

Loading cached processed dataset at /home/lxyuan/.cache/huggingface/datasets/tyqiangz___multilingual-sentiments/all/1.0.0/b7cdd8874d82679e59432edf79e074f595c4ad26d2e562eba4fb55f361691b07/cache-c1f1d041f22604ae.arrow
Loading cached processed dataset at /home/lxyuan/.cache/huggingface/datasets/tyqiangz___multilingual-sentiments/all/1.0.0/b7cdd8874d82679e59432edf79e074f595c4ad26d2e562eba4fb55f361691b07/cache-1d835284274eae4f.arrow


In [6]:
filtered_train

Dataset({
    features: ['text', 'source', 'language', 'label'],
    num_rows: 146721
})

In [7]:
filtered_test

Dataset({
    features: ['text', 'source', 'language', 'label'],
    num_rows: 9725
})

In [8]:
filtered_train.features["label"]

ClassLabel(names=['positive', 'neutral', 'negative'], id=None)

### 🤗 Zero-shot classification pipeline

The [zero-shot classification pipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline) is a tool withing 🤗 Transformers that can be used to classify text sequences out of the box, provided only a list of possible class names:

In [9]:
from transformers import pipeline
zero_shot_classifier = pipeline('zero-shot-classification', model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli", device=0)



In [None]:
sequence = "I love this movie and i would watch it again and again!"
class_names = ["positive", "neutral", "negative"]
zero_shot_classifier(sequence, class_names, hypothesis_template="The sentiment of this text is {}.")

{'sequence': 'I love this movie and i would watch it again and again!',
 'labels': ['positive', 'neutral', 'negative'],
 'scores': [0.9610546827316284, 0.029341870918869972, 0.009603399783372879]}

In [None]:
sequence = "我喜欢这部电影，我会一遍又一遍地看！"
class_names = ["positive", "neutral", "negative"]
zero_shot_classifier(sequence, class_names, hypothesis_template="The sentiment of this text is {}.")

{'sequence': '我喜欢这部电影，我会一遍又一遍地看！',
 'labels': ['positive', 'neutral', 'negative'],
 'scores': [0.9576952457427979, 0.031247487291693687, 0.011057236231863499]}

In [None]:
zero_shot_classifier.model.num_parameters() 

278811651

This method serves as a convenient out-of-the-box classifier. Unfortunately, the method is by necessity somewhat slow. This is partially due to the large underlying model being used, but more important is the fact that for this method to work, every possible sequence / class name pair must be fed through the model together. So in order to classify `N` sequences into `K` classes, the model has to be called `N*K` times (whereas a typical classifier would only be called `N` times). This makes the method comparatively slow, especially for settings with a large number of classes.

In [None]:
# classify 1600 examples with K=4 classes
%%time
for _ in range(100):
    zero_shot_classifier([sequence] * 16, class_names)



CPU times: user 1min 44s, sys: 363 ms, total: 1min 44s
Wall time: 1min 46s


In [None]:
# classify 1600 examples with K=8 classes
%%time
expanded_class_names = class_names + ["politics", "health", "food", "weather"]
for _ in range(100):
    zero_shot_classifier([sequence] * 16, expanded_class_names)

CPU times: user 3min 56s, sys: 624 ms, total: 3min 57s
Wall time: 4min 4s


As we can see, increasing the number of classes from `K=4` to `K=8` approximately doubles the inference time. This classification method is extremely useful, but ideally we'd like to speed up inference.

### Distilling a more efficient student model

The best way to speed up inference is to **train a more efficient student model on the zero-shot classifier's predictions** over an unlabeled dataset. This can be done with the [`distill_classifier.py`](https://github.com/huggingface/transformers/blob/master/examples/research_projects/zero-shot-distillation/distill_classifier.py) script provided in the `transformers` repo.

Given (1) an unlabeled corpus and (2) a set of `K` class names, this script allows a user to train a standard classification head with `K` output dimensions. The script generates a softmax distribution for the provided data & class names, and a student classifier is then fine-tuned on these proxy labels. The resulting student model can be used for classifying novel text instances into the previously specified `K` classes with an order-of-magnitude boost in inference speed plus decreased memory usage.

Let's see how to do this with the [tyqiangz/multilingual-sentiments](https://huggingface.co/datasets/tyqiangz/multilingual-sentiments)[AG's News](https://huggingface.co/datasets/ag_news) topic classification dataset. The first thing we need is an unlabeled dataset (in reality the multilingual-sentiment dataset is annotated of course, but we'll pretend and ignore the annotations for the sake of example). Let's put the sequences from the train set into a `txt` file:

In [9]:
!mkdir multilingual-sentiments
with open("multilingual-sentiments/train_unlabeled.txt", 'w') as f:
    for seq in filtered_train["text"]:
        f.write(seq + '\n')

The other thing the script needs is the names of the classes. We'll put these into their own newline-delimitted `txt` as well:

In [10]:
class_names = ['positive', 'neutral', 'negative']
with open("multilingual-sentiments/class_names.txt", 'w') as f:
    for label in class_names:
        f.write(label + '\n')

In [11]:
!cat multilingual-sentiments/train_unlabeled.txt | head -5

yang memerlukan pemerhatian dan tindakan serius
sentiasa memikirkan dan merancang inisiatif bagi menambah baik sistem penyampaian kerajaan kepada rakyat
Kita akan tengok daripada pelbagai aspek supaya akhirnya hak rakyat dapat dikembalikan, itu fokus kita pada tahun ini selaras dengan arahan pucuk pimpinan SPRM
justeru asean perlu mengambil tindakan sebagaimana dianjurkan oleh malaysia
@_Niiar_ Jangan punah dulu, aku belum ke labuan bajo
cat: write error: Broken pipe


In [12]:
!cat multilingual-sentiments/class_names.txt | head -5

positive
neutral
negative


Now we can run the script. First the zero-shot model will loop through the data and generate (soft) proxy-labels, and then a student `Multilinguai DistilBert` model will be fine-tuned on these predictions. The student will then be saved in `./distilbert-base-multilingual-cased-sentiments-student/`. See the [script readme](https://github.com/huggingface/transformers/blob/master/examples/research_projects/zero-shot-distillation/README.md) for more information about the available script arguments.

On a single P100, this will take about ~2 hours with the full training set of 130K examples. On a V100 with mixed precision (just pass `--fp16`), it will take ~30 minutes.

---
Code changes:

###### modify L78 to disable fast tokenizer 
```python
default=False,
```

###### update dataset map part at L313
```python
dataset = dataset.map(tokenizer, input_columns="text", fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512})
```

###### add following lines to L213
```python
del model
print(f"Manually deleted Teacher model")
```

###### add following lines to L337
```python
trainer.push_to_hub()
tokenizer.push_to_hub("distilbert-base-multilingual-cased-sentiments-student")
```

---

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [13]:
#!python transformers/examples/research_projects/zero-shot-distillation/distill_classifier.py \
!python distill_classifier.py \
--data_file ./multilingual-sentiments/train_unlabeled.txt \
--class_names_file ./multilingual-sentiments/class_names.txt \
--hypothesis_template "The sentiment of this text is {}." \
--teacher_name_or_path MoritzLaurer/mDeBERTa-v3-base-mnli-xnli \
--teacher_batch_size 32 \
--student_name_or_path distilbert-base-multilingual-cased \
--output_dir ./distilbert-base-multilingual-cased-sentiments-student \
--per_device_train_batch_size 16 \
--fp16

05/06/2023 10:05:30 - INFO - __main__ - Training/evaluation parameters DistillTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hu

### Using the student model

The resulting model can now be loaded and used like any other pre-trained model:

(you can also use `"lxyuan/distilbert-base-multilingual-cased-sentiments-student"` to download this model from the hub if you want to try it without running whole script above)

In [15]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./distilbert-base-multilingual-cased-sentiments-student")
model = AutoModelForSequenceClassification.from_pretrained("./distilbert-base-multilingual-cased-sentiments-student")
model.config

DistilBertConfig {
  "_name_or_path": "./distilbert-base-multilingual-cased-sentiments-student",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "positive",
    "1": "neutral",
    "2": "negative"
  },
  "initializer_range": 0.02,
  "label2id": {
    "negative": 2,
    "neutral": 1,
    "positive": 0
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.28.1",
  "vocab_size": 119547
}

and even used trivially with a `TextClassificationPipeline`:

In [50]:
from transformers import TextClassificationPipeline
distilled_classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True, device=0)
distilled_classifier(train["text"][2])

[[{'label': 'positive', 'score': 0.6660619974136353},
  {'label': 'neutral', 'score': 0.12189780175685883},
  {'label': 'negative', 'score': 0.21204017102718353}]]

In [17]:
# 0: positive
train["text"][2], train["label"][2]

('Kita akan tengok daripada pelbagai aspek supaya akhirnya hak rakyat dapat dikembalikan, itu fokus kita pada tahun ini selaras dengan arahan pucuk pimpinan SPRM',
 0)

----

Let's compare the speed & accuracy of the two methods.

Original zero-shot model:

In [10]:
import numpy as np
from time import time
from tqdm.auto import tqdm

In [42]:
start = time()
batch_size = 32
hypothesis_template = "The sentiment of this text is {}."
class_names = ['positive', 'neutral', 'negative']
preds = []
teacher_preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]['text']
    outputs = zero_shot_classifier(examples, class_names, hypothesis_template=hypothesis_template)
    preds += [class_names.index(o['labels'][0]) for o in outputs]
    teacher_preds += [class_names.index(o['labels'][0]) for o in outputs]
accuracy = np.mean(np.array(preds) == np.array(test['label']))
print(f"Teacher model accuracy: {accuracy*100:0.2f}%")
print(f"Runtime: {time() - start : 0.2f} seconds")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 453/453 [07:55<00:00,  1.05s/it]

Teacher model accuracy: 58.87%
Runtime:  475.53 seconds





Distilled student model:

In [46]:
start = time()
batch_size = 128 # larger batch size bc distilled model is more memory efficient
distilled_classifier.return_all_scores = False
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]['text']
    outputs = distilled_classifier(examples, **tokenizer_kwargs)
    preds += [class_names.index(o['label']) for o in outputs]
accuracy = np.mean(np.array(preds) == np.array(test['label']))
print(f"Distilled model accuracy: {accuracy*100:0.2f}%")
print(f"Runtime: {time() - start : 0.2f} seconds")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 114/114 [00:46<00:00,  2.46it/s]

Distilled model accuracy: 54.26%
Runtime:  46.37 seconds





In [49]:
# error analysis
for idx, (student_pred, true, teacher_pred) in enumerate(zip(preds, test["label"], teacher_preds)):
    if idx >20:
        break
        
    if pred != true:
        print(f"Text: {test['text'][idx]}")
        print(f"{teacher_pred = }  |  {student_pred = }  |  {true = }")
        print("\n")

Text: Sepatutnya berbuat begitu  demi untuk menawarkan sesuatu yang lebih baik kepada rakyat.
teacher_pred = 0  |  student_pred = 0  |  true = 0


Text: Alhamdulillah, sama2 bantu kerajaan memerangi Covid19, menyelamatkan nyawa rakyat dan memulihkan semula negara. Buk https://t.co/A4uyo9BtXL
teacher_pred = 0  |  student_pred = 0  |  true = 0


Text: Biasanya bantuan disalurkan kepada sekolah berdaftar
teacher_pred = 0  |  student_pred = 0  |  true = 0


Text: Kerajaan wajar mengkaji semula had tunggakan cukai yang dinaikkan kepada RM10,000 untuk pembayar cukai individu dan RM50,000 untuk syarikat.
teacher_pred = 2  |  student_pred = 0  |  true = 0


Text: me; those everytime nak beli baju or seluar tak confidence dengan size sendiri
teacher_pred = 2  |  student_pred = 2  |  true = 0


Text: Apa pun yang berlaku, kami jalan terus.
teacher_pred = 1  |  student_pred = 2  |  true = 0


Text: hahahah betul2 tea ni..
teacher_pred = 0  |  student_pred = 0  |  true = 1


Text: PKR Perak hadiah

As you can see, **the disitlled model gets similar accuracy on a held-out test set while running in 1/10th the time**. 

Note: The goal here is demostrate the idea of model distillation using zero-shot classification pipline. We aim to get a student model with similar performance as the teacher model but with smaller model size and more efficient inference spped.
