### Fast Text Classification -- when you have no labeled data

This notebook demonstrates the process of training an efficient student classifier based off predictions (labeled data) from a pretrained Hugging Face Zero Shot classifier. 

In [1]:
import os
import sys
from time import time
from tqdm.auto import tqdm

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    pipeline,
    TextClassificationPipeline,
    TrainingArguments,
)

sys.path.append(os.path.abspath(os.path.join('..')))
from utilities.distill_classifier_ import (
    ZeroShotStudentTrainer,
    read_lines,
    get_results_df
)

In [2]:
OUTPUT_DIR = "../distilled_text_classifier"

# read in synthetic data
EXAMPLES = read_lines('./examples.txt')

# define example class names
CLASS_NAMES = [
    'quality',
    'texture',
    'scent',
    'value',
    'results',
    'color',
    'dryness',
    'brightening',
    'staining',
    'experience',
    'quantity',
    'longevity',
    'antiaging'
]

TRAINING_ARGS = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    seed=48,
    fp16=False,
    local_rank=-1
)

# TOKENIZERS_PARALLELISM = False

In [3]:
len(EXAMPLES)

95000

In [4]:
train = EXAMPLES[:10000]

In [5]:
# Initialize zero shot student trainer with chosen text and class names
zero_shot_student_trainer = ZeroShotStudentTrainer(train,
                                                   class_names=CLASS_NAMES,
                                                   hypothesis_template="This text is about {}.")

In [6]:
# Get predictions from Teacher model and train the student model based off these predictions
zero_shot_student_trainer.distill_text_classifier(TRAINING_ARGS)

Generating predictions from zero-shot teacher model


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/4063 [00:00<?, ?it/s]

Initializing student model


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Splitting dataset into training and testing
DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1000
    })
})
Tokenizing training and testing datasets


Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})
Training student model on teacher predictions




  0%|          | 0/4500 [00:00<?, ?it/s]

{'loss': 0.2742, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.44}
{'loss': 0.2544, 'learning_rate': 3.888888888888889e-05, 'epoch': 0.89}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.2511460483074188, 'eval_agreement': 0.643, 'eval_runtime': 9.748, 'eval_samples_per_second': 102.585, 'eval_steps_per_second': 12.823, 'epoch': 1.0}
{'loss': 0.2511, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.33}
{'loss': 0.2497, 'learning_rate': 2.777777777777778e-05, 'epoch': 1.78}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.24967394769191742, 'eval_agreement': 0.669, 'eval_runtime': 9.5117, 'eval_samples_per_second': 105.134, 'eval_steps_per_second': 13.142, 'epoch': 2.0}
{'loss': 0.2489, 'learning_rate': 2.2222222222222223e-05, 'epoch': 2.22}
{'loss': 0.2468, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.67}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.24932238459587097, 'eval_agreement': 0.67, 'eval_runtime': 9.4953, 'eval_samples_per_second': 105.315, 'eval_steps_per_second': 13.164, 'epoch': 3.0}
{'loss': 0.2472, 'learning_rate': 1.1111111111111112e-05, 'epoch': 3.11}
{'loss': 0.2459, 'learning_rate': 5.555555555555556e-06, 'epoch': 3.56}
{'loss': 0.2463, 'learning_rate': 0.0, 'epoch': 4.0}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.24907821416854858, 'eval_agreement': 0.691, 'eval_runtime': 10.0169, 'eval_samples_per_second': 99.831, 'eval_steps_per_second': 12.479, 'epoch': 4.0}
{'train_runtime': 3548.5545, 'train_samples_per_second': 10.145, 'train_steps_per_second': 1.268, 'train_loss': 0.2516319105360243, 'epoch': 4.0}


  0%|          | 0/125 [00:00<?, ?it/s]

Agreement of student and teacher predictions: 69.10%


## Inference with Student Model

In [7]:
# load the tokenizer and student model
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForSequenceClassification.from_pretrained(OUTPUT_DIR)

In [8]:
student_distilled_pipeline = TextClassificationPipeline(model=model,
                                                        tokenizer=tokenizer,
                                                        return_all_scores=False)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [9]:
student_distilled_pipeline("slightly thick")

[{'label': 'texture', 'score': 0.25389814376831055}]

## Compare Speed between Original and Distilled Model

In [10]:
test = EXAMPLES[-1000:]

### First test original Zero Shot Classifer

In [11]:
zero_shot_classifier = pipeline('zero-shot-classification', model="roberta-large-mnli")

start = time()
batch_size = 32
hypothesis_template = "This text is about {}."
preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]
    outputs = zero_shot_classifier(examples, CLASS_NAMES, hypothesis_template=hypothesis_template)
    preds += [CLASS_NAMES.index(o['labels'][0]) for o in outputs]

print(f"Runtime: {time() - start : 0.2f} seconds")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/32 [00:00<?, ?it/s]

Runtime:  4106.51 seconds


### Distilled Model

In [12]:
start = time()
batch_size = 128  # larger batch size bc distilled model is more memory efficient
preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]
    outputs = student_distilled_pipeline(examples)
    preds += [CLASS_NAMES.index(o['label']) for o in outputs]

print(f"Runtime: {time() - start : 0.2f} seconds")

  0%|          | 0/8 [00:00<?, ?it/s]

Runtime:  31.71 seconds


In [13]:
results = get_results_df(test, class_names=CLASS_NAMES, preds=preds)
results

Unnamed: 0,text,label
0,religious on,experience
1,different reasonsoccasion,experience
2,strechmark recovery,results
3,need caffeine everyday,quantity
4,home bleach,color
...,...,...
995,real deal guy,value
996,trump manual,experience
997,four stars good eyeliner,quality
998,mauve mama,scent
