### Fast Text Classification -- when you have no labeled data

This notebook demonstrates the process of training an efficient student classifier based off predictions (labeled data) from a pretrained Hugging Face Zero Shot classifier. 

In [1]:
import os
import sys
from time import time
from tqdm.auto import tqdm

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    pipeline,
    TextClassificationPipeline,
    TrainingArguments,
)

sys.path.append(os.path.abspath(os.path.join('..')))
from utilities.distill_classifier_ import (
    ZeroShotStudentTrainer,
    read_lines,
    get_results_df
)

In [4]:
OUTPUT_DIR = "distilled_text_classifier"

# read in synthetic data
EXAMPLES = read_lines('./examples.txt')

# define example class names
CLASS_NAMES = [
    'quality',
    'texture',
    'scent',
    'value',
    'results',
    'color',
    'dryness',
    'brightening',
    'staining',
    'experience',
    'quantity',
    'longevity',
    'antiaging'
]

TRAINING_ARGS = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    seed=48,
    fp16=False,
    local_rank=-1
)

# TOKENIZERS_PARALLELISM = False

In [15]:
len(EXAMPLES)

95000

In [5]:
train = EXAMPLES[:10000]

In [6]:
# Initialize zero shot student trainer with chosen text and class names
zero_shot_student_trainer = ZeroShotStudentTrainer(train,
                                                   class_names=CLASS_NAMES,
                                                   hypothesis_template="This text is about {}.")

In [7]:
# Get predictions from Teacher model and train the student model based off these predictions
zero_shot_student_trainer.distill_text_classifier(TRAINING_ARGS)

Generating predictions from zero-shot teacher model


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/5 [00:00<?, ?it/s]

Initializing student model


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Splitting dataset into training and testing
DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 9
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1
    })
})
Tokenizing training and testing datasets


Map:   0%|          | 0/9 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 9
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
})
Training student model on teacher predictions




  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.6830998063087463, 'eval_agreement': 0.0, 'eval_runtime': 0.1831, 'eval_samples_per_second': 5.462, 'eval_steps_per_second': 5.462, 'epoch': 1.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.6665518879890442, 'eval_agreement': 0.0, 'eval_runtime': 0.126, 'eval_samples_per_second': 7.935, 'eval_steps_per_second': 7.935, 'epoch': 2.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.6522639393806458, 'eval_agreement': 0.0, 'eval_runtime': 0.148, 'eval_samples_per_second': 6.759, 'eval_steps_per_second': 6.759, 'epoch': 3.0}


  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.6446555852890015, 'eval_agreement': 0.0, 'eval_runtime': 0.2043, 'eval_samples_per_second': 4.894, 'eval_steps_per_second': 4.894, 'epoch': 4.0}
{'train_runtime': 27.9975, 'train_samples_per_second': 1.286, 'train_steps_per_second': 0.286, 'train_loss': 0.6709115505218506, 'epoch': 4.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Agreement of student and teacher predictions: 0.00%


## Inference with Student Model

In [9]:
# load the tokenizer and student model
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForSequenceClassification.from_pretrained(OUTPUT_DIR)

In [10]:
student_distilled_pipeline = TextClassificationPipeline(model=model,
                                                        tokenizer=tokenizer,
                                                        return_all_scores=False)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [11]:
student_distilled_pipeline("slightly thick")

[{'label': 'texture', 'score': 0.4992130398750305}]

## Compare Speed between Original and Distilled Model

In [16]:
test = EXAMPLES[-1000:]

1000

### First test original Zero Shot Classifer

In [13]:
zero_shot_classifier = pipeline('zero-shot-classification', model="roberta-large-mnli")

start = time()
batch_size = 32
hypothesis_template = "This text is about {}."
preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]
    outputs = zero_shot_classifier(examples, CLASS_NAMES, hypothesis_template=hypothesis_template)
    preds += [CLASS_NAMES.index(o['labels'][0]) for o in outputs]

print(f"Runtime: {time() - start : 0.2f} seconds")

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/1 [00:00<?, ?it/s]

Runtime:  47.40 seconds


### Distilled Model

In [19]:
start = time()
batch_size = 128  # larger batch size bc distilled model is more memory efficient
preds = []
for i in tqdm(range(0, len(test), batch_size)):
    examples = test[i:i+batch_size]
    outputs = student_distilled_pipeline(examples)
    preds += [CLASS_NAMES.index(o['label']) for o in outputs]

print(f"Runtime: {time() - start : 0.2f} seconds")

  0%|          | 0/1 [00:00<?, ?it/s]

Runtime:  2.13 seconds


In [20]:
results = get_results_df(test, class_names=CLASS_NAMES, preds=preds)
results

Unnamed: 0,text,label
0,less wasteful tube,value
1,heel within,value
2,complexion matte,staining
3,slightly more flat,staining
4,disappointed star,quality
5,real deal guy,staining
6,trump manual,staining
7,four stars good eyeliner,staining
8,mauve mama,quality
9,happen exfoliate,value
