# SetFit for Text Classification

In [None]:
#!pip install setfit

This notebook is designed to work with any multiclass [text classification dataset](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) and pretrained [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub. Change the values below to try a different dataset / model!

In [1]:
dataset_id = "banking77"
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"

## Loading and sampling the dataset

We will use the 🤗 Datasets library to download the data, which can be done as follows:

In [2]:
from datasets import load_dataset

dataset = load_dataset(dataset_id)
dataset

Found cached dataset banking77 (/root/.cache/huggingface/datasets/banking77/default/1.1.0/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3080
    })
})

Most datasets on the Hub have many more labeled examples than those one encounters in few-shot settings. To simulate the effect of training on a limited number of examples, let's subsample the training set to have 8 labeled examples per class:

In [3]:
from setfit import sample_dataset

train_dataset = sample_dataset(dataset["train"])
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 616
})

In [4]:
eval_dataset = dataset["test"] 

Okay, now we have the dataset, let's load and train a model!

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the `from_pretrained()` method associated with the `SetFitModel` class:

In [5]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


Here, we've downloaded a pretrained Sentence Transformer from the Hub and added a logistic classification head to the create the SetFit model. As indicated in the message, we need to train this model on some labeled examples. We can do so by using the `SetFitTrainer` class as follows:

In [6]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    #eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    num_epochs=1,
    batch_size=128,
    column_mapping={"text": "text", "label": "label"},
)

The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning
* `column_mapping`: The `SetFitTrainer` expects the inputs to be found in a `text` and `label` column. This mapping automatically formats the training and evaluation datasets for us.

Now that we've created a trainer, we can train it!

In [7]:
trainer.train()

Applying column mapping to training dataset


Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 24640
  Num epochs = 1
  Total optimization steps = 193
  Total train batch size = 128


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/193 [00:00<?, ?it/s]

The final step is to compute the model's performance using the `evaluate()` method:

In [13]:
eval_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 3080
})

In [18]:
trainer.eval_dataset = eval_dataset

In [21]:
metrics = trainer.evaluate()
metrics

Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.8152597402597402}

## Compare with open-source finetuned model from HF model hub.

In [22]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer

tokenizer = AutoTokenizer.from_pretrained("lxyuan/banking-intent-distilbert-classifier")
finetuned_model = AutoModelForSequenceClassification.from_pretrained("lxyuan/banking-intent-distilbert-classifier")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/5.78k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [34]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [25]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10003 [00:00<?, ? examples/s]

Map:   0%|          | 0/3080 [00:00<?, ? examples/s]

In [26]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3080
    })
})

In [30]:
finetuned_trainer = Trainer(
    model=finetuned_model,
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

[codecarbon INFO @ 18:54:46] [setup] RAM Tracking...
[codecarbon INFO @ 18:54:46] [setup] GPU Tracking...
[codecarbon INFO @ 18:54:46] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 18:54:46] [setup] CPU Tracking...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[codecarbon INFO @ 18:54:47] CPU Model on constant consumption mode: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
[codecarbon INFO @ 18:54:47] >>> Tracker's metadata:
[codecarbon INFO @ 18:54:47]   Platform system: Linux-4.15.0-136-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 18:54:47]   Python version: 3.8.0
[codecarbon INFO @ 18:54:47]   CodeCarbon version: 2.2.3
[codecarbon INFO @ 18:54:47]   Available RAM : 88.490 GB
[codecarbon INFO @ 18:54:47]   CPU count: 8
[codecarbon INFO @ 18:54:47]   CPU model: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
[codecarbon INFO @ 18:54:47]   GPU count: 1
[codecarbon INFO @ 18:54:47]   GPU model: 1 x Tesla V100S-PCIE-32GB


In [35]:
finetuned_trainer.evaluate()

{'eval_loss': 0.2885190546512604,
 'eval_accuracy': 0.9243506493506494,
 'eval_runtime': 3.0169,
 'eval_samples_per_second': 1020.91,
 'eval_steps_per_second': 127.614}

----

Conclusion: Another good library to try when you have limited number training examples.