# SetFit for Multilabel Text Classification

In this notebook, we'll learn how to do few-shot text classification on a multilabel dataset with SetFit.

## Setup

If you're running this Notebook on Colab or some other cloud platform, you will need to install the `setfit` library. Uncomment the following cell and run it:

In [1]:
#%pip install setfit

Alternatively, if you are running directly from source, you can set the path to the setfit source code

In [2]:
#import sys
#sys.path.append('src')

To be able to share your model with the community, there are a few more steps to follow.

First, you have to store your authentication token from the Hugging Face Hub (sign up [here](https://huggingface.co/join) if you haven't already!). To do so, execute the following cell and input an [access token](https://huggingface.co/docs/hub/security-tokens) associated with your account:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS, which you can do by uncommenting and running following command:

In [5]:
# !apt install git-lfs

This notebook is designed to work with any multiclass [text classification dataset](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) and pretrained [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub. Change the values below to try a different dataset / model!

In [6]:
dataset_id = "SetFit/go_emotions"
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"

## Loading and sampling the dataset

We will use the 🤗 Datasets library to download the data, which can be done as follows:

In [None]:
from datasets import load_dataset

dataset = load_dataset(dataset_id)

Most datasets on the Hub have many more labeled examples than those one encounters in few-shot settings. To simulate the effect of training on a limited number of examples, let's subsample the training set to have at least 8 labeled examples per feature.

Note that if your dataset has other columns which are not label features, or differently formatted labels, you may need to adapt this section.

In [8]:
import numpy as np

features = dataset['train'].column_names
features.remove('text')
features

['admiration',
 'amusement',
 'anger',
 'annoyance',
 'approval',
 'caring',
 'confusion',
 'curiosity',
 'desire',
 'disappointment',
 'disapproval',
 'disgust',
 'embarrassment',
 'excitement',
 'fear',
 'gratitude',
 'grief',
 'joy',
 'love',
 'nervousness',
 'optimism',
 'pride',
 'realization',
 'relief',
 'remorse',
 'sadness',
 'surprise',
 'neutral']

In [9]:
num_samples = 8
samples = np.concatenate([np.random.choice(np.where(dataset["train"][f])[0],num_samples) for f in features])
train_dataset = dataset["train"].select(samples)
train_dataset

Dataset({
    features: ['text', 'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'],
    num_rows: 224
})

Here we have 224 total examples to train with since the `go_emotions` dataset has 28 classes.

We encode the emotions in a single `'label'` feature. 

In [None]:
def encode_labels(record):
    return {'label': [record[feature] for feature in features]}

train_dataset = train_dataset.map(encode_labels, remove_columns=features)
eval_dataset = dataset["test"].map(encode_labels, remove_columns=features)

Okay, now we have the dataset, let's load and train a model!

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the `from_pretrained()` method associated with the `SetFitModel` class.

**Note that the `multi_target_strategy` parameter here signals to both the model and the trainer to expect a multi-labelled dataset.**

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id, multi_target_strategy="one-vs-rest")

Here, we've downloaded a pretrained Sentence Transformer from the Hub and added a logistic classification head to the create the SetFit model. As indicated in the message, we need to train this model on some labeled examples. We can do so by using the `SetFitTrainer` class as follows:

In [12]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_epochs=1,
    num_iterations=20
)

The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning

Now that we've created a trainer, we can train it!

In [13]:
trainer.train()

***** Running training *****
  Num examples = 10700
  Num epochs = 1
  Total optimization steps = 669
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/669 [00:00<?, ?it/s]

The final step is to compute the model's performance using the `evaluate()` method. This is a difficult dataset, and the metric we are using measures having all 28 labels correct for a sample, so you can expect 'only' around 15% accuracy. For higher performance, increase the number of samples per class:

In [14]:
metrics = trainer.evaluate()
metrics

***** Running evaluation *****


{'accuracy': 0.1461212456237332}

And once the model is trained, you can push it to the Hub:

In [15]:
#trainer.push_to_hub(f"setfit-go-emotions-example")

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `your-username/the-name-you-picked` so for instance:

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("lewtun/setfit-go-emotions-example")

In [17]:
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
preds

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])

In [18]:
# Show predicted labels, requires you to have stored the 'features' somewhere
[[f for f,p in zip(features, ps) if p] for ps in preds]

[[], ['disgust']]