# SetFit - Efficient Few-shot Learning with Sentence Transformers

Modified from [SetFit - Efficient Few-shot Learning with Sentence Transformers](https://github.com/huggingface/setfit) by HK Turesson.

## Introduction

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples exploding_head!

Compared to other few-shot learning methods, SetFit has several unique features:

 * No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that's suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
 * Fast to train: SetFit doesn't require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
 * Multilingual support: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.

## Install

In [1]:
# !pip install -U setfit
!pip install transformers==4.42.2
!pip install git+https://github.com/huggingface/setfit.git

Collecting transformers==4.42.2
  Downloading transformers-4.42.2-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers==4.42.2)
  Using cached huggingface_hub-0.26.3-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.42.2)
  Downloading tokenizers-0.19.1-cp312-none-win_amd64.whl.metadata (6.9 kB)
Downloading transformers-4.42.2-py3-none-any.whl (9.3 MB)
   ---------------------------------------- 0.0/9.3 MB ? eta -:--:--
   ------------------- -------------------- 4.5/9.3 MB 22.4 MB/s eta 0:00:01
   ---------------------------------------- 9.3/9.3 MB 24.2 MB/s eta 0:00:00
Using cached huggingface_hub-0.26.3-py3-none-any.whl (447 kB)
Downloading tokenizers-0.19.1-cp312-none-win_amd64.whl (2.2 MB)
   ---------------------------------------- 0.0/2.2 MB ? eta -:--:--
   ---------------------------------------- 2.2/2.2 MB 20.8 MB/s eta 0:00:00
Installing collected packages: huggingface-hub, tokenizers, transforme


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting git+https://github.com/huggingface/setfit.git
  Cloning https://github.com/huggingface/setfit.git to c:\users\adrie\appdata\local\temp\pip-req-build-ch4hol3_
  Resolved https://github.com/huggingface/setfit.git to commit 146c7c9dacdc7dca678b2fffff8ddeb79dd762c2
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting datasets>=2.15.0 (from setfit==1.2.0.dev0)
  Using cached datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting sentence-transformers>=3 (from sentence-transformers[train]>=3->setfit==1.2.0.dev0)
  Using cached sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting evaluate>=0.3.0 (from setfit==1.2.0.dev0)
  Using cached evaluate-0.4.3-py3-none-any.whl.metadat

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/setfit.git 'C:\Users\adrie\AppData\Local\Temp\pip-req-build-ch4hol3_'

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Verify that we use a compatible version of transformers.

In [2]:
import transformers
assert(transformers.__version__ == '4.42.2')

  from .autonotebook import tqdm as notebook_tqdm


ImportError: tokenizers>=0.20,<0.21 is required for a normal functioning of this module, but found tokenizers==0.19.1.
Try: `pip install transformers -U` or `pip install -e '.[dev]'` if you're working with git main

Disable wandb integration.

In [None]:
import os
os.environ['WANDB_DISABLED'] = 'true'

## Usage

The examples below provide a quick overview on the various features supported in `setfit`.

### Training a SetFit model

`setfit` is integrated with the [Hugging Face Hub](https://huggingface.co/) and provides two main classes:

 * `SetFitModel`: a wrapper that combines a pretrained body from `sentence_transformers` and a classification head from either [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) or [`SetFitHead`](https://github.com/huggingface/setfit/blob/main/src/setfit/modeling.py) (a differentiable head built upon `PyTorch` with similar APIs to `sentence_transformers`).
 * `SetFitTrainer`: a helper class that wraps the fine-tuning process of SetFit.

Here is an end-to-end example using a classification head from `scikit-learn`:

In [4]:
from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"].select(range(100))
test_dataset = dataset["validation"].select(range(100, len(dataset["validation"])))

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    labels=["negative", "positive"],
)

args = TrainingArguments(
    batch_size=32,
    num_epochs=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    logging_strategy="no"
    )

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"sentence": "text", "label": "label"}  # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate(test_dataset)
print(metrics)
# {'accuracy': 0.8691709844559585}

# Download from Hub
# model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2")
# Run inference
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(preds)
# ["positive", "negative"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/16 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 144
  Batch size = 32
  Num epochs = 1


Epoch,Training Loss,Validation Loss
1,0.204,0.244624


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Applying column mapping to the evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.7590673575129534}
['negative', 'negative']


In [5]:
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

In [6]:
preds

  and should_run_async(code)


['negative', 'negative']

In [7]:
metrics = trainer.evaluate()

***** Running evaluation *****


In [8]:
metrics

{'accuracy': 0.82}

## Multiclass


### Data
We'll use the reviews dataset with all five classes.

In [15]:
valid['Sentiment'].unique()

  and should_run_async(code)


array([0, 1, 2])

In [23]:
Dataset.from_list?

  and should_run_async(code)


In [17]:
import pandas as pd
from datasets import Dataset

train = pd.read_csv('reviews.csv', sep='\t')

labels = list(train['RatingValue'].values)
docs = list(train['Review'].values)

train_dataset = Dataset.from_dict({'text': docs, 'label': labels})

In [31]:
train_dataset_mini = sample_dataset(train_dataset, label_column="label", num_samples=8)
valid_dataset = sample_dataset(train_dataset, label_column="label", num_samples=100)

  and should_run_async(code)


In [29]:
train['RatingValue'].unique()

  and should_run_async(code)


array([4, 3, 5, 1, 2])

In [27]:
train_dataset = Dataset.from_dict({'text': docs, 'label': labels})

  and should_run_async(code)


In [28]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("BAAI/bge-small-en-v1.5",
                                    use_differentiable_head=True,
                                    head_params={"out_features": 5})

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
args = TrainingArguments(
    batch_size=16,
    num_epochs=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset_mini,
    eval_dataset=valid_dataset,
    metric="accuracy",
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate(valid_dataset)
print(metrics)

  and should_run_async(code)
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 1280
  Batch size = 16
  Num epochs = 1


## Multilabel

TODO

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(
    "BAAI/bge-small-en-v1.5"
    multi_target_strategy="multi-output",
)