<a href="https://colab.research.google.com/github/DJCordhose/transformers/blob/main/notebooks/SetFit_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (Sentence Transformer Fine-tuning): Efficient Few-Shot Learning Without Prompts

## How it works
1. fine-tuning a pretrained Sentence Transformers (ST) on a small number of text pairs, in a contrastive Siamese manner
1. training a classifier head on the embeddings generated from the fine-tuned ST

![](https://raw.githubusercontent.com/huggingface/setfit/main/assets/setfit.png)

## Detailed steps of of contrastive Siamese training for fine-tuning

1. **Embedding Generation**: The input data is passed through a pretrained transformer model, like Sentence-BERT or RoBERTa, to generate embeddings. These embeddings are vector representations of the text data, capturing the semantic nuances in a high-dimensional space.

1. **Contrastive Loss Calculation**: In contrastive training, the goal is to adjust the embeddings so that similar texts (texts with the same label) are closer together in the embedding space, and dissimilar texts (texts with different labels) are farther apart. Involves pairs of texts, where the model tries to minimize the distance between pairs of similar texts while ensuring that pairs of dissimilar texts are separated by at least a margin.

1. **Model Training**: The model is trained by optimizing this contrastive loss across all selected texts in the dataset. During training, the parameters of the model (or a portion of the model if using fine-tuning) are adjusted to reduce the loss, thereby learning to generate embeddings that effectively group similar texts together and push dissimilar texts apart.

1. **Outcome**: After this step, the model produces high-quality embeddings that are more useful for the specific classification or analysis tasks because they better represent the differences and similarities as per the task-specific data.

This contrastive training step effectively leverages a small amount of labeled data to teach the model a nuanced understanding of the task at hand, setting a strong foundation for the subsequent fine-tuning steps. This makes SetFit particularly powerful in scenarios where labeled data is scarce but quality embeddings are crucial for performance.

## Links
* Intrduction: https://huggingface.co/blog/setfit
* This code mostly taken fron: https://huggingface.co/docs/setfit/quickstart
* Sources with technical details notebooks: https://github.com/huggingface/setfit
  * https://github.com/huggingface/setfit/tree/main/notebooks
* Paper [2209.11055] Efficient Few-Shot Learning Without Prompts: https://arxiv.org/abs/2209.11055




In [1]:
!nvidia-smi

Mon May  6 09:35:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install -q setfit

In [3]:
from setfit import SetFitModel

# Massive Text Embedding Benchmark (MTEB) Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

# https://huggingface.co/BAAI/bge-small-en-v1.5
model_id = "BAAI/bge-small-en-v1.5"
# https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2
# model_id = "sentence-transformers/paraphrase-mpnet-base-v2"

model = SetFitModel.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [4]:
model.labels = ["negative", "positive"]

  and should_run_async(code)


In [5]:
# https://huggingface.co/datasets/sst2

from datasets import load_dataset

dataset = load_dataset("SetFit/sst2")
dataset

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 6920
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 872
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1821
    })
})

In [6]:
test_dataset = dataset["test"]
test_dataset[0]

{'text': 'no movement , no yuks , not much of anything .',
 'label': 0,
 'label_text': 'negative'}

In [7]:
from setfit import sample_dataset

num_samples=64
# num_samples=16  # typical
# num_samples=8  # also typical
# num_samples=4
# num_samples=2
# num_samples=1

# more realistic, very small dataset, num_samples samples per category

train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=num_samples)
train_dataset

Dataset({
    features: ['text', 'label', 'label_text'],
    num_rows: 128
})

In [8]:
train_dataset.data.to_pylist()

  and should_run_async(code)


[{'text': '-lrb- a -rrb- crushing disappointment .',
  'label': 0,
  'label_text': 'negative'},
 {'text': "eddie murphy and owen wilson have a cute partnership in i spy , but the movie around them is so often nearly nothing that their charm does n't do a load of good .",
  'label': 0,
  'label_text': 'negative'},
 {'text': 'bogdanich is unashamedly pro-serbian and makes little attempt to give voice to the other side .',
  'label': 0,
  'label_text': 'negative'},
 {'text': 'represents the depths to which the girls-behaving-badly film has fallen .',
  'label': 0,
  'label_text': 'negative'},
 {'text': 'the secrets of time travel will have been discovered , indulged in and rejected as boring before i see this piece of crap again .',
  'label': 0,
  'label_text': 'negative'},
 {'text': "the movie is concocted and carried out by folks worthy of scorn , and the nicest thing i can say is that i ca n't remember a single name responsible for it .",
  'label': 0,
  'label_text': 'negative'},
 {'

In [9]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=1 # Number of epochs to use for contrastive learning
)


  trainer = SetFitTrainer(


Map:   0%|          | 0/128 [00:00<?, ? examples/s]

In [10]:
# from setfit import Trainer, TrainingArguments

# args = TrainingArguments(
#     batch_size=32, # even though we have less samples, this makes sense - we train on unique pairs
#     num_epochs=10,
# )

# trainer = Trainer(
#     model=model,
#     args=args,
#     train_dataset=train_dataset,
# )

In [11]:
%%time

trainer.train()

***** Running training *****
  Num unique pairs = 5120
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 320


Step,Training Loss


CPU times: user 34.7 s, sys: 856 ms, total: 35.6 s
Wall time: 39 s


In [12]:
# for "BAAI/bge-small-en-v1.5"
# Take only a few seconds to train
# 84% with 1 sample per category
# 83% with 2 samples per category
# 84% with 4 samples per category
# 85% with 8 samples per category
# 87% with 64 samples per category

# for "sentence-transformers/paraphrase-mpnet-base-v2"
# 90% with 8 samples per category

trainer.evaluate(test_dataset)

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 0.8671059857221307}

In [13]:
model_name = "setfit-bge-small-v1.5-sst2-8-shot"

  and should_run_async(code)


In [14]:
model.save_pretrained(model_name)

In [15]:
!ls -lh {model_name}

total 129M
drwxr-xr-x 2 root root 4.0K May  6 09:17 1_Pooling
drwxr-xr-x 2 root root 4.0K May  6 09:32 2_Normalize
-rw-r--r-- 1 root root  706 May  6 09:37 config.json
-rw-r--r-- 1 root root  172 May  6 09:37 config_sentence_transformers.json
-rw-r--r-- 1 root root   85 May  6 09:37 config_setfit.json
-rw-r--r-- 1 root root 3.9K May  6 09:37 model_head.pkl
-rw-r--r-- 1 root root 128M May  6 09:37 model.safetensors
-rw-r--r-- 1 root root  349 May  6 09:37 modules.json
-rw-r--r-- 1 root root 8.4K May  6 09:37 README.md
-rw-r--r-- 1 root root   52 May  6 09:37 sentence_bert_config.json
-rw-r--r-- 1 root root  695 May  6 09:37 special_tokens_map.json
-rw-r--r-- 1 root root 1.3K May  6 09:37 tokenizer_config.json
-rw-r--r-- 1 root root 695K May  6 09:37 tokenizer.json
-rw-r--r-- 1 root root 227K May  6 09:37 vocab.txt


In [16]:
# obviously not neccessary, only here for illustration on how to load a model
model = SetFitModel.from_pretrained(model_name)

In [17]:
preds = model.predict([
    "It's a charming and often affecting journey.",
    "It's slow -- very, very slow.",
    "A sometimes tedious film.",
    "Greatest experience of my life (not)",
])
preds

['positive', 'negative', 'negative', 'positive']

In [18]:
!nvidia-smi

Mon May  6 09:37:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   78C    P0              45W /  70W |   1541MiB / 15360MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

  and should_run_async(code)
