<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Assessment_SetFit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (Sentence Transformer Fine-tuning): Efficient Few-Shot Learning Without Prompts

## Options
1. Training on GPU, inference on GPU
1. Training on GPU, inference on CPU (recommended)
1. **Training on CPU, inference on CPU (training a bit slow, but this is the default set here)**

## How it works
1. fine-tuning a pretrained Sentence Transformers (ST) on a small number of text pairs, in a contrastive Siamese manner
1. training a classifier head on the embeddings generated from the fine-tuned ST

![](https://raw.githubusercontent.com/huggingface/setfit/main/assets/setfit.png)

## Detailed steps of of contrastive Siamese training for fine-tuning

1. **Embedding Generation**: The input data is passed through a pretrained transformer model, like Sentence-BERT or RoBERTa, to generate embeddings. These embeddings are vector representations of the text data, capturing the semantic nuances in a high-dimensional space.

1. **Contrastive Loss Calculation**: In contrastive training, the goal is to adjust the embeddings so that similar texts (texts with the same label) are closer together in the embedding space, and dissimilar texts (texts with different labels) are farther apart. Involves pairs of texts, where the model tries to minimize the distance between pairs of similar texts while ensuring that pairs of dissimilar texts are separated by at least a margin.

1. **Model Training**: The model is trained by optimizing this contrastive loss across all selected texts in the dataset. During training, the parameters of the model (or a portion of the model if using fine-tuning) are adjusted to reduce the loss, thereby learning to generate embeddings that effectively group similar texts together and push dissimilar texts apart.

1. **Outcome**: After this step, the model produces high-quality embeddings that are more useful for the specific classification or analysis tasks because they better represent the differences and similarities as per the task-specific data.

This contrastive training step effectively leverages a small amount of labeled data to teach the model a nuanced understanding of the task at hand, setting a strong foundation for the subsequent fine-tuning steps. This makes SetFit particularly powerful in scenarios where labeled data is scarce but quality embeddings are crucial for performance.

## Links
* Intrduction: https://huggingface.co/blog/setfit
* This code mostly taken fron: https://huggingface.co/docs/setfit/quickstart
* Sources with technical details notebooks: https://github.com/huggingface/setfit
  * https://github.com/huggingface/setfit/tree/main/notebooks
* Paper [2209.11055] Efficient Few-Shot Learning Without Prompts: https://arxiv.org/abs/2209.11055




In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [2]:
!pip install -q setfit

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m872.4 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

In [3]:
import setfit
setfit.__version__

'1.0.3'

# Data

In [4]:
positive = [
  "With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.",
  "The socio-medical prerequisites for the prescribed aid supply have been met.",
  "Everyday relevant usage benefits have been determined.",
  "Socio-medical indication for the aid is confirmed.",
  "Contraindications have been excluded; there are no contraindications for the use of the requested aid."
]

  and should_run_async(code)


In [5]:
negative = [
  "No specific findings can be derived from the diagnosis currently named as the basis for the regulation.",
  "According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.",
  "A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.",
  "From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.",
  "A medical justification for why a product not listed in the directory of aids should be used in the present case has not been transmitted."
]

In [6]:
from datasets import Dataset
ds = Dataset.from_dict({"text": positive + negative, "label": len(positive) * [1] + len(negative) * [0]})
# ds.to_list()

In [7]:
from setfit import sample_dataset

train_dataset = sample_dataset(ds, label_column="label", num_samples=3)
train_dataset.to_list()

[{'text': 'Socio-medical indication for the aid is confirmed.', 'label': 1},
 {'text': 'A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.',
  'label': 0},
 {'text': 'Everyday relevant usage benefits have been determined.',
  'label': 1},
 {'text': 'The socio-medical prerequisites for the prescribed aid supply have been met.',
  'label': 1},
 {'text': 'From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.',
  'label': 0},
 {'text': 'According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.',
  'label': 0}]

# Few Shot Training

In [8]:
%%time

from setfit import SetFitModel
model_id = "BAAI/bge-small-en-v1.5"
model = SetFitModel.from_pretrained(model_id)
model.labels = ["negative", "positive"]

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


CPU times: user 1.11 s, sys: 513 ms, total: 1.63 s
Wall time: 6.86 s


In [9]:
from setfit import Trainer, TrainingArguments

args = TrainingArguments(
    batch_size=16, # even though we have less samples, this makes sense - we train on unique pairs
    num_epochs=1, # Number of epochs to use for contrastive learning
    num_iterations=20, # Number of text pairs to generate for contrastive learning
)

# https://github.com/huggingface/setfit/issues/512#issuecomment-2118679266
args.eval_strategy = args.evaluation_strategy

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset
)

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

In [10]:
%%time

# this does train on CPU, takes 1,5 minutes for one epoch
trainer.train()

***** Running training *****
  Num unique pairs = 240
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 15


Step,Training Loss


CPU times: user 1min 14s, sys: 2.69 s, total: 1min 16s
Wall time: 1min 21s


In [11]:
trainer.evaluate(train_dataset)

  and should_run_async(code)
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 1.0}

In [12]:
trainer.evaluate(ds)

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 0.8}

# Trying it out

In [13]:
model.predict(negative)

  and should_run_async(code)


['negative', 'negative', 'negative', 'negative', 'negative']

In [14]:
positive

  and should_run_async(code)


['With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.',
 'The socio-medical prerequisites for the prescribed aid supply have been met.',
 'Everyday relevant usage benefits have been determined.',
 'Socio-medical indication for the aid is confirmed.',
 'Contraindications have been excluded; there are no contraindications for the use of the requested aid.']

In [15]:
model.predict(positive)

['negative', 'positive', 'positive', 'positive', 'negative']

In [16]:
%%time

# give it a shot, what is your example
model.predict([
    "Give them what they want.",
    "They get nothing",
    "Are you kidding me?"
])

  and should_run_async(code)


CPU times: user 62.2 ms, sys: 7.89 ms, total: 70 ms
Wall time: 71.9 ms


['positive', 'negative', 'negative']

In [17]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


  and should_run_async(code)


# Saving model

In [18]:
model_name = "setfit-bge-small-v1.5-sst2-8-shot"

In [19]:
model.save_pretrained(model_name)

In [20]:
!ls -lh {model_name}

total 129M
drwxr-xr-x 2 root root 4.0K Jul 18 08:03 1_Pooling
drwxr-xr-x 2 root root 4.0K Jul 18 08:03 2_Normalize
-rw-r--r-- 1 root root  706 Jul 18 08:03 config.json
-rw-r--r-- 1 root root  201 Jul 18 08:03 config_sentence_transformers.json
-rw-r--r-- 1 root root   85 Jul 18 08:03 config_setfit.json
-rw-r--r-- 1 root root 3.9K Jul 18 08:03 model_head.pkl
-rw-r--r-- 1 root root 128M Jul 18 08:03 model.safetensors
-rw-r--r-- 1 root root  349 Jul 18 08:03 modules.json
-rw-r--r-- 1 root root 8.5K Jul 18 08:03 README.md
-rw-r--r-- 1 root root   52 Jul 18 08:03 sentence_bert_config.json
-rw-r--r-- 1 root root  695 Jul 18 08:03 special_tokens_map.json
-rw-r--r-- 1 root root 1.3K Jul 18 08:03 tokenizer_config.json
-rw-r--r-- 1 root root 695K Jul 18 08:03 tokenizer.json
-rw-r--r-- 1 root root 227K Jul 18 08:03 vocab.txt


In [26]:
# download to local machine
!tar czvf {model_name}.tgz {model_name}

setfit-bge-small-v1.5-sst2-8-shot/
setfit-bge-small-v1.5-sst2-8-shot/vocab.txt
setfit-bge-small-v1.5-sst2-8-shot/model_head.pkl
setfit-bge-small-v1.5-sst2-8-shot/special_tokens_map.json
setfit-bge-small-v1.5-sst2-8-shot/tokenizer.json
setfit-bge-small-v1.5-sst2-8-shot/README.md
setfit-bge-small-v1.5-sst2-8-shot/config_setfit.json
setfit-bge-small-v1.5-sst2-8-shot/1_Pooling/
setfit-bge-small-v1.5-sst2-8-shot/1_Pooling/config.json
setfit-bge-small-v1.5-sst2-8-shot/tokenizer_config.json
setfit-bge-small-v1.5-sst2-8-shot/modules.json
setfit-bge-small-v1.5-sst2-8-shot/2_Normalize/
setfit-bge-small-v1.5-sst2-8-shot/sentence_bert_config.json
setfit-bge-small-v1.5-sst2-8-shot/model.safetensors
setfit-bge-small-v1.5-sst2-8-shot/config.json
setfit-bge-small-v1.5-sst2-8-shot/config_sentence_transformers.json


# In case you were training on GPU: Loading onto CPU and making inferences

**Only if you were training on GPU and want to try inference on CPU**: Switch to CPU (which also restarts the runtime), upload the tgz saved before and execute *only the following* on CPU

In [28]:
model_name = "setfit-bge-small-v1.5-sst2-8-shot"

In [29]:
!tar xzvf {model_name}.tgz

setfit-bge-small-v1.5-sst2-8-shot/
setfit-bge-small-v1.5-sst2-8-shot/vocab.txt
setfit-bge-small-v1.5-sst2-8-shot/model_head.pkl
setfit-bge-small-v1.5-sst2-8-shot/special_tokens_map.json
setfit-bge-small-v1.5-sst2-8-shot/tokenizer.json
setfit-bge-small-v1.5-sst2-8-shot/README.md
setfit-bge-small-v1.5-sst2-8-shot/config_setfit.json
setfit-bge-small-v1.5-sst2-8-shot/1_Pooling/
setfit-bge-small-v1.5-sst2-8-shot/1_Pooling/config.json
setfit-bge-small-v1.5-sst2-8-shot/tokenizer_config.json
setfit-bge-small-v1.5-sst2-8-shot/modules.json
setfit-bge-small-v1.5-sst2-8-shot/2_Normalize/
setfit-bge-small-v1.5-sst2-8-shot/sentence_bert_config.json
setfit-bge-small-v1.5-sst2-8-shot/model.safetensors
setfit-bge-small-v1.5-sst2-8-shot/config.json
setfit-bge-small-v1.5-sst2-8-shot/config_sentence_transformers.json


In [30]:
!ls -lh {model_name}

total 129M
drwxr-xr-x 2 root root 4.0K Jul 18 08:03 1_Pooling
drwxr-xr-x 2 root root 4.0K Jul 18 08:03 2_Normalize
-rw-r--r-- 1 root root  706 Jul 18 08:03 config.json
-rw-r--r-- 1 root root  201 Jul 18 08:03 config_sentence_transformers.json
-rw-r--r-- 1 root root   85 Jul 18 08:03 config_setfit.json
-rw-r--r-- 1 root root 3.9K Jul 18 08:03 model_head.pkl
-rw-r--r-- 1 root root 128M Jul 18 08:03 model.safetensors
-rw-r--r-- 1 root root  349 Jul 18 08:03 modules.json
-rw-r--r-- 1 root root 8.5K Jul 18 08:03 README.md
-rw-r--r-- 1 root root   52 Jul 18 08:03 sentence_bert_config.json
-rw-r--r-- 1 root root  695 Jul 18 08:03 special_tokens_map.json
-rw-r--r-- 1 root root 1.3K Jul 18 08:03 tokenizer_config.json
-rw-r--r-- 1 root root 695K Jul 18 08:03 tokenizer.json
-rw-r--r-- 1 root root 227K Jul 18 08:03 vocab.txt


In [31]:
!pip install -q setfit

In [32]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_name)

In [33]:
%%time

# give it a shot, what is your example
model.predict([
    "Give them what they want.",
    "They get nothing",
    "Are you kidding me?"
])

CPU times: user 73.7 ms, sys: 3.98 ms, total: 77.7 ms
Wall time: 166 ms


['positive', 'negative', 'negative']