<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Assessment_SetFit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (Sentence Transformer Fine-tuning): Efficient Few-Shot Learning Without Prompts

## How it works
1. fine-tuning a pretrained Sentence Transformers (ST) on a small number of text pairs, in a contrastive Siamese manner
1. training a classifier head on the embeddings generated from the fine-tuned ST

![](https://raw.githubusercontent.com/huggingface/setfit/main/assets/setfit.png)

## Detailed steps of of contrastive Siamese training for fine-tuning

1. **Embedding Generation**: The input data is passed through a pretrained transformer model, like Sentence-BERT or RoBERTa, to generate embeddings. These embeddings are vector representations of the text data, capturing the semantic nuances in a high-dimensional space.

1. **Contrastive Loss Calculation**: In contrastive training, the goal is to adjust the embeddings so that similar texts (texts with the same label) are closer together in the embedding space, and dissimilar texts (texts with different labels) are farther apart. Involves pairs of texts, where the model tries to minimize the distance between pairs of similar texts while ensuring that pairs of dissimilar texts are separated by at least a margin.

1. **Model Training**: The model is trained by optimizing this contrastive loss across all selected texts in the dataset. During training, the parameters of the model (or a portion of the model if using fine-tuning) are adjusted to reduce the loss, thereby learning to generate embeddings that effectively group similar texts together and push dissimilar texts apart.

1. **Outcome**: After this step, the model produces high-quality embeddings that are more useful for the specific classification or analysis tasks because they better represent the differences and similarities as per the task-specific data.

This contrastive training step effectively leverages a small amount of labeled data to teach the model a nuanced understanding of the task at hand, setting a strong foundation for the subsequent fine-tuning steps. This makes SetFit particularly powerful in scenarios where labeled data is scarce but quality embeddings are crucial for performance.

## Links
* Intrduction: https://huggingface.co/blog/setfit
* This code mostly taken fron: https://huggingface.co/docs/setfit/quickstart
* Sources with technical details notebooks: https://github.com/huggingface/setfit
  * https://github.com/huggingface/setfit/tree/main/notebooks
* Paper [2209.11055] Efficient Few-Shot Learning Without Prompts: https://arxiv.org/abs/2209.11055




In [None]:
!nvidia-smi

Wed Jul 17 15:23:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install -q setfit

In [None]:
import setfit
setfit.__version__

'1.0.3'

In [None]:
positive = [
  "With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.",
  "The socio-medical prerequisites for the prescribed aid supply have been met.",
  "Everyday relevant usage benefits have been determined.",
  "Socio-medical indication for the aid is confirmed.",
  "Contraindications have been excluded; there are no contraindications for the use of the requested aid."
]

  and should_run_async(code)


In [None]:
negative = [
  "No specific findings can be derived from the diagnosis currently named as the basis for the regulation.",
  "According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.",
  "A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.",
  "From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.",
  "A medical justification for why a product not listed in the directory of aids should be used in the present case has not been transmitted."
]

In [None]:
from datasets import Dataset
ds = Dataset.from_dict({"text": positive + negative, "label": len(positive) * [1] + len(negative) * [0]})
# ds.to_list()

In [None]:
from setfit import sample_dataset

train_dataset = sample_dataset(ds, label_column="label", num_samples=3)
train_dataset.to_list()

[{'text': 'Socio-medical indication for the aid is confirmed.', 'label': 1},
 {'text': 'A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.',
  'label': 0},
 {'text': 'Everyday relevant usage benefits have been determined.',
  'label': 1},
 {'text': 'The socio-medical prerequisites for the prescribed aid supply have been met.',
  'label': 1},
 {'text': 'From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.',
  'label': 0},
 {'text': 'According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.',
  'label': 0}]

In [None]:
%%time

from setfit import SetFitModel
model_id = "BAAI/bge-small-en-v1.5"
model = SetFitModel.from_pretrained(model_id)
model.labels = ["negative", "positive"]

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


CPU times: user 475 ms, sys: 339 ms, total: 814 ms
Wall time: 2.49 s


In [None]:
from setfit import Trainer, TrainingArguments

args = TrainingArguments(
    batch_size=16, # even though we have less samples, this makes sense - we train on unique pairs
    num_epochs=1, # Number of epochs to use for contrastive learning
    num_iterations=20, # Number of text pairs to generate for contrastive learning
)

# https://github.com/huggingface/setfit/issues/512#issuecomment-2118679266
args.eval_strategy = args.evaluation_strategy

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset
)

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

In [None]:
%%time

trainer.train()

***** Running training *****
  Num unique pairs = 240
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 15


Step,Training Loss


CPU times: user 3.36 s, sys: 447 ms, total: 3.81 s
Wall time: 7.19 s


In [None]:
trainer.evaluate(train_dataset)

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 1.0}

In [None]:
trainer.evaluate(ds)

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 0.8}

In [None]:
model.predict(negative)

  and should_run_async(code)


['negative', 'negative', 'negative', 'negative', 'negative']

In [None]:
positive

  and should_run_async(code)


['With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.',
 'The socio-medical prerequisites for the prescribed aid supply have been met.',
 'Everyday relevant usage benefits have been determined.',
 'Socio-medical indication for the aid is confirmed.',
 'Contraindications have been excluded; there are no contraindications for the use of the requested aid.']

In [None]:
model.predict(positive)

['negative', 'positive', 'positive', 'positive', 'negative']

In [None]:
# give it a shot, what is your example
model.predict(["Give them what they want.", "They get nothing", "Are you kidding me?"])

  and should_run_async(code)


['positive', 'negative', 'negative']

In [None]:
!nvidia-smi

Wed Jul 17 15:24:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0              37W /  70W |   1139MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

  and should_run_async(code)
