##SetFit

🤗 SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers. It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, 🤗 SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples! See all details here
https://huggingface.co/docs/setfit/index

## Task
Using the SetFit library, we will build a sentence classifier to classify inclusive language-related sentences depending on the context.

## Setup

In [None]:
!pip3 install setfit

First, you have to store your authentication token from the Hugging Face Hub  https://huggingface.co/ (sign up here if you haven't already!). To do so, execute the following cell and input an access token associated with your account:

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading and sampling the dataset

We load dataset from your repo at HuggingFace account

In [4]:
#Load dataset
from datasets import load_dataset, Dataset

dataset = load_dataset("your_account/english_sentences", use_auth_token=True)
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/36.1k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'sentence', 'purpose', 'word'],
        num_rows: 310
    })
})

convert dataset to pandas to explore

In [5]:
df = dataset["train"].to_pandas()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   label     310 non-null    int64 
 1   sentence  310 non-null    object
 2   purpose   310 non-null    object
 3   word      310 non-null    object
dtypes: int64(1), object(3)
memory usage: 9.8+ KB


Split data into train and test sets

In [8]:
#train data
df_train = df[df["purpose"]=="train"]
df_train = df_train[["label", "sentence", "word"]]
#create a subset of dataframe
df_train=df_train.loc[:, 'label':'sentence']
df_train = df_train.reset_index(drop=True)
df_train = Dataset.from_pandas(df_train)

In [9]:
# test data
df_test = df[df["purpose"]=="test"]
df_test = df_test[["label", "sentence", "word"]]
#create a subset of dataframe
df_test = df_test.loc[:, 'label':'sentence']
df_test = df_test.reset_index(drop=True)
df_test = Dataset.from_pandas(df_test)

Explore the train and test sets

In [11]:
df_train

Dataset({
    features: ['label', 'sentence'],
    num_rows: 186
})

In [12]:
df_test

Dataset({
    features: ['label', 'sentence'],
    num_rows: 66
})

We are ready to train the model

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the from_pretrained() method associated with the SetFitModel class:

Change the values below to try a different models! Try at least three different models
For example, try as well sentence-transformers/paraphrase-mpnet-base-v2

Here you can explore all avalible pretrained sentence-transformers models
https://huggingface.co/models?library=sentence-transformers&sort=downloads

In [13]:
from setfit import SetFitModel
model_id = "sentence-transformers/all-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


Here, we've downloaded a pretrained Sentence Transformer from the Hub and added a logistic classification head to the create the SetFit model. As indicated in the message, we need to train this model on some labeled examples. We can do so by using the SetFitTrainer class as follows:

In [None]:
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=df_train,
    eval_dataset=df_test,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_epochs=1,
    num_iterations=20,
    column_mapping={"sentence": "text", "label": "label"},
)

The main arguments to notice in the trainer is the following:
loss_class: The loss function to use for contrastive learning with the Sentence Transformer body
num_iterations: The number of text pairs to generate for contrastive learning
column_mapping: The SetFitTrainer expects the inputs to be found in a text and label column. This mapping automatically formats the training and evaluation datasets for us.

Now that we've created a trainer, we can train it!

**Note!** Please be sure to use GPU to train. You can check and change at "Runtime" tab above

In [None]:
trainer.train()

Applying column mapping to training dataset
***** Running training *****
  Num examples = 7440
  Num epochs = 1
  Total optimization steps = 465
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/465 [00:00<?, ?it/s]

## Metrics and evaluation

The final step is to compute the model's performance using the evaluate() method:

In [None]:
metrics = trainer.evaluate()

Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

check metrics

In [None]:
metrics

{'accuracy': 0.9242424242424242}

Try to train the different models and compare the metrics

## Predictions and tests

Below are sentence samples that you can use and run to see if your classifier can catch the nuances of the language to distinguish between two synonyms.

"you look like an old fossyl", "Fossil fuels is an old source of energy", "you will have a flexible schedule", "you need to have a flexible personality", "you should give a valuable impact", "We have the latest styles & trends of Fossil watches, wallets, bags and accessories.", "FREE Shipping & Returns at Fossil.com." "a flexible substance or material, as rubber or leather." "The impact of water pollution on aquatic life and land life can be devastating."


In [None]:
# Run inference
preds = model(["you look like an old fossyl", "Fossil fuels is an old source of energy", "you will have a flexible schedule", "you need to have a flexible personality", "you should give a valuable impact"])
preds

array([1, 0, 0, 1, 1])

In [None]:
preds = model(["A simple black spider is a fast face painting design that can make a big impact come Halloween."])

In [None]:
preds = model(["you should keep your schedule flexible"])
for item in preds:
  if item == 1:
    print ("highlight")
  else:
    print ("false positive")

highlight
