# Few-shot learning with SetFit and Argilla

This tutorial covers the case when there are no labels and we need to "bootstrap" a training dataset to use a few-shot approach - SetFit. We will also cover the use of Argilla for data labelling and quality assurance.

Here's a summary of the key steps in each approach:

**SetFit Framework:**

1. **Data Preparation**: The training data is read from a Parquet file and sampled to create a smaller dataset for demonstration purposes. This data is then converted into a Hugging Face `datasets` Dataset.

2. **Text Encoding**: The text data is encoded using a pre-trained sentence transformer model to obtain embeddings for each text instance. These will then be used for semantic search within Argilla...to speed up labeling.

After that we label the data manually...only a few examples.

3. **Training with SetFit**: The encoded text data is loaded into a SetFit dataset for text classification. The model, loss function, and trainer are instantiated, and the training process is initiated. After training, the model's performance is evaluated using the evaluation dataset.

**Traditional Machine Learning Approach:**

To compare performance we run a simple traditional classifier - using TFIDF and logistic regression.

1. **Data Loading**: The AG News dataset is loaded using the `load_dataset` function from Hugging Face datasets.

2. **Data Preprocessing**: The text data is split into training and test sets, and TF-IDF vectorization is applied to convert the text into numerical features. 

3. **Model Training and Evaluation**: A logistic regression model is trained on the TF-IDF transformed text data, and predictions are made on the test set. The performance of the model is evaluated using classification reports to assess its accuracy, precision, recall, and F1-score.



In [None]:
%pip install argilla setfit -qqq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/417.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m368.6/417.2 kB[0m [31m11.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.2/417.2 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━

In [None]:
import argilla as rg

In [None]:
rg.init(
    api_url="https://YOUR-ARGILLA.hf.space", # add your space-url
    api_key="owner.apikey",
    workspace="admin"
)

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


In [None]:
from datasets import load_dataset, Dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

In [None]:
import pandas as pd

  and should_run_async(code)


In [None]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

In [None]:
## PREPPING THE DATA ##

# Read in the training data from a parquet file
data_train = pd.read_parquet('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/ag_news_unlabelled.pq')



  and should_run_async(code)


In [None]:
data_train_sample = data_train.groupby('label').apply(lambda x: x.sample(n=30)).reset_index(drop=True)


  and should_run_async(code)


In [None]:
# Convert the Pandas DataFrame to a Hugging Face `datasets` Dataset
dataset_news = Dataset.from_pandas(data_train_sample)

  and should_run_async(code)


In [None]:
from sentence_transformers import SentenceTransformer

# Define fast version of sentence transformers
encoder = SentenceTransformer("intfloat/multilingual-e5-base", device="cuda")


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:

# Encode text field using batched computation
dataset_news = dataset_news.map(lambda batch: {"vectors": encoder.encode(batch["text"])}, batch_size=32, batched=True)

# Turn vectors into a dictionary
dataset_news = dataset_news.map(
    lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
)

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

In [None]:
unlabelled = rg.DatasetForTextClassification.from_datasets(dataset_news)



In [None]:
rg.log(unlabelled, "news_sample_vecs")

Output()

BulkResponse(dataset='news_sample_vecs', processed=120, failed=0)

In [None]:
# Load the handlabelled dataset from Argilla
train_ds = rg.load("news_sample_vecs").prepare_for_training()
test_ds = load_dataset("ag_news", split="test")



In [None]:
train_ds[0]

  and should_run_async(code)


{'id': '02de3f25-14f2-4b77-8c5f-c18129d8a550',
 'text': 'Stocks Rally on Lower Oil Prices Stocks rallied in quiet trading Wednesday as lower oil prices brought out buyers, countering a pair of government reports that gave a mixed picture of the economy.',
 'label': 0}

In [None]:
train_ds.features

{'id': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['Business', 'Sci/Tech', 'Sports', 'World'], id=None)}

In [None]:
test_ds.features

  and should_run_async(code)


{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}

In [None]:
# Example label mappings for demonstration (replace these with your actual mappings)
label_mapping_a_to_b = {0: 2, 1: 3, 2:1, 3:0}

In [None]:
# Function to apply label mapping
def apply_label_mapping(example, label_mapping):
    example['label'] = label_mapping[example['label']]
    return example

# Apply the mapping to align dataset_b labels with dataset_a
train_ds = train_ds.map(lambda x: apply_label_mapping(x, label_mapping_a_to_b))


Map:   0%|          | 0/28 [00:00<?, ? examples/s]

In [None]:
model = SetFitModel.from_pretrained("intfloat/multilingual-e5-base")

model = model.to('cuda')

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20,  # The number of text pairs to generate
)


  and should_run_async(code)
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer = SetFitTrainer(


Map:   0%|          | 0/28 [00:00<?, ? examples/s]

In [None]:
# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

***** Running training *****
  Num unique pairs = 1120
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 70


Step,Training Loss


***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
metrics

{'accuracy': 0.8686842105263158}

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# Load the AG News dataset
dataset = load_dataset("ag_news")

# Prepare training and test sets
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']

test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

# Create a TF-IDF vectorizer and logistic regression pipeline
model_lg = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    LogisticRegression(max_iter=1000)
)

# Train the model
model_lg.fit(train_texts, train_labels)

# Predict on the test set
predicted_labels = model_lg.predict(test_texts)

# Evaluate the model
print(classification_report(test_labels, predicted_labels))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92      1900
           1       0.96      0.98      0.97      1900
           2       0.89      0.88      0.88      1900
           3       0.89      0.90      0.89      1900

    accuracy                           0.92      7600
   macro avg       0.92      0.92      0.92      7600
weighted avg       0.92      0.92      0.92      7600

