# <h1 align="center"><font color="gree">SetFit For Few Shot Classification of Advertisements</font></h1>

<font color="pink">Senior Data Scientist.: Dr. Eddy Giusepe Chirinos Isidro</font>

<font color="orange">Este Notebook foi baseado no Tutorial da [Dra. Ishita Gopal](https://ishitagopal.github.io/)</font>

# <font color="red">Few-shot classification</font>

É uma abordagem de aprendizado de máquina projetada para treinar modelos com um número muito limitado de exemplos rotulados para cada classe. Ao contrário dos métodos tradicionais que exigem grandes conjuntos de dados para atingir um desempenho robusto, a `classificação few-shot` aproveita técnicas como `Transfer Learning`, `Data Augmentation`, etc. para generalizar a partir de apenas um punhado de amostras de treinamento. Isso é particularmente útil em cenários em que coletar dados rotulados é custoso ou impraticável.

# <font color="red">Advertisement Classification Task</font>

No tutorial, a `Dra. Ishita Gopal`, avalia o Few Shot Learning usando [SetFit](https://huggingface.co/docs/setfit/en/index) para a classificação de anúncios políticos, categorizando-os em categorias "promote" (`promover`) "attack" (`atacar`) ou "contrast" (`contraste`). Você pode encontrar uma discussão detalhada dos dados e da tarefa de classificação [aqui](https://colab.research.google.com/drive/1wz0btgzCYdXPzwVcuxCpRFiiJV13u7AZ?usp=sharing#scrollTo=4gQHpJdqKXFE). E em posts anteriores dela [1](https://colab.research.google.com/drive/1wz0btgzCYdXPzwVcuxCpRFiiJV13u7AZ?usp=sharing#scrollTo=4gQHpJdqKXFE) e  [2](https://colab.research.google.com/drive/1-tO8BRHArDWqZwe5hviqWfzxzyDBbDBs#scrollTo=jsEn3-SQ_B_Q), aí ela avaliou o desempenho do `Facebook/BART-large-MNLI` Modelo de `Classificação zero-shot` e do `Modelo SetFit` adaptado para classificação zero-shot. Ambos os modelos tiveram dificuldade para classificar a categoria "contrast".

Para uma tarefa binária simplificada, onde a categoria "contrast" foi colapsada na categoria "attack", ela obtuvo uma accuracy balanceada de aproximadamente `80%` usando `Facebook/BART-large-MNLI`. Neste Notebook, exploraremos o `Few Shot Learning` e examinaremos seu desempenho aprimorado com dados de treinamento limitados!

# <font color="red">The process of Few Shot Learning with SetFit</font>

* `Contrastive Tuning of Sentence Embeddings:` um modelo pré-treinado é `Fine-Tuned` usando um pequeno número de exemplos rotulados para cada categoria de anúncio (`advertisement`) por meio de aprendizado contrastivo (`contrastive learning`). A ideia é minimizar a perda de `similaridade de cosseno` para garantir que os `embeddings` de anúncios semelhantes fiquem mais próximos, enquanto anúncios diferentes são afastados, melhorando a diferenciação de classe.


* `Tuning the Classifier:` Após otimizar os `embeddings`, um classificador é treinado sobre essas representações aprimoradas usando os poucos exemplos rotulados. O classificador aprende a estabelecer limites de decisão entre as categorias `"promote"`, `"attack"` e `"contrast"`. Durante a inferência, ele gera embeddings para novos anúncios políticos e prevê os rótulos de classe correspondentes.

# <font color="yellow">Pacotes python, Carregar e limpar os dados</font>

In [1]:
%%capture
%pip install sentence-transformers datasets setfit


In [3]:
import pandas as pd
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
from datasets import Dataset
from setfit import SetFitModel, Trainer, TrainingArguments, get_templated_dataset, sample_dataset
from matplotlib import pyplot as plt
import numpy as np

# Atualizar  --> pip install --upgrade jupyter

In [None]:
data = pd.read_csv('transcripts_df_clean.csv')
data.head()
# Rename variables
data.rename(columns = {"transcription":"text", "label":"label"}, inplace = True)

# print num of NAS
print(f'Num NAs{data.text.isna().sum()}')

# Remove rows with NAs
data = data.loc[~data.text.isna()] # 3 NAs

# print num of duplicate ads
print(f'Num duplicates: {data.text.duplicated().sum()}') # 63 duplicates in the data

# Remove duplicates
data.drop_duplicates(subset=['text'], keep='first', inplace=True)
print(f'Num duplicates: {data.text.duplicated().sum()}')

# Keep only English Ads
data = data.loc[data['is_english']==True]
data.head()

In [None]:
data.label.value_counts()

In [None]:
# There is class imbalance and the contrast category has the least number of observations
data.label.hist()


# **Create Train and Test Dataset**

In [None]:
def sample_data(data, label_column, num_samples):
  sampled_df = data.groupby(label_column).apply(lambda x: x.sample(n=num_samples, random_state=42)).reset_index(level=label_column, drop=True)
  return sampled_df

def train_test_split(data, label_column, num_samples):
  train_df = sample_data(data, label_column, num_samples)
  test_df = data.drop(train_df.index)
  return train_df, test_df

In [None]:
# Train data
train_df, test_df = train_test_split(data, label_column="label", num_samples=100)
train_dataset = Dataset.from_pandas(train_df[["text", "label"]])

# Eval and Test dataset subset
test_dataset = Dataset.from_pandas(test_df[["text", "label"]])
dataset_dict = test_dataset.train_test_split(test_size=500, seed=42)

eval_dataset = dataset_dict["test"] # Evaluate the models on this = 500 samples
test_dataset = dataset_dict["train"] # Remaining test data

# **Testing Different Settings for Tuning SetFit**

I experimented with a few different parameters:

1. **Increasing the Number of Samples per Class**: This was done to observe how many more samples were needed to see significant improvements in accuracy.
2. **Testing Different Base Sentence Transformer Models**: I evaluated the performance of two distinct models.
3. **Adjusting Iteration Settings**: I modified the number of text pairs generated for contrastive learning to assess its impact on model performance.

Note that there are several other settings and hyperparameters we could adjust, such as tuning the learning rate, batch size, and classification head ([see documentation](https://huggingface.co/docs/setfit/en/how_to/hyperparameter_optimization)). However, in this post, I opted to use the default values for these parameters based on the provided recommendations.

It is advisable to keep the [learning rate around 2e-5](https://github.com/huggingface/setfit/issues/208) and utilize the [sklearn classification head](https://wandb.ai/gladiator/SetFit/reports/SetFit-Efficient-Few-Shot-Learning-Without-Prompts--VmlldzozMDUyMzk2), as it performs better than the differentiable head and the SetFit head also takes longer to train. I opted for the top performing sentence transformers from the [sentence-transformers](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) library and selected two general-purpose models that were trained to detect semantic similarity between sentence pairs, a task that closely aligns with our text classification problem. For a similar application, see this [blog post](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Sentence-Transformer-Fine-Tuning-SetFit/post/1407712) on Intel's website.

## **Training Sample Size Vs Balanced Accuracy**

I evalute the model on a random sample of 500 samples that reflect the true distribution of classes in the data.

In [None]:
# Define a custom metric function -- returns balanced acc and confusion matrix
def balanced_accuracy(y_pred, y_true):
  cm = confusion_matrix(y_true, y_pred)
  balanced_acc = balanced_accuracy_score(y_true, y_pred)
  return {'confusion_matrix': cm,
          'balanced_accuracy': balanced_acc}

In [None]:
models = ["paraphrase-mpnet-base-v2"] # "all-mpnet-base-v2"

# Initialize a list to store evaluation results and run information
results = []

for model_name in models:
  for num_train_samples in [25, 50, 75, 100]:
    for num_iterations in [20]: # 15, 25

      # Subset to the value in num_train_samples
      train_dataset = sample_dataset(train_dataset, label_column="label", num_samples=num_train_samples)
      model = SetFitModel.from_pretrained(model_name)

      # Set training arguments
      args = TrainingArguments(
          num_iterations=num_iterations,
          batch_size=16,
          evaluation_strategy="epoch",
        )

      # Create trainer
      trainer = Trainer(
          model=model,
          train_dataset=train_dataset,
          eval_dataset=eval_dataset,
          args=args,
          metric=balanced_accuracy,
        )

      # Train the model
      trainer.train()

      # Evaluate the model on train and test dataset
      eval_metrics_train = trainer.evaluate(dataset=train_dataset)
      eval_metrics_test = trainer.evaluate(dataset=eval_dataset)

      results_row = {
            'model' : model_name,
            'num_train_samples':num_train_samples,
            #'num_iterations': num_iterations,
            'train_balanced_accuracy': eval_metrics_train['balanced_accuracy'],
            'test_balanced_accuracy': eval_metrics_test['balanced_accuracy'],
            'test_confusion_matrix': eval_metrics_test['confusion_matrix']
          }
      print(results_row)
      results.append(results_row)


In [None]:
results = pd.DataFrame(results)
#results.to_csv('results_runs.csv', index=False)

In [None]:
# Plot using pandas built-in plotting functionality
results.plot(x='num_train_samples', y='test_balanced_accuracy', kind='line', marker='o', linestyle='-', color='b', legend=False)

# Adding titles and labels
plt.title('Political Advertisement Classification Task')
plt.xlabel('Number of Train Samples Per Class (Attack,  Promote, Contrast)')
plt.ylabel('Test Balanced Accuracy')
plt.ylim(0.6, .9)
plt.grid(True)

# Show the plot
plt.show()

The graph illustrates balanced accuracy as a function of the number of training samples (ad texts) provided. With only 100 training sentences per class and 20 contrastive pairs per sample, SetFit achieves an impressive 85% balanced accuracy 🥵🎉!!

In contrast, both the zero-shot adaptation of SetFit ([implemented here](https://colab.research.google.com/drive/1-tO8BRHArDWqZwe5hviqWfzxzyDBbDBs#scrollTo=W-cv1hCioeFe)) and the `Facebook/BART-large-MNLI` zero-shot model ([implemented here](https://colab.research.google.com/drive/1wz0btgzCYdXPzwVcuxCpRFiiJV13u7AZ?usp=sharing)) performed poorly, with balanced accuracy of 43% and 60%, respectively. Traditional machine learning models like SVM and Random Forest also lagged, with balanced accuracy ranging from 60%-65%. Overall, the few-shot approach has **significantly improved** classification performance, delivering major gains with minimal data 🏆.


## **Different Base Transformer Models and `num_iterations` Vs Balanced Accuracy**

I also tested the performance of `all-mpnet-base-v2`. In all the runs below, the training data consisted of 25 labeled examples for each class, totaling 75 labeled examples. Since `paraphrase-mpnet-base-v2` performed better in these tests, and considering the GPU resources I had available, I decided to continue working with `paraphrase-mpnet-base-v2`.

| Model                    | Num Iterations | Test Balanced Accuracy |
|--------------------------|----------------|------------------------|
| all-mpnet-base-v2         | 15             | 0.67                   |
| all-mpnet-base-v2         | 20             | 0.70                   |
| all-mpnet-base-v2         | 25             | 0.69                   |
| paraphrase-mpnet-base-v2  | 15             | 0.72                   |
| paraphrase-mpnet-base-v2  | 20             | 0.72                   |
| paraphrase-mpnet-base-v2  | 25             | 0.71                   |


# **Fine-Tuning based on best model**

I used the best-performing model with the `paraphrase-mpnet-base-v2` sentence embedding model, 100 samples per class, and 20 iterations to train and evaluate the final model on the entire test dataset.

In [None]:
# Train and test dataset -- using 100 samples per class for training
train_df, test_df = train_test_split(data, label_column="label", num_samples=100)
train_dataset = Dataset.from_pandas(train_df[["text", "label"]])
test_dataset = Dataset.from_pandas(test_df[["text", "label"]])

# Fine Tune model
model = SetFitModel.from_pretrained("paraphrase-mpnet-base-v2")

# Set training arguments
args = TrainingArguments(
    num_iterations=20,
    eval_strategy="epoch")

# Create trainer

trainer = Trainer(
          model=model,
          train_dataset=train_dataset,
          eval_dataset=eval_dataset,
          args=args,
          metric=balanced_accuracy,
        )

# Train the model
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
results = trainer.evaluate(dataset=test_dataset)

In [None]:
results

In [None]:
from sklearn.metrics import classification_report
y_pred = trainer.model.predict(test_dataset['text'])
print(classification_report(test_dataset['label'], y_pred))

In [None]:
import seaborn as sns
sns.heatmap(results['confusion_matrix'], annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

## **Results**
Based on the classification report and confusion matrix, we observe that the overall performance on the test set remains consistent, with a balanced accuracy of 85%. The classifier performs exceptionally well on the Promote and Attack categories, but struggles a bit with the Contrast class. This lower performance for the Contrast category could be attributed to its inherently confusing nature, which is reflected in the low precision score for Class 2.Despite this challenge, the model effectively captures a significant portion of the actual Class 2 ads. To further improve this result, it may be beneficial to tune the model's parameters and incorporate additional labeled or synthetic data for the 'Contrast' class. Overall, these results are very impressive, especially considering that the model was trained with only 300 labeled examples!