# Implementation of Setfit Framework

You must first open this notebook in google colab 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MouadEt-tali/setfit/blob/main/Simple-text-classification.ipynb)


Setfit is based on <font color="yellow">Sentence Transformers </font>  which are modifications of pretrained transformer models that use `Siamese and triplet network structures` to derive **semantically meaningful** sentence embeddings.

The goal of these models is to minimize the distance between pairs of semantically similar sentences and maximize the distance between sentence pairs that are semantically distant. Standard STs output a fixed, dense vector that is meant to represent textual data and can then be used by machine learning algorithms

# Image classification analyogy 

In Siamese networks, we take an input image of a person and find out the encodings of that image, then, we take the same network without performing any updates on weights or biases and input an image of a different person and again predict it’s encodings. 

<center><img src="https://github.com/MouadEt-tali/setfit/blob/main/simese_network.png?raw=true"/></center>

Now, we compare these two encodings to check whether there is a similarity between the two images. These two encodings act as a latent feature representation of the images. Images with the same person have similar features/encodings. Using this, we compare and tell if the two images are the same person or not.

## training 

You might be wondering, how to actually train the network? you can train the network by taking an anchor image and comparing it with both a positive sample and a negative sample. The dissimilarity between the anchor image and positive image must low and the dissimilarity between the anchor image and the negative image must be high.


<center><img src="https://github.com/MouadEt-tali/setfit/blob/main/triplet_loss.png?raw=true"/></center>

# Setfit Approach to few-shot text classification 

SETFIT uses a two-step training approach in which we first fine-tune a ST  (Sentence Transformer) and then train a classifier head.


In the first step, the ST is fine-tuned on the input data in a **contrastive, Siamese manner on sentence pairs.** 

In the second step, a text classification head is trained using the encoded training data generated by the fine-tuned ST from the first step.

<center><img src="https://github.com/MouadEt-tali/setfit/blob/main/setfit_architecture.png?raw=true"/></center>

Similarly to our analogy Setfit starts by creating pairs of training examples in the following manner:

Suppose you have a sentiment analysis task where you have a dataset of sentences with corresponding labels. For simplicity let's assume we have only 2 classes (negative sentiment 0 and positive sentiment 1) :

In [18]:
example = ["That movie was awesome, I wish I could watch it all over again " , "LOOVED IT, next time I'll bring my kids" ,"I was totally DISAPPOINTED, the plot was horrible as well "]
labels  = [1, 1, 0]

for i in range(len(example)):
    print(f'{example[i]:{65}} {labels[i]:{2}} ')

That movie was awesome, I wish I could watch it all over again     1 
LOOVED IT, next time I'll bring my kids                            1 
I was totally DISAPPOINTED, the plot was horrible as well          0 


the idea behind of the sentence pairs generation is that it is possible to **map** the sentences to a feature space where similar sentences are close, and dissimilar sentences are far

<center><img src="https://github.com/MouadEt-tali/setfit/blob/main/feature_space.png?raw=true"/></center>

In order to map our example to this feature space we need to create training example following the siamese/triplet network ideas  

In [21]:
#  (sentence 1 , sentence 2 , label )  where label is 1 if the sentences belong to the same class and 0 if they don't
train_examples = [(example[0],example[1],1),(example[0],example[2],0)]
train_examples

[('That movie was awesome, I wish I could watch it all over again ',
  "LOOVED IT, next time I'll bring my kids",
  1),
 ('That movie was awesome, I wish I could watch it all over again ',
  'I was totally DISAPPOINTED, the plot was horrible as well ',
  0)]

and these train_examples are what we use to fine tune our Sentence Transformers.

Once the ST is fine tuned we encode the original sentences simply by calling  

`ST.encode(example)`  

remember that example contained only the sentences. 

And finally now that we have rich vector embeddings for each sentence that contain in them a notion of distance between positive and negative labels, we can train a simple classification model on these vectors. 

# Implementation 

In [None]:
!pip install sentence_transformers

In [None]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets, evaluation
from torch.utils.data import DataLoader

from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np

import torch
import random
import torch

def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)

In [None]:
def sentence_pairs_generation(sentences, labels, pairs):
	# initialize two empty lists to hold the (sentence, sentence) pairs and
	# labels to indicate if a pair is positive or negative

  numClassesList = np.unique(labels)
  idx = [np.where(labels == i)[0] for i in numClassesList]

  for idxA in range(len(sentences)):      
    currentSentence = sentences[idxA]
    label = labels[idxA]
    idxB = np.random.choice(idx[np.where(numClassesList==label)[0][0]])
    posSentence = sentences[idxB]
		  # prepare a positive pair and update the sentences and labels
		  # lists, respectively
    pairs.append(InputExample(texts=[currentSentence, posSentence], label=1.0))

    negIdx = np.where(labels != label)[0]
    negSentence = sentences[np.random.choice(negIdx)]
		  # prepare a negative pair of images and update our lists
    pairs.append(InputExample(texts=[currentSentence, negSentence], label=0.0))
  
	# return a 2-tuple of our image pairs and labels
  return (pairs)

In [None]:
#SST-2
# Load SST-2 dataset into a pandas dataframe.

train_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

# Load the test dataset into a pandas dataframe.
eval_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None)

text_col=train_df.columns.values[0] 
category_col=train_df.columns.values[1]

x_eval = eval_df[text_col].values.tolist()
y_eval = eval_df[category_col].values.tolist()

In [None]:
#@title SetFit
st_model = 'paraphrase-mpnet-base-v2' #@param ['paraphrase-mpnet-base-v2', 'all-mpnet-base-v1', 'all-mpnet-base-v2', 'stsb-mpnet-base-v2', 'all-MiniLM-L12-v2', 'paraphrase-albert-small-v2', 'all-roberta-large-v1']
num_training = 32 #@param ["8", "16", "32", "54", "128", "256", "512"] {type:"raw"}
num_itr = 5 #@param ["1", "2", "3", "4", "5", "10"] {type:"raw"}
plot2d_checkbox = True #@param {type: 'boolean'}

set_seed(0)
# Equal samples per class training
train_df_sample = pd.concat([train_df[train_df[1]==0].sample(num_training), train_df[train_df[1]==1].sample(num_training)])
x_train = train_df_sample[text_col].values.tolist()
y_train = train_df_sample[category_col].values.tolist()

train_examples = [] 
for x in range(num_itr):
  train_examples = sentence_pairs_generation(np.array(x_train), np.array(y_train), train_examples)

orig_model = SentenceTransformer(st_model)
model = SentenceTransformer(st_model)

# S-BERT adaptation 
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=10, show_progress_bar=True)

# No Fit
X_train_noFT = orig_model.encode(x_train)
X_eval_noFT = orig_model.encode(x_eval)

sgd =  LogisticRegression()
sgd.fit(X_train_noFT, y_train)
y_pred_eval_sgd = sgd.predict(X_eval_noFT)

print('Acc. No Fit', accuracy_score(y_eval, y_pred_eval_sgd))

# With Fit (SetFit)
X_train = model.encode(x_train)
X_eval = model.encode(x_eval)

sgd =  LogisticRegression()
sgd.fit(X_train, y_train)
y_pred_eval_sgd = sgd.predict(X_eval)

print('Acc. SetFit', accuracy_score(y_eval, y_pred_eval_sgd))

#Plot 2-D 2x2 figures
if plot2d_checkbox:   

  plt.figure(figsize=(20,10))

#Plot X_train_noFit
  X_embedded = TSNE(n_components=2).fit_transform(np.array(X_train_noFT))
  plt.subplot(221)
  plt.title('X_train No Fit')

  for i, t in enumerate(set(np.array(y_train))):
      idx = np.array(y_train) == t
      plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], label=t)   

  plt.legend(bbox_to_anchor=(1, 1));

#Plot X_eval noFit
  X_embedded = TSNE(n_components=2).fit_transform(np.array(X_eval_noFT))
  plt.subplot(223)
  plt.title('X_eval No Fit')

  for i, t in enumerate(set(np.array(y_eval))):
      idx = np.array(y_eval) == t
      plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], label=t)   

  plt.legend(bbox_to_anchor=(1, 1));


#Plot X_train SetFit
  X_embedded = TSNE(n_components=2).fit_transform(np.array(X_train))

  plt.subplot(222)
  plt.title('X_train SetFit')

  for i, t in enumerate(set(np.array(y_train))):
      idx = np.array(y_train) == t
      plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], label=t)   

  plt.legend(bbox_to_anchor=(1, 1));

#Plot X_eval SetFit
  X_embedded = TSNE(n_components=2).fit_transform(np.array(X_eval))
  plt.subplot(224)
  plt.title('X_eval SetFit')

  for i, t in enumerate(set(np.array(y_eval))):
      idx = np.array(y_eval) == t
      plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], label=t)   

  plt.legend(bbox_to_anchor=(1, 1));
