# Few-shot for Text Classification

In this notebook, we'll do few-shot text classification based on SetFit framework.

In [None]:
# In a google colab install git-lfs
#!sudo apt-get install git-lfs
#!git lfs install

# Then
#!git clone https://huggingface.co/Salesforce/codet5p-110m-embedding

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))
    !nvidia-smi

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4
Fri Jan 19 12:30:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8              11W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
        

***Classes label:***

0- Bioinformatics

1- Economics

2- Social Sciences

3- Statistical and data analysis

4- Enviromental Sciences


## Setup

In [None]:
!pip install setfit
!pip install datasets
from datasets import load_dataset
from setfit import sample_dataset
import pandas as pd



In [None]:
model_id = "aubmindlab/bert-base-arabertv02"

  and should_run_async(code)


## Loading and sampling the dataset

In [None]:
df_train = pd.read_csv('Train.csv', on_bad_lines='skip')
df_test = pd.read_csv('Test.csv' , on_bad_lines='skip')

In [None]:
#df_train.columns = df_train.columns.str.replace('Label', 'label')

  and should_run_async(code)


Textual data

In [None]:
for i in range(len(df_train)):
  df_train['sentence'][i] ='Title: '+ ' '.join(str(df_train['Paper'][i]).split()[:50]) + ' Description: ' + ' '.join(str(df_train['description'][i]).split()[:100]) + ' Keywords: ' + ' '.join(str(df_train['Keywords'][i]).split()[:100]) # + ' files: ' + ' '.join(str(df_train['files'][i]).split()[:100])
for i in range(len(df_test)):
  df_test['sentence'][i] = 'Title: '+ ' '.join(df_test['Paper'][i].split()[:50]) +' Description: ' + ' '.join(df_test['description'][i].split()[:100]) + ' Keywords: ' + ' '.join(df_test['keywords'][i].split()[:100]) # + ' files: ' + ' '.join(str(df_train['files'][i]).split()[:100])

Code data

In [None]:
for i in range(len(df_train)):
  df_train['sentence'][i] = str(df_train['content'][i])
for i in range(len(df_test)):
  df_test['sentence'][i] = str(df_test['Code'][i])

In [None]:
df_train.to_csv('Train.csv',index=False)
df_test.to_csv('Test.csv',index=False)

In [None]:

dataset = load_dataset('csv', data_files={'train': 'Train.csv'}, on_bad_lines='skip')
dataset_ = load_dataset('csv', data_files={'test': 'Test.csv'})

eval_dataset = dataset_["test"]

train_dataset = sample_dataset(dataset["train"], num_samples=500)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the `from_pretrained()` method associated with the `SetFitModel` class:

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning
* `column_mapping`: The `SetFitTrainer` expects the inputs to be found in a `text` and `label` column. This mapping automatically formats the training and evaluation datasets for us.

In [None]:
from sentence_transformers.losses import CosineSimilarityLoss
from transformers.trainer_utils import EvaluationStrategy

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=5,
    num_epochs=5,
    batch_size=32,
    seed=42,
    metric='accuracy',
    column_mapping={"sentence": "text", "label": "label"},
)

Now that we've created a trainer, we can train it!

In [None]:
trainer.train()

***** Running training *****
  Num unique pairs = 10000
  Batch size = 32
  Num epochs = 5
  Total optimization steps = 1565


Step,Training Loss


The final step is to compute the model's performance using the `evaluate()` method:

In [None]:
metrics = trainer.evaluate()
metrics

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 0.60375}

And once the model is trained, you can push it to the Hub or save in local directory

In [None]:
model._save_pretrained('/content/Model_Few_shot_TDK_Exp3')