<a href="https://colab.research.google.com/github/Ramkuchana/LLMs/blob/master/Fine-tuning%20with%20Few-shot%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning with Few-shot Classification

In [1]:
# %%capture
!pip install datasets>=2.18.0 transformers>=4.38.2 sentence-transformers>=2.5.1 setfit>=1.0.3 accelerate>=0.27.2 seqeval>=1.2.2

This project involves:

* Fine-tuning the pre-trained BERT model, specifically bert-base-cased, along with the classification head as a single architecture for movie review sentiment classification.

* Reducing fine-tuning training time by freezing some layers while attempting to maintain nearly the same performance as a fully fine-tuned bert-base-cased model.

## IMDb dataset

We will use the IMDb dataset from Hugging Face's datasets library to fine-tune our model for binary sentiment classification. The dataset consists of movie reviews labeled as either positive or negative. The original dataset has balanced train and test datasets, each containing 25,000 labeled samples (reviews) and an additional 50,000 unlabeled samples for unsupervised learning.

However, we will only use a subset of the IMDb dataset to reduce the computational requirements.

let's first load and explore the original IMDb dataset.

In [2]:
from datasets import load_dataset, DatasetDict, concatenate_datasets

# Load the IMDb dataset
imdb_dataset = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
# Print the new imdb dataset

print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [4]:
# Inspecting the dataset structure

print(imdb_dataset["train"].features)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


In [5]:
# Checking the first review to see what it looks like

print(imdb_dataset["train"][0]["text"])

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [6]:
# Checking the label of the first review

print(imdb_dataset["train"][0]["label"])

0


## Creating a Smaller Dataset



We will create a new IMDb dataset containing a balanced training set with 8,000 samples (reviews) and a test set with 8,000 samples from the original IMDb dataset.



In [7]:
# Define a function to balance the dataset
def balance_dataset(dataset, label_col, num_samples):
    # Filter positive and negative examples
    positive_samples = dataset.filter(lambda example: example[label_col] == 1)
    negative_samples = dataset.filter(lambda example: example[label_col] == 0)

    # Subsample both to the desired number
    positive_samples = positive_samples.shuffle(seed=42).select(range(num_samples // 2))
    negative_samples = negative_samples.shuffle(seed=42).select(range(num_samples // 2))

    # Concatenate positive and negative examples to form a balanced dataset
    balanced_dataset = concatenate_datasets([positive_samples, negative_samples]).shuffle(seed=42)

    return balanced_dataset

In [8]:
# Create a balanced train and test dataset
balanced_train = balance_dataset(imdb_dataset["train"], "label", 8000)
balanced_test = balance_dataset(imdb_dataset["test"], "label", 2000)

# Create a new IMDb dataset of DatasetDict type with the balanced datasets
new_imdb_dataset = DatasetDict({
    "train": balanced_train,
    "test": balanced_test
})


In [9]:
# Print the new imdb dataset

print(new_imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


In [10]:
# Create train and test dataset splits

train_data, test_data = new_imdb_dataset["train"], new_imdb_dataset["test"]

# Few-shot Classification

In [11]:
from setfit import sample_dataset

# We simulate a few-shot setting by sampling 15 examples per class
sampled_train_data = sample_dataset(imdb_dataset["train"], num_samples=15)

In [12]:
from setfit import SetFitModel

# Load a pre-trained SentenceTransformer model
model = SetFitModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [19]:
model = SetFitModel.from_pretrained(
    "sentence-transformers/all-mpnet-base-v2", use_differentiable_head=True
)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [13]:
from setfit import TrainingArguments as SetFitTrainingArguments
from setfit import Trainer as SetFitTrainer

# Define training arguments
args = SetFitTrainingArguments(
    num_epochs=3, # The number of epochs to use for contrastive learning
    num_iterations=20  # The number of text pairs to generate
)
#args.eval_strategy = args.evaluation_strategy

# Create trainer
trainer = SetFitTrainer(
    model=model,
    args=args,
    train_dataset=sampled_train_data,
    eval_dataset=test_data,
    metric="f1"
)

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

In [14]:
# Training loop
trainer.train()

***** Running training *****
  Num unique pairs = 1200
  Batch size = 16
  Num epochs = 3


Step,Training Loss
1,0.2416
50,0.105
100,0.0003
150,0.0001
200,0.0001


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [15]:
# Evaluate the model on our test data
trainer.evaluate()

***** Running evaluation *****


{'f1': 0.90104662226451}

In [16]:
# Evaluate the model on our test data
trainer.evaluate()

***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'f1': 0.90104662226451}

In [None]:
model.model_head