# Overview

With Sentence Transformers v3 we can finetune it's models to improve their performance on specific tasks.


# Why Finetune?

This is because each task requires a unique notion of similarity. For example:

* "Apple launches the new iPad"
* "NVIDIA is gearing up for the next GPU generation"

As a classification model for new articles, it could treat these texts as similar since they both belong to the Technology category. On the other hand, a semantic textual similarity or retrieval model may consider them dissimilar due to their distinct meanings

In [1]:
!pip install -U -q sentence-transformers==3.0.0
!pip install -U -q datasets==2.18.0

[0m[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info/METADATA'
[0m[31m
[0m

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune model with Sentence Transformer"
os.environ["WANDB_NAME"] = "ft-with-st-v3"
os.environ["MODEL_NAME"] = "bert-base-uncased"

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from datasets import load_dataset

# (anchor, positive, negative)
all_nli_triplet_train = load_dataset("sentence-transformers/all-nli", "triplet", split="train[:500]")
# (sentence1, sentence2) + score
stsb_pair_score_train = load_dataset("sentence-transformers/stsb", split="train[:500]")

# (anchor, positive, negative)
all_nli_triplet_dev = load_dataset("sentence-transformers/all-nli", "triplet", split="dev[:400]")
# (sentence1, sentence2, score)
stsb_pair_score_dev = load_dataset("sentence-transformers/stsb", split="validation[:400]")


# Combine all datasets into a dictionary with dataset names to datasets
train_dataset = {
    "all-nli-triplet": all_nli_triplet_train,
    "stsb": stsb_pair_score_train
}


eval_dataset = {
    "all-nli-triplet": all_nli_triplet_dev,
    "stsb": stsb_pair_score_dev
}

Downloading readme:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/782k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/810k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/557850 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/6584 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6609 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/471k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/142k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/108k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

# Model

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(os.getenv("MODEL_NAME"))



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

# Loss Function

It's crucial to ensure that your dataset format matches your chosen loss function.

* If your loss function requires a Label (as indicated in the Loss Overview table), your dataset must have a column named "label" or "score".

* All columns other than "label" or "score" are considered Inputs

In [6]:
from sentence_transformers.losses import CoSENTLoss, MultipleNegativesRankingLoss

# (anchor, positive), (anchor, positive, negative)
mnrl_loss = MultipleNegativesRankingLoss(model)

# (sentence_A, sentence_B) + score
cosent_loss = CoSENTLoss(model)


losses={
    "all-nli-triplet": mnrl_loss,
    "stsb": cosent_loss,
}

In [8]:
from sentence_transformers import (
    SentenceTransformerTrainingArguments, 
    SentenceTransformerTrainer
)

from sentence_transformers.training_args import BatchSamplers

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=os.getenv("WANDB_NAME"),
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if GPU can't handle FP16
    bf16=False,  # Set to True if GPU supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicates
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    report_to="wandb",
    run_name=os.getenv('WANDB_NAME'),
)


trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=losses,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss


TrainOutput(global_step=64, training_loss=2.8195793628692627, metrics={'train_runtime': 29.2099, 'train_samples_per_second': 34.235, 'train_steps_per_second': 2.191, 'total_flos': 0.0, 'train_loss': 2.8195793628692627, 'epoch': 1.0})

In [12]:
model.save_pretrained(os.getenv("WANDB_NAME"))
model.push_to_hub(os.getenv("WANDB_NAME"))

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/aisuko/ft-with-st/commit/457f15442bebc3e8d9319ac412a6b4705b993173'

# Acknowledge

* https://huggingface.co/blog/train-sentence-transformers?utm_source=substack&utm_medium=email#dataset
* https://sbert.net