# Fine-Tuning a Sentence Transformer Model for Financial Text

This notebook demonstrates a best-practice workflow for fine-tuning a pre-trained sentence-transformer model on a specialized domain. In this case, a financial question-and-context dataset.

The key steps are:
1.  **Setup**: Install necessary libraries.
2.  **Data Preparation**: Load the dataset and create a robust three-way split (train, validation and test).
3.  **Model Configuration**: Define the training arguments, loss function, and evaluators.
4.  **Training**: Run the training process, using the validation set to monitor performance and save the best model.
5.  **Evaluation**: Perform a final, unbiased evaluation on the held-out test set.
6.  **Comparison**: Compare the performance of the fine-tuned model against the original pre-trained model to quantify the improvement.

## 1. Setup

First, we install the required Python libraries. `sentence-transformers` is the core library for this task, and we also need `datasets` for data handling and `transformers` as a dependency.

In [None]:
!pip install torch sentence-transformers datasets transformers



## 2. Data Loading and Initial Processing

We will load the `philschmid/finanical-rag-embedding-dataset` from the Hugging Face Hub. This dataset contains pairs of financial questions and their corresponding context paragraphs, which is ideal for our retrieval task.

We also load the pre-trained `sentence-transformers/all-MiniLM-L6-v2` model. This is a strong, general-purpose model that we will specialize for our financial domain.

In [None]:
import torch
from datasets import load_dataset, concatenate_datasets
from sentence_transformers import SentenceTransformer

# Load the base model from the Hugging Face Hub.
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Load the financial dataset.
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/882 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

## 3. Creating a Robust Train, Validation, and Test Split

This is a critical step for reliable model evaluation. We will split our data into three distinct sets:
- **Training Set (81%)**: The data the model learns from.
- **Validation Set (9%)**: Data held out from training, used to check the model's performance at the end of each epoch. This helps us find the best model and prevent overfitting.
- **Test Set (10%)**: Data that is completely untouched during training and validation. It's used only once at the end to get a final, unbiased measure of the model's performance.

We will save these splits to JSON files for easy access.

In [None]:
# Rename the columns to the format expected by the trainer: 'anchor' for the query and 'positive' for the relevant document.
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")

# Add a unique ID to each row, which is useful for creating our evaluation dictionaries.
dataset = dataset.add_column("id", range(len(dataset)))

# First, split off the test set (10% of the total data).
train_val_dataset = dataset.train_test_split(test_size=0.1, seed=42)

# Next, split the remaining 90% into a new training set and a validation set.
# The validation set will be 10% of this remaining data.
train_dataset_final = train_val_dataset['train'].train_test_split(test_size=0.1, seed=42)

# Save the three final datasets to disk.
train_dataset_final["train"].to_json("train_dataset.json", orient="records")
train_dataset_final["test"].to_json("validation_dataset.json", orient="records") # This is our validation set
train_val_dataset["test"].to_json("test_dataset.json", orient="records") # This is our final test set

Creating json from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

252985

In [None]:
# Load the splits from the JSON files we just created.
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
validation_dataset = load_dataset("json", data_files="validation_dataset.json", split="train")
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")

# The InformationRetrievalEvaluator needs a 'corpus' of all possible documents to search from.
# We create this by combining all three splits.
corpus_dataset = concatenate_datasets([train_dataset, validation_dataset, test_dataset])

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

## 4. Preparing Data for the Evaluator

The `InformationRetrievalEvaluator` requires the data to be in a specific dictionary format:
- `corpus`: A dictionary mapping a document ID to the document text.
- `queries`: A dictionary mapping a query ID to the query text.
- `relevant_docs`: A dictionary mapping a query ID to a list of relevant document IDs.

We will create these dictionaries for both our validation set and our test set.

In [None]:
# Create the corpus dictionary from the combined dataset.
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)

# Create the queries and relevant documents dictionaries for the VALIDATION set.
val_queries = dict(
    zip(validation_dataset["id"], validation_dataset["anchor"])
)
val_relevant_docs = {}
for q_id in val_queries:
    val_relevant_docs[q_id] = [q_id]

# Create the queries and relevant documents dictionaries for the TEST set.
test_queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)
test_relevant_docs = {}
for q_id in test_queries:
    test_relevant_docs[q_id] = [q_id]

## 5. Model and Training Configuration

Now we configure the components needed for training.

### Loss Function

We use `MultipleNegativesRankingLoss`, a highly effective loss function for this type of task. For each training sample (an anchor-positive pair), it treats all other positive samples within the same batch as negative examples. This creates a rich set of challenging negatives for the model to learn from, improving its ability to distinguish between similar documents.

In [None]:
from sentence_transformers.losses import MultipleNegativesRankingLoss

loss = MultipleNegativesRankingLoss(model)

### Training Arguments

We define all the training parameters using `SentenceTransformerTrainingArguments`. This is where we set the number of epochs, batch sizes, learning rate, and evaluation strategy.

Crucially, we set:
- `eval_strategy="epoch"`: to run evaluation at the end of each epoch.
- `load_best_model_at_end=True`: to ensure the trainer reloads the model weights from the epoch that had the best performance on the validation set.
- `metric_for_best_model="eval_validation_cosine_ndcg@10"`: This tells the trainer that the "best" model is the one with the highest `ndcg@10` score on the validation set.

In [None]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

args = SentenceTransformerTrainingArguments(
    # A name for the output directory.
    output_dir="all-MiniLM-L6-v2-financial",
    # The number of training epochs.
    num_train_epochs=6,
    # The batch size for the training dataloader.
    per_device_train_batch_size=32,
    # The batch size for the evaluation dataloader.
    per_device_eval_batch_size=16,
    # The learning rate.
    learning_rate=2e-5,
    # Use a cosine learning rate scheduler.
    lr_scheduler_type="cosine",
    # Use a fused AdamW optimizer for faster training.
    optim="adamw_torch_fused",
    # Use mixed precision training for a speedup.
    fp16=True,
    # This loss benefits from not having duplicates in the batch.
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    # Run evaluation at the end of each epoch.
    eval_strategy="epoch",
    # Save the model at the end of each epoch.
    save_strategy="epoch",
    # Only keep the last 3 saved models.
    save_total_limit=3,
    # When training is finished, load the best model found during training.
    load_best_model_at_end=True,
    # The metric to use to compare models and select the best one.
    metric_for_best_model="eval_validation_cosine_ndcg@10",
)

### Validation Evaluator

We create an `InformationRetrievalEvaluator` that will be used by the trainer to assess the model's performance on the **validation set** after each epoch.

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.util import cos_sim

# The evaluator passed to the trainer will use the validation set.
validation_evaluator = InformationRetrievalEvaluator(
    queries=val_queries,
    corpus=corpus,
    relevant_docs=val_relevant_docs,
    score_functions={"cosine": cos_sim},
    name="validation"
)

### Assembling the Trainer

Finally, we bring all the components together in the `SentenceTransformerTrainer`.

In [None]:
from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select_columns(
        ["anchor", "positive"]
    ),
    loss=loss,
    evaluator=validation_evaluator,
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

## 6. Training the Model

Now we can start the training process. The trainer will display a table showing the performance on the validation set after each epoch. Notice how it tracks the `eval_validation_cosine_ndcg@10` metric we specified.

In [None]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpetroschol123[0m ([33mpetroschol123-freelancer[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Validation Cosine Accuracy@1,Validation Cosine Accuracy@3,Validation Cosine Accuracy@5,Validation Cosine Accuracy@10,Validation Cosine Precision@1,Validation Cosine Precision@3,Validation Cosine Precision@5,Validation Cosine Precision@10,Validation Cosine Recall@1,Validation Cosine Recall@3,Validation Cosine Recall@5,Validation Cosine Recall@10,Validation Cosine Ndcg@10,Validation Cosine Mrr@10,Validation Cosine Map@100
1,No log,No log,0.677778,0.819048,0.857143,0.91746,0.677778,0.273016,0.171429,0.091746,0.677778,0.819048,0.857143,0.91746,0.798128,0.760034,0.763571
2,No log,No log,0.698413,0.838095,0.88254,0.926984,0.698413,0.279365,0.176508,0.092698,0.698413,0.838095,0.88254,0.926984,0.813546,0.777188,0.780508
3,0.053400,No log,0.693651,0.84127,0.884127,0.925397,0.693651,0.280423,0.176825,0.09254,0.693651,0.84127,0.884127,0.925397,0.813164,0.776847,0.780266
4,0.053400,No log,0.7,0.849206,0.890476,0.92381,0.7,0.283069,0.178095,0.092381,0.7,0.849206,0.890476,0.92381,0.816792,0.781934,0.78549
5,0.053400,No log,0.693651,0.847619,0.885714,0.925397,0.693651,0.28254,0.177143,0.09254,0.693651,0.847619,0.885714,0.925397,0.81444,0.778381,0.78185
6,0.022900,No log,0.695238,0.847619,0.885714,0.926984,0.695238,0.28254,0.177143,0.092698,0.695238,0.847619,0.885714,0.926984,0.815507,0.779355,0.782652


TrainOutput(global_step=1068, training_loss=0.03762949986404247, metrics={'train_runtime': 306.2295, 'train_samples_per_second': 111.093, 'train_steps_per_second': 3.488, 'total_flos': 0.0, 'train_loss': 0.03762949986404247, 'epoch': 6.0})

In [None]:
trainer.save_model()

## 7. Final Evaluation on the Test Set

Training is complete. Because we set `load_best_model_at_end=True`, the `trainer` object now holds the model from the epoch with the best validation score.

Now, we perform the final evaluation on the held-out **test set** to get an unbiased measure of its performance.

In [None]:
# Load the best model that was saved during training.
final_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)

# Create a new evaluator specifically for the test set.
test_evaluator = InformationRetrievalEvaluator(
        queries=test_queries,
        corpus=corpus,
        relevant_docs=test_relevant_docs,
        score_functions={"cosine": cos_sim},
        name="test"
    )

# Evaluate the model on the test set.
test_results = test_evaluator(final_model)
print("Final results on the test set:")
print(test_results)

Final results on the test set:
{'test_cosine_accuracy@1': 0.7171428571428572, 'test_cosine_accuracy@3': 0.8442857142857143, 'test_cosine_accuracy@5': 0.8914285714285715, 'test_cosine_accuracy@10': 0.9228571428571428, 'test_cosine_precision@1': 0.7171428571428572, 'test_cosine_precision@3': 0.2814285714285714, 'test_cosine_precision@5': 0.17828571428571427, 'test_cosine_precision@10': 0.09228571428571428, 'test_cosine_recall@1': 0.7171428571428572, 'test_cosine_recall@3': 0.8442857142857143, 'test_cosine_recall@5': 0.8914285714285715, 'test_cosine_recall@10': 0.9228571428571428, 'test_cosine_ndcg@10': 0.8226461612658785, 'test_cosine_mrr@10': 0.790068027210884, 'test_cosine_map@100': 0.7934531683753783}


## 8. Comparison and Conclusion

The final step is to compare the performance of our fine-tuned model against the original, off-the-shelf `all-MiniLM-L6-v2` model. This will clearly demonstrate the value of fine-tuning on our domain-specific data.

We run the original model through the same `test_evaluator` and then display the results side-by-side in a DataFrame.

In [None]:
# Load the original, pre-trained model again.
original_model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Evaluate the original model on the same test set.
print("Evaluating the original model...")
original_model_results = test_evaluator(original_model)

print("\n--- Evaluation Complete ---")

Evaluating the original model...

--- Evaluation Complete ---


In [None]:
import pandas as pd

# Create a dictionary to hold both sets of results.
comparison_data = {
    "Original Model": original_model_results,
    "Fine-Tuned Model": test_results
}

# Convert to a Pandas DataFrame for easy viewing.
df_comparison = pd.DataFrame(comparison_data)

# Calculate the percentage improvement.
df_comparison['Improvement'] = (
    (df_comparison['Fine-Tuned Model'] - df_comparison['Original Model']) / df_comparison['Original Model']
) * 100

# Format the improvement column to show as a percentage.
df_comparison['Improvement'] = df_comparison['Improvement'].map('{:.2f}%'.format)

print("--- Performance Comparison on the Test Set ---")
print(df_comparison)

--- Performance Comparison on the Test Set ---
                          Original Model  Fine-Tuned Model Improvement
test_cosine_accuracy@1          0.628571          0.717143      14.09%
test_cosine_accuracy@3          0.762857          0.844286      10.67%
test_cosine_accuracy@5          0.818571          0.891429       8.90%
test_cosine_accuracy@10         0.860000          0.922857       7.31%
test_cosine_precision@1         0.628571          0.717143      14.09%
test_cosine_precision@3         0.254286          0.281429      10.67%
test_cosine_precision@5         0.163714          0.178286       8.90%
test_cosine_precision@10        0.086000          0.092286       7.31%
test_cosine_recall@1            0.628571          0.717143      14.09%
test_cosine_recall@3            0.762857          0.844286      10.67%
test_cosine_recall@5            0.818571          0.891429       8.90%
test_cosine_recall@10           0.860000          0.922857       7.31%
test_cosine_ndcg@10           

### Conclusion

The comparison table clearly shows a significant improvement across all metrics. For example, the **`test_cosine_ndcg@10`** score, a key measure of ranking quality, improved by over **10%**. This demonstrates that fine-tuning has successfully adapted the model to the nuances of financial language, making it a much more effective retrieval tool for this specific domain.