# Fine-Tuning SentenceTransformers for Multilingual Entity Resolution

This notebook demonstrates how to fine-tune a SentenceTransformer model for entity matching across languages and character sets. We use contrastive learning to train a model that can compare people and company names, measuring similarity even when the names are in different languages or scripts.

## Background

Entity resolution for person and company names is challenging because:
1. Names vary across cultures and languages
2. Transliteration creates variations (e.g., "Yevgeny Prigozhin" vs "Евгений Пригожин")
3. Abbreviations and different forms (e.g., "John Smith" vs "J. Smith")

Traditional string matching methods often fail with these variations. We use representation learning to create embeddings that capture semantic meaning across languages.

## Setup and Imports

We start by importing the required libraries and setting up our environment.

In [None]:
import logging
import os
import random
import sys
import warnings
from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.quantization as tq
from datasets import Dataset  # type: ignore
from scipy.stats import iqr
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    losses,
)
from sentence_transformers.evaluation import BinaryClassificationEvaluator
from sentence_transformers.model_card import SentenceTransformerModelCardData
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_recall_curve,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
from transformers import EarlyStoppingCallback
from transformers.integrations import WandbCallback

import wandb

# For reproducibility
RANDOM_SEED = 31337
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
torch.mps.manual_seed(RANDOM_SEED)

# Setup logging and suppress warnings
logging.basicConfig(stream=sys.stderr, level=logging.ERROR)
logger = logging.getLogger(__name__)
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", UserWarning)

# HuggingFace settings
os.environ["HF_ENDPOINT"] = "https://huggingface.co/"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Configure Pandas to show more rows
pd.set_option("display.max_rows", 40)
pd.set_option("display.max_columns", None)

## Utility Functions

First, let's define some utility functions to help with our training and evaluation. These are imported from `eridu.train.utils` in the full codebase.

In [None]:
def compute_sbert_metrics(eval_pred):
    """Compute accuracy, precision, recall, f1 and roc_auc
    
    This function is called during model evaluation and logs metrics to W&B automatically
    through the WandbCallback.
    """
    predictions, labels = eval_pred
    
    # Apply threshold to predictions (0.5 is default)
    if isinstance(predictions[0], float):
        # If predictions are similarity scores (between 0 and 1)
        binary_preds = [1 if pred >= 0.5 else 0 for pred in predictions]
    else:
        # If predictions are already binary
        binary_preds = predictions
    
    # Calculate metrics
    metrics = {
        "accuracy": accuracy_score(labels, binary_preds),
        "precision": precision_score(labels, binary_preds, zero_division=0),
        "recall": recall_score(labels, binary_preds, zero_division=0),
        "f1": f1_score(labels, binary_preds, zero_division=0),
    }
    
    # Calculate AUC only if predictions are continuous (not binary)
    if isinstance(predictions[0], float):
        metrics["auc"] = roc_auc_score(labels, predictions)
    
    return metrics

def sbert_compare(sbert_model, name1, name2, use_gpu=True):
    """Compare two names using SBERT embeddings and cosine similarity.
    
    Args:
        sbert_model: The SentenceTransformer model to use
        name1: First name to compare
        name2: Second name to compare
        use_gpu: Whether to use GPU acceleration
        
    Returns:
        Cosine similarity between the two name embeddings (0-1)
    """
    # Get the device from the model
    device = next(sbert_model.parameters()).device if use_gpu else torch.device("cpu")
    device_str = str(device)
    
    # Determine whether to use GPU based on availability and parameter
    convert_to_tensor = use_gpu and (device.type == "cuda" or device.type == "mps")
    
    # Encode both names into embeddings
    if convert_to_tensor:
        # GPU path
        embedding1_tensor = sbert_model.encode(
            name1,
            convert_to_tensor=True,
            convert_to_numpy=False,
            device=device_str,
        )
        embedding2_tensor = sbert_model.encode(
            name2,
            convert_to_tensor=True,
            convert_to_numpy=False,
            device=device_str,
        )
        
        # Handle dimensions
        if len(embedding1_tensor.shape) > 1 and embedding1_tensor.shape[0] == 1:
            embedding1_tensor = embedding1_tensor.squeeze(0)
            embedding2_tensor = embedding2_tensor.squeeze(0)
            
        # Normalize the embeddings
        embedding1_tensor = embedding1_tensor / torch.norm(embedding1_tensor)
        embedding2_tensor = embedding2_tensor / torch.norm(embedding2_tensor)
        
        # For a single vector, use a simple dot product
        similarity = torch.sum(embedding1_tensor * embedding2_tensor).item()
        return float(similarity)
    else:
        # CPU implementation
        from scipy.spatial import distance
        embedding1_np = sbert_model.encode(name1, convert_to_numpy=True)
        embedding2_np = sbert_model.encode(name2, convert_to_numpy=True)
        diff = 1 - distance.cosine(embedding1_np, embedding2_np)
        return diff

def sbert_compare_multiple(sbert_model, names1, names2, use_gpu=True):
    """Compare multiple pairs of names efficiently.
    
    Args:
        sbert_model: The SentenceTransformer model to use
        names1: List of first names to compare
        names2: List of second names to compare
        use_gpu: Whether to use GPU acceleration
        
    Returns:
        Array of cosine similarities between corresponding name pairs
    """
    # Handle pandas Series
    if isinstance(names1, pd.Series):
        names1 = names1.astype(str).tolist()
    if isinstance(names2, pd.Series):
        names2 = names2.astype(str).tolist()
    
    # Get device
    device = next(sbert_model.parameters()).device if use_gpu else torch.device("cpu")
    device_str = str(device)
    
    # Determine whether to use GPU
    convert_to_tensor = use_gpu and (device.type == "cuda" or device.type == "mps")
    
    if convert_to_tensor:
        # GPU path
        embeddings1_tensor = sbert_model.encode(
            names1,
            convert_to_tensor=True,
            convert_to_numpy=False,
            device=device_str,
        )
        embeddings2_tensor = sbert_model.encode(
            names2,
            convert_to_tensor=True,
            convert_to_numpy=False,
            device=device_str,
        )
        
        # Handle small sample sizes
        if len(embeddings1_tensor.shape) == 1:
            embeddings1_tensor = embeddings1_tensor.unsqueeze(0)
            embeddings2_tensor = embeddings2_tensor.unsqueeze(0)
            
        # Normalize along embedding dimension
        embeddings1_tensor = embeddings1_tensor / torch.norm(embeddings1_tensor, dim=1, keepdim=True)
        embeddings2_tensor = embeddings2_tensor / torch.norm(embeddings2_tensor, dim=1, keepdim=True)
        
        # Compute similarity
        tensor_similarities = torch.sum(embeddings1_tensor * embeddings2_tensor, dim=1)
        
        # Convert to numpy
        similarities = tensor_similarities.cpu().numpy()
    else:
        # CPU path
        embeddings1_np = sbert_model.encode(names1, convert_to_numpy=True)
        embeddings2_np = sbert_model.encode(names2, convert_to_numpy=True)
        
        # Handle small sample sizes
        if len(embeddings1_np.shape) == 1:
            embeddings1_np = np.expand_dims(embeddings1_np, axis=0)
            embeddings2_np = np.expand_dims(embeddings2_np, axis=0)
            
        # Normalize
        embeddings1_np = embeddings1_np / np.linalg.norm(embeddings1_np, axis=1, keepdims=True)
        embeddings2_np = embeddings2_np / np.linalg.norm(embeddings2_np, axis=1, keepdims=True)
        
        # Compute similarity
        similarities = np.sum(embeddings1_np * embeddings2_np, axis=1)
        
    return similarities

def sbert_compare_multiple_df(sbert_model, names1, names2, matches, use_gpu=True):
    """Compute similarities and return as DataFrame."""
    similarities = sbert_compare_multiple(sbert_model, names1, names2, use_gpu=use_gpu)
    return pd.DataFrame(
        {"name1": names1, "name2": names2, "similarity": similarities, "match": matches}
    )

## Configuration Parameters

Now, let's set up the configuration parameters for our training run.

In [None]:
# Configure sample size and model training parameters
SAMPLE_FRACTION = 0.01  # Fraction of data to use (set lower for faster testing)
SBERT_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
VARIANT = "original"
OPTIMIZER = "adafactor"
MODEL_SAVE_NAME = (SBERT_MODEL + "-" + VARIANT + "-" + OPTIMIZER).replace("/", "-")
EPOCHS = 6
BATCH_SIZE = 1024
GRADIENT_ACCUMULATION_STEPS = 4
PATIENCE = 2
LEARNING_RATE = 5e-5
SBERT_OUTPUT_FOLDER = f"data/fine-tuned-sbert-{MODEL_SAVE_NAME}"
SAVE_EVAL_STEPS = 100
USE_FP16 = True

# Weights & Biases configuration
WANDB_PROJECT = "eridu"
WANDB_ENTITY = "your_wandb_username"  # Replace with your W&B username

# GPU configuration
USE_GPU = True

# Display the configuration
print("Training configuration:")
print(f"  Model: {SBERT_MODEL}")
print(f"  Sample fraction: {SAMPLE_FRACTION}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Epochs: {EPOCHS}")
print(f"  FP16: {USE_FP16}")
print(f"  GPU enabled: {USE_GPU}")
print(f"  Output folder: {SBERT_OUTPUT_FOLDER}")

## Initialize Weights & Biases

We'll use Weights & Biases for experiment tracking.

In [None]:
# Initialize Weights & Biases
# You need to run wandb.login() first if you haven't already
wandb.init(
    entity=WANDB_ENTITY,
    project=WANDB_PROJECT,
    config={
        "variant": VARIANT,
        "optimizer": OPTIMIZER,
        "epochs": EPOCHS,
        "sample_fraction": SAMPLE_FRACTION,
        "gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
        "use_fp16": USE_FP16,
        "batch_size": BATCH_SIZE,
        "patience": PATIENCE,
        "learning_rate": LEARNING_RATE,
        "sbert_model": SBERT_MODEL,
        "model_save_name": MODEL_SAVE_NAME,
        "sbert_output_folder": SBERT_OUTPUT_FOLDER,
        "save_eval_steps": SAVE_EVAL_STEPS,
    },
    save_code=True,
)

## GPU Setup

Check for GPU availability and set the device accordingly.

In [None]:
# Check for CUDA or MPS availability and set the device
if USE_GPU:
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using Apple GPU (Metal) acceleration")
    elif torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using NVIDIA CUDA GPU acceleration")
    else:
        device = "cpu"
        print("No GPU available, falling back to CPU")
else:
    device = "cpu"
    print("GPU disabled, using CPU for training")

print(f"Device for fine-tuning SBERT: {device}")

## Data Loading and Preprocessing

Load the labeled pairs dataset and prepare it for training.

In [None]:
# Load the dataset
dataset = pd.read_parquet("data/pairs-all.parquet")

# Display a sample of the raw data
print("\nRaw training data sample:\n")
display(dataset.sample(n=5))

# Sample the dataset if needed
if SAMPLE_FRACTION < 1.0:
    dataset = dataset.sample(frac=SAMPLE_FRACTION, random_state=RANDOM_SEED)

# Split into train, eval, and test sets
train_df, tmp_df = train_test_split(dataset, test_size=0.2, random_state=RANDOM_SEED, shuffle=True)
eval_df, test_df = train_test_split(tmp_df, test_size=0.5, random_state=RANDOM_SEED, shuffle=True)

print(f"\nTraining data:   {len(train_df):,}")
print(f"Evaluation data: {len(eval_df):,}")
print(f"Test data:       {len(test_df):,}\n")

# Convert to HuggingFace Datasets format
train_dataset = Dataset.from_dict({
    "sentence1": train_df["left_name"].tolist(),
    "sentence2": train_df["right_name"].tolist(),
    "label": train_df["match"].astype(float).tolist(),
})

eval_dataset = Dataset.from_dict({
    "sentence1": eval_df["left_name"].tolist(),
    "sentence2": eval_df["right_name"].tolist(),
    "label": eval_df["match"].astype(float).tolist(),
})

test_dataset = Dataset.from_dict({
    "sentence1": test_df["left_name"].tolist(),
    "sentence2": test_df["right_name"].tolist(),
    "label": test_df["match"].astype(float).tolist(),
})

## Model Setup

Initialize the SentenceTransformer model that we'll fine-tune.

In [None]:
# Initialize the SBERT model
sbert_model = SentenceTransformer(
    SBERT_MODEL,
    device=str(device),
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name=f"{SBERT_MODEL}-address-matcher-{VARIANT}",
    ),
)

# Enable gradient checkpointing to save memory
sbert_model.gradient_checkpointing_enable()

# Put the model in training mode
sbert_model.train()

# Apply quantization if not using fp16
if not USE_FP16:
    # Tell PyTorch to quantize the Linear layers
    for module in sbert_model.modules():
        if isinstance(module, torch.nn.Linear):
            module.qconfig = tq.get_default_qat_qconfig("fbgemm")
    
    # Prepare QAT: inserts FakeQuant and Observer modules
    tq.prepare_qat(sbert_model, inplace=True)

## Testing the Pre-Trained Model

Before fine-tuning, let's test the base model to see how it performs.

In [None]:
print("Testing raw (un-fine-tuned) SBERT model:")
examples = []
examples.append(
    ["John Smith", "John Smith", sbert_compare(sbert_model, "John Smith", "John Smith", use_gpu=True)]
)
examples.append(
    ["John Smith", "John H. Smith", sbert_compare(sbert_model, "John Smith", "John H. Smith", use_gpu=True)]
)
# Russian name example
examples.append(
    ["Yevgeny Prigozhin", "Евгений Пригожин", sbert_compare(sbert_model, "Yevgeny Prigozhin", "Евгений Пригожин", use_gpu=True)]
)
# Chinese name example
examples.append(
    ["Ben Lorica", "罗瑞卡", sbert_compare(sbert_model, "Ben Lorica", "罗瑞卡", use_gpu=True)]
)
examples_df = pd.DataFrame(examples, columns=["sentence1", "sentence2", "similarity"])
display(examples_df)

# Create a visualization
plt.figure(figsize=(10, 6))
sns.barplot(x="sentence1", y="similarity", data=examples_df)
plt.title("Pre-trained Model Similarity Scores")
plt.ylabel("Similarity Score")
plt.xlabel("Name Pairs")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## Initial Model Evaluation

Evaluate the pre-trained model on our evaluation dataset.

In [None]:
# Sample a subset of the evaluation data
sample_df = eval_df.sample(frac=0.1, random_state=RANDOM_SEED)
if len(sample_df) < 5 and len(eval_df) >= 5:
    sample_df = eval_df.sample(n=5, random_state=RANDOM_SEED)
print(f"Running initial evaluation on {device} for {len(sample_df):,} sample records")

# Compare the names and measure error
result_df = sbert_compare_multiple_df(
    sbert_model, sample_df["left_name"], sample_df["right_name"], sample_df["match"], use_gpu=True
)
error_s = np.abs(result_df.match.astype(float) - result_df.similarity)
score_diff_s = np.abs(error_s - sample_df.score)

# Calculate statistics
stats_df = pd.DataFrame(
    [
        {"mean": error_s.mean(), "std": error_s.std(), "iqr": iqr(error_s)},
        {"mean": score_diff_s.mean(), "std": score_diff_s.std(), "iqr": iqr(score_diff_s.dropna())},
    ],
    index=["Raw SBERT", "Raw SBERT - Levenshtein Score"],
)
print("\nRaw SBERT model stats:")
display(stats_df)

# Log metrics to W&B
wandb.log({
    "raw_model/error_mean": error_s.mean(),
    "raw_model/error_std": error_s.std(),
    "raw_model/error_iqr": iqr(error_s),
})

# Create dataset for binary classification evaluation
sample_dataset = Dataset.from_dict({
    "sentence1": sample_df["left_name"].tolist(),
    "sentence2": sample_df["right_name"].tolist(),
    "label": sample_df["match"].astype(float).tolist(),
})

# Set up evaluation directory
eval_dir = f"{SBERT_OUTPUT_FOLDER}/eval/binary_classification_evaluation_{SBERT_MODEL.replace('/', '-')}"
os.makedirs(eval_dir, exist_ok=True)
evaluation_name = SBERT_MODEL.replace("/", "-")

# Use SentenceTransformers evaluator
binary_acc_evaluator = BinaryClassificationEvaluator(
    sentences1=sample_dataset["sentence1"],
    sentences2=sample_dataset["sentence2"],
    labels=sample_dataset["label"],
    name=evaluation_name,
)
binary_acc_results = binary_acc_evaluator(sbert_model)
binary_acc_df = pd.DataFrame([binary_acc_results])
display(binary_acc_df)

# Log binary metrics to W&B
wandb.log({
    "raw_model/binary_accuracy": binary_acc_results.get("accuracy", 0.0),
    "raw_model/binary_f1": binary_acc_results.get("f1", 0.0),
    "raw_model/binary_precision": binary_acc_results.get("precision", 0.0),
    "raw_model/binary_recall": binary_acc_results.get("recall", 0.0),
    "raw_model/binary_ap": binary_acc_results.get("ap", 0.0),
})

## Fine-Tuning Setup

Now we'll set up the training configuration for fine-tuning.

In [None]:
# Set up contrastive loss for training
loss = losses.ContrastiveLoss(model=sbert_model)

# Configure training arguments
sbert_args = SentenceTransformerTrainingArguments(
    output_dir=SBERT_OUTPUT_FOLDER,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    fp16=USE_FP16,
    fp16_opt_level="O1" if USE_FP16 else "O0",
    warmup_ratio=0.1,
    run_name=SBERT_MODEL,
    load_best_model_at_end=True,
    save_total_limit=5,
    save_steps=SAVE_EVAL_STEPS,
    eval_steps=SAVE_EVAL_STEPS,
    save_strategy="steps",
    eval_strategy="steps",
    greater_is_better=False,
    metric_for_best_model="eval_loss",
    learning_rate=LEARNING_RATE,
    logging_dir="./logs",
    weight_decay=0.02,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    gradient_checkpointing=True,
    optim=OPTIMIZER,
)

# Initialize the trainer
trainer = SentenceTransformerTrainer(
    model=sbert_model,
    args=sbert_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=binary_acc_evaluator,
    compute_metrics=compute_sbert_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=PATIENCE), WandbCallback()],
)

## Model Training

Run the fine-tuning process.

In [None]:
# Train the model
# Note: This will take a while - lower SAMPLE_FRACTION for faster training
# For CPU training, use a very small sample (e.g., 0.001)
trainer.train()

print(f"Best model checkpoint path: {trainer.state.best_model_checkpoint}")
display(pd.DataFrame([trainer.evaluate()]))

# Save the final model
trainer.save_model(SBERT_OUTPUT_FOLDER)
print(f"Saved model to {SBERT_OUTPUT_FOLDER}")

## Testing the Fine-Tuned Model

Let's test our fine-tuned model on the same examples to see if it improved.

In [None]:
print("\nTesting fine-tuned SBERT model:\n")
tuned_examples = []
tuned_examples.append(
    ["John Smith", "John Smith", sbert_compare(sbert_model, "John Smith", "John Smith", use_gpu=True)]
)
tuned_examples.append(
    ["John Smith", "John H. Smith", sbert_compare(sbert_model, "John Smith", "John H. Smith", use_gpu=True)]
)
# Russian name
tuned_examples.append(
    ["Yevgeny Prigozhin", "Евгений Пригожин", sbert_compare(sbert_model, "Yevgeny Prigozhin", "Евгений Пригожин", use_gpu=True)]
)
# Chinese name
tuned_examples.append(
    ["Ben Lorica", "罗瑞卡", sbert_compare(sbert_model, "Ben Lorica", "罗瑞卡", use_gpu=True)]
)
tuned_examples_df = pd.DataFrame(tuned_examples, columns=["sentence1", "sentence2", "similarity"])
display(tuned_examples_df)

# Combine the pre-trained and fine-tuned results for comparison
comparison_df = pd.DataFrame({
    "Name Pair": examples_df["sentence1"] + " vs " + examples_df["sentence2"],
    "Pre-trained": examples_df["similarity"],
    "Fine-tuned": tuned_examples_df["similarity"]
})

# Plot the comparison
plt.figure(figsize=(12, 6))
comparison_df.plot(x="Name Pair", y=["Pre-trained", "Fine-tuned"], kind="bar", figsize=(12, 6))
plt.title("Pre-trained vs Fine-tuned Model Comparison")
plt.ylabel("Similarity Score")
plt.xticks(rotation=45, ha="right")
plt.ylim(0, 1)
plt.tight_layout()
plt.legend(title="Model Type")
plt.show()

## Final Model Evaluation

Let's evaluate the fine-tuned model on our test dataset and determine the optimal threshold.

In [None]:
# Sample test data
test_sample_size = max(int(len(test_df) * 0.1), min(len(test_df), 10))
test_sample_df = test_df.sample(n=test_sample_size, random_state=RANDOM_SEED)

# Get ground truth labels
y_true = test_sample_df["match"].astype(float).tolist()

# Run inference with GPU acceleration
print(f"Running inference on {device} for {len(test_sample_df):,} test records")
y_scores = sbert_compare_multiple(
    sbert_model, test_sample_df["left_name"], test_sample_df["right_name"], use_gpu=True
)

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

# Calculate F1 score for each threshold
f1_scores = [f1_score(y_true, y_scores >= t) for t in thresholds]

# Find the best threshold
best_threshold_index = np.argmax(f1_scores)
best_threshold = thresholds[best_threshold_index]
best_f1_score = f1_scores[best_threshold_index]

print(f"Best Threshold: {best_threshold:.4f}")
print(f"Best F1 Score: {best_f1_score:.4f}")

# Calculate ROC AUC
roc_auc = roc_auc_score(y_true, y_scores)
print(f"AUC-ROC: {roc_auc:.4f}")

# Calculate additional metrics at the best threshold
accuracy = accuracy_score(y_true, y_scores >= best_threshold)
precision_at_best = precision_score(y_true, y_scores >= best_threshold)
recall_at_best = recall_score(y_true, y_scores >= best_threshold)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision_at_best:.4f}")
print(f"Recall: {recall_at_best:.4f}")

# Create a DataFrame for visualization
pr_data = pd.DataFrame({"Precision": precision[:-1], "Recall": recall[:-1], "F1 Score": f1_scores})

# Plot precision-recall curve
plt.figure(figsize=(10, 6))
sns.lineplot(data=pr_data, x="Recall", y="Precision", marker="o")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall Curve (AUC = {roc_auc:.4f})")
plt.grid(True, alpha=0.3)
plt.show()

# Plot F1 score vs threshold
plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores)
plt.axvline(x=best_threshold, color="r", linestyle="--", label=f"Best Threshold = {best_threshold:.4f}")
plt.xlabel("Threshold")
plt.ylabel("F1 Score")
plt.title(f"F1 Score vs Threshold (Best F1 = {best_f1_score:.4f})")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Log final metrics to W&B
wandb.log({
    "final/best_threshold": best_threshold,
    "final/best_f1_score": best_f1_score,
    "final/accuracy": accuracy,
    "final/precision": precision_at_best,
    "final/recall": recall_at_best,
    "final/auc": roc_auc,
})

## W&B Visualizations

Log the PR curve to Weights & Biases and finish the run.

In [None]:
# Log the precision-recall curve to W&B
try:
    # Convert to the format W&B expects (probabilities for binary classification)
    y_probs_formatted = np.vstack([1 - y_scores, y_scores]).T
    wandb.log({"final/pr_curve": wandb.plot.pr_curve(y_true, y_probs_formatted)})
except Exception as e:
    print(f"Warning: Could not log PR curve to W&B: {e}")
    # Log individual metrics instead
    wandb.log({"final/y_true": y_true, "final/y_scores": y_scores.tolist()})

# Finish the W&B run
wandb.finish()

## Accessing W&B Visualizations

You can also access your W&B visualizations directly from the notebook.

In [None]:
# Load a previous W&B run
import wandb

# You need to login first if you haven't already
# wandb.login()

# Initialize the W&B API
api = wandb.Api()

# Replace with your entity and project
entity = WANDB_ENTITY
project = WANDB_PROJECT

# Get the most recent run
try:
    runs = api.runs(f"{entity}/{project}")
    if not runs:
        print("No runs found in the project.")
    else:
        run = runs[0]  # Most recent run
        print(f"Loaded run: {run.name}")
        
        # Display run summary
        print("\nRun Summary:")
        metrics = [k for k in run.summary._json_dict.keys() if not k.startswith('_')]
        metrics_df = pd.DataFrame([(k, run.summary._json_dict[k]) for k in metrics], 
                                columns=["Metric", "Value"])
        display(metrics_df)
        
        # Create a link to the W&B dashboard
        from IPython.display import display, HTML
        display(HTML(f'<a href="{run.url}" target="_blank">View full results on W&B Dashboard</a>'))
except Exception as e:
    print(f"Error loading W&B run: {e}")

## Conclusion

In this notebook, we've demonstrated how to fine-tune a SentenceTransformer model for multilingual entity resolution. The model can now compare names across languages and character sets, capturing semantic similarities that traditional string matching algorithms miss.

Key takeaways:
1. The pre-trained model already had some cross-lingual capabilities
2. Fine-tuning on domain-specific data significantly improved performance
3. GPU acceleration makes training and inference much faster
4. Weights & Biases provides valuable visualizations and tracking for model performance

To use this model in your own applications, you can load it from the saved directory and use it to compare names:

```python
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Load the model
model_path = "data/fine-tuned-sbert-sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2-original-adafactor"
model = SentenceTransformer(model_path)

# Compare names
name1 = "John Smith"
name2 = "Jon Smith"
embedding1 = model.encode(name1)
embedding2 = model.encode(name2)
similarity = 1 - cosine(embedding1, embedding2)
print(f"Similarity: {similarity:.3f}")
```