# BOND Reranker Training - Complete Guide

This notebook provides a comprehensive guide to training a cross-encoder reranker for the BOND (Biomedical Ontology Normalization and Disambiguation) system.

## Overview

The BOND reranker is a cross-encoder model that improves ontology normalization accuracy by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. This notebook will walk you through:

1. **Understanding the Dataset Structure** - How training data is formatted
2. **Setup and Configuration** - Installing dependencies and setting paths
3. **Data Loading** - Loading and inspecting training data
4. **Model Initialization** - Setting up the cross-encoder model
5. **Training** - Training the reranker with proper hyperparameters
6. **Evaluation** - Evaluating model performance
7. **Saving and Using** - Saving the trained model and integrating it into BOND

## Why Use a Reranker?

- **Improves Accuracy**: Boosts Hit@10 accuracy from ~75-80% (retrieval only) to ~85-90% (with reranker)
- **Context-Aware**: Learns context-dependent relevance (e.g., "lymphocyte" in tonsil vs. blood)
- **Handles One-to-Many Mappings**: The same author term can map to different ontology IDs depending on context

## Prerequisites

- Python 3.11+
- GPU recommended (T4, V100, or A100 with 16GB+ VRAM)
- Training data in JSONL format (see dataset structure below)


## Understanding the Dataset Structure

Before training, it's important to understand the format of your training data. The reranker training data should be in **JSONL format** (one JSON object per line).

### Dataset Format

Each line in your training data file should be a JSON object with the following structure:

```json
{
  "query": "cell_type: T-cell; tissue: blood; organism: Homo sapiens",
  "candidate": "label: T cell; synonyms: T lymphocyte | T-lymphocyte | thymocyte",
  "candidate_id": "CL:0000084",
  "correct_id": "CL:0000084",
  "label": 1.0,
  "retrieval_score": 0.85,
  "retrieval_rank": 0,
  "example_type": "positive"
}
```

### Field Descriptions

- **`query`**: The formatted query string containing field type, author term, tissue, and organism
  - Format: `"{field_type}: {author_term}; tissue: {tissue}; organism: {organism}"`
  - Example: `"cell_type: T-cell; tissue: blood; organism: Homo sapiens"`

- **`candidate`**: The formatted candidate ontology term
  - Format: `"label: {label}; synonyms: {syn1} | {syn2} | ...; definition: {definition}"`
  - Example: `"label: T cell; synonyms: T lymphocyte | T-lymphocyte"`

- **`candidate_id`**: The ontology ID (CURIE) of the candidate term
  - Example: `"CL:0000084"` (Cell Ontology ID)

- **`correct_id`**: The correct ontology ID for this query
  - This is the ground truth label

- **`label`**: Binary label (1.0 = positive match, 0.0 = negative)
  - `1.0`: Candidate matches the correct ontology ID
  - `0.0`: Candidate does not match (hard negative or random negative)

- **`retrieval_score`**: Confidence score from initial retrieval (0.0 to 1.0)

- **`retrieval_rank`**: Rank position from initial retrieval (0 = top result)

- **`example_type`**: Type of training example
  - `"positive"`: Correct match that was retrieved
  - `"positive_missed"`: Correct match that wasn't retrieved (added manually)
  - `"hard_negative"`: Retrieved but wrong (hard negative)
  - `"random_negative"`: Same field type but not retrieved (random negative)

### Expected Data Distribution

For a well-balanced training set:
- **Positives**: 
- **Hard Negatives**: 
- **Random Negatives**: 

### File Structure

Your training data should be organized as:
```
reranker_training_data/
├── train.jsonl      # Training set (~1.5M examples)
├── dev.jsonl        # Validation set (~90K examples)
└── test.jsonl       # Test set (~85K examples, optional)
```

**Note**: If you don't have training data yet, you can generate it using the `build_reranker_training_data.py` script from the BOND repository.


## Step 1: Installation and Setup

First, let's install the required dependencies and check GPU availability.


In [None]:
# Install required packages (matches original Colab notebook)
# Uncomment the line below if running in Google Colab or a fresh environment
# !pip install -q sentence-transformers accelerate

import os
import json
import torch
from pathlib import Path
from tqdm import tqdm
from collections import Counter
from typing import Dict, List
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("BOND Reranker Training - Setup")
print("=" * 60)

# Check GPU availability
print("\n>>> Checking GPU availability...")
if torch.cuda.is_available():
    print(f"✓ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"✓ CUDA Version: {torch.version.cuda}")
    print(f"✓ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    device = "cuda"
else:
    print("⚠️  No GPU available - training will be very slow!")
    print("   Consider using Google Colab with GPU runtime")
    device = "cpu"

print(f"\n>>> Using device: {device}")


## Step 2: Configuration

**IMPORTANT**: Update the paths below to match your setup!

### Path Configuration

You need to specify:
1. **Training data path**: Path to your `train.jsonl` file
2. **Validation data path**: Path to your `dev.jsonl` file
3. **Output path**: Where to save the trained model
4. **Model name**: Base model to fine-tune (using: `bioformers/bioformer-16L` - same as Colab)

### For Google Colab Users

If you're using Google Colab:
- Upload your `reranker_training_data` folder to Colab
- Or mount Google Drive and point to your data there
- Update paths to `/content/...` or `/content/drive/MyDrive/...`

### For Local Users

If running locally:
- Use absolute paths or paths relative to your working directory
- Example: `/Users/yourname/projects/BOND/reranker_training_data/train.jsonl`


In [None]:
# ============================================================================
# CONFIGURATION - UPDATE THESE PATHS FOR YOUR SETUP
# ============================================================================

CONFIG = {
    # ========================================================================
    # DATA PATHS - UPDATE THESE!
    # ========================================================================
    # For Google Colab:
    # 'train_path': '/content/reranker_training_data/train.jsonl',
    # 'val_path': '/content/reranker_training_data/dev.jsonl',
    
    # For Google Drive (mounted):
    # 'train_path': '/content/drive/MyDrive/BOND/reranker_training_data/train.jsonl',
    # 'val_path': '/content/drive/MyDrive/BOND/reranker_training_data/dev.jsonl',
    
    # For local machine (UPDATE THESE PATHS):
    'train_path': '/path/to/your/reranker_training_data/train.jsonl',  # ⚠️ CHANGE THIS
    'val_path': '/path/to/your/reranker_training_data/dev.jsonl',      # ⚠️ CHANGE THIS
    
    # ========================================================================
    # MODEL CONFIGURATION
    # ========================================================================
    'model_name': 'bioformers/bioformer-16L',  # Bioformer model used in Colab
    # This is the same model used in the original Colab training
    
    # Output directory for trained model
    # For Google Colab:
    # 'output_path': '/content/reranker_checkpoints/bond-reranker-v1',
    # For local:
    'output_path': './reranker_checkpoints/bond-reranker-v1',  # ⚠️ CHANGE IF NEEDED
    
    # ========================================================================
    # TRAINING HYPERPARAMETERS
    # ========================================================================
    'epochs': 3,                    # Number of training epochs
    'batch_size': 32,               # Batch size (matches Colab config)
    'learning_rate': 2e-5,         # Learning rate
    'warmup_ratio': 0.1,            # Warmup ratio (10% of training steps)
    'weight_decay': 0.01,          # Weight decay for regularization
    'max_grad_norm': 1.0,          # Gradient clipping
    'seed': 42,                     # Random seed for reproducibility
    'pos_weight': 5.0,             # Weight positives 5x more (for imbalanced data)
    
    # ========================================================================
    # DATA LIMITS (for testing)
    # ========================================================================
    'max_train_examples': None,    # Set to 1000 for quick testing, None for full dataset
    'max_val_examples': 10000,     # Limit validation examples for faster evaluation
}

# Print configuration
print("\n" + "=" * 60)
print("CONFIGURATION")
print("=" * 60)
for key, value in CONFIG.items():
    print(f"  {key}: {value}")
print("=" * 60)

# Verify paths exist
print("\n>>> Checking data files...")
train_exists = os.path.exists(CONFIG['train_path'])
val_exists = os.path.exists(CONFIG['val_path'])

if not train_exists:
    print(f"❌ ERROR: Training data not found at {CONFIG['train_path']}")
    print("\nPlease:")
    print("1. Update CONFIG['train_path'] with the correct path to your train.jsonl file")
    print("2. Make sure the file exists")
else:
    print(f"✓ Training data found: {CONFIG['train_path']}")

if not val_exists:
    print(f"❌ ERROR: Validation data not found at {CONFIG['val_path']}")
    print("\nPlease:")
    print("1. Update CONFIG['val_path'] with the correct path to your dev.jsonl file")
    print("2. Make sure the file exists")
else:
    print(f"✓ Validation data found: {CONFIG['val_path']}")

if train_exists and val_exists:
    print("\n✓ All data files found! Ready to proceed.")
else:
    print("\n⚠️  Please fix the paths above before continuing.")


## Step 3: Data Loading Functions

Let's create functions to load and inspect the training data.


In [None]:
def load_jsonl(file_path: str) -> List[Dict]:
    """Load JSONL file into a list of dictionaries."""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))
    return data

def prepare_cross_encoder_dataset(data: List[Dict]) -> Dataset:
    """
    Convert raw data to cross-encoder format (matches original Colab notebook).

    Each example should have:
    - sentence1: query
    - sentence2: candidate
    - label: binary label (0 or 1)
    """
    prepared_data = {
        'sentence1': [],
        'sentence2': [],
        'label': []
    }

    for item in data:
        prepared_data['sentence1'].append(item['query'])
        prepared_data['sentence2'].append(item['candidate'])
        prepared_data['label'].append(float(item['label']))  # Ensure float for BCE loss

    return Dataset.from_dict(prepared_data)

# Test loading a few examples to verify format
if train_exists and val_exists:
    print("\n>>> Testing data loading (first 3 examples)...")
    test_data = load_jsonl(CONFIG['train_path'])
    if len(test_data) > 3:
        test_data = test_data[:3]
    
    print("\nSample training example:")
    print(json.dumps(test_data[0], indent=2))
    
    test_dataset = prepare_cross_encoder_dataset(test_data)
    print("\nPrepared sample:")
    print(test_dataset[0])
else:
    print("\n⚠️  Skipping data loading test - fix paths first")


## Step 4: Import Required Libraries

Import the sentence-transformers libraries needed for training.


In [None]:
# Import sentence-transformers components
from datasets import Dataset
from sentence_transformers import CrossEncoder
from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments

print("✓ All imports successful!")


## Step 5: Load Training and Validation Data

Now let's load the full training and validation datasets.


In [None]:
# Load datasets (matches original Colab notebook format)
print("\n" + "=" * 60)
print("LOADING DATASETS")
print("=" * 60)

print("Loading datasets...")
train_data = load_jsonl(CONFIG['train_path'])
if CONFIG['max_train_examples']:
    train_data = train_data[:CONFIG['max_train_examples']]

dev_data = load_jsonl(CONFIG['val_path'])
if CONFIG['max_val_examples']:
    dev_data = dev_data[:CONFIG['max_val_examples']]

print(f"Train samples: {len(train_data):,}")
print(f"Dev samples: {len(dev_data):,}")

# Display sample
print("\nSample training example:")
print(json.dumps(train_data[0], indent=2))

# Prepare datasets for cross-encoder
print("\nPreparing datasets for cross-encoder training...")
train_dataset = prepare_cross_encoder_dataset(train_data)
dev_dataset = prepare_cross_encoder_dataset(dev_data)

print(f"Prepared train dataset: {len(train_dataset):,} samples")
print(f"Prepared dev dataset: {len(dev_dataset):,} samples")

# Display prepared sample
print("\nPrepared sample:")
print(dev_dataset[0])

# Analyze label distribution
print("\nLabel distribution:")
train_labels = [float(item['label']) for item in train_data]
label_counts = Counter(train_labels)
for label, count in sorted(label_counts.items()):
    percentage = count / len(train_data) * 100
    label_name = "Positive" if label == 1.0 else "Negative"
    print(f"  {label_name} (label={label}): {count:,} ({percentage:.1f}%)")

print("\n✓ Data loading complete!")


## Step 6: Initialize the Model

Initialize the cross-encoder model. We'll use `bioformers/bioformer-16L` as the base model, which is the same model used in the original Colab training. This is a biomedical domain-specific transformer model optimized for biological text.


In [None]:
print("\n" + "=" * 60)
print("INITIALIZING MODEL")
print("=" * 60)

print(f"\n>>> Loading base model: {CONFIG['model_name']}")
print("    This may take a few minutes on first run (downloading model)...")

# Initialize cross-encoder (matches Colab configuration)
model = CrossEncoder(
    CONFIG['model_name'],
    num_labels=1,        # Binary classification (relevance score)
    max_length=512,     # Maximum sequence length
    device=device       # Use GPU if available
)

print(f"✓ Model loaded on device: {model.device}")
print(f"✓ Model max length: {model.max_length}")
print(f"✓ Model type: Cross-Encoder (binary classification)")

# Calculate model size
try:
    total_params = sum(p.numel() for p in model.model.parameters())
    trainable_params = sum(p.numel() for p in model.model.parameters() if p.requires_grad)
    print(f"✓ Total parameters: {total_params:,}")
    print(f"✓ Trainable parameters: {trainable_params:,}")
except:
    pass


## Step 7: Setup Loss Function and Evaluator

### Loss Function

We use `BinaryCrossEntropyLoss` with class weighting to handle imbalanced data. Since we have many more negative examples (~90-95%) than positive examples (~5-10%), we weight positive examples more heavily.

### Evaluator

The evaluator computes metrics (accuracy, F1, precision, recall) on the validation set during training.


In [None]:
print("\n" + "=" * 60)
print("SETTING UP LOSS FUNCTION AND EVALUATOR")
print("=" * 60)

# Binary Cross Entropy Loss for binary relevance prediction
# Weight positives 5x more (matches original Colab configuration)
pos_weight = torch.tensor([CONFIG['pos_weight']])
loss = losses.BinaryCrossEntropyLoss(model, pos_weight=pos_weight)

print(f"\n>>> Using BinaryCrossEntropyLoss with pos_weight={pos_weight.item()}")
print(f"✓ Loss function initialized")

# Setup evaluator (matches original Colab format)
print(f"\n>>> Creating evaluator...")
evaluator = CEBinaryClassificationEvaluator(
    sentence_pairs=list(zip(dev_dataset['sentence1'], dev_dataset['sentence2'])),
    labels=dev_dataset['label'],
    name='dev'
)

print("✓ Evaluator configured for development set")
print(f"  Validation examples: {len(dev_dataset):,}")
print(f"  Metrics tracked: accuracy, F1, precision, recall")


## Step 8: Configure Training Arguments

Set up training arguments including learning rate, batch size, evaluation strategy, and checkpointing.


In [None]:
print("\n" + "=" * 60)
print("CONFIGURING TRAINING ARGUMENTS")
print("=" * 60)

# Create output directory
os.makedirs(CONFIG['output_path'], exist_ok=True)
print(f"\n>>> Output directory: {CONFIG['output_path']}")

# Setup training arguments (matches Colab configuration exactly)
training_args = CrossEncoderTrainingArguments(
    output_dir=CONFIG['output_path'],
    
    # Training hyperparameters
    num_train_epochs=CONFIG['epochs'],
    per_device_train_batch_size=CONFIG['batch_size'],
    per_device_eval_batch_size=CONFIG['batch_size'],
    learning_rate=CONFIG['learning_rate'],
    weight_decay=CONFIG['weight_decay'],
    warmup_ratio=CONFIG['warmup_ratio'],
    
    # Evaluation and saving
    eval_strategy='steps',
    eval_steps=500,
    save_strategy='steps',
    save_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model='dev_f1',
    
    # Optimization
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    gradient_accumulation_steps=1,
    max_grad_norm=CONFIG['max_grad_norm'],
    
    # Logging
    logging_dir='./logs',
    logging_steps=100,
    logging_first_step=True,
    report_to='none',  # Change to 'wandb' or 'tensorboard' if needed
    
    # Other settings
    seed=CONFIG['seed'],
    dataloader_drop_last=False,
)

print("\nTraining arguments configured:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Warmup ratio: {training_args.warmup_ratio}")
print(f"  Max grad norm: {training_args.max_grad_norm}")
print(f"  Seed: {training_args.seed}")
print(f"  FP16: {training_args.fp16}")
print(f"  Best model metric: {training_args.metric_for_best_model}")

print("\n✓ Training arguments configured")


## Step 9: Initialize Trainer

Create the trainer object that will handle the training loop.


In [None]:
print("\n" + "=" * 60)
print("INITIALIZING TRAINER")
print("=" * 60)

trainer = CrossEncoderTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    loss=loss,
    evaluator=evaluator
)

print("\n✓ Trainer initialized")
print(f"  Training examples: {len(train_dataset):,}")
print(f"  Validation examples: {len(dev_dataset):,}")
print(f"  Device: {device}")

# Estimate training time
if CONFIG['max_train_examples']:
    total_steps = (CONFIG['max_train_examples'] // CONFIG['batch_size']) * CONFIG['epochs']
else:
    total_steps = (len(train_dataset) // CONFIG['batch_size']) * CONFIG['epochs']

print(f"\nEstimated training steps: ~{total_steps:,}")
print(f"Estimated checkpoints: ~{total_steps // training_args.save_steps}")
print("\n⚠️  Training may take 2-4 hours depending on dataset size and GPU")


## Step 10: Start Training

Now we're ready to train! This will take some time depending on your dataset size and hardware.

**Training Tips:**
- Monitor the loss - it should decrease over time
- Watch validation metrics (F1, accuracy) - they should improve
- If you run out of memory (OOM), reduce `batch_size` in CONFIG
- Training will automatically save checkpoints every 500 steps
- The best model (by F1 score) will be loaded at the end


In [None]:
print("\n" + "=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print(f"Epochs: {CONFIG['epochs']}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Learning rate: {CONFIG['learning_rate']}")
print(f"Device: {device}")
print(f"Output directory: {CONFIG['output_path']}")
print("=" * 60)
print("\nTraining started... This may take several hours.\n")

# Start training
trainer.train()

print("\n" + "=" * 60)
print("TRAINING COMPLETE!")
print("=" * 60)


## Step 11: Save Final Model

Save the final trained model to disk.


In [None]:
print("\n>>> Saving final model...")
trainer.save_model()

print(f"✓ Model saved to: {CONFIG['output_path']}")
print("\nModel files saved:")
print(f"  - config.json (model configuration)")
print(f"  - model.safetensors (model weights)")
print(f"  - tokenizer files (tokenizer.json, vocab.txt, etc.)")


## Step 12: Evaluate Model Performance

Let's evaluate the trained model on the validation set to see final performance metrics.


In [None]:
print("\n" + "=" * 60)
print("EVALUATING MODEL")
print("=" * 60)

# Run evaluation
eval_results = trainer.evaluate()

print("\nFinal Evaluation Results:")
print("=" * 60)
for metric, value in sorted(eval_results.items()):
    if isinstance(value, float):
        print(f"  {metric}: {value:.4f}")
    else:
        print(f"  {metric}: {value}")
print("=" * 60)

# Expected metrics (evaluator name is 'dev'):
# - dev_accuracy: Overall accuracy
# - dev_f1: F1 score
# - dev_precision: Precision
# - dev_recall: Recall
# - eval_loss: Validation loss


## Step 13: Test the Trained Model

Let's test the trained model on a few example queries to see how it performs.


In [None]:
# Load the trained model
print("\n>>> Loading trained model for testing...")
trained_model = CrossEncoder(CONFIG['output_path'], device=device)
print("✓ Model loaded")

# Example test cases
test_cases = [
    {
        "query": "cell_type: T-cell; tissue: blood; organism: Homo sapiens",
        "candidates": [
            "label: T cell; synonyms: T lymphocyte | T-lymphocyte | thymocyte",
            "label: B cell; synonyms: B lymphocyte | B-lymphocyte",
            "label: NK cell; synonyms: natural killer cell | NK lymphocyte"
        ]
    },
    {
        "query": "tissue: liver; organism: Mus musculus",
        "candidates": [
            "label: liver; synonyms: hepatic organ",
            "label: kidney; synonyms: renal organ",
            "label: lung; synonyms: pulmonary organ"
        ]
    }
]

print("\n" + "=" * 60)
print("TESTING MODEL ON EXAMPLE QUERIES")
print("=" * 60)

for i, test_case in enumerate(test_cases, 1):
    print(f"\nTest Case {i}:")
    print(f"  Query: {test_case['query']}")
    print(f"\n  Candidates:")
    
    # Score each candidate
    scores = []
    for candidate in test_case['candidates']:
        score = trained_model.predict([(test_case['query'], candidate)])[0]
        prob = torch.sigmoid(torch.tensor(score)).item()
        scores.append((candidate, score, prob))
    
    # Sort by score (descending)
    scores.sort(key=lambda x: x[1], reverse=True)
    
    for rank, (candidate, score, prob) in enumerate(scores, 1):
        print(f"    Rank {rank}: {prob:.4f} - {candidate[:80]}...")
    
print("\n" + "=" * 60)


## Step 14: Using the Trained Model in BOND

Now that you have a trained reranker, here's how to use it in the BOND pipeline.

### Option 1: Update BOND Configuration

Set the reranker path in your BOND settings:

```python
from bond.config import BondSettings
from bond.pipeline import BondMatcher

settings = BondSettings(
    reranker_path="/path/to/your/reranker_checkpoints/bond-reranker-v1",  # Your trained model path
    enable_reranker=True
)

matcher = BondMatcher(settings=settings)
```

### Option 2: Environment Variable

Set the environment variable:

```bash
export BOND_RERANKER_PATH="/path/to/your/reranker_checkpoints/bond-reranker-v1"
export BOND_ENABLE_RERANKER=1
```

### Option 3: Direct Usage

You can also use the reranker directly:

```python
from sentence_transformers import CrossEncoder
import torch

# Load your trained model
model = CrossEncoder(
    "/path/to/your/reranker_checkpoints/bond-reranker-v1",
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Score query-candidate pairs
query = "cell_type: T-cell; tissue: blood; organism: Homo sapiens"
candidates = [
    "label: T cell; synonyms: T lymphocyte",
    "label: B cell; synonyms: B lymphocyte"
]

scores = model.predict([(query, c) for c in candidates])
probabilities = [torch.sigmoid(torch.tensor(s)).item() for s in scores]

# Rank by probability
ranked = sorted(zip(candidates, probabilities), key=lambda x: x[1], reverse=True)
for candidate, prob in ranked:
    print(f"{prob:.4f}: {candidate}")
```


## Troubleshooting

### Out of Memory (OOM) Errors

**Problem**: Training crashes with CUDA out of memory error.

**Solutions**:
1. Reduce batch size in CONFIG: `'batch_size': 4` or `'batch_size': 2`
2. Reduce max sequence length: Change `max_length=512` to `max_length=256` in model initialization
3. Use gradient accumulation (if supported)
4. Use a smaller base model: `'model_name': 'cross-encoder/ms-marco-MiniLM-L-6-v2'`

### Training Too Slow

**Problem**: Training is taking too long.

**Solutions**:
1. Use a smaller base model
2. Limit training examples: `'max_train_examples': 100000` for testing
3. Ensure GPU is being used (check device output)
4. Use mixed precision (FP16) - already enabled if GPU available

### Poor Performance

**Problem**: Model accuracy/F1 is low.

**Solutions**:
1. Check data quality - ensure labels are correct
2. Verify data format matches expected structure
3. Increase training epochs: `'epochs': 5`
4. Adjust learning rate: Try `'learning_rate': 1e-5` or `'learning_rate': 3e-5`
5. Check class balance - should have ~5-10% positives

### File Not Found Errors

**Problem**: Cannot find training data files.

**Solutions**:
1. Verify paths in CONFIG are correct
2. Use absolute paths instead of relative paths
3. Check file permissions
4. For Colab: Make sure files are uploaded or Drive is mounted

### Model Not Improving

**Problem**: Validation metrics not improving during training.

**Solutions**:
1. Check if learning rate is too high/low
2. Verify data is being loaded correctly
3. Check if model is actually training (loss should decrease)
4. Try different base model
5. Increase warmup steps: `'warmup_steps': 2000`


## Next Steps

1. **Evaluate on Test Set**: If you have a test set, evaluate the model on it
2. **Fine-tune Hyperparameters**: Experiment with different learning rates, batch sizes, etc.
3. **Train Field-Specific Models**: Consider training separate rerankers for different field types (cell_type, tissue, disease, etc.)
4. **Upload to Hugging Face**: Share your trained model on Hugging Face Hub
5. **Integrate into BOND**: Use the trained model in your BOND pipeline for improved accuracy

## Summary

You've successfully trained a cross-encoder reranker for BOND! The model should improve ontology normalization accuracy by 10-15% compared to retrieval-only approaches.

**Key Takeaways**:
- Training data format: JSONL with query, candidate, and label fields
- Model: Cross-encoder (bioformers/bioformer-16L - same as Colab)
- Loss: Binary cross-entropy with pos_weight=5.0 for imbalanced data
- Evaluation: F1 score (dev_f1) used to select best model
- Output: Trained model saved to specified directory
- Configuration: Matches original Colab training exactly (batch_size=32, warmup_ratio=0.1, weight_decay=0.01, etc.)

For more information, see:
- [BOND Reranker Training Guide](../RERANKER_TRAINING_GUIDE.md)
- [BOND Documentation](../README.md)
