# MolEncoder Regression Fine-tuning Tutorial

This notebook shows you how to fine-tune the pretrained *MolEncoder* model for regression tasks on your own molecular data.

## What this notebook does:

1. **Data Loading**: Load your molecular dataset with SMILES strings and target values
2. **Data Preprocessing**: Tokenize SMILES and scale labels for stable training  
3. **Training Duration Selection**: Use cross-validation to find optimal number of training epochs
4. **Model Training**: Fine-tune MolEncoder on your specific regression task
5. **Prediction & Evaluation**: Make predictions on new molecules and evaluate performance

## Your data requirements:

- **SMILES column**: Molecular structures as SMILES strings (e.g., 'CCO', 'CCOCC')
- **Labels column**: Continuous target values for regression (e.g., solubility, toxicity scores)
- **Format**: CSV, pandas DataFrame, or any format that can be converted to a Hugging Face Dataset

## Getting started:

1. Replace the example **training dataset** with your own data
2. Replace the example **test dataset** with your own test data (towards the end of the )
3. Run all cells - the notebook will guide you through the entire process!

---




First, we'll install all necessary dependencies and import the required libraries for fine-tuning MolEncoder:


In [None]:
# Install all necessary dependencies for the fine-tuning notebook
%pip install torch transformers datasets accelerate schedulefree scikit-learn pandas numpy

In [1]:
import tempfile
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import KFold
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    Trainer,
    TrainerCallback,
    TrainingArguments,
)

## Data Loading & Preprocessing

In this section, we'll load your training dataset and prepare it for fine-tuning. **You will need to add your own data loading here.** The data needs to contain SMILES strings and target values for your regression task. This code then:
1) Tokenizes the dataset
2) Scale the labels to be centered around 0 before training (neccessary to avoid big gradients in the beginning of training)


In [2]:
# load your training dataset (You need to insert your own dataset loading code here)

# eg. as a pandas dataframe and then convert to a Hugging Face dataset
raw_data = pd.DataFrame({
    'smiles': ['CCO', 'CCOCC', 'CCC', 'CCCO', 'CCCN'],
    'labels': [0.2, 0.1, 0.3, 0.25, 0.15]
})
dataset = Dataset.from_pandas(raw_data)


In [3]:
# Make sure your dataset is in the correct format
assert 'smiles' in dataset.column_names, "Dataset must contain 'smiles' column. Use dataset.rename_column('old_name', 'smiles') to rename."
assert 'labels' in dataset.column_names, "Dataset must contain 'labels' column. Use dataset.rename_column('old_name', 'labels') to rename."


In [None]:
# Now we load the tokenizer and tokenize the dataset so the model can understand the SMILES strings
model_name = "fabikru/MolEncoder" 
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["smiles"], truncation=True, max_length=502)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [5]:
class LabelScaler:
    """Scale labels using robust scaling (median and interquartile range (IQR))."""
    
    def __init__(self, dataset: Dataset):
        labels = np.array(dataset['labels'])
        self.median = np.median(labels)
        q1 = np.percentile(labels, 25)
        q3 = np.percentile(labels, 75)
        self.iqr = q3 - q1
        if self.iqr == 0:
            raise ValueError("Most labels are the same. Cannot scale due to division by zero.")
    
    def scale_labels(self, dataset: Dataset) -> Dataset:
        """Scale the labels using median and IQR."""
        labels = np.array(dataset['labels'])
        scaled_labels = (labels - self.median) / self.iqr
        new_dataset = dataset.remove_columns(['labels'])
        new_dataset = new_dataset.add_column('labels', scaled_labels.tolist())
        return new_dataset
    
    def scale_predictions(self, predictions: List[float]) -> List[float]:
        """Rescale predictions back to original scale."""
        predictions = np.array(predictions)
        rescaled_predictions = predictions * self.iqr + self.median
        return rescaled_predictions.tolist()

In [6]:
# Now scale the labels (important for stable training)
label_scaler = LabelScaler(tokenized_dataset)
tokenized_dataset_scaled = label_scaler.scale_labels(tokenized_dataset)

# Update our dataset variable to use the scaled version
tokenized_dataset = tokenized_dataset_scaled

## Training Duration Selection

Here we use cross-validation to automatically determine the optimal number of training epochs for your dataset. This prevents overfitting and ensures the best performance.


In [7]:
class BestEpochTracker(TrainerCallback):
    def __init__(self):
        self.best_eval_loss = float("inf")
        self.best_epoch = None

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if metrics is None:
            return control  # Nothing to do if no metrics provided

        current_loss = metrics.get("eval_loss")
        if current_loss is not None and current_loss < self.best_eval_loss:
            self.best_eval_loss = current_loss
            self.best_epoch = metrics.get("epoch")
        return control

In [None]:
def find_optimal_epochs(dataset, model_name, tokenizer, n_splits=5, max_epochs=50):
    """
    A bit lengthy code but all it does is use KFold cross validation to determine the optimal number of epochs to train (lowest validation loss).
    
    Args:
        dataset: The tokenized dataset with 'input_ids', 'attention_mask', and 'labels'
        model_name: The pretrained model name to fine-tune
        tokenizer: The tokenizer (needed for DataCollator)
        n_splits: Number of cross-validation folds
        max_epochs: Maximum number of epochs to consider
    
    Returns:
        int: The optimal number of epochs for training
    """
    
    
    print(f"Finding optimal epochs using {n_splits}-fold cross-validation...")
    
    data_collator = DataCollatorWithPadding(tokenizer)
    
    # Setup cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    best_epochs = []
    
    # Convert to numpy for KFold splitting
    indices = np.arange(len(dataset))
    
    for fold_num, (train_idx, val_idx) in enumerate(kf.split(indices)):
        print(f"Training fold {fold_num + 1}/{n_splits}")
        
        # Create fold datasets
        train_fold = dataset.select(train_idx.tolist())
        val_fold = dataset.select(val_idx.tolist())
        
        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
        
        best_epoch_tracker = BestEpochTracker()
        early_stopping = EarlyStoppingCallback(early_stopping_patience=3)
        
        with tempfile.TemporaryDirectory() as temp_dir:
            # These are the default hyperparameters that we used. Feel free to change them and optimize them further.
            training_args = TrainingArguments(
                output_dir=temp_dir,
                logging_dir=temp_dir,
                num_train_epochs=max_epochs,
                per_device_train_batch_size=32,
                per_device_eval_batch_size=32,
                learning_rate=8e-4,
                weight_decay=1e-5,
                warmup_steps=100,
                optim="schedule_free_adamw",
                lr_scheduler_type="constant",
                adam_beta1=0.9,
                adam_beta2=0.999,
                adam_epsilon=1e-8,
                fp16=True, # Try turning this off if you get some weird errors. Swap this to bf16 if you have a GPU with bf16 support.
                eval_strategy="epoch",
                save_strategy="no",
                max_grad_norm=1.0,
                load_best_model_at_end=False,
                metric_for_best_model="eval_loss",
                greater_is_better=False,
                logging_steps=1,
            )
            
            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=train_fold,
                eval_dataset=val_fold,
                data_collator=data_collator,  
                callbacks=[early_stopping, best_epoch_tracker],
            )
            
            trainer.train()
            best_epochs.append(best_epoch_tracker.best_epoch)
        
        # Clean up model to save memory
        del model
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
    
    optimal_epochs = int(np.round(np.mean(best_epochs)))
    print(f"Best epochs per fold: {best_epochs}")
    print(f"Optimal epochs: {optimal_epochs}")
    
    return optimal_epochs


In [9]:
# Find optimal number of epochs using cross-validation
optimal_epochs = find_optimal_epochs(
    dataset=tokenized_dataset, 
    model_name=model_name,
    tokenizer=tokenizer, 
    n_splits=5, 
    max_epochs=50 
)

Finding optimal epochs using 5-fold cross-validation...
Training fold 1/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.
Compiling the model with `torch.compile` and using a `torch.mps` device is not supported. Falling back to non-compiled mode.


Epoch,Training Loss,Validation Loss
1,0.3419,1.262935
2,0.337,1.28349
3,0.3293,1.311018
4,0.3189,1.344887


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training fold 2/5


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,0.568,0.445132
2,0.5603,0.446407
3,0.5478,0.44915
4,0.5309,0.455296


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training fold 3/5


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,0.4845,0.786283
2,0.4745,0.798454
3,0.4586,0.814665
4,0.4377,0.835194


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training fold 4/5


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,0.6752,0.014186
2,0.6667,0.012098
3,0.6532,0.009654
4,0.6349,0.00714
5,0.6124,0.004841
6,0.5863,0.002973
7,0.557,0.001638
8,0.5238,0.000816
9,0.4847,0.000393
10,0.4382,0.000201


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training fold 5/5


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,0.5404,0.523911
2,0.5357,0.529148
3,0.528,0.535683
4,0.5172,0.542847


Best epochs per fold: [1.0, 1.0, 1.0, 13.0, 1.0]
Optimal epochs: 3


## Final Model Training

Now we'll train the final model using the optimal number of epochs determined above. The model will be saved for future use.


In [None]:
def train_final_model(dataset, model_name, tokenizer, epochs, output_dir=Path("./trained_model")):
    """
    Train the final model using the optimal number of epochs.
    
    Args:
        dataset: The tokenized dataset with scaled labels
        model_name: The pretrained model name
        tokenizer: The tokenizer
        epochs: Number of epochs to train
        output_dir: Where to save the trained model
    
    Returns:
        Trained model and trainer for making predictions
    """
    
    print(f"Training final model for {epochs} epochs...")
    
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
    
    data_collator = DataCollatorWithPadding(tokenizer)
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Training arguments for final model
    training_args = TrainingArguments(
        output_dir=output_dir,
        logging_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        learning_rate=8e-4,
        weight_decay=1e-5,
        warmup_steps=100,
        optim="schedule_free_adamw",
        lr_scheduler_type="constant",
        adam_beta1=0.9,
        adam_beta2=0.999,
        adam_epsilon=1e-8,
        fp16=True, # Try turning this off if you get some weird errors. Swap this to bf16 if you have a GPU with bf16 support.
        save_strategy="epoch",
        eval_strategy="no", # we train on all available data
        save_total_limit=1, 
        max_grad_norm=1.0,
        logging_steps=1,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=data_collator,
    )
    
    print("Starting training...")
    trainer.train()
    
    trainer.save_model()
    print(f"Model saved to {output_dir}")
    
    return model, trainer


In [11]:
# Train the final model using optimal epochs

output_dir = Path("./my_finetuned_model")
final_model, final_trainer = train_final_model(
    dataset=tokenized_dataset,
    model_name=model_name,
    tokenizer=tokenizer,
    epochs=optimal_epochs,
    output_dir=output_dir
)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training final model for 3 epochs...
Starting training...


Step,Training Loss
1,0.5433
2,0.5362
3,0.5249


Model saved to my_finetuned_model


## Prediction & Evaluation

Finally, we'll use our trained model to make predictions on your test data and evaluate the performance using standard regression metrics.


In [12]:
def make_predictions(trainer, tokenized_test_data, label_scaler):
    """
    Make predictions on new data and rescale them back to original scale.
    
    Args:
        trainer: The trained Trainer object
        tokenized_test_data: Tokenized test dataset (without labels)
        label_scaler: The LabelScaler used during training
    
    Returns:
        List of predictions in original scale
    """
    print("Making predictions...")
    
    predictions = trainer.predict(tokenized_test_data)
    scaled_predictions = predictions.predictions.flatten()
    
    # Rescale predictions back to original scale
    original_scale_predictions = label_scaler.scale_predictions(scaled_predictions.tolist())
    
    print(f"Made {len(original_scale_predictions)} predictions")
    return original_scale_predictions

**You need to add your data loading logic here:**


In [None]:
# Load your test dataset (You need to insert your own dataset loading code here) Both cases with and without labels work.

# Option 1: With labels (prediction + evaluation)

test_data = pd.DataFrame({
    'smiles': ['CCOC', 'CCN', 'CCC', 'CCCO', 'CCCN'],
    'labels': [0.2, 0.1, 0.3, 0.25, 0.15]  # Your ground truth labels. If you don't have labels, just provide the smiles column.
})


test_dataset = Dataset.from_pandas(test_data)

# Check if labels are present
has_labels = 'labels' in test_dataset.column_names

def tokenize_function(examples):
    return tokenizer(examples["smiles"], truncation=True, max_length=502)

tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare dataset for prediction
if has_labels:
    ground_truth = test_data['labels'].tolist()
    prediction_dataset = tokenized_test_dataset.remove_columns(['labels'])
else:
    ground_truth = None
    prediction_dataset = tokenized_test_dataset

predictions = make_predictions(final_trainer, prediction_dataset, label_scaler)

# Show some example predictions
print(f"\nExample Predictions:")
for i in range(min(3, len(test_data['smiles']))):
    smiles = test_data['smiles'].iloc[i]
    prediction = predictions[i]
    
    print(f"SMILES: {smiles}")
    print(f"Prediction: {prediction:.3f}")
    
    if has_labels:
        actual = ground_truth[i]
        print(f"Actual: {actual:.3f}")

# Calculate and display metrics if labels are available
if has_labels:
    mse = mean_squared_error(ground_truth, predictions)
    mae = mean_absolute_error(ground_truth, predictions) 
    r2 = r2_score(ground_truth, predictions)
    rmse = np.sqrt(mse)
    
    print(f"\n\nPerformance Metrics:")
    print(f"  MSE:  {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  MAE:  {mae:.4f}")
    print(f"  R²:   {r2:.4f}")

# Create a results DataFrame
results_df = pd.DataFrame({
    'smiles': test_data['smiles'].values,
    'prediction': predictions
})

# Add labels column if available
if has_labels:
    results_df['label'] = ground_truth

# Create results directory if it doesn't exist
# Save to CSV
results_df.to_csv(str(output_dir / "predictions.csv"), index=False)
print(f"\nResults saved to {output_dir / 'predictions.csv'}")
print(f"Saved {len(results_df)} predictions")


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Making predictions...


Made 5 predictions

Example Predictions:
SMILES: CCOC
Prediction: 0.211
Actual: 0.200
SMILES: CCN
Prediction: 0.209
Actual: 0.100
SMILES: CCC
Prediction: 0.212
Actual: 0.300


Performance Metrics:
  MSE:  0.0049
  RMSE: 0.0702
  MAE:  0.0601
  R²:   0.0136

Results saved to my_finetuned_model/predictions.csv
Saved 5 predictions
