# MolEncoder Classification Fine-tuning Tutorial

This notebook shows you how to fine-tune the pretrained *MolEncoder* model for classification tasks on your own molecular data.

## What this notebook does:

1. **Data Loading**: Load your molecular dataset with SMILES strings and class labels
2. **Data Preprocessing**: Tokenize SMILES for model input  
3. **Training Duration Selection**: Use cross-validation to find optimal number of training epochs
4. **Model Training**: Fine-tune MolEncoder on your specific classification task
5. **Prediction & Evaluation**: Make predictions on new molecules and evaluate performance

## Your data requirements:

- **SMILES column**: Molecular structures as SMILES strings (e.g., 'CCO', 'CCOCC')
- **Labels column**: Integer class labels for classification (e.g., 0, 1, 2, 3... for any number of classes)
- **Format**: CSV, pandas DataFrame, or any format that can be converted to a Hugging Face Dataset

**Note**: This notebook supports binary classification (2 classes), multi-class classification (3+ classes), and any number of classes!

## Getting started:

1. Replace the example **training dataset** with your own data
2. Replace the example **test dataset** with your own test data (towards the end of the notebook)
3. Run all cells - the notebook will guide you through the entire process!

---



First, we'll install all necessary dependencies and import the required libraries for fine-tuning MolEncoder:


In [None]:
# Install all necessary dependencies for the fine-tuning notebook
%pip install torch transformers datasets accelerate schedulefree scikit-learn pandas numpy


In [1]:
import tempfile
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
from sklearn.model_selection import KFold
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    Trainer,
    TrainerCallback,
    TrainingArguments,
)


## Data Loading & Preprocessing

In this section, we'll load your training dataset and prepare it for fine-tuning. **You will need to add your own data loading here.** The data needs to contain SMILES strings and integer class labels for your classification task. 
This code then tokenizes the dataset for model input.


In [2]:
# load your training dataset (You need to insert your own dataset loading code here)

# eg. as a pandas dataframe and then convert to a Hugging Face dataset
# Multi-class classification example (0 = low activity, 1 = medium activity, 2 = high activity)
raw_data = pd.DataFrame({
    'smiles': ['CCO', 'CCOCC', 'CCC', 'CCCO', 'CCCN', 'CCCC', 'CCCCC', 'CCCCCC', 'CC', 'CCCCCCCC'],
    'labels': [0, 1, 0, 2, 1, 0, 2, 1, 0, 2]  # Multi-class labels: 0, 1, 2
})
dataset = Dataset.from_pandas(raw_data)


In [3]:
# Make sure your dataset is in the correct format
assert 'smiles' in dataset.column_names, "Dataset must contain 'smiles' column. Use dataset.rename_column('old_name', 'smiles') to rename."
assert 'labels' in dataset.column_names, "Dataset must contain 'labels' column. Use dataset.rename_column('old_name', 'labels') to rename."

# Check the number of unique classes
num_classes = len(set(dataset['labels']))
print(f"Number of classes: {num_classes}")
print(f"Classes: {sorted(set(dataset['labels']))}")


Number of classes: 3
Classes: [0, 1, 2]


In [None]:
# Now we load the tokenizer and tokenize the dataset so the model can understand the SMILES strings
model_name = "fabikru/MolEncoder" 
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["smiles"], truncation=True, max_length=502)

tokenized_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

## Training Duration Selection

Here we use cross-validation to automatically determine the optimal number of training epochs for your dataset. This prevents overfitting and ensures the best performance.


In [5]:
class BestEpochTracker(TrainerCallback):
    def __init__(self):
        self.best_eval_loss = float("inf")
        self.best_epoch = None

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if metrics is None:
            return control  # Nothing to do if no metrics provided

        current_loss = metrics.get("eval_loss")
        if current_loss is not None and current_loss < self.best_eval_loss:
            self.best_eval_loss = current_loss
            self.best_epoch = metrics.get("epoch")
        return control


In [None]:
def find_optimal_epochs(dataset, model_name, tokenizer, num_classes, n_splits=5, max_epochs=50):
    """
    A bit lengthy code but all it does is use KFold cross validation to determine the optimal number of epochs to train (lowest validation loss).
    
    Args:
        dataset: The tokenized dataset with 'input_ids', 'attention_mask', and 'labels'
        model_name: The pretrained model name to fine-tune
        tokenizer: The tokenizer (needed for DataCollator)
        num_classes: Number of classes for classification
        n_splits: Number of cross-validation folds
        max_epochs: Maximum number of epochs to consider
    
    Returns:
        int: The optimal number of epochs for training
    """
    
    
    print(f"Finding optimal epochs using {n_splits}-fold cross-validation...")
    
    data_collator = DataCollatorWithPadding(tokenizer)
    
    # Setup cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    best_epochs = []
    
    # Convert to numpy for KFold splitting
    indices = np.arange(len(dataset))
    
    for fold_num, (train_idx, val_idx) in enumerate(kf.split(indices)):
        print(f"Training fold {fold_num + 1}/{n_splits}")
        
        # Create fold datasets
        train_fold = dataset.select(train_idx.tolist())
        val_fold = dataset.select(val_idx.tolist())
        
        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
        
        best_epoch_tracker = BestEpochTracker()
        early_stopping = EarlyStoppingCallback(early_stopping_patience=3)
        
        with tempfile.TemporaryDirectory() as temp_dir:
            # These are the default hyperparameters that we used. Feel free to change them and optimize them further.
            training_args = TrainingArguments(
                output_dir=temp_dir,
                logging_dir=temp_dir,
                num_train_epochs=max_epochs,
                per_device_train_batch_size=32,
                per_device_eval_batch_size=32,
                learning_rate=8e-4,
                weight_decay=1e-5,
                warmup_steps=100,
                optim="schedule_free_adamw",
                lr_scheduler_type="constant",
                adam_beta1=0.9,
                adam_beta2=0.999,
                adam_epsilon=1e-8,
                fp16=True, # Try turning this off if you get some weird errors. Swap this to bf16 if you have a GPU with bf16 support.
                eval_strategy="epoch",
                save_strategy="no",
                max_grad_norm=1.0,
                load_best_model_at_end=False,
                metric_for_best_model="eval_loss",
                greater_is_better=False,
                logging_steps=1,
            )
            
            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=train_fold,
                eval_dataset=val_fold,
                data_collator=data_collator,  
                callbacks=[early_stopping, best_epoch_tracker],
            )
            
            trainer.train()
            best_epochs.append(best_epoch_tracker.best_epoch)
        
        # Clean up model to save memory
        del model
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
    
    optimal_epochs = int(np.round(np.mean(best_epochs)))
    print(f"Best epochs per fold: {best_epochs}")
    print(f"Optimal epochs: {optimal_epochs}")
    
    return optimal_epochs


In [9]:
# Find optimal number of epochs using cross-validation
optimal_epochs = find_optimal_epochs(
    dataset=tokenized_dataset, 
    model_name=model_name,
    tokenizer=tokenizer,
    num_classes=num_classes,
    n_splits=5, 
    max_epochs=50 
)


Finding optimal epochs using 5-fold cross-validation...
Training fold 1/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.
Compiling the model with `torch.compile` and using a `torch.mps` device is not supported. Falling back to non-compiled mode.


Epoch,Training Loss,Validation Loss
1,1.1157,1.27344
2,1.1133,1.27442
3,1.1092,1.276269
4,1.1034,1.279199


Training fold 2/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,1.1123,1.127296
2,1.1088,1.130762
3,1.1031,1.135065
4,1.0953,1.139908


Training fold 3/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,1.0762,1.269171
2,1.0731,1.268849
3,1.0681,1.268488
4,1.0611,1.26816
5,1.0521,1.267976
6,1.041,1.268084
7,1.0277,1.26861
8,1.0117,1.269581


Training fold 4/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,1.1725,0.950231
2,1.1692,0.954692
3,1.1639,0.960663
4,1.1566,0.967963


Training fold 5/5


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,1.172,0.890116
2,1.1678,0.896568
3,1.1609,0.905498
4,1.1513,0.917317


Best epochs per fold: [1.0, 1.0, 5.0, 1.0, 1.0]
Optimal epochs: 2


## Final Model Training

Now we'll train the final model using the optimal number of epochs determined above. The model will be saved for future use.


In [None]:
def train_final_model(dataset, model_name, tokenizer, num_classes, epochs, output_dir=Path("./trained_model")):
    """
    Train the final model using the optimal number of epochs.
    
    Args:
        dataset: The tokenized dataset
        model_name: The pretrained model name
        tokenizer: The tokenizer
        num_classes: Number of classes for classification
        epochs: Number of epochs to train
        output_dir: Where to save the trained model
    
    Returns:
        Trained model and trainer for making predictions
    """
    
    print(f"Training final model for {epochs} epochs...")
    
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    
    data_collator = DataCollatorWithPadding(tokenizer)
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    # Training arguments for final model
    training_args = TrainingArguments(
        output_dir=output_dir,
        logging_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        learning_rate=8e-4,
        weight_decay=1e-5,
        warmup_steps=100,
        optim="schedule_free_adamw",
        lr_scheduler_type="constant",
        adam_beta1=0.9,
        adam_beta2=0.999,
        adam_epsilon=1e-8,
        fp16=True, # Try turning this off if you get some weird errors. Swap this to bf16 if you have a GPU with bf16 support.
        save_strategy="epoch",
        eval_strategy="no", # we train on all available data
        save_total_limit=1, 
        max_grad_norm=1.0,
        logging_steps=1,
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        data_collator=data_collator,
    )
    
    print("Starting training...")
    trainer.train()
    
    trainer.save_model()
    print(f"Model saved to {output_dir}")
    
    return model, trainer


In [11]:
# Train the final model using optimal epochs

output_dir = Path("./my_finetuned_classification_model")
final_model, final_trainer = train_final_model(
    dataset=tokenized_dataset,
    model_name=model_name,
    tokenizer=tokenizer,
    num_classes=num_classes,
    epochs=optimal_epochs,
    output_dir=output_dir
)


Training final model for 2 epochs...


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at fabikru/MolEncoder and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training...


Step,Training Loss
1,1.1148
2,1.1121


Model saved to my_finetuned_classification_model


## Prediction & Evaluation

Finally, we'll use our trained model to make predictions on your test data and evaluate the performance using standard classification metrics.


In [12]:
def make_predictions(trainer, tokenized_test_data):
    """
    Make predictions on new data.
    
    Args:
        trainer: The trained Trainer object
        tokenized_test_data: Tokenized test dataset (without labels)
    
    Returns:
        Tuple of (predicted_classes, prediction_probabilities)
    """
    print("Making predictions...")
    
    predictions = trainer.predict(tokenized_test_data)
    
    # For classification, get the predicted class (argmax of logits)
    predicted_classes = np.argmax(predictions.predictions, axis=1)
    
    # Convert logits to probabilities using softmax
    prediction_probs = torch.softmax(torch.tensor(predictions.predictions), dim=1).numpy()
    
    print(f"Made {len(predicted_classes)} predictions")
    return predicted_classes, prediction_probs


**You need to add your data loading logic here:**


In [None]:
# Load your test dataset (You need to insert your own dataset loading code here) Both cases with and without labels work.

# Option 1: With labels (prediction + evaluation)

test_data = pd.DataFrame({
    'smiles': ['CCOC', 'CCN', 'CCCCO', 'CCCCN', 'CCCCCCC'],
    'labels': [1, 0, 2, 1, 0]  # Your ground truth class labels (matching the training classes). If you don't have labels, just provide the smiles column.
})


test_dataset = Dataset.from_pandas(test_data)

# Check if labels are present
has_labels = 'labels' in test_dataset.column_names

def tokenize_function(examples):
    return tokenizer(examples["smiles"], truncation=True, max_length=502)

tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare dataset for prediction
if has_labels:
    ground_truth = test_data['labels'].tolist()
    prediction_dataset = tokenized_test_dataset.remove_columns(['labels'])
else:
    ground_truth = None
    prediction_dataset = tokenized_test_dataset

predicted_classes, prediction_probs = make_predictions(final_trainer, prediction_dataset)

# Show some example predictions
print(f"\nExample Predictions:")
for i in range(min(3, len(test_data['smiles']))):
    smiles = test_data['smiles'].iloc[i]
    prediction = predicted_classes[i]
    
    print(f"SMILES: {smiles}")
    print(f"Predicted Class: {prediction}")
    
    # Show probabilities for each class
    for class_idx in range(num_classes):
        prob = prediction_probs[i][class_idx]
        print(f"  Class {class_idx} probability: {prob:.3f}")
    
    if has_labels:
        actual = ground_truth[i]
        print(f"Actual Class: {actual}")
    print()

# Calculate and display metrics if labels are available
if has_labels:
    accuracy = accuracy_score(ground_truth, predicted_classes)
    precision, recall, f1, support = precision_recall_fscore_support(ground_truth, predicted_classes, average='weighted', zero_division=0)
    
    print(f"\nPerformance Metrics:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-score:  {f1:.4f}")
    
    # Detailed classification report
    print(f"\nDetailed Classification Report:")
    print(classification_report(ground_truth, predicted_classes, zero_division=0))
    
    # Confusion matrix
    print(f"\nConfusion Matrix:")
    print(confusion_matrix(ground_truth, predicted_classes))

# Create a results DataFrame
results_df = pd.DataFrame({
    'smiles': test_data['smiles'].values,
    'predicted_class': predicted_classes
})

# Add probability columns for each class
for class_idx in range(num_classes):
    results_df[f'class_{class_idx}_probability'] = prediction_probs[:, class_idx]

# Add labels column if available
if has_labels:
    results_df['true_class'] = ground_truth

# Save to CSV
results_df.to_csv(str(output_dir / "predictions.csv"), index=False)
print(f"\nResults saved to {output_dir / 'predictions.csv'}")
print(f"Saved {len(results_df)} predictions")


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Making predictions...


Made 5 predictions

Example Predictions:
SMILES: CCOC
Predicted Class: 2
  Class 0 probability: 0.323
  Class 1 probability: 0.271
  Class 2 probability: 0.406
Actual Class: 1

SMILES: CCN
Predicted Class: 2
  Class 0 probability: 0.319
  Class 1 probability: 0.313
  Class 2 probability: 0.368
Actual Class: 0

SMILES: CCCCO
Predicted Class: 2
  Class 0 probability: 0.353
  Class 1 probability: 0.240
  Class 2 probability: 0.406
Actual Class: 2


Performance Metrics:
  Accuracy:  0.2000
  Precision: 0.0400
  Recall:    0.2000
  F1-score:  0.0667

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         2
           2       0.20      1.00      0.33         1

    accuracy                           0.20         5
   macro avg       0.07      0.33      0.11         5
weighted avg       0.04      0.20      0.07         5


Confusion Matrix:
[[0 0 2]
 [0 0 2