# Combining ChemProp, `polaris`, and `nepare`

This notebook demonstrates using ChemProp as a learnable embedding with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - rdkit
 - lightning
 - torch
 - ipywidgets

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [1]:
import polaris as po
import pandas as pd

In [2]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [3]:
%%capture
competition = po.load_competition("asap-discovery/antiviral-potency-2025")
# or
# competition = po.load_competition("asap-discovery/antiviral-admet-2025")

In [4]:
train, test = competition.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [5]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

In [6]:
train_df

Unnamed: 0,CXSMILES,pIC50 (MERS-CoV Mpro),pIC50 (SARS-CoV-2 Mpro)
103,CNC(=O)CN1C[C@]2(C(=O)N(C3=CN=CC4=CC=CC=C34)C[...,4.09,5.20
200,CNC(=O)CN1C[C@]2(C(=O)N(C3=CN=CC4=CC=CC=C34)C[...,4.15,5.27
605,C[C@H]1CN(C2=CN=CC3=CC=CC=C23)C(=O)[C@@]12CN(C...,5.50,6.44
430,O=C(CN1CC2=CC=C(Cl)C=C2[C@H](C(=O)NC2=CN=CC3=C...,5.05,6.14
169,O=C(CC1=CC(F)=CN=C1)NC1=CN=CC2=CC=CC=C12,4.18,5.33
...,...,...,...
280,CC1=NC=C(C2=CC=CC(Cl)=C2)C(=O)N1C1=CN=CC2=CC=C...,4.91,4.80
852,O=C(CC1=CN=CC2=CC=CC=C12)N1CC[C@H]2C[C@H]2C1 |...,4.62,4.55
526,O=C(CC1=CN=CC2=CC=CC=C12)N1CCC(F)(C2=CC=CC=C2)CC1,3.84,4.63
528,O=C(CC1=CN=CC2=CC=CC=C12)N1CC[C@H](CC2=CC=CC(C...,4.04,4.98


## Learn an Embedding with ChemProp
ChemProp using Message Passing Graph Neural Networks to learn a molecular representation tailored for the problem at hand.
We can 'plug it in' to `nepare` to take advantage of that, with the additional benefit for ChemProp that it will have more training data to learn its representation.

In [7]:
val_idx = int(len(train_df) * 0.15)  # use n for validation

We'll first write a function that converts our SMILES into their ChemProp input (a `MolGraph`).

In [8]:
from rdkit.Chem import MolFromSmiles
from chemprop.featurizers import MolGraphCache, SimpleMoleculeMolGraphFeaturizer

def smiles2molgraphcache(smiles: list[str]):
    mols = list(map(MolFromSmiles, smiles))
    featurizer = SimpleMoleculeMolGraphFeaturizer()
    mgc = MolGraphCache(mols, [None] * len(mols), [None] * len(mols), featurizer)
    return mgc

In [9]:
import torch

In [10]:
train_mgc = smiles2molgraphcache(train_df["CXSMILES"][val_idx:])
train_targets = torch.tensor(train_df[list(competition.target_cols)][val_idx:].to_numpy(), dtype=torch.float32)
val_mgc = smiles2molgraphcache(train_df["CXSMILES"][:val_idx])
val_targets = torch.tensor(train_df[list(competition.target_cols)][:val_idx].to_numpy(), dtype=torch.float32)
test_mgc = smiles2molgraphcache(test_df["CXSMILES"])

In [11]:
from fastprop.data import standard_scale

In [12]:
train_targets, target_means, target_vars = standard_scale(train_targets)
val_targets = standard_scale(val_targets, target_means, target_vars)

In [13]:
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset

In [14]:
train_dataset = PairwiseAugmentedDataset(train_mgc, train_targets)
val_dataset = PairwiseAnchoredDataset(train_mgc, train_targets, val_mgc, val_targets)
test_dataset = PairwiseInferenceDataset(train_mgc, train_targets, test_mgc)

Next, we need to write a function to collate our `MolGraph`s and target values - ChemProp has a class for batches of `MolGraph` aptly named `BatchMolGraph`.

In [15]:
from typing import Iterable

import torch
import numpy as np
from chemprop.data.molgraph import MolGraph
from chemprop.data.collate import BatchMolGraph

In [16]:
def _collate(batch: Iterable[tuple[MolGraph, MolGraph, float]]):
    mgs_1, mgs_2, ys = zip(*batch)  #  now need to convert y back into a tensor
    return BatchMolGraph(mgs_1), BatchMolGraph(mgs_2), torch.tensor(np.array(ys), dtype=torch.float32)

In [17]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=_collate)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=512, collate_fn=_collate)
predict_loader = torch.utils.data.DataLoader(test_dataset, batch_size=512, collate_fn=_collate)

Finally, we just need to define a class to take our collated batches and convert them into their learned representations.
This class can then be passed to the `nepare` class `LearnedEmbeddingNeuralPairwiseRegressor`, which will call our class on the two inputs for each batch.

In [18]:
from chemprop.conf import DEFAULT_HIDDEN_DIM
from chemprop.nn.agg import MeanAggregation
from chemprop.nn.message_passing import BondMessagePassing

from nepare.nn import LearnedEmbeddingNeuralPairwiseRegressor

In [19]:
class ChemPropEmbedder(torch.nn.Module):
    def __init__(self, mp, agg):
        super().__init__()
        self.mp = mp
        self.agg = agg

    def forward(self, bmg):
        H = self.mp(bmg)
        Z = self.agg(H, bmg.batch)
        return Z

In [20]:
mp = BondMessagePassing()
agg = MeanAggregation()
embedder = ChemPropEmbedder(mp, agg)

In [21]:
npr = LearnedEmbeddingNeuralPairwiseRegressor(embedder, DEFAULT_HIDDEN_DIM, 100, 2, n_targets=len(competition.target_cols))

/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/parsing.py:209: Attribute 'embedding_module' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['embedding_module'])`.


## Training and Predicting

From here on out we follow a very standard `lightning` training workflow - see `demo.ipynb` for a slightly more in-depth explanation of what's going on.

In [22]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint

from nepare.inference import predict

In [23]:
early_stopping = EarlyStopping(monitor="validation/loss", patience=3)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [None]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader, val_loader)

In [25]:
npr = LearnedEmbeddingNeuralPairwiseRegressor.load_from_checkpoint("/home/jackson/neural-pairwise-regression/notebooks/lightning_logs/version_0/checkpoints/epoch=1-step=24036.ckpt")  # reload best model based on early stopping

/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/parsing.py:209: Attribute 'embedding_module' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['embedding_module'])`.


In [26]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

In [27]:
from fastprop.data import inverse_standard_scale

In [28]:
y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)

  y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)


In [29]:
predictions = {name: y_pred[:, i].tolist() for i, name in enumerate(competition.target_cols)}

This last block is commented out because it will fail (unless you are me) - you can replace the inputs with your own if you are submitting this for yourself.

In [31]:
competition.submit_predictions(
    predictions=predictions,
    prediction_name="nepare_chemprop",
    prediction_owner="jacksonburns",
    report_url="https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/meta",
    github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/nepare_asap.ipynb",
    description = "Neural Pairwise Regression with ChemProp as a learnable embedding",
)

Output()