# Combining ChemProp, `polaris`, and `nepare`

This notebook demonstrates using ChemProp as a learnable embedding with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - rdkit
 - lightning
 - torch
 - ipywidgets

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [1]:
import polaris as po
import pandas as pd

In [2]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [3]:
%%capture
# competition = po.load_competition("asap-discovery/antiviral-potency-2025")
# or
competition = po.load_competition("asap-discovery/antiviral-admet-2025")

In [4]:
train, test = competition.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [5]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

In [6]:
train_df

Unnamed: 0,CXSMILES,MLM,LogD,KSOL,HLM,MDR1-MDCKII
214,NCC1=CC=CC(NC(=O)[C@H](NC(=O)OCC2=CC=CC=C2)C2C...,,2.0,,5.29,1.580
166,CC1=CC(C2=NOC(C(F)(F)F)=N2)=CC=C1OCCCC1=NC=CN1,25.8,,68.0,,1.890
356,CC1=CC=CC=C1OCC(=O)NC1=CN=CC(C2=CN=C3NC=CC3=C2...,,,6.0,113.00,2.060
77,CNC(=O)[C@H]1CC12CCN(C(=O)CC1=CN=CC3=CC=CC=C13...,,,366.0,,1.980
233,CN1CCN(C)[C@H](CNC(=O)C2=CC=C3COB(O)C3=C2)C1 |...,16.0,-0.5,,14.00,1.000
...,...,...,...,...,...,...
339,CNC(=O)CN1C=C(C2=CC3=C(N[C@H](C4=CC=C5CCCS(=O)...,44.0,2.3,294.0,237.00,0.916
282,CN1N=NN=C1CN1C[C@]2(CCN(C3=CN=CC4=CC=CC=C34)C2...,129.0,1.9,398.0,57.00,3.800
94,CCN1C(C)=NN=C1SC1=CC(C)=NC2=NC=NN12,10.0,-0.5,348.0,16.50,0.401
4,CC1=CC(CC(=O)N2CCC[C@H](C(N)=O)C2)=CC=N1 |&1:11|,,-0.3,375.0,,0.900


## Learn an Embedding with ChemProp
ChemProp using Message Passing Graph Neural Networks to learn a molecular representation tailored for the problem at hand.
We can 'plug it in' to `nepare` to take advantage of that, with the additional benefit for ChemProp that it will have more training data to learn its representation.

In [7]:
val_idx = int(len(train_df) * 0.15)  # use n for validation

We'll first write a function that converts our SMILES into their ChemProp input (a `MolGraph`).

In [8]:
from rdkit.Chem import MolFromSmiles
from chemprop.featurizers import MolGraphCache, SimpleMoleculeMolGraphFeaturizer

def smiles2molgraphcache(smiles: list[str]):
    mols = list(map(MolFromSmiles, smiles))
    featurizer = SimpleMoleculeMolGraphFeaturizer()
    mgc = MolGraphCache(mols, [None] * len(mols), [None] * len(mols), featurizer)
    return mgc

In [9]:
import torch

In [10]:
train_mgc = smiles2molgraphcache(train_df["CXSMILES"][val_idx:])
train_targets = torch.tensor(train_df[list(competition.target_cols)][val_idx:].to_numpy(), dtype=torch.float32)
val_mgc = smiles2molgraphcache(train_df["CXSMILES"][:val_idx])
val_targets = torch.tensor(train_df[list(competition.target_cols)][:val_idx].to_numpy(), dtype=torch.float32)
test_mgc = smiles2molgraphcache(test_df["CXSMILES"])

In [11]:
from fastprop.data import standard_scale

In [12]:
train_targets, target_means, target_vars = standard_scale(train_targets)
val_targets = standard_scale(val_targets, target_means, target_vars)

In [13]:
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset

In [14]:
train_dataset = PairwiseAugmentedDataset(train_mgc, train_targets)
val_dataset = PairwiseAnchoredDataset(train_mgc, train_targets, val_mgc, val_targets)
test_dataset = PairwiseInferenceDataset(train_mgc, train_targets, test_mgc)

Next, we need to write a function to collate our `MolGraph`s and target values - ChemProp has a class for batches of `MolGraph` aptly named `BatchMolGraph`.

In [15]:
from typing import Iterable

import torch
import numpy as np
from chemprop.data.molgraph import MolGraph
from chemprop.data.collate import BatchMolGraph

In [16]:
def _collate(batch: Iterable[tuple[MolGraph, MolGraph, float]]):
    mgs_1, mgs_2, ys = zip(*batch)  #  now need to convert y back into a tensor
    return BatchMolGraph(mgs_1), BatchMolGraph(mgs_2), torch.tensor(np.array(ys), dtype=torch.float32)

In [17]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=_collate)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=512, collate_fn=_collate)
predict_loader = torch.utils.data.DataLoader(test_dataset, batch_size=512, collate_fn=_collate)

Finally, we just need to define a class to take our collated batches and convert them into their learned representations.
This class can then be passed to the `nepare` class `LearnedEmbeddingNeuralPairwiseRegressor`, which will call our class on the two inputs for each batch.

In [18]:
from chemprop.conf import DEFAULT_HIDDEN_DIM
from chemprop.nn.agg import MeanAggregation
from chemprop.nn.message_passing import BondMessagePassing

from nepare.nn import LearnedEmbeddingNeuralPairwiseRegressor

In [19]:
class ChemPropEmbedder(torch.nn.Module):
    def __init__(self, mp, agg):
        super().__init__()
        self.mp = mp
        self.agg = agg

    def forward(self, bmg):
        H = self.mp(bmg)
        Z = self.agg(H, bmg.batch)
        return Z

In [20]:
mp = BondMessagePassing()
agg = MeanAggregation()
embedder = ChemPropEmbedder(mp, agg)

In [21]:
npr = LearnedEmbeddingNeuralPairwiseRegressor(embedder, DEFAULT_HIDDEN_DIM, 100, 2, n_targets=len(competition.target_cols))

/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/parsing.py:209: Attribute 'embedding_module' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['embedding_module'])`.


## Training and Predicting

From here on out we follow a very standard `lightning` training workflow - see `demo.ipynb` for a slightly more in-depth explanation of what's going on.

In [27]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint

from nepare.inference import predict

In [23]:
early_stopping = EarlyStopping(monitor="validation/loss", patience=3)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [24]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type             | Params | Mode 
--------------------------------------------------------------
0 | fnn              | Sequential       | 70.7 K | train
1 | embedding_module | ChemPropEmbedder | 227 K  | train
--------------------------------------------------------------
298 K     Trainable params
0         Non-trainable params
298 K     Total params
1.194     Total estimated model params size (MB)
16        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [25]:
npr = LearnedEmbeddingNeuralPairwiseRegressor.load_from_checkpoint(model_checkpoint.best_model_path)  # reload best model based on early stopping

In [39]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

In [43]:
from fastprop.data import inverse_standard_scale

In [44]:
y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)

  y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)


In [46]:
predictions = {name: y_pred[:, i].tolist() for i, name in enumerate(competition.target_cols)}

This last block is commented out because it will fail (unless you are me) - you can replace the inputs with your own if you are submitting this for yourself.

In [48]:
competition.submit_predictions(
    predictions=predictions,
    prediction_name="nepare_chemprop",
    prediction_owner="jacksonburns",
    report_url="https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/meta",
    github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_chemprop_nepare.ipynb",
    description = "Neural Pairwise Regression with ChemProp as a learnable embedding",
)

Output()