# Combining ChemProp, `polaris`, and `nepare`

This notebook demonstrates using ChemProp as a learnable embedding with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - rdkit
 - lightning
 - torch
 - ipywidgets

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [11]:
import polaris as po
import pandas as pd

In [12]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [13]:
%%capture
# https://polarishub.io/benchmarks/polaris/adme-fang-rppb-1
benchmark = po.load_benchmark("polaris/adme-fang-RPPB-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-solu-1
# benchmark = po.load_benchmark("polaris/adme-fang-SOLU-1")
# another interesting one
# https://polarishub.io/benchmarks/polaris/pkis1-ret-wt-mut-r-1
# benchmark = po.load_benchmark("polaris/pkis1-ret-wt-mut-r-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-hppb-1
# benchmark = po.load_benchmark("polaris/adme-fang-HPPB-1")

# benchmark = po.load_benchmark("tdcommons/half-life-obach")

In [14]:
train, test = benchmark.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [15]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

In [16]:
train_df

Unnamed: 0,Drug,Y
519,CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21,42.0
391,COc1ccc(Cl)cc1[C@]1(F)C(=O)Nc2cc(C(F)(F)F)ccc21,37.0
31,CCN(C)C(=O)Oc1cccc([C@H](C)N(C)C)c1,1.4
27,N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O,1.3
407,Cc1nnc(SCC2=C(C(=O)O)N3C(=O)[C@@H](NC(=O)[C@H]...,1.0
...,...,...
451,CCn1cc(C(=O)O)c(=O)c2cc(F)c(N3CCNCC3)nc21,5.1
494,CN(C)/N=N/c1[nH]cnc1C(N)=O,6.2
94,CC=CC1=C(C(=O)O)N2C(=O)[C@@H](NC(=O)[C@H](N)c3...,1.2
516,O=C1Nc2ccc(Cl)cc2C(c2ccccc2Cl)=NC1O,17.0


## Learn an Embedding with ChemProp
ChemProp using Message Passing Graph Neural Networks to learn a molecular representation tailored for the problem at hand.
We can 'plug it in' to `nepare` to take advantage of that, with the additional benefit for ChemProp that it will have more training data to learn its representation.

In [17]:
val_idx = int(len(train_df) * 0.20)  # use n for validation

We'll first write a function that converts our SMILES into their ChemProp input (a `MolGraph`).

In [18]:
from rdkit.Chem import MolFromSmiles
from chemprop.featurizers import MolGraphCache, SimpleMoleculeMolGraphFeaturizer

def smiles2molgraphcache(smiles: list[str]):
    mols = list(map(MolFromSmiles, smiles))
    featurizer = SimpleMoleculeMolGraphFeaturizer()
    mgc = MolGraphCache(mols, [None] * len(mols), [None] * len(mols), featurizer)
    return mgc

In [19]:
import torch

In [20]:
train_mgc = smiles2molgraphcache(train_df["Drug"][val_idx:])
train_targets = torch.tensor(train_df[list(benchmark.target_cols)][val_idx:].to_numpy(), dtype=torch.float32)
val_mgc = smiles2molgraphcache(train_df["Drug"][:val_idx])
val_targets = torch.tensor(train_df[list(benchmark.target_cols)][:val_idx].to_numpy(), dtype=torch.float32)
test_mgc = smiles2molgraphcache(test_df["Drug"])

In [21]:
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset

In [22]:
train_dataset = PairwiseAugmentedDataset(train_mgc, train_targets, how='sut')
val_dataset = PairwiseAnchoredDataset(train_mgc, train_targets, val_mgc, val_targets, how='half')
test_dataset = PairwiseInferenceDataset(train_mgc, train_targets, test_mgc, how='half')

Next, we need to write a function to collate our `MolGraph`s and target values - ChemProp has a class for batches of `MolGraph` aptly named `BatchMolGraph`.

In [23]:
from typing import Iterable

import numpy as np
from chemprop.data.molgraph import MolGraph
from chemprop.data.collate import BatchMolGraph

In [24]:
def _collate(batch: Iterable[tuple[MolGraph, MolGraph, float]]):
    mgs_1, mgs_2, ys = zip(*batch)  #  now need to convert y back into a tensor
    return BatchMolGraph(mgs_1), BatchMolGraph(mgs_2), torch.tensor(np.array(ys), dtype=torch.float32)

In [25]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=_collate)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, collate_fn=_collate)
predict_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, collate_fn=_collate)

Finally, we just need to define a class to take our collated batches and convert them into their learned representations.
This class can then be passed to the `nepare` class `LearnedEmbeddingNeuralPairwiseRegressor`, which will call our class on the two inputs for each batch.

In [26]:
from chemprop.nn.agg import MeanAggregation
from chemprop.nn.message_passing import BondMessagePassing

from nepare.nn import LearnedEmbeddingNeuralPairwiseRegressor

In [27]:
class ChemPropEmbedder(torch.nn.Module):
    def __init__(self, mp, agg):
        super().__init__()
        self.mp = mp
        self.agg = agg
        self.bn = torch.nn.BatchNorm1d(mp.output_dim)

    def forward(self, bmg):
        H = self.mp(bmg)
        Z = self.agg(H, bmg.batch)
        return self.bn(Z)

In [28]:
HIDDEN = 400

In [29]:
mp = BondMessagePassing(d_h=HIDDEN, depth=3)
agg = MeanAggregation()
embedder = ChemPropEmbedder(mp, agg)

In [30]:
npr = LearnedEmbeddingNeuralPairwiseRegressor(embedder, HIDDEN, HIDDEN, 2, lr=5e-5, n_targets=len(benchmark.target_cols))

/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/utilities/parsing.py:209: Attribute 'embedding_module' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['embedding_module'])`.


## Training and Predicting

From here on out we follow a very standard `lightning` training workflow - see `demo.ipynb` for a slightly more in-depth explanation of what's going on.

In [31]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint

from nepare.inference import predict

In [32]:
early_stopping = EarlyStopping(monitor="validation/loss", patience=5)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [33]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type             | Params | Mode 
--------------------------------------------------------------
0 | fnn              | Sequential       | 481 K  | train
1 | embedding_module | ChemPropEmbedder | 384 K  | train
--------------------------------------------------------------
865 K     Trainable params
0         Non-trainable params
865 K     Total params
3.462     Total estimated model params size (MB)
17        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [34]:
npr = LearnedEmbeddingNeuralPairwiseRegressor.load_from_checkpoint(model_checkpoint.best_model_path)  # reload best model based on early stopping

In [35]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/jackson/neural-pairwise-regression/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Predicting: |          | 0/? [00:00<?, ?it/s]

In [36]:
results = benchmark.evaluate(y_pred.numpy().flatten())

In [37]:
results.name = "nepare_chemprop"
results.description = "Neural Pairwise Regression with ChemProp Learned Embedding"
results.github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_chemprop_nepare.ipynb"

In [38]:
results

test_set,target_label,scores
test,Y,spearmanr0.2748942684755672
spearmanr,0.2748942684755672,
name,nepare_chemprop,
description,Neural Pairwise Regression with ChemProp Learned Embedding,
tags,,
user_attributes,,
owner,,
polaris_version,0.11.8,
github_url,https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_chemprop_nepare.ipynb,
paper_url,,

test_set,target_label,scores
test,Y,spearmanr0.2748942684755672
spearmanr,0.2748942684755672,

0,1
spearmanr,0.2748942684755672


This last line is commented out because it will fail (unless you are me) - you can replace the `owner` without your own name to upload your results (and also update the link, name, and description above).

In [39]:
# results.upload_to_hub(owner="jacksonburns", access="public")