# Combining ChemProp, `polaris`, and `nepare`

This notebook demonstrates using ChemProp as a learnable embedding with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - fastprop
 - mordredcommunity
 - rdkit
 - lightning
 - torch
 - numpy
 - ipywidgets
 - chemprop 2.1 or newer

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [1]:
import polaris as po
import pandas as pd

In [2]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [3]:
%%capture
# https://polarishub.io/benchmarks/polaris/adme-fang-rppb-1
# benchmark = po.load_benchmark("polaris/adme-fang-RPPB-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-solu-1
benchmark = po.load_benchmark("polaris/adme-fang-SOLU-1")

In [4]:
train, test = benchmark.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [5]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

In [6]:
train_df

Unnamed: 0,smiles,LOG_SOLUBILITY
474,O=C(CCc1ccco1)Nc1ccccc1OC(F)F,1.649627
913,CCCN(Cc1ccccc1)C(=O)CC1(N)CCC1,1.751356
1268,Cc1cc(C)c(S(=O)(=O)Nc2ccc(N(C)C)cc2)c(C)c1-n1c...,0.436163
618,CCOc1cccc(CNCCc2c[nH]c3ccccc23)c1,1.863263
1334,NCC(=O)Nc1ccc(-n2nc(C(F)(F)F)cc2-c2ccc3c(ccc4c...,-0.602060
...,...,...
1475,CCn1ccc2cc(C(=O)N3CCN(c4ccc(C)cc4)C(=O)C3)ccc21,1.660201
1518,CN(C)[C@@H]1CN(C(=O)CCn2cnc3ccccc32)C[C@H]1O,2.179264
1118,Nc1ncc(-c2cccc(C(F)(F)F)c2)c(C2CCCCN2C(=O)c2cc...,1.399674
1540,COc1cccc(CN(C)c2ncnc3c2CN(C)CC3)c1,1.761402


## Learn an Embedding with ChemProp
ChemProp using Message Passing Graph Neural Networks to learn a molecular representation tailored for the problem at hand.
We can 'plug it in' to `nepare` to take advantage of that, with the additional benefit for ChemProp that it will have more training data to learn its representation.

In [7]:
val_idx = 150  # use n for validation

In [7]:
# from chemprop import data, featurizers, models, nn
# import torch
# import numpy as np

# from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset

In [13]:
# train_mols = list(map(MolFromSmiles, train_df["smiles"][val_idx:]))
# train_targets = torch.tensor(train_df["LOG_SOLUBILITY"][val_idx:].to_numpy(), dtype=torch.float32)
# val_mols = list(map(MolFromSmiles, train_df["smiles"][:val_idx]))
# val_targets = torch.tensor(train_df["LOG_SOLUBILITY"][:val_idx].to_numpy(), dtype=torch.float32)
# test_mols = list(map(MolFromSmiles, test_df["smiles"]))
# train_dataset = PairwiseAugmentedDataset(train_mols, train_targets)
# train_dataset.downsample_(n=100_000)
# val_dataset = PairwiseAnchoredDataset(train_mols, train_targets, val_mols, val_targets)
# test_dataset = PairwiseInferenceDataset(train_mols, train_targets, test_mols)

We start by converting our SMILES into RDKit mols, and then convert them into their corresponding ChemProp features.

TODO: refactor this into a function

In [8]:
from rdkit.Chem import MolFromSmiles

In [9]:
train_mols = list(map(MolFromSmiles, train_df["smiles"][val_idx:]))
val_mols = list(map(MolFromSmiles, train_df["smiles"][:val_idx]))
test_mols = list(map(MolFromSmiles, test_df["smiles"]))

In [10]:
from chemprop.featurizers import MolGraphCache, SimpleMoleculeMolGraphFeaturizer

In [11]:
featurizer = SimpleMoleculeMolGraphFeaturizer()

In [12]:
train_mgc = MolGraphCache(train_mols, [None] * len(train_mols), [None] * len(train_mols), featurizer)
val_mgc = MolGraphCache(val_mols, [None] * len(val_mols), [None] * len(val_mols), featurizer)
test_mgc = MolGraphCache(test_mols, [None] * len(train_mols), [None] * len(train_mols), featurizer)

In [13]:
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset

In [14]:
train_dataset = PairwiseAugmentedDataset(mgc, train_df["LOG_SOLUBILITY"][val_idx:].to_numpy())
val_dataset = PairwiseAnchoredDataset(mgc, train_df["LOG_SOLUBILITY"][val_idx:].to_numpy())
test_dataset = PairwiseInferenceDataset(mgc, train_df["LOG_SOLUBILITY"][val_idx:].to_numpy())

In [20]:
from chemprop.data.molgraph import MolGraph

from chemprop.data.collate import BatchMolGraph

from typing import Iterable

In [41]:
def _collate(batch: Iterable[tuple[MolGraph, MolGraph, float]]):
    mgs_1, mgs_2, ys = zip(*batch)  # may need to cast back to a tensor here (y especially)
    return BatchMolGraph(mgs_1), BatchMolGraph(mgs_2), torch.tensor(ys, dtype=torch.float32).reshape(-1, 1)

In [42]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=_collate)

In [None]:
for batch in train_loader:
    print(batch)
    break

In [None]:
make a list of molgraph objects

In [37]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
import torch
import numpy as np

## Implementing Pairwise Regression

`nepare` provides a number of convenience classes than handle training, validation, and testing augmentation automatically.

In [15]:
from nepare.nn import NeuralPairwiseRegressor as NPR
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset
from nepare.inference import predict

In [16]:
from chemprop.data import BatchMolGraph

In [17]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64)
predict_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64)

In [22]:
nt = torch.nested.nested_tensor([[1,2],[2]])

In [None]:
nt.size(0)

In [None]:
for batch in train_loader:
    print(batch)
    break

## Building the Learnable Embedding Module

In [33]:
from nepare.nn import LearnedEmbeddingNeuralPairwiseRegressor

In [26]:
class PassThroughPredictor(nn.Predictor):
    def forward(self, Z):
        return Z

    def train_step(self, Z):
        return Z

    def encode(self, Z, i: int):
        return Z

In [35]:
from chemprop.conf import DEFAULT_HIDDEN_DIM
from chemprop.models import MPNN

In [None]:
from chemprop.nn.agg import MeanAggregation

In [30]:
class ChemPropEmbedder(torch.nn.Module):
    def __init__(self, mp, agg):
        super().__init__()
        self.mp = mp
        self.agg = agg

    def forward(self, bmg):
        H = self.mp(bmg)
        Z = self.agg(H, bmg.batch)
        return Z

In [31]:
mp = nn.BondMessagePassing()
agg = nn.MeanAggregation()
embedder = ChemPropEmbedder(mp, agg)

In [None]:
npr = LearnedEmbeddingNeuralPairwiseRegressor(embedder, DEFAULT_HIDDEN_DIM, 100, 2)
early_stopping = EarlyStopping(monitor="validation/loss", patience=10)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [28]:
# must be a learnable module that takes two inputs of some arbitrary type and generates a vector representation

ptp = PassThroughPredictor()

embedder = torch.nn.Sequential(mp, agg)
# embedder = MPNN(mp, agg, ptp)

These networks can overfit very quickly, so we will use `EarlyStopping` to stop training once we start overfitting and then reset the network to to _just before_ it overfit.

In [None]:
npr = LearnedEmbeddingNeuralPairwiseRegressor(embedder, DEFAULT_HIDDEN_DIM, 100, 2)
early_stopping = EarlyStopping(monitor="validation/loss", patience=10)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [None]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader)

In [19]:
npr = NPR.load_from_checkpoint(model_checkpoint.best_model_path)  # reload best model based on early stopping

In [None]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

If we had re-scaled the target, we would have to undo that scaling like this:

```python
y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)
```

## Finally, Results!

In [21]:
results = benchmark.evaluate(y_pred)

In [22]:
results.name = "nepare"
results.description = "Neural Pairwise Regression with Mordred(-community) Molecular Descriptors"
results.github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_adme.ipynb"

In [None]:
results

As of writing, this method lands at third on the leaderboard just barely losing out to a couple 1 _billion_ parameter MPNN-based foundation models.
We've achieved pretty similar performance (without any tuning!) in just a few minutes - pretty good!

This last line is commented out because it will fail (unless you are me) - you can replace the `owner` without your own name to upload your results (and also update the link, name, and description above).

In [None]:
results.upload_to_hub(owner="jacksonburns", access="public")