# `polaris adme`

This notebook demonstrates using the Mordred(-community) molecular descriptors with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - fastprop
 - mordredcommunity
 - rdkit
 - lightning
 - torch
 - numpy
 - ipywidgets

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [1]:
import polaris as po
import pandas as pd

In [2]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [3]:
%%capture
# https://polarishub.io/benchmarks/polaris/adme-fang-rppb-1
# benchmark = po.load_benchmark("polaris/adme-fang-RPPB-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-solu-1
# benchmark = po.load_benchmark("polaris/adme-fang-SOLU-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-hppb-1
benchmark = po.load_benchmark("polaris/adme-fang-HPPB-1")
# https://polarishub.io/benchmarks/polaris/pkis1-ret-wt-mut-r-1
# benchmark = po.load_benchmark("polaris/pkis1-ret-wt-mut-r-1")
# https://polarishub.io/benchmarks/polaris/pkis1-kit-wt-mut-r-1
# benchmark = po.load_benchmark("polaris/pkis1-kit-wt-mut-r-1")


In [4]:
train, test = benchmark.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [5]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

In [6]:
train_df

Unnamed: 0,smiles,LOG_HPPB
88,CN(Cc1ccccc1)C1(C(=O)N2CCNC(=O)CC2)Cc2ccccc2C1,0.904932
58,Cc1cnc(C(=O)NCCc2ccc(S(=O)(=O)NC(=O)NC3CCCCC3)...,0.017451
79,Cc1c[nH]c2nccc(Oc3c(F)cc(Nc4cc(Cl)nc(N)n4)cc3F...,-1.221849
80,CNC(=O)C1(Cc2ccc(-c3cccnc3)cc2)CCN(Cc2cccc(F)c...,0.829754
39,NC1CC(NC(=O)c2ccc(-c3cn[nH]c3)cn2)C12CCC2,1.709702
...,...,...
26,Cc1oc2ccccc2c1CNc1nnc(-c2ccncc2)o1,0.717254
110,CC(=O)N1CCN(c2c(Cl)cccc2NC(=O)COc2ccccc2Cl)CC1,0.256958
94,COc1nn(C)cc1C(=O)Nc1cccc2cnccc12,1.504729
4,CC(=O)Nc1ccc(C(=O)N2CCCCC2c2nc(N)ncc2-c2ccc(Cl...,0.596927


## Featurize the Molecules with `mordred(community)`
We use `mordred` to calculate a vector of molecular descriptors for each species in this dataset, and then do some re-scaling and imputing to prepare the data for use.

In [7]:
from mordred import Calculator, descriptors
from rdkit.Chem import MolFromSmiles

In [8]:
calc = Calculator(descriptors, ignore_3D=True)

In [9]:
train_features = calc.pandas(map(MolFromSmiles, train_df["smiles"]), nmols=len(train_df)).fill_missing()
test_features = calc.pandas(map(MolFromSmiles, test_df["smiles"]), nmols=len(test_df)).fill_missing()

100%|██████████| 126/126 [00:02<00:00, 51.23it/s]
  t[t.applymap(is_missing)] = value
100%|██████████| 34/34 [00:00<00:00, 49.14it/s]
  t[t.applymap(is_missing)] = value


In [10]:
train_features

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,22.044380,17.873084,0,1,36.754262,2.548742,4.977040,36.754262,1.312652,4.272717,...,10.317020,78.137325,377.210327,6.858370,1909,49,150.0,179.0,8.090278,6.138889
1,23.878481,17.487201,0,0,38.878448,2.371316,4.742632,38.878448,1.254143,4.342553,...,10.163426,66.435015,445.178375,7.675489,3430,44,156.0,176.0,10.340278,6.819444
2,22.265059,17.391952,0,0,35.459035,2.459516,4.869993,35.459035,1.266394,4.275317,...,10.320321,77.580074,402.080743,9.806847,2161,45,152.0,180.0,9.222222,5.972222
3,23.451254,17.780215,0,1,39.351058,2.469754,4.851481,39.351058,1.311702,4.334236,...,10.297453,79.689919,403.205991,7.200107,2701,46,158.0,185.0,8.590278,6.652778
4,17.692350,14.676448,0,1,28.110239,2.582583,5.165118,28.110239,1.277738,4.072959,...,10.540355,70.744181,297.158960,7.247780,1163,31,124.0,152.0,5.979167,4.722222
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,18.332406,14.490070,0,0,30.950737,2.445783,4.709454,30.950737,1.345684,4.089797,...,10.007983,73.401695,306.111676,8.273289,1311,33,124.0,147.0,5.527778,5.027778
122,21.557952,16.847265,0,0,35.432414,2.411282,4.822565,35.432414,1.265443,4.246539,...,10.140573,63.090395,421.095997,8.593796,2246,43,142.0,164.0,9.500000,6.222222
123,16.320475,13.753705,0,0,27.437194,2.415496,4.803374,27.437194,1.306533,3.980961,...,9.946882,68.861249,282.111676,8.060334,929,32,110.0,130.0,6.527778,4.694444
124,25.053046,18.938515,0,0,41.442231,2.465220,4.930439,41.442231,1.295070,4.393731,...,10.382234,68.062338,449.161853,8.020747,2975,52,168.0,197.0,10.222222,7.000000


In [11]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
import torch
import numpy as np

In [12]:
X = torch.tensor(train_features.to_numpy(dtype=np.float32), dtype=torch.float32)
y = torch.tensor(train_df[list(benchmark.target_cols)].to_numpy(dtype=np.float32), dtype=torch.float32) # keep it 2d!
X_test = torch.tensor(test_features.to_numpy(dtype=np.float32), dtype=torch.float32)

In [13]:
val_idx = int(len(train_df) * 0.20)  # use n/110 for validation

In [14]:
from fastprop.data import standard_scale, inverse_standard_scale

In [15]:
X[val_idx:], means, vars = standard_scale(X[val_idx:])
X[:val_idx] = standard_scale(X[:val_idx], means, vars)
X_test = standard_scale(X_test, means, vars)
# sorta-Winsorization
X.clamp_(-3, 3)
X_test.clamp_(-3, 3)

tensor([[-0.1720,  0.2261, -0.2021,  ...,  0.1601, -0.0423, -0.2105],
        [-0.6350, -0.2958, -0.2021,  ..., -0.5119, -1.2848, -0.5902],
        [ 0.2132,  0.3225, -0.2021,  ...,  0.1825, -0.2952,  0.4091],
        ...,
        [-0.6659, -0.6370, -0.2021,  ..., -0.6239, -0.0610, -0.7101],
        [-0.5371, -0.2711, -0.2021,  ..., -0.5119,  0.0513, -0.5103],
        [-0.4327, -0.5557,  3.0000,  ..., -0.4447, -0.8977, -0.5902]])

In [16]:
y[val_idx:], target_means, target_vars = standard_scale(y[val_idx:])
y[:val_idx] = standard_scale(y[:val_idx], target_means, target_vars)

## Implementing Pairwise Regression

`nepare` provides a number of convenience classes than handle training, validation, and testing augmentation automatically.

In [17]:
from nepare.nn import NeuralPairwiseRegressor as NPR
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset
from nepare.inference import predict

In [18]:
training_dataset = PairwiseAugmentedDataset(X[val_idx:], y[val_idx:], how='full')
validation_dataset = PairwiseAnchoredDataset(X[val_idx:], y[val_idx:], X[:val_idx], y[:val_idx], how='full')
predict_dataset = PairwiseInferenceDataset(X[val_idx:], y[val_idx:], X_test, how='full')
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=64)
predict_loader = torch.utils.data.DataLoader(predict_dataset, batch_size=64)

These networks can overfit very quickly, so we will use `EarlyStopping` to stop training once we start overfitting and then reset the network to to _just before_ it overfit.

In [19]:
npr = NPR(X.shape[1], 100, 3, n_targets=len(benchmark.target_cols))
early_stopping = EarlyStopping(monitor="validation/loss", patience=6)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [20]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type       | Params | Mode 
--------------------------------------------
0 | fnn  | Sequential | 343 K  | train
--------------------------------------------
343 K     Trainable params
0         Non-trainable params
343 K     Total params
1.372     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [21]:
npr = NPR.load_from_checkpoint(model_checkpoint.best_model_path)  # reload best model based on early stopping

In [22]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

In [23]:
y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)

  y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)


In [24]:
predictions = {
    name: y_pred[:, i] for i, name in enumerate(benchmark.target_cols)
}

## Finally, Results!

In [25]:
results = benchmark.evaluate(predictions if len(benchmark.target_cols) > 1 else np.array(y_pred).flatten())

In [26]:
results.name = "nepare"
results.description = "Neural Pairwise Regression with Mordred(-community) Molecular Descriptors"
results.github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_adme.ipynb"

In [27]:
results

test_set,target_label,scores
test,LOG_HPPB,explained_var0.6212714259646027spearmanr0.7806555145405341mean_squared_error0.30329107077970924pearsonr0.8094660450548139r20.4992211495855208mean_absolute_error0.461897259822166
explained_var,0.6212714259646027,
spearmanr,0.7806555145405341,
mean_squared_error,0.30329107077970924,
pearsonr,0.8094660450548139,
r2,0.4992211495855208,
mean_absolute_error,0.461897259822166,
name,nepare,
description,Neural Pairwise Regression with Mordred(-community) Molecular Descriptors,
tags,,

test_set,target_label,scores
test,LOG_HPPB,explained_var0.6212714259646027spearmanr0.7806555145405341mean_squared_error0.30329107077970924pearsonr0.8094660450548139r20.4992211495855208mean_absolute_error0.461897259822166
explained_var,0.6212714259646027,
spearmanr,0.7806555145405341,
mean_squared_error,0.30329107077970924,
pearsonr,0.8094660450548139,
r2,0.4992211495855208,
mean_absolute_error,0.461897259822166,

0,1
explained_var,0.6212714259646027
spearmanr,0.7806555145405341
mean_squared_error,0.3032910707797092
pearsonr,0.8094660450548139
r2,0.4992211495855208
mean_absolute_error,0.461897259822166


As of writing, this method lands at third on the leaderboard just barely losing out to a couple 1 _billion_ parameter MPNN-based foundation models.
We've achieved pretty similar performance (without any tuning!) in just a few minutes - pretty good!

This last line is commented out because it will fail (unless you are me) - you can replace the `owner` without your own name to upload your results (and also update the link, name, and description above).

In [29]:
# results.upload_to_hub(owner="jacksonburns", access="public")