# `polaris adme`

This notebook demonstrates using the Mordred(-community) molecular descriptors with Neural Pairwise Regression (via `nepare`) with the `polaris` benchmarking library.

## Requirements
Python 3.10+ (originally run on 3.12)
 - polaris-lib
 - pandas
 - fastprop
 - mordredcommunity
 - rdkit
 - lightning
 - torch
 - numpy
 - ipywidgets

You will also need to run `pip install .` in the repository's root directory to install `nepare`.

## `polaris` Setup

After running `polaris login` on the command line, we can import everything (checking that the version is recent enough) and then download the benchmark data.

In [1]:
import polaris as po
import pandas as pd

In [2]:
from packaging.version import Version
assert Version(po.__version__) >= Version("0.11.6"), "test.as_dataframe does not work in earlier versions of Polaris, please upgrade"

`polaris` makes it really easy to run different benchmarks quickly - just change the name inside `load_benchmark` to try something else.
I'm using this same notebook for a few different benchmarks, all from the Fang biogen ADME paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.3c00160) which have been made conveniently available on `polaris`.

In [3]:
%%capture
# https://polarishub.io/benchmarks/polaris/adme-fang-rppb-1
benchmark = po.load_benchmark("polaris/adme-fang-RPPB-1")
# https://polarishub.io/benchmarks/polaris/adme-fang-solu-1
# benchmark = po.load_benchmark("polaris/adme-fang-SOLU-1")

In [4]:
train, test = benchmark.get_train_test_split()
test_df: pd.DataFrame = test.as_dataframe()
train_df: pd.DataFrame = train.as_dataframe()

We'll shuffle the data just for good measure.

In [5]:
train_df = train_df.sample(frac=1.0, random_state=1701)  # shuffle the training data

## Featurize the Molecules with `mordred(community)`
We use `mordred` to calculate a vector of molecular descriptors for each species in this dataset, and then do some re-scaling and imputing to prepare the data for use.

In [6]:
from mordred import Calculator, descriptors
from rdkit.Chem import MolFromSmiles

In [7]:
calc = Calculator(descriptors, ignore_3D=True)

In [8]:
train_features = calc.pandas(map(MolFromSmiles, train_df["smiles"]), nmols=len(train_df)).fill_missing()
test_features = calc.pandas(map(MolFromSmiles, test_df["smiles"]), nmols=len(test_df)).fill_missing()

100%|██████████| 111/111 [00:02<00:00, 52.37it/s]
  t[t.applymap(is_missing)] = value
100%|██████████| 24/24 [00:00<00:00, 39.24it/s]
  t[t.applymap(is_missing)] = value


In [9]:
train_features

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,16.958632,13.546673,0,0,28.642859,2.441470,4.882941,28.642859,1.301948,4.021066,...,10.014984,56.167647,297.111341,8.030036,1054,35,114.0,134.0,6.777778,5.000000
1,24.196109,17.719041,0,0,39.042073,2.479021,4.953308,39.042073,1.301402,4.381167,...,10.558388,84.851780,405.160103,8.103202,2706,49,168.0,203.0,8.833333,6.361111
2,24.110580,17.106061,0,0,40.912296,2.363271,4.726543,40.912296,1.319751,4.353586,...,10.087433,66.284276,418.211724,7.337048,3104,42,156.0,176.0,7.527778,6.944444
3,22.494166,17.802912,0,1,37.841697,2.469719,4.850692,37.841697,1.351489,4.291692,...,10.311117,79.272896,372.175025,7.595409,2088,44,154.0,185.0,6.500000,6.083333
4,12.495823,11.308105,0,0,20.141728,2.358666,4.717332,20.141728,1.184808,3.715821,...,9.507106,49.119415,276.033541,9.201118,566,22,80.0,88.0,7.395833,3.875000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,20.838915,17.520490,0,0,33.703458,2.553484,5.026050,33.703458,1.248276,4.221641,...,10.312911,76.668244,370.180504,7.403610,1792,46,142.0,170.0,10.472222,5.972222
107,16.401355,13.178323,0,0,27.516148,2.357497,4.672669,27.516148,1.310293,4.000209,...,9.789030,72.959236,282.111676,8.060334,1031,28,108.0,124.0,5.916667,4.638889
108,21.526450,17.525930,0,0,35.070347,2.479304,4.867733,35.070347,1.298902,4.244547,...,10.300618,78.156175,397.076409,9.234335,1936,42,148.0,176.0,8.451389,5.736111
109,23.201273,16.661749,0,1,39.311739,2.533536,4.911786,39.311739,1.355577,4.321365,...,10.347757,79.133264,386.185509,7.572265,2342,47,158.0,189.0,6.750000,6.333333


In [10]:
import lightning
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
import torch
import numpy as np

In [11]:
X = torch.tensor(train_features.to_numpy(dtype=np.float32), dtype=torch.float32)
y = torch.tensor(train_df["LOG_RPPB"].to_numpy(dtype=np.float32), dtype=torch.float32)[:, None]  # keep it 2d!
X_test = torch.tensor(test_features.to_numpy(dtype=np.float32), dtype=torch.float32)

In [12]:
val_idx = 12  # use n/110 for validation

In [13]:
from fastprop.data import standard_scale, inverse_standard_scale

In [14]:
X[val_idx:], means, vars = standard_scale(X[val_idx:])
X[:val_idx] = standard_scale(X[:val_idx], means, vars)
X_test = standard_scale(X_test, means, vars)
# sorta-Winsorization
X.clamp_(-3, 3)
X_test.clamp_(-3, 3)

tensor([[-1.4571, -1.5857, -0.2616,  ..., -1.6617, -0.9991, -1.2891],
        [ 0.4292,  0.3898, -0.2616,  ...,  0.3805,  0.0612,  0.5553],
        [ 0.1896,  0.2196, -0.2616,  ...,  0.2329,  0.3545,  0.0052],
        ...,
        [ 0.0646,  0.4525, -0.2616,  ...,  0.0114,  1.0507,  0.1347],
        [ 1.4779,  1.4945, -0.2616,  ...,  1.6354,  1.4375,  1.4721],
        [-1.7885, -1.7768, -0.2616,  ..., -1.4895, -1.7179, -1.9685]])

We could also rescale the targets like this:

```python
y, target_means, target_vars = standard_scale(y)
```

But their natural scale is already pretty close to what we want, so we won't bother (i've tried, and it doesn't significantly impact performance).

## Implementing Pairwise Regression

`nepare` provides a number of convenience classes than handle training, validation, and testing augmentation automatically.

In [15]:
from nepare.nn import NeuralPairwiseRegressor as NPR
from nepare.data import PairwiseAugmentedDataset, PairwiseAnchoredDataset, PairwiseInferenceDataset
from nepare.inference import predict

In [16]:
training_dataset = PairwiseAugmentedDataset(X[val_idx:], y[val_idx:], how='full')
validation_dataset = PairwiseAnchoredDataset(X[val_idx:], y[val_idx:], X[:val_idx], y[:val_idx], how='full')
predict_dataset = PairwiseInferenceDataset(X[val_idx:], y[val_idx:], X_test, how='full')
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(validation_dataset, batch_size=64)
predict_loader = torch.utils.data.DataLoader(predict_dataset, batch_size=64)

These networks can overfit very quickly, so we will use `EarlyStopping` to stop training once we start overfitting and then reset the network to to _just before_ it overfit.

In [17]:
npr = NPR(X.shape[1], 50, 2)
early_stopping = EarlyStopping(monitor="validation/loss", patience=10)
model_checkpoint = ModelCheckpoint(monitor="validation/loss")

In [18]:
trainer = lightning.Trainer(max_epochs=50, log_every_n_steps=1, callbacks=[early_stopping, model_checkpoint])
trainer.fit(npr, train_loader, val_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type       | Params | Mode 
--------------------------------------------
0 | fnn  | Sequential | 163 K  | train
--------------------------------------------
163 K     Trainable params
0         Non-trainable params
163 K     Total params
0.656     Total estimated model params size (MB)
6         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [19]:
npr = NPR.load_from_checkpoint(model_checkpoint.best_model_path)  # reload best model based on early stopping

In [20]:
y_pred, y_stdev = predict(npr, predict_loader, how="all")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

If we had re-scaled the target, we would have to undo that scaling like this:

```python
y_pred = inverse_standard_scale(torch.tensor(y_pred), target_means, target_vars)
```

## Finally, Results!

In [21]:
results = benchmark.evaluate(y_pred)

In [22]:
results.name = "nepare"
results.description = "Neural Pairwise Regression with Mordred(-community) Molecular Descriptors"
results.github_url = "https://github.com/JacksonBurns/neural-pairwise-regression/blob/main/notebooks/polaris_adme.ipynb"

In [23]:
results

test_set,target_label,scores
test,LOG_RPPB,r20.6565050593916393mean_squared_error0.30518906672948914spearmanr0.8243478260869566pearsonr0.8126869707267603explained_var0.657950659759234mean_absolute_error0.41379897609924793
r2,0.6565050593916393,
mean_squared_error,0.30518906672948914,
spearmanr,0.8243478260869566,
pearsonr,0.8126869707267603,
explained_var,0.657950659759234,
mean_absolute_error,0.41379897609924793,
name,nepare,
description,Neural Pairwise Regression with Mordred(-community) Molecular Descriptors,
tags,,

test_set,target_label,scores
test,LOG_RPPB,r20.6565050593916393mean_squared_error0.30518906672948914spearmanr0.8243478260869566pearsonr0.8126869707267603explained_var0.657950659759234mean_absolute_error0.41379897609924793
r2,0.6565050593916393,
mean_squared_error,0.30518906672948914,
spearmanr,0.8243478260869566,
pearsonr,0.8126869707267603,
explained_var,0.657950659759234,
mean_absolute_error,0.41379897609924793,

0,1
r2,0.6565050593916393
mean_squared_error,0.3051890667294891
spearmanr,0.8243478260869566
pearsonr,0.8126869707267603
explained_var,0.657950659759234
mean_absolute_error,0.4137989760992479


As of writing, this method lands at third on the leaderboard just barely losing out to a couple 1 _billion_ parameter MPNN-based foundation models.
We've achieved pretty similar performance (without any tuning!) in just a few minutes - pretty good!

This last line is commented out because it will fail (unless you are me) - you can replace the `owner` without your own name to upload your results (and also update the link, name, and description above).

In [24]:
# results.upload_to_hub(owner="jacksonburns", access="public")