In [1]:
import polaris as po
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset = po.load_dataset("asap-discovery/antiviral-admet-2025-unblinded")

In [3]:
df = pd.DataFrame(dataset[:])
train_df, test_df = df[df["Set"] == "Train"], df[df["Set"] == "Test"]
val_df = train_df.sample(frac=0.2, random_state=42)
train_df = train_df[~train_df.index.isin(val_df.index)]

## Retrieving the `CheMeleon` Model

The `CheMeleon` model file is stored on Zenodo at [this link](https://zenodo.org/records/15426601).
Please cite the Zenodo if you use this model in published work.
You can manually download for your own use, or simply execute the below cell to programatically download it using Python:

In [4]:
from pathlib import Path
from urllib.request import urlretrieve

if not Path("chemeleon_mp.pt").exists():
    urlretrieve(
        r"https://zenodo.org/records/15460715/files/chemeleon_mp.pt",
        "chemeleon_mp.pt",
    )

## Initializing `CheMeleon`

`CheMeleon` uses the following classes for featurization, message passing, and aggregation:

In [5]:
import torch

from chemprop import featurizers, nn

featurizer = featurizers.SimpleMoleculeMolGraphFeaturizer()
agg = nn.MeanAggregation()
chemeleon_mp = torch.load("chemeleon_mp.pt", weights_only=True)
mp = nn.BondMessagePassing(**chemeleon_mp['hyper_parameters'])
mp.load_state_dict(chemeleon_mp['state_dict'])

<All keys matched successfully>

If you have an existing ChemProp model, you can simply replace your `agg`, `featurizer`, and `mp` with these classes and you can immediately take advantage of `CheMeleon`!

## Standard ChemProp Preparation

The below code handles importing needed modules, setting up the data, and initializing the ChemProp model.
It's **mostly** the same as the `training` example provided in the ChemProp repository - for a more detailed breakdown, check that notebook.

The one important change is that we must set `input_dim=mp.output_dim` when we initialize our FFN.
This ensure that the dimension of the learned representation from `CheMeleon` matches the input size for the regressor.
Also important to note here is that to make the `CheMeleon` model useful you set up your own FFN to regress the target you care about - in this case lipophilicity.

In [6]:
from pathlib import Path

from lightning import pytorch as pl
from lightning.pytorch.callbacks import ModelCheckpoint

from chemprop import data, models

chemprop_dir = Path.cwd().parent
num_workers = 0
smiles_column = "CXSMILES"
target_columns = ['HLM', 'KSOL', 'LogD', 'MDR1-MDCKII', 'MLM']

train_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(train_df[smiles_column].to_numpy(), train_df[target_columns].to_numpy())]
val_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(val_df[smiles_column].to_numpy(), val_df[target_columns].to_numpy())]
test_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(test_df[smiles_column].to_numpy(), test_df[target_columns].to_numpy())]
train_dset = data.MoleculeDataset(train_data, featurizer)
scaler = train_dset.normalize_targets()
val_dset = data.MoleculeDataset(val_data, featurizer)
val_dset.normalize_targets(scaler)
test_dset = data.MoleculeDataset(test_data, featurizer)
train_loader = data.build_dataloader(train_dset, num_workers=num_workers)
val_loader = data.build_dataloader(val_dset, num_workers=num_workers, shuffle=False)
test_loader = data.build_dataloader(test_dset, num_workers=num_workers, shuffle=False)
output_transform = nn.UnscaleTransform.from_standard_scaler(scaler)
ffn = nn.RegressionFFN(n_tasks=len(target_columns), output_transform=output_transform, input_dim=mp.output_dim)
metric_list = [nn.metrics.RMSE(), nn.metrics.MAE()]
mpnn = models.MPNN(mp, agg, ffn, batch_norm=False, metrics=metric_list)

Now we can take a look at the model, which we can see has the huge message passing setup from `CheMeleon`:

In [7]:
mpnn

MPNN(
  (message_passing): BondMessagePassing(
    (W_i): Linear(in_features=86, out_features=2048, bias=False)
    (W_h): Linear(in_features=2048, out_features=2048, bias=False)
    (W_o): Linear(in_features=2120, out_features=2048, bias=True)
    (dropout): Dropout(p=0.0, inplace=False)
    (tau): ReLU()
    (V_d_transform): Identity()
    (graph_transform): Identity()
  )
  (agg): MeanAggregation()
  (bn): Identity()
  (predictor): RegressionFFN(
    (ffn): MLP(
      (0): Sequential(
        (0): Linear(in_features=2048, out_features=300, bias=True)
      )
      (1): Sequential(
        (0): ReLU()
        (1): Dropout(p=0.0, inplace=False)
        (2): Linear(in_features=300, out_features=5, bias=True)
      )
    )
    (criterion): MSE(task_weights=[[1.0, 1.0, 1.0, 1.0, 1.0]])
    (output_transform): UnscaleTransform()
  )
  (X_d_transform): Identity()
  (metrics): ModuleList(
    (0): RMSE(task_weights=[[1.0]])
    (1): MAE(task_weights=[[1.0]])
    (2): MSE(task_weights=[[1.0,

## Training

The remainder of this notebook again follows the typical training routine.
With the addition of `CheMeleon` your model may take longer to train but will (hopefully!) have better performance, particularly if the dataset you have is small!

In [8]:
# Configure model checkpointing
checkpointing = ModelCheckpoint(
    "checkpoints",  # Directory where model checkpoints will be saved
    "best-{epoch}-{val_loss:.2f}",  # Filename format for checkpoints, including epoch and validation loss
    "val_loss",  # Metric used to select the best checkpoint (based on validation loss)
    mode="min",  # Save the checkpoint with the lowest validation loss (minimization objective)
    save_last=True,  # Always save the most recent checkpoint, even if it's not the best
)
trainer = pl.Trainer(
    logger=False,
    enable_checkpointing=True, # Use `True` if you want to save model checkpoints. The checkpoints will be saved in the `checkpoints` folder.
    enable_progress_bar=True,
    accelerator="auto",
    devices=1,
    max_epochs=20, # number of epochs to train for
    callbacks=[checkpointing], # Use the configured checkpoint callback
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [9]:
trainer.fit(mpnn, train_loader, val_loader)

/home/jackson/miniconda3/envs/ff_tune/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:654: Checkpoint directory /home/jackson/fastprop_foundation/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.
/home/jackson/miniconda3/envs/ff_tune/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.

  | Name            | Type               | Params | Mode 
---------------------------------------------------------------
0 | message_passing | BondMessagePassing | 8.7 M  | train
1 | agg             | MeanAggregation    | 0      | train
2 | bn              | Identity           | 0      | train
3 | predictor       | RegressionFFN      | 616 K  | t

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/jackson/miniconda3/envs/ff_tune/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Epoch 19: 100%|██████████| 6/6 [00:01<00:00,  5.53it/s, train_loss_step=0.324, val_loss=0.544, train_loss_epoch=0.146] 

`Trainer.fit` stopped: `max_epochs=20` reached.


Epoch 19: 100%|██████████| 6/6 [00:01<00:00,  4.07it/s, train_loss_step=0.324, val_loss=0.544, train_loss_epoch=0.146]


Evaluation needs to be done as shown here: https://github.com/asapdiscovery/asap-polaris-blind-challenge-examples/tree/main/evaluation

In [10]:
results = trainer.test(dataloaders=test_loader)

Restoring states from the checkpoint path at /home/jackson/fastprop_foundation/checkpoints/best-epoch=19-val_loss=0.54.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at /home/jackson/fastprop_foundation/checkpoints/best-epoch=19-val_loss=0.54.ckpt
/home/jackson/miniconda3/envs/ff_tune/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Testing DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 15.15it/s]
