# `CheMeleon` $P_{vap}$

For reproducibility in this demo we will seed everything:

In [1]:
from lightning import seed_everything
seed_everything(42)

Seed set to 42


42

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [3]:
smiles_column = "smiles"

In [None]:
df = pd.read_csv("./kruger_pvap.csv", index_col=smiles_column)
scaler = MinMaxScaler((0, 1))
df["log_vp(pa)"] = scaler.fit_transform(df["log_vp(pa)"])
train_df = df[df["split"] == "train"]
test_df = df[df["split"] == "test"]
val_df = train_df.sample(frac=0.2)
train_df = train_df[~train_df.index.isin(val_df.index)]

## Training

The code below shows how to retrieve the `CheMeleon` model and train it as is typically done with `chemprop` - most of it is adapated from the [`CheMeleon` demo notebook](https://github.com/JacksonBurns/chemeleon/blob/main/finetuning_demo.ipynb).

We'll start by downloading `CheMeleon` if it's not already present:

In [5]:
from pathlib import Path
from urllib.request import urlretrieve

mp_path = Path("chemeleon_mp.pt")
if not mp_path.exists():
  urlretrieve(
      r"https://zenodo.org/records/15460715/files/chemeleon_mp.pt",
      mp_path.name,
  )

We can then instantiate it:

In [6]:
import torch

from chemprop import featurizers, nn

featurizer = featurizers.SimpleMoleculeMolGraphFeaturizer()
agg = nn.MeanAggregation()
chemeleon_mp = torch.load(mp_path, weights_only=True)
mp = nn.BondMessagePassing(**chemeleon_mp['hyper_parameters'])
mp.load_state_dict(chemeleon_mp['state_dict'])

<All keys matched successfully>

And from there we follow the typical `chemprop` training procedure, building our datasets and dataloaders using the dataframes we have already made:

In [7]:
from lightning import pytorch as pl
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping
from lightning.pytorch.loggers import TensorBoardLogger

from chemprop import data, models

target_columns = ["log_vp(pa)"]

train_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(train_df.index, train_df[target_columns].to_numpy())]
val_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(val_df.index, val_df[target_columns].to_numpy())]
test_data = [data.MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(test_df.index, test_df[target_columns].to_numpy())]

train_dset = data.MoleculeDataset(train_data, featurizer)
scaler = train_dset.normalize_targets()
val_dset = data.MoleculeDataset(val_data, featurizer)
val_dset.normalize_targets(scaler)
test_dset = data.MoleculeDataset(test_data, featurizer)
train_loader = data.build_dataloader(train_dset, batch_size=64)
val_loader = data.build_dataloader(val_dset, shuffle=False)
test_loader = data.build_dataloader(test_dset, shuffle=False)
output_transform = nn.UnscaleTransform.from_standard_scaler(scaler)
ffn = nn.RegressionFFN(output_transform=output_transform, input_dim=mp.output_dim, hidden_dim=1_800)
mpnn = models.MPNN(mp, agg, ffn, batch_norm=False)

In [8]:
checkpointing = ModelCheckpoint("checkpoints", "best-{epoch}-{val_loss:.2f}", "val_loss", mode="min", save_last=True)
early_stopping = EarlyStopping("val_loss", patience=5)
logger = TensorBoardLogger(save_dir="logs", default_hp_metric=False, name=None)
trainer = pl.Trainer(logger=logger, enable_checkpointing=True, max_epochs=50, callbacks=[checkpointing, early_stopping], deterministic=True)

Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
/home/jwburns/.conda/envs/chemprop_live/lib/python3.12/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/jwburns/.conda/envs/chemprop_live/lib/python3. ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [9]:
trainer.fit(mpnn, train_loader, val_loader)

/home/jwburns/.conda/envs/chemprop_live/lib/python3.12/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:658: Checkpoint directory /home/jwburns/chemeleon_pvap/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Loading `train_dataloader` to estimate number of stepping batches.
/home/jwburns/.conda/envs/chemprop_live/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=63` in the `DataLoader` to improve performance.

  | Name            | Type               | Params | Mode 
---------------------------------------------------------------
0 | message_passing | BondMessagePassing | 8.7 M  | train
1 | agg             | MeanAggregation    | 0      | train
2 | bn              | Identity           | 0      | train
3 | predictor       | RegressionFFN     

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/jwburns/.conda/envs/chemprop_live/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=63` in the `DataLoader` to improve performance.


Epoch 15: 100%|██████████| 71/71 [00:12<00:00,  5.71it/s, v_num=2, train_loss_step=0.0209, val_loss=0.134, train_loss_epoch=0.123] 


## Inference and Results

With our model trained, we can now roll it back to the best weights as determined by early stopping and then run inference on our testing dataset.

In [10]:
mpnn = models.MPNN.load_from_checkpoint(trainer.checkpoint_callback.best_model_path, model=mpnn)

In [11]:
preds = torch.cat(trainer.predict(mpnn, test_loader), dim=0)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/home/jwburns/.conda/envs/chemprop_live/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=63` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [00:00<00:00, 12.48it/s]


RMSE is the metric of choice for this study, so we'll import that from `scikit-learn` and check our results.

In [12]:
from sklearn.metrics import mean_absolute_error

In [None]:
mae = mean_absolute_error(scaler.inverse_transform(test_df["log_vp(pa)"].to_numpy()), scaler.inverse_transform(preds.detach().numpy()))
mae

0.027042454780363927