This notebook demonstrates how to train a hetero GNN model on a pre-compiled UCNP dataset (SUNSET-1) to predict UCNP emisison intensity. 

In [1]:
from NanoParticleTools.machine_learning.models.hetero.intra_inter_model import HeteroDCVModel
from NanoParticleTools.machine_learning.data.datamodule import NPMCDataModule
from NanoParticleTools.machine_learning.data.utils import get_sunset_datasets
from NanoParticleTools.machine_learning.data import SummedWavelengthRangeLabelProcessor
from NanoParticleTools.machine_learning.models.hetero.intra_inter_data import HeteroDCVFeatureProcessor
from NanoParticleTools.machine_learning.util.training import train_uv_model
import warnings
warnings.filterwarnings('ignore')

First, download the SUNSET datasets from [Figshare](https://figshare.com/s/49222bae78f228363897). Then, we prepare a DataModule using the appropriate feature processor and label processor. In this case, since we are using the HeteroDCVModel, we use the HeteroDCVFeatureProcessor. We also define the label processor, which sums the wavelengths over the UCNP emisison spectra. 

In [2]:
train_dataset, val_dataset, iid_test_dataset, ood_test_dataset = get_sunset_datasets(
    sunset_ids=1,
    feature_processor_cls=HeteroDCVFeatureProcessor,
    label_processor_cls=SummedWavelengthRangeLabelProcessor,
    data_path=r"C:\Users\ChemeGrad2021\Desktop\NanoParticleTools",
    feature_processor_kwargs={'include_zeros': True},
    label_processor_kwargs={
        'spectrum_ranges': {
            'uv': (300, 450)
        },
        'log_constant': 100
    })
Hetero_data_module = NPMCDataModule(train_dataset, val_dataset, ood_test_dataset, iid_test_dataset, batch_size=16)

Now, using train_uv_model, we can train a hetero GNN model. We use weights and biases (wandb) to track model performance and metrics during training, and to access the model checkpoint file after training is complete. In this demo, we only train for 5 epochs, but recommend training for at least 500 epochs in practice. 

In [4]:
config = {
    'n_dopants': 3,
    'embed_dim': 16,
    'n_message_passing': 4,
    'learning_rate': 1e-3,
    'l2_regularization_weight': 1e-5,
    'interaction_embedding': True,
}
model = train_uv_model(config = config,
                        model_cls=HeteroDCVModel, 
                        data_module=Hetero_data_module,
                        num_epochs=5,
                        early_stop = 'False',
                        trainer_device_config = {'devices' :1
                                                 },
                        wandb_config= {'name': 'demo-run',
                                       'project':'demo-model'
                                       },
                        lr_scheduler_kwargs={'patience': 1,
                                             'factor': 0.85,
                                             'warmup_epochs': 1
                                             })


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mlattia[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011277777777932999, max=1.0…


  | Name                  | Type                          | Params | Mode 
--------------------------------------------------------------------------------
0 | representation_module | HeteroDCVRepresentationModule | 14.5 K | train
1 | readout               | NonLinearMLP                  | 2.3 K  | train
--------------------------------------------------------------------------------
16.8 K    Trainable params
0         Non-trainable params
16.8 K    Total params
0.067     Total estimated model params size (MB)
144       Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=5` reached.


0,1
epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆███████
iid_test_cos_sim,▁
iid_test_loss,▁
iid_test_mae,▁
iid_test_mse,▁
lr-Adam,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
test_cos_sim,▁
test_loss,▁
test_mae,▁
test_mse,▁

0,1
epoch,4.0
iid_test_cos_sim,1.00348
iid_test_loss,0.25954
iid_test_mae,0.4338
iid_test_mse,0.25954
lr-Adam,0.0001
test_cos_sim,1.00971
test_loss,0.30397
test_mae,0.46751
test_mse,0.30397


Now we have a trained model to predict UCNP emission intensity! This training procedure can be slow on a normal personal computer (about 1 minute per epoch) for large datasets, so we recommend training on an HPC. 