# 2.1 Training models with Anvil
<div style="text-align: center">
<img src="../../static/anvil_diagram.png" alt="Anvil diagram" width="500"/>  
</div>

### Background

Anvil is our primary infrastructure for model training and evaluation, built to support scalable, reproducible, and rigorous development of ADMET prediction models. Recognizing that building the best models requires training many variants, ensuring their reproducibility, and enabling robust performance comparisons, Anvil centers around a YAML-based recipe system. These recipes allow users to specify model architectures and training procedures in a standardized, shareable format—minimizing code duplication while supporting both deep learning and traditional machine learning approaches.  

Designed with both internal and external engagement in mind, Anvil aims to lower the barrier for outside users to adopt and fine-tune models by offering simple, transparent workflows. Long-term, it will serve as a foundation for broader community involvement and model reuse.

### Requirements
To run Anvil, you need:
1. A dataset that has been processed with `1.1_Curating_external_datasets.ipynb`.  
2. A `YAML` file with instructions for Anvil. We will show you how to create this file in this notebook.

## 1. Overview

This notebook will walk you through how to run the Anvil model training workflow with human pregnane X receptor (PXR) data processed and cleaned in previous notebooks.

## 2. Creating the YAML file
The heart of an anvil run is in its `YAML` configuration file. Here we specify nearly everything needed to:
- load data
- preprocess it
- split the data appropriately into train/validation/test
- featurize according to model selection
- train the model
- and, finally, validate on the test set (which generates performance metrics and plots)  

We will walkthrough two `YAML` files: one for training a traditional machine learning model (`2.1_anvil_lgbm.yaml`) and one for training a deep learning model (`2.1_anvil_chemprop.yaml`).

## 3. Training a traditional machine learning LightGBM 

Here is a `YAML` file for training a LightGBM (LGBM) model. We are using the previously curated PXR data from ChEMBL. Be sure to read through the comments (in green) to understand each field.  

1. At a minimum, ensure `resource`, `input_col`, and `target_cols` are specified to match your dataset, as these will vary per dataset
2. The `procedure` section may not need much modification, especially if not tweaking parameters, but look it over to make sure it’s sensible

```yaml 
# This spection specifies the input data
data:
  # Specify the dataset file
  resource: ../1_Data_Curation/processed_data/processed_PXR_chembl.parquet
  type: intake
  input_col: OPENADMET_CANONICAL_SMILES
  # Specify each (1+) of the target columns, or the column that you're trying to predict
  target_cols:
  - pchembl_value_mean
  # Whether or not to drop rows with no target value
  dropna: true

# Additional metadata
metadata:
  authors: Your Name
  email: youremail@mail.com
  biotargets:
  - PXR
  build_number: 0
  description: basic regression using a LightGBM model
  driver: sklearn
  name: lgbm_pchembl
  tag: openadmet-chembl
  tags:
  - openadmet
  - test
  - pchembl
  version: v1

# Section specifying training procedure
procedure:
# Featurization specification
  feat:
    # Using concatenated features, which combines multiple featurizers
    # here we use DescriptorFeaturizer and FingerprintFeaturizer for 2D RDKit descriptors and ECFP4 fingerprints
    # See openadmet.models.features 
    type: FeatureConcatenator
    # Add parameters for the featurizer. Full description of the featurizer options are in Section 5.
    params:
      featurizers:
        DescriptorFeaturizer:
          descr_type: "desc2d"
        FingerprintFeaturizer:
          fp_type: "ecfp:4"
  
  # Model specification
  model:
    # Indicate model type
    # See openadmet.models.architecture for all model types
    type: LGBMRegressorModel
    # Specify model parameters
    params:
      alpha: 0.005
      learning_rate: 0.05
      n_estimators: 500


  # Specify data splits
  split:
    # Specify how data will be split
    # See openadmet.models.split
    type: ShuffleSplitter
    # Specify split parameters
    params:
      random_state: 42
      train_size: 0.8
      val_size: 0.0 # For LGBM, no validation set is needed
      test_size: 0.2 # If you want to compare tree-based models with Dl models later, the test sizes should match
    
  # Specify training configuration
  train:
    # Specify the trainer, here SKLearnBasicTrainer as model has an sklearn interface
    # could also use SKLearnGridSearchTrainer for hyperparameter tuning
    type: SKLearnBasicTrainer


# Section specifying report generation
report:
  # Configure evaluation
  eval:
  # Generate regression metrics
  - type: RegressionMetrics
    params: {}
  # Generate regression plots & do cross validation
  - type: SKLearnRepeatedKFoldCrossValidation
    params:
      axes_labels:
      - True pAC50
      - Predicted pAC50
      max_val: 10
      min_val: 3
      pXC50: true
      n_splits: 5
      n_repeats: 5
      title: True vs Predicted pAC50 on test set

```

After you have created or modified this `YAML` file to your liking, you can run the workflow with the below command either in a `bash` cell or in your command line:
```
openadmet anvil --recipe-path <your_file.yaml>
```

This may take 5-10 minutes to run, depending on the number of epochs, your hyperparameters (e.g. learning rate), etc.

In [1]:
%%bash
openadmet anvil --recipe-path anvil_lgbm.yaml --output-dir lgbm

[2;36m[09/22/25 13:00:49][0m[2;36m [0m[34mINFO    [0m Making workflow from           ]8;id=692767;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/specification.py\[2mspecification.py[0m]8;;\[2m:[0m]8;id=914780;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/specification.py#615\[2m615[0m]8;;\
[2;36m                    [0m         specification                  [2m                    [0m
Workflow initialized successfully with recipe: anvil_lgbm.yaml
[2;36m[09/22/25 13:00:51][0m[2;36m [0m[34mINFO    [0m Running workflow from directory     ]8;id=630931;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=268563;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#229\[2m229[0m]8;;\
[2;36m            



[2;36m[09/22/25 13:01:03][0m[2;36m [0m[34mINFO    [0m No transform specified, skipping    ]8;id=343523;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=864816;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#290\[2m290[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m Data featurized                     ]8;id=697713;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=450146;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#292\[2m292[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m Building model                       ]8;id=658793;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openad

  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.024852 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20710
[LightGBM] [Info] Number of data points in the train set: 754, number of used features: 571
[LightGBM] [Info] Start training from score 5.575624
[2;36m[09/22/25 13:01:04][0m[2;36m [0m[34mINFO    [0m Model trained                       ]8;id=552149;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=317816;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#110\[2m110[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m Saving model                        ]8;id=269650;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[



[2;36m                   [0m[2;36m [0m[34mINFO    [0m Predictions made                    ]8;id=211855;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=638662;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#350\[2m350[0m]8;;\
[2;36m                   [0m[2;36m [0m[34mINFO    [0m Evaluating                          ]8;id=744004;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=452233;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#353\[2m353[0m]8;;\
[2;36m[09/22/25 13:01:10][0m[2;36m [0m[34mINFO    [0m Starting cross-validation   ]8;id=548038;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/model

  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007810 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16658
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.587882


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006377 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16640
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 475
[LightGBM] [Info] Start training from score 5.586808


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008054 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16666
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 476
[LightGBM] [Info] Start training from score 5.565545


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007834 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16646
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.569993


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007401 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16535
[LightGBM] [Info] Number of data points in the train set: 604, number of used features: 479
[LightGBM] [Info] Start training from score 5.567903


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008152 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16603
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 480
[LightGBM] [Info] Start training from score 5.577317


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007515 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16704
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 464
[LightGBM] [Info] Start training from score 5.570486


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007614 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16531
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.576905


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007978 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16648
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 475
[LightGBM] [Info] Start training from score 5.598616


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007493 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16696
[LightGBM] [Info] Number of data points in the train set: 604, number of used features: 479
[LightGBM] [Info] Start training from score 5.554828


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006833 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16609
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 467
[LightGBM] [Info] Start training from score 5.577270


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.010548 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16582
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 482
[LightGBM] [Info] Start training from score 5.571049


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007369 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16686
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.586761


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007998 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16604
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 473
[LightGBM] [Info] Start training from score 5.566366


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007707 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16632
[LightGBM] [Info] Number of data points in the train set: 604, number of used features: 471
[LightGBM] [Info] Start training from score 5.576670


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007585 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16559
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 472
[LightGBM] [Info] Start training from score 5.601366


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007636 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16745
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.583937


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007412 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16673
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 470
[LightGBM] [Info] Start training from score 5.569384


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007681 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16645
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 477
[LightGBM] [Info] Start training from score 5.548064


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006025 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16551
[LightGBM] [Info] Number of data points in the train set: 604, number of used features: 478
[LightGBM] [Info] Start training from score 5.575367


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005801 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16615
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 470
[LightGBM] [Info] Start training from score 5.571918


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007992 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16633
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 478
[LightGBM] [Info] Start training from score 5.563129


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.012981 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16561
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 482
[LightGBM] [Info] Start training from score 5.568001


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006670 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16767
[LightGBM] [Info] Number of data points in the train set: 603, number of used features: 478
[LightGBM] [Info] Start training from score 5.571877


  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009222 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16629
[LightGBM] [Info] Number of data points in the train set: 604, number of used features: 481
[LightGBM] [Info] Start training from score 5.603146




[2;36m[09/22/25 13:01:20][0m[2;36m [0m[34mINFO    [0m Cross-validation complete   ]8;id=540892;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/eval/cross_validation.py\[2mcross_validation.py[0m]8;;\[2m:[0m]8;id=525020;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/eval/cross_validation.py#204\[2m204[0m]8;;\
[2;36m[09/22/25 13:01:24][0m[2;36m [0m[34mINFO    [0m Evaluation done                     ]8;id=362524;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py\[2mworkflow.py[0m]8;;\[2m:[0m]8;id=246110;file:///Users/cynthiaxu/miniconda3/envs/demos/lib/python3.12/site-packages/openadmet/models/anvil/workflow.py#370\[2m370[0m]8;;\
Workflow completed successfully


The outputs of the Anvil workflow are in `/anvil_training`:  
- `/data` folder includes the split data, saved as `.csv`
- `/recipe_components` folder contains the inputs from the `2.1_anvil_lgbm.yaml` file split by section
- `cross_validation_metrics.json` is the cross validation metrics of the model saved as a `.json` file
- `model.json` is the model's hyperparameters saved as a `.json` file
- `regression_metrics.json` is the regression metrics saved as a `.json` file
- `model.pkl` is the trained model saved as `.pkl` which can be loaded and used for predictions elsewhere
- `cross_validation_regplot.png` is a plot of the cross validation metrics of the model
- `anvil_recipe.yaml` is a copy of the input `.yaml`

Here are the results of above trained LGBM model:

<img src="lgbm/cross_validation_regplot.png" alt="LGBM model results" width="500"/>

## 4. Training a deep learning Chemprop model

Here is a `YAML` file (`anvil_chemprop.yaml`) for training OpenADMET's ChemProp model. We are using the same ChEMBL PXR dataset. Be sure to note the different fields required for deep learning.

```yaml
# This spection specifies the input data
data:
  # Specify the dataset file
  resource: ../../1_Data_Curation/processed_data/processed_PXR_chembl.parquet
  type: intake
  input_col: OPENADMET_CANONICAL_SMILES
  # Specify each (1+) of the target columns, or the column that you're trying to predict
  target_cols:
  - OPENADMET_LOGAC50


# Additional metadata
metadata:
  authors: Your Name
  email: youremail@mail.com
  biotargets:
  - PXR
  build_number: 0
  description: basic regression using a ChemProp multitask task model
  driver: pytorch
  name: chemprop_pchembl
  tag: chemprop-PXR-chembl
  tags:
  - openadmet
  - test
  version: v1

# Section specifying training procedure
procedure:
  # Featurization specification
  feat:
    # Using the ChemPropFeaturizer (for ChemProp model)
    # See openadmet.models.features
    type: ChemPropFeaturizer
    # No parameters passed
    params: {}
  
  # Model specification
  model:
    # Indicate model type
    # See openadmet.models.architecture
    type: ChemPropModel
    # Specify model parameters
    params:
      depth: 4
      ffn_hidden_dim: 1024
      ffn_hidden_num_layers: 4
      message_hidden_dim: 2048
      dropout: 0.2
      batch_norm: True
      messages: bond
      n_tasks: 1 # Number of tasks should match the number of target columns
      from_chemeleon: False

  # Specify data splits
  split:
    # Specify how data will be split
    # See openadmet.models.split
    type: ShuffleSplitter
    # Specify split parameters
    params:
      random_state: 42
      train_size: 0.7
      val_size: 0.1
      test_size: 0.2
    
  # Specify training configuration
  train:
    # Specify the trainer, here LightningTrainer as ChemProp is a PyTorch Lightning model
    # See openadmet.models.trainer
    type: LightningTrainer
    # Specify model parameters
    params:
      accelerator: gpu
      early_stopping: true
      early_stopping_patience: 10
      early_stopping_mode: min
      early_stopping_min_delta: 0.001
      max_epochs: 50
      monitor_metric: val_loss
      use_wandb: false
      wandb_project: demos # Specify wandb project name according to guidelines

# Section specifying report generation
report:
  # Configure evaluation
  eval:
  # Generate regression metrics
  - type: RegressionMetrics
    params: {}
  # Generate regression plots & do cross validation
  - type: PytorchLightningRepeatedKFoldCrossValidation
    params:
      axes_labels:
      - True LogAC50
      - Predicted LogAC50
      n_repeats: 5
      n_splits: 5
      random_state: 42
      pXC50: true
      title: True vs Predicted LogAC50 on test set
```

The command is
```
openadmet anvil --recipe-path anvil_chemprop.yaml --output-dir chemprop
```
We recommend training deep learning models on GPU.

## 5. Training a multitask deep learning model
Similarly, we are able to train a multitask deep learning model with the combined data from `1.1_Curating_external_datasets.ipynb`. There are a few changes made to the anvil recipe `anvil_multitask.yaml`:

```yaml
# Section specifying input data
data:
  # Specify the dataset file, can be S3 path etc.
  resource:  ../../1_Data_Curation/processed_data/processed_PXR_chembl.parquet
  # must be intake
  type: intake
  # Specify input column containing SMILES
  input_col: OPENADMET_CANONICAL_SMILES
  # Specify whether or not to drop NaN data rows
  dropna: False
  # Specify each (1+) of the target columns
  target_cols:
  - OPENADMET_LOGAC50_cyp3a4
  - OPENADMET_LOGAC50_pxr
  - OPENADMET_LOGAC50_ahr

# Additional metadata
metadata:
  authors: Your Name
  email: youremail@mail.com
  biotargets:
  - CYP3A4
  - PXR
  - AHR
  build_number: 0
  description: basic regression using a ChemProp multitask task model
  driver: pytorch
  name: chemprop_pchembl
  tag: chemprop
  tags:
  - openadmet
  - test
  - chemprop
  version: v1

# Section specifying training procedure
procedure:
  # Featurization specification
  feat:
    # Using the ChemPropFeaturizer (for ChemProp model)
    # See openadmet.models.features
    type: ChemPropFeaturizer
    # No parameters passed
    params: {}
  
  # Model specification
  model:
    # Indicate model type
    # See openadmet.models.architecture
    type: ChemPropModel
    # Specify model parameters
    params:
      depth: 4
      ffn_hidden_dim: 1024
      ffn_hidden_num_layers: 4
      message_hidden_dim: 2048
      dropout: 0.2
      batch_norm: True
      messages: bond
      n_tasks: 4 # Number of tasks should match the number of target columns
      from_chemeleon: False

  # Specify data splits
  split:
    # Specify how data will be split, can be ShuffleSplitter, ScaffoldSplitter, etc.
    # See openadmet.models.split
    type: ShuffleSplitter
    # Specify split parameters
    params:
      random_state: 42
      train_size: 0.7
      val_size: 0.1
      test_size: 0.2
    
  # Specify training configuration
  train:
    # Specify the trainer, here LightningTrainer as ChemProp is a PyTorch Lightning model
    # See openadmet.models.trainer
    type: LightningTrainer
    # Specify model parameters
    params:
      accelerator: gpu
      early_stopping: true
      early_stopping_patience: 10
      early_stopping_mode: min
      early_stopping_min_delta: 0.001
      max_epochs: 50
      monitor_metric: val_loss
      use_wandb: false
      wandb_project: demos # Specify wandb project name according to guidelines

# Section specifying report generation
report:
  # Configure evaluation
  eval:
  # Generate regression metrics
  - type: RegressionMetrics
    params: {}
  # Generate regression plots & do cross validation
  - type: PytorchLightningRepeatedKFoldCrossValidation
    params:
      axes_labels:
      - True LogAC50
      - Predicted LogAC50
      n_repeats: 5
      n_splits: 5
      random_state: 42
      pXC50: true
      title: Multitask True vs Predicted LogAC50 on test set
```

The command is
```
openadmet anvil --recipe-path anvil_chemprop.yaml --output-dir multitask
```

To train these deep learning models, we also provide an example `SLURM` script for submitting jobs to an HPC environment: `run_anvil.sbatch`.

We will examine the full results of these models in `3_Evaluation`.

Congrats! You now know how to train models with the Anvil workflow. Explore our [model catalog](https://github.com/OpenADMET/openadmet-models/tree/2f58b521cdf122d8c929f6b64aead96d1378cd6f/openadmet/models) for other model architectures and featurizers.

✨✨✨✨✨✨✨