# 4.0 Ensemble Model Training

## Why Train an Ensemble of Models?

For complex problems, like predicting ADMET profiles of compounds, where data is  it can be beneficial to increase the robustness and accuracy of model predictions by training an **ensemble** of models rather than relying on a single model.

### Requirements
As in `02_Training_Models.ipynb`, you will need:

1. A dataset that has been processed with `01_Curate_ChEMBL_Data.ipynb`.  
2. A `YAML` file with instructions for Anvil and specifically for ensemble model training. We will show you how to create this file in this notebook.

## Overview
This notebook will walk you through how to train an ensemble of models with the Anvil workflow with the same CYP3A4 data used in `02_Training_Models.ipynb`.

## Create the YAML file
As in `02_Training_Models.ipynb`, we will use a `YAML` file containing all the necessary information to train the ensemble. The only difference from the usual anvil recipe is the `ensemble` section.  

In the below example, we will be training a **5-model** ensemble of `LGBM` regressors with the CYP3A4 ChEMBL data.  

```yaml
# This spection specifies the input data
data:
  # Specify the dataset file
  resource: ../01_Data_Curation/processed_data/processed_CYP3A4_inhibition.csv
  type: intake
  input_col: OPENADMET_SMILES
  # Specify each (1+) of the target columns, or the column that you're trying to predict
  target_cols:
  - OPENADMET_LOGAC50
  dropna: true

# Additional metadata
metadata:
  authors: Your Name
  email: youremail@email.com
  biotargets:
  - CYP3A4
  build_number: 0
  description: basic regression using a LightGBM model
  driver: sklearn
  name: lgbm_pchembl
  tag: openadmet-chembl
  tags:
  - openadmet
  - test
  - pchembl
  version: v1

# Section specifying training procedure
procedure:
# Featurization specification
  feat:
    # Using concatenated features, which combines multiple featurizers
    # here we use DescriptorFeaturizer and FingerprintFeaturizer for 2D RDKit descriptors and ECFP4 fingerprints
    # See openadmet.models.features 
    type: FeatureConcatenator
    # Add parameters for the featurizer. Full description of the featurizer options are in Section 5.
    params:
      featurizers:
        DescriptorFeaturizer:
          descr_type: "desc2d"
        FingerprintFeaturizer:
          fp_type: "ecfp:4"
  
  # Model specification
  model:
    # Indicate model type
    # See openadmet.models.architecture for all model types
    type: LGBMRegressorModel
    # Specify model parameters
    params:
      alpha: 0.005
      learning_rate: 0.05
      n_estimators: 500

  # Ensemble specification
  ensemble:
    type: CommitteeRegressor
    n_models: 5
    calibration_method: scaling-factor

  # Specify data splits
  split:
    # Specify how data will be split
    # See openadmet.models.split
    type: ShuffleSplitter
    # Specify split parameters
    params:
      random_state: 42
      train_size: 0.7
      val_size: 0.1 # Validation set is needed for uncertainty calibration
      test_size: 0.2 # If you want to compare tree-based models with Dl models later, the test sizes should match
    
  # Specify training configuration
  train:
    # Specify the trainer, here SKLearnBasicTrainer as model has an sklearn interface
    # could also use SKLearnGridSearchTrainer for hyperparameter tuning
    type: SKLearnBasicTrainer


# Section specifying report generation
report:
  # Configure evaluation
  eval:
  # Generate regression metrics
  - type: RegressionMetrics
    params: {}
  # Generate regression plots & do cross validation
  - type: SKLearnRepeatedKFoldCrossValidation
    params:
      axes_labels:
      - True pAC50
      - Predicted pAC50
      max_val: 10
      min_val: 3
      pXC50: true
      n_splits: 5
      n_repeats: 5
      title: True vs Predicted pAC50 on test set
  # Generate uncertainty metrics
  - type: UncertaintyMetrics
    params:
      bins: 100
      resolution: 99
      scaled: True
  # Generate uncertainty calibration plot
  - type: UncertaintyPlots
    params: {}
```

The command for running anvil is exactly the same as it was before!

```bash
    openadmet anvil --recipe-path anvil_ensemble.yaml --output-dir ensemble
```

**We highly recommend training on GPU for ensemble models.**

~ End of `04_Ensemble_Model_Training` ~