# 3.1 Comparing models
### Background
After training multiple models with Anvil, you will want to compare the performance across models in a robust way. We closely follow the guidlines laid out in [this paper](https://chemrxiv.org/engage/chemrxiv/article-details/672a91bd7be152b1d01a926b). Consider the below decision chart for helping figure out which models to compare:

<div style="text-align: center">
<img src="comparison_guidelines.png" alt="Model comparison" width="500"/>.  
</div>

### Requirements
For this demo, you will need:
1. At least 2 models trained with the Anvil workflow.
## 1. Overview
This notebook will walk you through how to use the OpenADMET CLI to evaluate models that have been trained with the Anvil workflow. In this particular demo, we will compare the two models we trained in `2.1_Training_models_with_Anvil.ipynb`: LightGBM and Chemprop.

## 2. Comparing the models
As with training models with Anvil, comparing models is also a simple command with the following arguments:
```bash
openadmet compare \
--model-stats <path-1/cross_validation_metrics.json> \ # this is the path to the cross_validation_metrics.json file output by anvil of your first model
--model-tag <a-tag-to-label-your-trained-model-1> \ # this can be any moniker that is distinguishable for you
--task-name <name-of-task-1> \ # this is the name of your target_cols from the anvil recipe.yaml
\
--model-stats <path-2/cross_validation_metrics.json> \ # corresponding info for your second model
--model-tag <a-tag-to-label-your-trained-model-2> \
--task-name <name-of-task-2> \
... repeat this set of arguments for as many models as you want to compare
--output-dir <path-to-output-plots> \ # this is an existing directory for your plot to export to
--report <whether-or-not-to-write-pdf-report>
```
**IMPORTANT NOTE** You can only compare models that have the same number of cross validation folds, e.g. a model with `5 splits x 2 repeats` can only be compared to another model that is also cross validated with `5 splits x 2 repeats`.
For this demo, this command is:

In [1]:
%%bash
openadmet compare \
    --model-stats ../2.1_Training_models_with_Anvil/anvil_training/cross_validation_metrics.json \
    --model-tag lgbm \
    --task-name OPENADMET_LOGAC50 \
    --model-stats ../2.1_Training_models_with_Anvil/anvil_training_2025-07-14_86073d/cross_validation_metrics.json \
    --model-tag chemprop \
    --task-name OPENADMET_LOGAC50 \
    --output-dir model_comparisons/ \
    --report True

  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  with pd.option_context('mode.use_inf_as_na', True):
  significance[(hsd.pvalue < self.sig_levels[2]) & (hsd.pvalue >= 0)] = "***"
  significance[(hsd.pvalue < self.sig_levels[2]) & (hsd.pvalue >= 0)] = "***"
  significance[(hsd.pvalue < self.sig_levels[2]) & (hsd.pvalue >= 0)] = "***"
  significance[(hsd.pvalue < self.sig_levels[2]) & (hsd.pvalue >= 0)] = "***"
  significance[(hsd.pvalue < self.sig_levels[2]) & (hsd.pvalue >= 0)] = "***"


Levene's test results
-------------------------
+-----------+-----------+----------+----------+-------------+
|       mse |       mae |       r2 |     ktau |   spearmanr |
|-----------+-----------+----------+----------+-------------|
| 6.80603   | 3.71671   | 1.18886  | 0.210189 |    0.400492 |
| 0.0177661 | 0.0697995 | 0.289942 | 0.652106 |    0.534793 |
+-----------+-----------+----------+----------+-------------+

Tukey's HSD results
-------------------------
+---------------+-----------+-----------+-------------+-------------+
| method        | metric    |     value |   errorbars |     p-value |
|---------------+-----------+-----------+-------------+-------------|
| lgbm-chemprop | mse       | -0.227728 |   0.0713879 | 2.76495e-06 |
| lgbm-chemprop | mae       | -0.125671 |   0.0309857 | 9.86736e-08 |
| lgbm-chemprop | r2        |  0.333783 |   0.0911281 | 4.24207e-07 |
| lgbm-chemprop | ktau      |  0.147616 |   0.0372833 | 1.40001e-07 |
| lgbm-chemprop | spearmanr |  0.176311 |  

Now, in model comparisons, you should find these outputs:
- `Levene.json` - file containing results of Levene test which assesses homogeneity of variances among groups
- `Tukey_HSD.json` - file containing confidence intervals for Tukey HSD (honestly significant difference) test for pairwise comparisons between models
- `anova.pdf` - ANOVA (analsyis of variance) plot showing whether each metric across all the compared models are statistically signficantly different; p-value ≤ 0.05
- `mcs_plots.pdf`- multiple comparisons similarity plot where the color denotes effect size and asterisk annotations denote statistical significance
- `mean_diffs.pdf`- plot of confidence intervals of the difference in mean performance between models; intervals that do not cross the zero line imply statistical significance
- `normality_plots.pdf` - plots to show how normal the distribution of metrics are to check assumptions of parametric tests, e.g. ANOVA, etc.
- `paired_plots.pdf` - plots to check pairwise relationships between metrics across the comparing models
- `posthoc.pdf` - a file containing the tabulated Levene and Tukey HSD results 

✨✨✨✨✨✨✨