# Waterfall Plots

## Background 
Waterfall plots are commonly used to interpret finanicial data to see cash inflows and outflows. However, they're readily adaptable for interpreting base boosting models.

In a base boosting framework, each iterative model aims to predict the residual of the previous model. Ideally, as we combine the residuals they get closer and closer to 0, as that means we've accurately predicted the desired value. 

Thus, by taking the standard deviation of the residuals at reach "level" of a base boosting model, the decrease in standard deviation reflects the individual model's contribution. 

Let's get started with a basic example. This is the same dataset and model as the minimal example. 

In [None]:
import olorenchemengine as oce
import pandas as pd

df = pd.read_csv("https://storage.googleapis.com/oloren-public-data/CHEMBL%20Datasets/997_2298%20-%20VEGFR1%20(CHEMBL1868).csv")
dataset = (oce.BaseDataset(data = df.to_csv(),
    structure_col = "Smiles", property_col = "pChEMBL Value") +
           oce.CleanStructures() + 
           oce.ScaffoldSplit()
)

In [None]:
model = oce.BaseBoosting([
    oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts"), n_estimators=1000),
    oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000),
    oce.ChemPropModel(epochs=20, batch_size=64),
])

model.fit(*dataset.train_dataset)

In [9]:
model.test(*dataset.test_dataset)

100it [00:00, 307.87it/s]
100%|██████████| 2/2 [00:00<00:00, 43.75it/s]


{'r2': 0.5471913837930813,
 'Spearman': 0.7833685687985066,
 'Explained Variance': 0.5991635757305507,
 'Max Error': 2.5811079740992247,
 'Mean Absolute Error': 0.4982035876207892,
 'Mean Squared Error': 0.5514472859713764,
 'Root Mean Squared Error': 0.7425949676447965}

With a base boosting model trained, we can hop into generating waterfall plots.

## Basic Waterfall Plot Generation
The visualization takes in a model and data. In the background, we're running prediction on the provided data, calculating the residuals, and plotting them.

In [10]:
vis = oce.BaseErrorWaterfall(model, *dataset.train_dataset)
# vis = oce.BaseErrorWaterfall(model, dataset.train_dataset[0], dataset.train_dataset[1]) returns the same result 
vis.render_ipynb()

790it [00:02, 303.12it/s]
100%|██████████| 13/13 [00:00<00:00, 25.15it/s]


Note that dataset baseline is generated by guessing the mean of the provided dataset. 

We can see that the first 2 models meaningfully improved the performance, but the 3rd model made little contribution. Let's explore what happens if we make another base boosting model without the 3rd ChemProp model.

In [None]:
model2 = oce.BaseBoosting([
    oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts"), n_estimators=1000),
    oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000),
])

model2.fit(*dataset.train_dataset)

In [11]:
model2.test(*dataset.test_dataset)

{'r2': 0.5468501644869088,
 'Spearman': 0.7803775235341537,
 'Explained Variance': 0.6053431346825286,
 'Max Error': 2.558878681651925,
 'Mean Absolute Error': 0.49958443257914614,
 'Mean Squared Error': 0.5518628356176841,
 'Root Mean Squared Error': 0.7428747105788998}

In [12]:
vis = oce.BaseErrorWaterfall(model2, *dataset.train_dataset)
vis.render_ipynb()

Looking at both the r^2 values and the waterfall plots, the performance is almost the same! 

## Advanced Waterfall Plots 
### Normalization
By default, these plots show unnormalized standard deviations. We think that this leads to better interpretability since the baseline is with respect to the orignal dataset. However, the residuals can be normalized! 

In [13]:
vis = oce.BaseErrorWaterfall(model, *dataset.train_dataset, normalization=True)
vis.render_ipynb()

790it [00:02, 282.72it/s]
100%|██████████| 13/13 [00:00<00:00, 32.65it/s]


### Prediction on just Smiles 
Furthermore, often we're running predictions on just smiles without pChemBL values. While having a baseline makes interpreting the results easier, we support waterfall plots on predictions. 

In [15]:
vis = oce.BaseErrorWaterfall(model, dataset.train_dataset[0], normalization=True) # on the train set smiles
vis.render_ipynb()

790it [00:02, 304.93it/s]
100%|██████████| 13/13 [00:00<00:00, 36.18it/s]


Note that the baseline is with respect to the residuals found after predicting with the first model.