# Machine Learning Model Validation

June 21-23, 2023

This demo (based on BikeSharing data, a regression task) covers:

- WeakSpot test

- Reliability test

- Robustness test

## Install PiML Toolbox

- Run `!pip install piml` to install the latest version of PiML.
- In Google Colab, we need restart the runtime in order to use newly installed version.

In [1]:
!pip install piml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting piml
  Downloading PiML-0.5.0.post1-cp310-none-manylinux_2_17_x86_64.whl (11.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m99.2 MB/s[0m eta [36m0:00:00[0m
Collecting lime>=0.2.0.1 (from piml)
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting shap>=0.39.0 (from piml)
  Downloading shap-0.41.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.6/572.6 kB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m
Collecting pygam==0.8.0 (from piml)
  Downloading pygam-0.8.0-py2.py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31

## Load and Prepare Data

Initilaize a new experiment by `piml.Experiment()`

In [2]:
from piml import Experiment
exp = Experiment()

Choose CaliforniaHousing_trim1

In [3]:
exp.data_loader()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Data', 'CoCircles', 'Friedman', 'BikeShar…

Data summary

In [4]:
exp.data_summary()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

HTML(value='<link rel="stylesheet" href="//stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.…

VBox(children=(HTML(value='Data Shape:(20640, 9)'), Tab(children=(Output(), Output()), _dom_classes=('data-sum…

Prepare dataset with default settings

In [5]:
exp.data_prepare()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

VBox(children=(HBox(children=(VBox(children=(HTML(value='<p>Target Variable:</p>'), HTML(value='<p>Split Metho…

## Train Intepretable Models

- Train EBM, XGB2, and GAMI-Net model with default settings

- Register the fitted models

In [6]:
exp.model_train()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

<IPython.core.display.Javascript object>

VBox(children=(Box(children=(Box(children=(HTML(value="<h4 style='margin: 10px 0px;'>Choose Model</h4>"), Box(…

## WeakSpot

Choose XGB2.

- Switch to the "WeakSpot" tab, see the details [here](https://selfexplainml.github.io/PiML-Toolbox/_build/html/guides/testing/weakspot.html).

- Try the following options:

    - **Feature 1 or 2**: choose one or two features of interest as the slicing features.

    - **Method**: choose the slicing method, available options include "Histogram slicing", "Tree slicing", and "Ensemble slicing".

    - **Threshold**: specify the performance metric threshold of weak regions. The threshold here is relative to the average metric.

    - **Dataset**: choose train or test set to perform weakspot.

    - **Metric**: choose the performance metric, including MSE, MAE, or R2 for regression tasks.

    - **Min Sample**: specify the minimum number of samples of weak regions.

    - **Shown in original scale**: this check box can be enabled to display the features in their original scale, instead of the Minmax scaled between 0 to 1.

- The displayed results include:

    - **Figure**: highlights the detected weak regions.

    - **Table**: lists the information of the detected weak regions.

In [7]:
exp.model_diagnose()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

<IPython.core.display.Javascript object>

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'XGB2', 'GAMI-Net', 'EBM'), style…

## Reliability

Choose XGB2.

- Switch to the "Reliability" tab, see the details [here](https://selfexplainml.github.io/PiML-Toolbox/_build/html/guides/testing/reliability.html).

- Try the following options:

    - **Expected Coverage**: specify the expected coverage of the prediction intervals.

    - **Feature**: choose the feature of interest.

    - **Bins**: specify the number of bins in the hisogram slicing.

    - **Bandwidth Threshold**: specify the bandwidth threshold ratio relative to the average bandwidth. This is used to separate reliable and unreliable samples.

    - **Distance Metric**: the distributional distance between reliable and unreliable samples.

    - **Shown in original scale**: check box can be enabled to display the features in their original scale, instead of the Minmax scaled between 0 to 1.

- The displayed results include:

    - Marginal bandwidth plot showing the average bandwidth against the binning of a feature of interest.

    - Distribution distance between reliable and unreliable data.

    - A table that lists the average bandwidth and coverage.

In [8]:
exp.model_diagnose()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

<IPython.core.display.Javascript object>

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'XGB2', 'GAMI-Net', 'EBM'), style…

## Robustness

Choose XGB2.

- Switch to the "Robustness" tab, see the details [here](https://selfexplainml.github.io/PiML-Toolbox/_build/html/guides/testing/robustness.html).

- Try the following options::

    - **Perturb**: choose "All Features" or a single feature of interest to perturb.

    - **Noise Scale**: choose between "Raw Scale" or "Quantile Scale".

    - **Noise Step**: specify the noise level of perturbation.

    - **Metric**: choose performance metric, avialable metrics for regression tasks include MSE, MAE, and R2.

    - **Worst Ratio**: choose the ratio of worst samples, this is to show the robustness performance of the worst sample. It is only related to the plot in the right.

    - **Shown in original scale**: check box can be enabled to display the features in their original scale, instead of the Minmax scaled between 0 to 1.

- The displayed results include:

    - Full test set model performance against perturbation.

    - Worst test set model performance against perturbation.

In [9]:
exp.model_diagnose()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

<IPython.core.display.Javascript object>

VBox(children=(Dropdown(layout=Layout(width='20%'), options=('Select Model', 'XGB2', 'GAMI-Net', 'EBM'), style…

## Model Comparison

Choose GLM, XGB2, and XGB7.

- Switch to the "Robustness" tab.

- Customize the settings and get the comparison results. For example, change Noise step to 0.02.

In [10]:
exp.model_compare()

HTML(value='\n        <style>\n\n        .left-label {\n            width: 30%;\n        }\n\n        .card-pa…

<IPython.core.display.Javascript object>

VBox(children=(HBox(children=(Dropdown(layout=Layout(width='30%'), options=('Select Model', 'XGB2', 'GAMI-Net'…