# 📊 **Metric Visualization Notebook**



---



In this notebook, a visualization framework is developed which can dynamically generate plots from given parameters by displaying related data from metric files. Furthermore, a dashboard is created which serves as an interface for the framework and allows the user to easily explore our collected metrics.

The purpose of this notebook is to allow students and researchers of AI fairness to explore how changes in temporal and/or spatial context can affect the fairness and performance of a model, as measured by various metrics. [Recent research has shown](https://arxiv.org/abs/2206.11436) that the interpretation of the fairness of classification models and datasets drastically depends on such context.

<br></br>

⬇️ To start, please <font color="orange">run this cell</font>.






In [None]:
# we need a newer matplotlib version for some features
!pip install -q -U matplotlib

# restart the runtime for the update to take effect
exit()

Google Colab will tell you that your runtime just crashed. **That is completely normal**. After Colab automatically restarted your session, you can execute the setup by <font color='orange'>clicking on the arrow on the very left</font> of the collapsed cells below:

# ⚙️ **Setup**

⬇️ <font color='orange'>Click on the arrow below to run the setup!</font>

First, we need to download the metrics. They are currently located in a .zip-file on Google Drive, which we will download with the `gdown` command. If you are interested in adding any metrics, you can add them to the unzipped directory or create a new .zip-file and download it with `gdown` here.

In [None]:
# download "results.zip" from drive and unzip
%%capture
!rm -r results
!gdown 1R397q9bdk7--M6B1xFJDDfSJ5rVLkmfu
!unzip results.zip
!rm results.zip

Now, we're creating some constants that will be useful through the entire notebook.

In [None]:
# constants

# directory containing csv files with metrics
METRICS_DIR = "/content/results"

# style to use for matplotlib plots, affects font sizes etc.
PLOT_STYLE = "poster"

# whether to save PNG and PDF plots
SAVE_PNG = False
SAVE_PDF = False

# output directory for all plots
OUTPUT_DIR_PNG = f"/content/plots_png/"
OUTPUT_DIR_PDF = f"/content/plots_pdf/"

We will finish our initial environment setup by importing all necessary packages, setting some global settings and creating directories for our plots.

In [None]:
# imports
import os
from pathlib import Path

import pandas as pd

from datetime import datetime
import numpy as np

from ctypes import ArgumentError

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

import ipywidgets as widgets
from ipywidgets import Layout, AppLayout, VBox, HBox, Label, HTML, Output, Text
from IPython.display import display, clear_output

In [None]:
# global pandas settings
pd.set_option('display.expand_frame_repr', False)  # don't print dataframes across multiple lines
pd.set_option('display.max_rows', None)  # display all rows when printing dataframes
pd.options.mode.chained_assignment = None  # suppress warning for creating columns from two other cols

# global matplotlib/seaborn settings
plt.style.use(["seaborn-darkgrid", f"seaborn-{PLOT_STYLE}"])
plt.rcParams["figure.autolayout"] = True  # automatically choose plot layouts
sns.set_style('darkgrid',
              {'legend.frameon': True})

In [None]:
# (re-)create directories for plots

# whether to remove directories containing plots (WARNING, all files will be lost)
REMOVE_DIRECTORIES = False

if REMOVE_DIRECTORIES:
    import shutil
    shutil.rmtree(OUTPUT_DIR_PNG, ignore_errors=True)
    shutil.rmtree(OUTPUT_DIR_PDF, ignore_errors=True)

# create directories in which to store plots in (if they don't already exist)
Path(OUTPUT_DIR_PNG).mkdir(parents=True, exist_ok=True)
Path(OUTPUT_DIR_PDF).mkdir(parents=True, exist_ok=True)

## Data Loading

In this section, we will load all metrics from their respective .csv-file. All metrics are then stored in a dataframe that will be the basis for our visualizations.

We need a function which extracts all important parameters from the metric file name, since that is the easiest way to get them. This function might seem complex but it only does this simple task.

In [None]:
def parse_path_parts(filename):
    """
    given filename with information,
    return model_type, dataset_train, years_train, threshold_train, dataset_test, years_test, threshold_test, group_non_white
    """

    model_name_parts = filename.split('_')  # split model name into parts

    # 1. remove "metrics"
    if model_name_parts[0] == "metrics": model_name_parts.pop(0)

    # 2. get model_type
    # 2.1 if "adv", use special handling
    if model_name_parts[0] == "adv":
        model_name_parts.pop(0)  # pop "adv"
        model_name_parts.pop(0)  # pop "deb"
        model_type = f"AdversarialDebiasing ({model_name_parts.pop(0)}ed)"  # biased/debiased
    else:
        model_type = model_name_parts.pop(0)
        model_type = model_type[0].upper() + model_type[1:]
    
    # 3. get train dataset
    dataset_train = ""
    years_train = ""
    threshold_train = ""

    while True:
        tmp = model_name_parts.pop(0)

        if all(i.isdigit() for i in tmp) and len(tmp)==5:
            # threshold
            threshold_train = tmp
            break
        elif any(i.isdigit() for i in tmp):
            # years
            years_train += tmp
        else:
            # dataset name
            dataset_train += tmp

    assert dataset_train != "" and years_train != "" and threshold_train != "",\
            "Error parsing filename, could not find dataset_train, years_train or threshold_train"

    # 4. get test dataset
    dataset_test = ""
    years_test = ""
    threshold_test = ""

    while True:
        tmp = model_name_parts.pop(0)

        if all(i.isdigit() for i in tmp) and len(tmp)==5:
            # threshold
            threshold_test = tmp
            break
        elif any(i.isdigit() for i in tmp):
            # years
            years_test += tmp
        else:
            # dataset name
            dataset_test += tmp

    assert dataset_test != "" and years_test != "" and threshold_test != "",\
            "Error parsing filename, could not find dataset_test, years_test or threshold_test"

    # 5. get group_non_white
    if "True" in model_name_parts[-1]:
        # true
        group_non_white = True
    elif "False" in model_name_parts[-1]:
        # false
        group_non_white = False
    else:
        # error
        raise RuntimeError('Could not find value for group_non_white')

    years_train = years_train.replace(" ", ", ")
    years_test = years_test.replace(" ", ", ")

    return [model_type, dataset_train, years_train, threshold_train, dataset_test, years_test, threshold_test, group_non_white]

In [None]:
# parse metrics directory and put all metrics into one large dataframe with cols:
#   model_type, dataset_train, dataset_test, threshold_train, threshold_test,
#   years_train, years_test, group_non_white, *METRICS

# number of directories in METRICS_DIR path, we need this to correctly parse
# directory names for different METRICS_DIR values
irrelevant_dirs_num = len(METRICS_DIR.split("/"))-1

# list of all metrics
metrics = []

df_all = pd.DataFrame()

for subdir, dirs, files in os.walk(METRICS_DIR):
    if "Old" in subdir: continue  # skip old metrics
    for file in files:
        if not file.endswith(".csv"): continue  # skip non-metric files
        if file.endswith(").csv"): continue  # skip duplicate files

        model_type, dataset_train, years_train, threshold_train, dataset_test, years_test, threshold_test, group_non_white = parse_path_parts(file)

        full_path = os.path.join(subdir, file)
        df = pd.read_csv(full_path)
        df.drop('model_name', axis=1, inplace=True)

        # add columns for values parsed from filename
        df["model_type"] = model_type
        df["dataset_train"] = dataset_train
        df["years_train"] = years_train
        df["threshold_train"] = threshold_train
        df["dataset_test"] = dataset_test
        df["years_test"] = years_test
        df["threshold_test"] = threshold_test
        df["group_non_white"] = group_non_white

        df_all = pd.concat([df_all, df], ignore_index=True)

# calculate f1 score and add to dataframe
df_all["f1_score"] = 2 * (df_all["precision"]*df_all["recall"])/(df_all["precision"]+df_all["recall"])

# sort columns alphabetically
df_all = df_all.reindex(sorted(df_all.columns), axis=1)

# move filename columns to front
df_all.insert(0, 'group_non_white', df_all.pop('group_non_white'))
df_all.insert(0, 'threshold_test', df_all.pop('threshold_test'))
df_all.insert(0, 'years_test', df_all.pop('years_test'))
df_all.insert(0, 'dataset_test', df_all.pop('dataset_test'))
df_all.insert(0, 'threshold_train', df_all.pop('threshold_train'))
df_all.insert(0, 'years_train', df_all.pop('years_train'))
df_all.insert(0, 'dataset_train', df_all.pop('dataset_train'))
df_all.insert(0, 'model_type', df_all.pop('model_type'))

df_all

In [None]:
# save all metrics to csv with current timestamp
dt_string = datetime.now().strftime("%d%m%Y-%H:%M")
df_all.to_csv(f"all_metrics_{dt_string}.csv", encoding='utf-8', index=False)

In [None]:
# drop columns where all values are NaN
df_all = df_all.loc[:, df_all.notna().all()]

# drop "object" type columns (confusion matrices etc.) as we can't plot them
df_all.drop(columns=df_all.iloc[:, 8:].select_dtypes(include=["object"]).columns,
            inplace=True)

# replace space in AdversarialDebiasing model names with newline
df_all.loc[:, 'model_type'] = df_all.model_type.apply(lambda x: x.replace(" ", "\n"))

## Visualizations

Now that we have all the relevant data, it is time to work on the visualizations.

### Data Exploration

Let's have a look at our collected metrics and our data distribution.

**We have the following metrics:**

* ***abroca***
* accuracy
* average_abs_odds_difference
* average_odds_difference
* base_rate

>* between_all_groups_coefficient_of_variation
>* between_all_groups_generalized_entropy_index
>* between_all_groups_theil_index
>* between_group_coefficient_of_variation
>* between_group_generalized_entropy_index
>* between_group_theil_index

* binary_confusion_matrix
* coefficient_of_variation
* consistency
* differential_fairness_bias_amplification
* ***disparate_impact***
* ***equal_opportunity_difference*** (alias of true_positive_rate_difference)

>* error_rate
>* error_rate_difference
>* error_rate_ratio

>* false_discovery_rate
>* false_discovery_rate_difference
>* false_discovery_rate_ratio

>* false_negative_rate
>* false_negative_rate_difference
>* false_negative_rate_ratio

>* false_omission_rate
>* false_omission_rate_difference
>* false_omission_rate_ratio

>* false_positive_rate
>* false_positive_rate_difference
>* false_positive_rate_ratio

>* generalized_binary_confusion_matrix
>* generalized_entropy_index
>* generalized_false_negative_rate
>* generalized_false_positive_rate
>* generalized_true_negative_rate
>* generalized_true_positive_rate

* mean_difference (alias of statistical_parity_difference)
* negative_predictive_value

>* num_false_negatives
>* num_false_positives
>* num_generalized_false_negatives
>* num_generalized_false_positives
>* num_generalized_true_negatives
>* num_generalized_true_positives
>* num_instances
>* num_negatives
>* num_positives
>* num_pred_negatives
>* num_pred_positives
>* num_true_negatives
>* num_true_positives

* performance_measures
* positive_predictive_value
* power (alias of num_true_positives)
* precision (alias of positive_predictive_value)
* recall (alias of true_positive_rate)
* selection_rate
* sensitivity (alias of true_positive_rate)
* smoothed_empirical_differential_fairness
* specificity (alias of true_negative_rate)
* statistical_parity_difference
* theil_index

>* true_negative_rate
>* true_positive_rate
>* true_positive_rate_difference

In [None]:
# get overview over distribution of test datasets
print("Distribution of Test Datasets:")
print(df_all.groupby(["dataset_test", "years_test", "threshold_test"]).size())

# get overview over distribution of train datasets
print("-"*60 + "\nDistribution of Train Datasets:")
print(df_all.groupby(["dataset_train", "years_train", "threshold_train"]).size())

Here's a helper function which allows us to easily filter the dataframe for the values we are interested in.

In [None]:
def set_metrics(set_type, dataset, years, threshold):
    if set_type not in {"train", "test"}: raise ArgumentError("Unknown set_type, choose either 'train' or 'test'.")
    return df_all.loc[(df_all[f"dataset_{set_type}"]==dataset) &
           (df_all[f"years_{set_type}"]==years) &
           (df_all[f"threshold_{set_type}"]==threshold)]

This is how we can filter the dataframe to get only metrics from models `test`ed on `east coast geo` in the year `2018` with a threshold of `30000`.

In [None]:
# example of how to filter the dataframe to get only metrics from a specific test set
set_metrics("test", "east coast geo", "2018", "30000")

### Plotting

In this subsection, the actual visualization function is defined, which is the heart of this notebook.

#### Helper Functions

Our visualization relies on helper functions for many tasks.

In [None]:
# helper function for bar chart labels
def autolabel(rects, use_ints=False):
    for rect in rects:
        height = rect.get_height()
        annotation = str(int(height)) if use_ints else f'{height:.3f}'
        ax.annotate(annotation,
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    ha='center',
                    va='bottom',
                    fontsize=plt.rcParams['axes.labelsize'])

In [None]:
def dataset_name_to_pos(dataset_name, ignore_dataset=False, ignore_year=False):
    """
    Assign a given dataset name an integer, such that one can sort datasets in a
    custom order.

    The custom order is:
    1. East Coast Geo
    2. West Coast Geo
    3. Urban
    4. Rural

    Inside datasets, it will be sorted by year.
    """

    val = -1

    if not ignore_dataset:
        if "east" in dataset_name:
            val = 0
        elif "west" in dataset_name:
            val = 6
        elif "urban" in dataset_name:
            val = 12
        elif "rural" in dataset_name:
            val = 18
        else:
            raise ArgumentError(f"did not recognize dataset name: {dataset_name}")

    if not ignore_year:
        if "2014" in dataset_name:
            val += 1
        elif "2015" in dataset_name:
            val += 2
        elif "2016" in dataset_name:
            val += 3
        elif "2017" in dataset_name:
            val += 4
        elif "2018" in dataset_name:
            val += 5
        else:
            raise ArgumentError(f"did not recognize year in dataset name: {dataset_name}")

    return val

In [None]:
def colormap_from_dataset_names(dataset_names):
    """
    Given a list of dataset names, create a colormap that assigns each dataset name
    a color.

    The same datasets with different years should receive similar colors.
    """

    cm_east = plt.cm.Blues_r
    cm_west = plt.cm.Oranges_r
    cm_urban = plt.cm.Greens_r
    cm_rural = plt.cm.Purples_r

    count_east = 0
    count_west = 0
    count_urban = 0
    count_rural = 0

    for name in dataset_names:
        if "east" in name:
            count_east += 1
        elif "west" in name:
            count_west += 1
        elif "urban" in name:
            count_urban += 1
        elif "rural" in name:
            count_rural += 1
        else:
            raise ArgumentError(f"did not recognize dataset name: {name}")

    cm = []
    cm_linspace = np.linspace(.2, .8, 5)

    cm.extend(list(cm_east(cm_linspace))[:count_east])
    cm.extend(list(cm_west(cm_linspace))[:count_west])
    cm.extend(list(cm_urban(cm_linspace))[:count_urban])
    cm.extend(list(cm_rural(cm_linspace))[:count_rural])

    return cm

In [None]:
def refilter_metrics(vis_metrics, filters, iter_type):
    filter_mt, filter_ds, filter_yr = filters[0]

    if filters[1]:
        # exclusively include filtered metrics
        if filter_mt is not None:
            if type(filter_mt) is list:
                for f_mt in filter_mt:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_mt)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==filter_mt)]
        if filter_ds is not None:
            if type(filter_ds) is list:
                for f_ds in filter_ds:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_ds)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"dataset_{iter_type}"]==filter_ds)]
        if filter_yr is not None:
            if type(filter_yr) is list:
                for f_yr in filter_yr:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_yr)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"years_{iter_type}"]==filter_yr)]
    else:
        # exclude filtered metrics
        if filter_mt is not None:
            if type(filter_mt) is list:
                for f_mt in filter_mt:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_mt)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]!=filter_mt)]
        if filter_ds is not None:
            if type(filter_ds) is list:
                for f_ds in filter_ds:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_ds)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"dataset_{iter_type}"]!=filter_ds)]
        if filter_yr is not None:
            if type(filter_yr) is list:
                for f_yr in filter_yr:
                    vis_metrics = vis_metrics.loc[(vis_metrics[f"model_type"]==f_yr)]
            else:
                vis_metrics = vis_metrics.loc[(vis_metrics[f"years_{iter_type}"]!=filter_yr)]

    return vis_metrics

In [None]:
# ranges of each metric
metrics_range_zero_one = ['accuracy', 'average_abs_odds_difference', 'base_rate',
 'error_rate', 'f1_score', 'false_discovery_rate', 'false_negative_rate',
 'false_omission_rate', 'false_positive_rate', 'negative_predictive_value',
 'positive_predictive_value', 'precision', 'recall', 'selection_rate',
 'sensitivity', 'specificity', 'theil_index', 'true_negative_rate',
 'true_positive_rate']

pres_metrics_threshold = ["accuracy", "abroca", "f1_score", "disparate_impact"]

def get_limits_for_metric(metric):
    if metric in metrics_range_zero_one:
        return [0, 1]
    elif "ratio" in metric:
        return [0-get_base_for_metric(metric), None]
    return [None, None]

# give baseline value for given metric
def get_base_for_metric(metric):
    if "ratio" in metric or metric=="disparate_impact":
        return 1
    else:
        return 0

# interpretation help for each metric
metrics_hints = {"abroca": "0 is best",
                 "accuracy": "higher is better",
                 "f1_score": "higher is better",
                 "disparate_impact": "1 is best"}

def get_hint_for_metric(metric):
    if metric in metrics_hints:
        return f" ({metrics_hints[metric]})"
    return ""

#### Main Function

Here we define the main visualizaion function which takes a series of parameters and automatically creates a corresponding grouped bar chart.

In [None]:
def plot_experiment(eval_type, eval_dataset, eval_years, threshold, metric,
                    filters=(None, True), forced_legend_loc=None, forced_filename=None,
                    save_png=SAVE_PNG, save_pdf=SAVE_PDF):
    """
    Plot a grouped bar chart using metrics from experiment with given parameters.

        Parameters:
            eval_type (str): Either 'train' or 'test'
                             if 'test':  compare different trainings on the same test dataset
                             if 'train': compare the same training on different test datasets 
            eval_dataset (str): Dataset to use for evaluation
            eval_years (str): Years to use for evaluation
            threshold (str): Threshold to use for evaluation
            metric (str): Metric to use for evaluation
            filters ((list[(str or list(str)), (str or list(str)), (str or list(str))], bool), default (None, True)): 
                Additional filters if you don't want to include all existings metrics
                in your plot. Order is ([model_type, dataset, years], include).
                "include" denotes whether filtered plots should be exclusively included
                (True) or excluded (False).
                If None, include everything.
            forced_legend_loc (str, default None): Position to force legend in, if None, choose 'best'.
                                            This can be helpful if the legend overlaps with the graph.
            forced_filename (str, default None): Filename to use for saving, without file ending.
                                                 If None, automatically determine from parameters.
            save_png (bool, default SAVE_PNG): Whether to save plot as PNG
            save_pdf (bool, default SAVE_PDF): Whether to save plot as PDF

        Returns:
            None
    """
    # the inverse of eval_type; the type to iterate over
    iter_type = "train" if eval_type=="test" else "test"

    # get metrics from given parameters
    vis_metrics = set_metrics(eval_type, eval_dataset, eval_years, threshold)

    # if specified, only take filtered metrics
    if filters[0] is not None:
        vis_metrics = refilter_metrics(vis_metrics, filters, iter_type)
            
    # if no metrics were found, do not create visualization and output information
    if len(vis_metrics)==0:
        print("No metrics with given parameters were found. Please try different parameters or upload your metrics to the directory.")
        if filters is not None: print("Consider using less strict filters.")
        return

    # create merged column from dataset and years
    vis_metrics[f"dataset_years_{iter_type}"] = [f"{ds} ({yr})" for ds, yr in zip(vis_metrics[f"dataset_{iter_type}"], vis_metrics[f"years_{iter_type}"])]

    # sort that column and use it as the order of our grouped bar chart
    hue_order = list(dict.fromkeys(sorted(vis_metrics[f"dataset_years_{iter_type}"].to_list(),
                                          key=dataset_name_to_pos)))

    # create colormap which assigns each present dataset its respective color
    cmap = colormap_from_dataset_names(hue_order)

    # update rcParams with new color map and figure size 
    plt.rcParams['axes.prop_cycle'] = plt.cycler(color=cmap)
    plt.rcParams['figure.figsize'] = 16, 9

    # some metrics require a baseline other than 0, as they should be considered
    # to be centered around a different value, e.g. 1
    baseline = get_base_for_metric(metric)

    # we need to subtract the baseline from the metrics because seaborn will always
    # let the bars start from 0
    vis_metrics[metric] -= baseline

    # create the actual bar chart
    ax = sns.barplot(x="model_type",              
                    y=metric,         
                    hue=f"dataset_years_{iter_type}",
                    data=vis_metrics,
                    order=vis_metrics["model_type"].sort_values().unique(),
                    hue_order=hue_order,
                    ci=None)

    # add baseline to yticks to make them display the actual values
    ax.yaxis.set_major_formatter(mtick.FuncFormatter(lambda x,_: f"{x+baseline:g}"))

    # format plot to look better by doing some small things
    metric_fancy = metric.replace("_", " ").title()
    if metric == "abroca": metric_fancy = "ABROCA"

    ds_fancy = eval_dataset.replace("_", " ").title()

    # "best" sometimes does not work well, especially for the "accuracy" metric
    legend_loc = "lower left" if (metric == "accuracy" or metric == "f1_score") else "best"
    legend_loc = forced_legend_loc if forced_legend_loc is not None else legend_loc

    # set chart title depending on eval_type to help the user understand the plot
    if eval_type == "test":
        title = f"{metric_fancy} on {ds_fancy} ({eval_years}) with Threshold={int(threshold):,}"
    else:
        title = f"{metric_fancy} of Model trained on {ds_fancy} ({eval_years}) with Threshold={int(threshold):,}"

    # we defined a hint for the main metrics, we should add it to the yaxis label
    metric_fancy = f"{metric_fancy}{get_hint_for_metric(metric)}"
    ax.set_ylabel(metric_fancy)
    ax.set_xlabel('Model')
    ax.set_title(title)

    if forced_legend_loc != "none":
        ax.legend(title=f"{iter_type.capitalize()}ing Dataset",
                    fancybox=True,
                    title_fontsize=plt.rcParams['legend.fontsize'],
                    frameon=True,
                    loc=legend_loc,
                    facecolor="w")
    else:
        ax.legend_.remove()
    
    # set ylimits for current metric
    ax.set_ylim(get_limits_for_metric(metric))

    # update ylimit to leave some space if bars come from upper end of plot
    if baseline==1 and ax.get_ylim()[1]==1-baseline: 
        ax.set_ylim(top=0.1)

    if metric=="abroca" and ax.get_ylim()[0] == 0:
        ax.set_ylim(bottom=ax.get_ylim()[1]*-.2)

    # denote if any bars have positive/negative values, useful for baseline highlighting
    any_positive = any([any([d>0 for d in c.datavalues]) for c in ax.containers])
    any_negative = any([any([d<0 for d in c.datavalues]) for c in ax.containers])

    # add horizontal line to highlight baseline, if != 0
    if baseline != 0 or (any_positive and any_negative) or metric=="abroca": ax.axhline(0, color="darkgrey")

    # use integer format for integers, 3-digit float format for floats
    fmt= "{label:.0f}" if all([all([d.is_integer() for d in c.datavalues]) for c in ax.containers]) else "{label:.3f}"

    # add value label to each bar
    for container in ax.containers:
        ax.bar_label(container,
                    fontsize=plt.rcParams['xtick.labelsize'],
                    fmt=fmt,
                    labels=[fmt.format(label=x+baseline) for x in container.datavalues],
                    padding=2)

    # include a tight_layout() call for good measure
    plt.tight_layout()

    # construct filename from parameters, if no forced_filename was given
    filename = f"{eval_type}_{eval_dataset}_{eval_years}_{threshold}_{metric}"
    filename = forced_filename if forced_filename is not None else filename

    # save plots as png/pdf in respective directory
    if save_png: plt.savefig(f"{OUTPUT_DIR_PNG}{filename}.png")
    if save_pdf: plt.savefig(f"{OUTPUT_DIR_PDF}{filename}.pdf")

    # return the plot
    return ax

Here is an example of how to use this function.

In [None]:
plot = plot_experiment(
    eval_type    = "train",
    eval_dataset = "east coast geo",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "disparate_impact")

plt.show(plot)

#### Plots for the Presentation

In this section, we create the plots for our presentation slides.

Please keep in mind that the plots shown in our presentation slightly differ from the ones created here, since we used custom yaxis limits and color nuances for some plots to make them more accurate and easier to interpret.

If you wish to take a look at them, we suggest you take a look at our [presentation slides](https://docs.google.com/presentation/d/1ET3pKml7Sgjx6Y7E3P3XhLVUYo0IEelsB4JrQ7U1ktc/edit?usp=sharing).

##### Temporal vs. Spatial Context

>Evaluate: West Coast (2018)
>
>Training: Everything Compared


In [None]:
tvsc_abr = plot_experiment(
    eval_type    = "test",
    eval_dataset = "west coast geo",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "abroca",
    forced_legend_loc = "lower center",
    forced_filename = "tvsc_abr",
    save_png = True)

plt.show(tvsc_abr)

In [None]:
tvsc_acc = plot_experiment(
    eval_type    = "test",
    eval_dataset = "west coast geo",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "accuracy",
    forced_filename = "tvsc_acc",
    save_png = True)

plt.show(tvsc_acc)

In [None]:
tvsc_f1 = plot_experiment(
    eval_type    = "test",
    eval_dataset = "west coast geo",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "f1_score",
    forced_filename = "tvsc_f1",
    save_png = True)

plt.show(tvsc_f1)

In [None]:
tvsc_di = plot_experiment(
    eval_type    = "test",
    eval_dataset = "west coast geo",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "disparate_impact",
    forced_filename = "tvsc_di",
    save_png = True)

plt.show(tvsc_di)

##### Temporal Context

>Evaluate: Rural (2018)
>
>Training: Rural (2014) vs. Rural (2016) vs. Rural (2018) vs. Rural (2014, 2016) vs. Rural (2014, 2016, 2018)



In [None]:
tc_abr = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "abroca",
    filters      = ([None, "rural", None], True),
    forced_filename = "tc_abr",
    forced_legend_loc = "lower right",
    save_png     = True)

plt.show(tc_abr)

In [None]:
tc_acc = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "accuracy",
    filters      = ([None, "rural", None], True),
    forced_filename = "tc_acc",
    save_png     = True)

plt.show(tc_acc)

In [None]:
tc_di = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "disparate_impact",
    filters      = ([None, "rural", None], True),
    forced_filename = "tc_di",
    save_png     = True)

plt.show(tc_di)

In [None]:
tc_f1 = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "f1_score",
    filters      = ([None, "rural", None], True),
    forced_filename = "tc_f1",
    save_png     = True)

plt.show(tc_f1)

##### Spatial Context

>Evaluate: Rural (2016)
>
>Training: East Coast (2016) vs. West Coast (2016) vs. Rural (2016)


In [None]:
sc_abr = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2016",
    threshold    = "50000",
    metric       = "abroca",
    filters      = ([None, None, "2016"], True),
    forced_filename = "sc_abr",
    save_png     = True)

plt.show(sc_abr)

In [None]:
sc_acc = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2016",
    threshold    = "50000",
    metric       = "accuracy",
    filters      = ([None, None, "2016"], True),
    forced_filename = "sc_acc",
    save_png     = True)

plt.show(sc_acc)

In [None]:
sc_di = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2016",
    threshold    = "50000",
    metric       = "disparate_impact",
    filters      = ([None, None, "2016"], True),
    forced_filename = "sc_di",
    save_png     = True)

plt.show(sc_di)

In [None]:
sc_f1 = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2016",
    threshold    = "50000",
    metric       = "f1_score",
    filters      = ([None, None, "2016"], True),
    forced_filename = "sc_f1",
    save_png     = True)

plt.show(sc_f1)

##### Threshold Impact

>Evaluate: Rural (2018, 30k Threshold), Rural (2018, 40k Threshold), Rural (2018, 50k Threshold), Rural (2018, 60k Threshold)
>
>Training: Rural (2018, 30k Threshold), Rural (2018, 40k Threshold), Rural (2018, 50k Threshold), Rural (2018, 60k Threshold)


###### 30,000

In [None]:
th_abr_30k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "30000",
    metric       = "abroca",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_abr_30k",
    save_png     = True)

plt.show(th_abr_30k)

In [None]:
th_acc_30k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "30000",
    metric       = "accuracy",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_acc_30k",
    save_png     = True)

plt.show(th_acc_30k)

In [None]:
th_f1_30k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "30000",
    metric       = "f1_score",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_f1_30k",
    save_png     = True)

plt.show(th_f1_30k)

In [None]:
th_di_30k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "30000",
    metric       = "disparate_impact",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_di_30k",
    save_png     = True)

plt.show(th_di_30k)

###### 40,000

In [None]:
th_abr_40k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "40000",
    metric       = "abroca",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_abr_40k",
    save_png     = True)

plt.show(th_abr_40k)

In [None]:
th_acc_40k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "40000",
    metric       = "accuracy",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_acc_40k",
    save_png     = True)

plt.show(th_acc_40k)

In [None]:
th_f1_40k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "40000",
    metric       = "f1_score",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_f1_40k",
    save_png     = True)

plt.show(th_f1_40k)

In [None]:
th_di_40k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "40000",
    metric       = "disparate_impact",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_di_40k",
    save_png     = True)

plt.show(th_di_40k)

###### 50,000

In [None]:
th_abr_50k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "abroca",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_abr_50k",
    save_png     = True)

plt.show(th_abr_50k)

In [None]:
th_acc_50k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "accuracy",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_acc_50k",
    save_png     = True)

plt.show(th_acc_50k)

In [None]:
th_f1_50k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "f1_score",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_f1_50k",
    save_png     = True)

plt.show(th_f1_50k)

In [None]:
th_di_50k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "50000",
    metric       = "disparate_impact",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_di_50k",
    save_png     = True)

plt.show(th_di_50k)

###### 60,000

In [None]:
th_abr_60k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "60000",
    metric       = "abroca",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_abr_60k",
    save_png     = True)

plt.show(th_abr_60k)

In [None]:
th_acc_60k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "60000",
    metric       = "accuracy",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_acc_60k",
    save_png     = True)

plt.show(th_acc_60k)

In [None]:
th_f1_60k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "60000",
    metric       = "f1_score",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_f1_60k",
    save_png     = True)

plt.show(th_f1_60k)

In [None]:
th_di_60k = plot_experiment(
    eval_type    = "test",
    eval_dataset = "rural",
    eval_years   = "2018",
    threshold    = "60000",
    metric       = "disparate_impact",
    filters      = ([None, "rural", "2018"], True),
    forced_filename = "th_di_60k",
    save_png     = True)

plt.show(th_di_60k)

## Interactive Plotting

Now that we have the main functionality working, we can start working on the dashboard to enable the user to easily access our collected metrics. To do that, we will define some IPython widgets.

In [None]:
# helper function which generates a suitable filename from the current dropdown menu parameters
def get_filename_from_current_parameters():
    return f"{dropdown_eval_type.value}_{dropdown_eval_dataset.value}_{dropdown_eval_years.value}_{dropdown_threshold.value}_{dropdown_metric.value}"

In [None]:
# saving format checkboxes
checkbox_save_png = widgets.Checkbox(
    description='PNG',
    disabled=False,
    style={'description_width': 'initial'},
    layout=Layout(width="117px")
)      

checkbox_save_pdf = widgets.Checkbox(
    description='PDF',
    disabled=False,
    style={'description_width': 'initial'},
    layout=Layout(width="117px")
)      

In [None]:
# evaluation type dropdown menu
dropdown_eval_type = widgets.Dropdown(
    options=['train', 'test'],
    value='train',
    style={'description_width': 'initial'}
)

def on_change_dropdown_eval_type(change):
    if change['type'] == 'change' and change['name'] == 'value':
        dropdown_eval_dataset.options = update_dropdown_eval_dataset_by_filter(dropdown_eval_type.value)
        dropdown_eval_years.options = update_dropdown_eval_years_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value)
        dropdown_threshold.options = update_dropdown_threshold_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value, dropdown_eval_years.value)
        vbox_saving.children[-2].children[0].value = get_filename_from_current_parameters()

dropdown_eval_type.observe(on_change_dropdown_eval_type)

In [None]:
# evaluation dataset dropdown menu

def update_dropdown_eval_dataset_by_filter(filter_eval_type):
    return list(dict.fromkeys(sorted(df_all[f"dataset_{filter_eval_type}"].to_list(), key=lambda x: dataset_name_to_pos(x, ignore_year=True))))

initial_options_dropdown_eval_dataset = update_dropdown_eval_dataset_by_filter(dropdown_eval_type.value)

dropdown_eval_dataset = widgets.Dropdown(
    options=initial_options_dropdown_eval_dataset,
    value=initial_options_dropdown_eval_dataset[0]
)

def on_change_dropdown_eval_dataset(change):
    if change['type'] == 'change' and change['name'] == 'value':
        dropdown_eval_years.options = update_dropdown_eval_years_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value)
        dropdown_threshold.options = update_dropdown_threshold_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value, dropdown_eval_years.value)
        vbox_saving.children[-2].children[0].value = get_filename_from_current_parameters()

dropdown_eval_dataset.observe(on_change_dropdown_eval_dataset)

In [None]:
# evaluation years dropdown menu

def update_dropdown_eval_years_by_filter(filter_eval_type, filter_eval_dataset):
    filtered_years = df_all.loc[(df_all[f"dataset_{filter_eval_type}"]==filter_eval_dataset)][f"years_{filter_eval_type}"].to_list()
    return list(dict.fromkeys(sorted(filtered_years)))
    
initial_options_dropdown_eval_years = update_dropdown_eval_years_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value)

dropdown_eval_years = widgets.Dropdown(
    options=initial_options_dropdown_eval_years,
    value=initial_options_dropdown_eval_years[0]
)

def on_change_dropdown_eval_years(change):
    if change['type'] == 'change' and change['name'] == 'value':
        dropdown_threshold.options = update_dropdown_threshold_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value, dropdown_eval_years.value)
        vbox_saving.children[-2].children[0].value = get_filename_from_current_parameters()

dropdown_eval_years.observe(on_change_dropdown_eval_years)

In [None]:
# threshold dropdown menu

def update_dropdown_threshold_by_filter(filter_eval_type, filter_eval_dataset, filter_eval_years):
    filtered_thresholds = df_all.loc[(df_all[f"dataset_{filter_eval_type}"]==filter_eval_dataset) &
                                     (df_all[f"years_{filter_eval_type}"]  ==filter_eval_years)][f"threshold_{filter_eval_type}"].to_list()
    return list(dict.fromkeys(sorted(filtered_thresholds)))

initial_options_dropdown_threshold = update_dropdown_threshold_by_filter(dropdown_eval_type.value, dropdown_eval_dataset.value, dropdown_eval_years.value)

dropdown_threshold = widgets.Dropdown(
    options=initial_options_dropdown_threshold,
    value=initial_options_dropdown_threshold[0]
)

In [None]:
# metric dropdown menu
dropdown_metric = widgets.Dropdown(
    options=["abroca", "accuracy", "disparate_impact", "f1_score"],
    value="abroca"
)

def on_change_dropdown_metric(change):
    if change['type'] == 'change' and change['name'] == 'value':
        vbox_saving.children[-2].children[0].value = get_filename_from_current_parameters()
    elif change['type'] == 'change' and change['name'] == 'options':
        if dropdown_metric.value not in ["abroca", "accuracy", "disparate_impact", "f1_score"]:
            dropdown_metric.value = "abroca"
        else:
            dropdown_metric.value = val_old

dropdown_metric.observe(on_change_dropdown_metric)

In [None]:
# legend location dropdown menu
legend_locations = [
    'best',
    'upper left', 'upper center', 'upper right',
    'center left', 'center', 'center right',
    'lower left', 'lower center', 'lower right',
    'none'
]

dropdown_legend_location = widgets.Dropdown(
    options=legend_locations,
    value=plt.rcParams["legend.loc"]
)

In [None]:
# run button
%matplotlib inline

button_run = widgets.Button(
    description="Display Plot",
    style={'description_width': 'initial',
           'font_weight': 'bold'},
    button_style='primary',
    layout=Layout(width='435px', height='40px')
)

def on_click_button_run(b):
    with output_plot:
        ax = plot_experiment(
            eval_type    = dropdown_eval_type.value,  
            eval_dataset = dropdown_eval_dataset.value,
            eval_years   = dropdown_eval_years.value,
            threshold    = dropdown_threshold.value,
            metric       = dropdown_metric.value,
            save_png     = True,
            save_pdf     = False,
            forced_legend_loc = dropdown_legend_location.value
        )

        output_plot.clear_output()
        plt.show(ax)
    
button_run.on_click(on_click_button_run)

In [None]:
# filename text field and saving button

textfield_filename = Text(get_filename_from_current_parameters())

button_save = widgets.Button(
    description="Save",
    style={'description_width': 'initial'},
    button_style='success',
    layout=Layout(width='67px')
)

def on_click_button_save(b):
    with output_stash:
        ax = plot_experiment(
            eval_type    = dropdown_eval_type.value,  
            eval_dataset = dropdown_eval_dataset.value,
            eval_years   = dropdown_eval_years.value,
            threshold    = dropdown_threshold.value,
            metric       = dropdown_metric.value,
            save_png     = checkbox_save_png.value,
            save_pdf     = checkbox_save_pdf.value,
            forced_legend_loc = dropdown_legend_location.value,
            forced_filename = textfield_filename.value
        )

        plt.close()

button_save.on_click(on_click_button_save)

In [None]:
# filename reset button
button_reset = widgets.Button(
    description="Reset",
    style={'description_width': 'initial'},
    button_style='info',
    layout=Layout(width='80px')
)

def on_click_button_reset(b):
    vbox_saving.children[-2].children[0].value = get_filename_from_current_parameters()

button_reset.on_click(on_click_button_reset)

In [None]:
# advanced metrics checkbox
checkbox_advanced_metrics = widgets.Checkbox(
    description='Show Advanced Metrics',
    disabled=False,
    style={'description_width': 'initial'},
    layout=Layout(width="200px")
)

val_old = None

def on_trait_change_checkbox_advanced_metrics(b):
    global val_old
    val_old = dropdown_metric.value
    simple_metrics = ["abroca", "accuracy", "disparate_impact", "f1_score"]
    if checkbox_advanced_metrics.value:
        dropdown_metric.options = df_all.columns[8:]
    else:
        if val_old not in simple_metrics: val_old = "abroca"
        dropdown_metric.options = simple_metrics

checkbox_advanced_metrics.observe(on_trait_change_checkbox_advanced_metrics)

In [None]:
# app
%%capture
output_plot = Output()
output_stash = Output()

header = HTML("<h1 style='font-family:verdana'>Interactive Fairness Monitoring</h1>",
              layout=Layout(height='auto',
                            margin='0cm 0cm 0cm 5cm'))

vbox_saving = VBox([HTML("<b style='font-size:18px'>Saving Name and Format</b>"),
                    HBox([textfield_filename,
                          button_reset],
                         layout=Layout(width="305px")),
                    HBox([checkbox_save_png,
                          checkbox_save_pdf,
                          button_save],
                         layout=Layout(width="305px"))                 
                ])

vbox_legend_pos = VBox([HTML("<b style='font-size:18px'>Legend Position</b>"),
                        dropdown_legend_location])

app_layout = AppLayout(  
    header=header,

    left_sidebar=HBox([
        HTML("<style>.left-one {margin-left: 1cm;}</style>"),
        VBox([
            HTML("<b style='font-size:18px'>Parameters</b>"),
            Label("Evaluation Mode"),
            Label("Evaluation Dataset"),
            Label("Evaluation Year"),
            Label("Income Threshold"),
            Label("Metric"),
            Label()
        ]).add_class("left-one")
    ]),
    
    center=VBox([
        Label(),
        dropdown_eval_type,
        dropdown_eval_dataset,
        dropdown_eval_years,
        dropdown_threshold,
        dropdown_metric,
        checkbox_advanced_metrics
    ]),
    
    right_sidebar=VBox([
        vbox_legend_pos,
        Label(),
        vbox_saving,
    ]),
    
    pane_widths=['160px', '350px', 1],
    grid_gap="10px"
)

button_run.add_class("top-three")
button_run.add_class("left-one")
output_plot.add_class("top-three")
output_plot.add_class("left-one")


---


# 🔎 **Interaction**

When the setup finished, you can <font color='orange'>run the cell below</font> to start the interactive dashboard.

Here, you can easily explore our data by creating your own plots. Simply select the parameters you are interested in and the corresponding plot will generate by itself, provided there are metrics for the selected parameters.

**Interpretation Guidelines**

<table>
  <tr>
    <th>Parameter</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td>Evaluation Mode</td>
    <td>Evaluate different models tested on the same dataset => mode <i>'test'</i> <br></br>Evaluate model trained on one dataset on multiple other datasets => mode <i>'train'</i></td>
  </tr>
  <tr>
    <td>Evaluation Dataset</td>
    <td>For which dataset do you want to explore metrics?</td>
  </tr>
  <tr>
    <td>Evaluation Year</td>
    <td>For which years do you want to explore metrics?</td>
  </tr>
  <tr>
    <td>Income Threshold</td>
    <td>For which threshold do you want to explore metrics?</td>
  </tr>
  <tr>
    <td>Metric</td>
    <td>Which metric do you want to plot?<br></br>Click <i>'Show Advanced Metrics'</i> to see all collected metrics.</td>
  </tr>
  <tr>
    <td>Legend Position</td>
    <td>(Where) do you want the legend to be placed?<br></br>Choose <i>'best'</i> for automatic placement.</td>
  </tr>  
</table>
<br/><br/>

**Have fun exploring the metrics!** 

In [None]:
display(app_layout)
display(HTML("<style>.top-three {margin-top: .5cm;}</style>"))
display(HBox([button_run, HTML("<style>.left-onefive {margin-left: 1.45cm;}</style>")]))
display(output_plot)