Demo of visualizations of significance testing of dataset metrics between different LMMs

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

sys.path.append(os.path.join(os.getcwd(), "..","..","src"))
from unitxt.metric_paired_significance import PairedDifferenceTest

Assume we have a dataset of records (e.g., questions with ground truth answers, and predictions from several different models we want to compare).  For each record, we can calculate a metric value of the model accuracy (e.g., recall, precision, s-bert similarity).  Given a metric, we can compare model A vs B by testing whether, statistically, the metric value is very different (or higher/lower) on predictions from model A or B.  The metric values are *paired* (since the underlying record is the same, and the records are not repeated random samples) so the tests conducted assume paired observations.

Significance is defined in two ways:
- The p-value, which takes values in [0,1], and lower values indicate a higher significant difference
- The effect size, which overcomes the p-value tendency to overstate differences when the sample size is large

In [None]:
# read in the data
data = {mm: pd.read_csv('{}.csv'.format(mm)) for mm in ["recall", "precision", "sbert"]}
print(data['recall'].head())
# test combinations vs each other
tester_full = PairedDifferenceTest(nmodels=6, model_names=data["recall"].columns)

# perform test on each metric separately
test_results_twoside = [tester_full.signif_pair_diff(samples_list=[vv[cc].to_numpy() for cc in vv.columns], metric_name=kk) for kk, vv in data.items()]
# perform one-sided test (for example)
test_results_leftside = [tester_full.signif_pair_diff(samples_list=[vv[cc].to_numpy() for cc in vv.columns], metric_name=kk, alternative='less') for kk, vv in data.items()]

Display the resulting objects containing the test results

In [None]:
print("two-sided (metrics are different)")
print(test_results_twoside[0])
print("left-sided (model A < model B)")
print(test_results_leftside[0])

A heatmap takes the tests between models across multiple metrics.  It shows which model pairs appear significantly different across the metrics.  More significant comparisons are displayed higher in the heatmap.

In [None]:
# heatmap comparing multiple metrics
tester_full.multiple_metrics_significance_heatmap(test_results_twoside, optimize_color=False)
print("color optimized")
tester_full.multiple_metrics_significance_heatmap(test_results_twoside)

tester_full.multiple_metrics_significance_heatmap(test_results_leftside)
# use Cohen's d effect size instead
tester_full.multiple_metrics_significance_heatmap(test_results_twoside, use_pvalues=False, optimize_color=False)
print("color optimized")
tester_full.multiple_metrics_significance_heatmap(test_results_twoside, use_pvalues=False)

tester_full.multiple_metrics_significance_heatmap(test_results_leftside, use_pvalues=False)

Plot a connected graph, were nodes represent models.
The vertical orientation of nodes corresponds to the mean value of the metric.
Nodes that are significantly different are connected by an edge; thicker edges mean a less significant difference.  For example, if all comparisons are significant, we will see all nodes unconnected, without edges.  The graph allows us to visualize groupings of similarly-performing models.

In [None]:
# use retrieval and model to code the node drawings with color
# graph shows NOT SIGNIFICANT pairwise comparisons connected by an edge

node_color_levels = [mn.split('_')[1] for mn in tester_full.model_names]

for res in test_results_twoside + test_results_leftside:
    tester_full.metric_significant_pairs_graph(test_res=res, model_name_split_char="_", node_color_levels=node_color_levels, weight_edges=True)

A lineplot shows p-values by connecting the compared models with a line segment.  Initially the models are arranged along the y-axis according to the sample mean of the metric; each model is assigned a given color.  The horizontal location of the line segment is at the p-value, with more significant comparisons placed to the left.  Using this method, it is not as easy to visualize the significant comparisons because the segments tend to be overplotted.  We recommend the connected graph above.

In [None]:
# lineplot shows p-values of pairwise comparisons by significance
# significant pairs are shown toward the left side of the plot
for vv in test_results_twoside:
    tester_full.pvalue_lineplot(vv)