## Validation tasks - comparison of selected task distance candidates with previous work

This notebook uses the experimental results of the VALIDATION tasks and investigates how the selected candidates (see "1_develop_comparson.ipynb") perform in comparison to previous task distance measures.

In [2]:
import mml.interactive
mml.interactive.init()
import pandas as pd
from mml_tf.aggregate import AggregateStrategy
from mml_tf.distances import LoadCachedDistances, map_dist2printable
from mml_tf.evaluation import get_setup_stability_score, get_win_rates, get_evaluations
from mml_tf.experiments import EXPERIMENTS
from mml_tf.ranking import BootstrapRanking
from mml_tf.visualization import init_colors, get_dist_measure_color
from mml_tf.paths import FIG_PATH
from typing import List, Tuple, Optional, Sequence
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from plotly.subplots import make_subplots

  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)


 _____ ______   _____ ______   ___
|\   _ \  _   \|\   _ \  _   \|\  \
\ \  \\\__\ \  \ \  \\\__\ \  \ \  \
 \ \  \\|__| \  \ \  \\|__| \  \ \  \
  \ \  \    \ \  \ \  \    \ \  \ \  \____
   \ \__\    \ \__\ \__\    \ \__\ \_______\
    \|__|     \|__|\|__|     \|__|\|_______|
         ____  _  _    __  _  _  ____  _  _
        (  _ \( \/ )  (  )( \/ )/ ___)( \/ )
         ) _ ( )  /    )( / \/ \\___ \ )  /
        (____/(__/    (__)\_)(_/(____/(__/
Interactive MML API initialized.


In [4]:
# load distances and set plot style
comp_distances = [
    'SEMANTIC', # manual baseline
    'VDNA-PP:NN-1000-BINS', # VDNA paper
    'FED', # FED paper
    'FID', # Frechet Inception Distance
    'KLD-PP:NN', # P2L paper (no sample weighting here)
    'KLD-PP:NS-W:TS-100-BINS', # bKLD(small,target)
    'KLD-PP:NS-W:SN-1000-BINS', # bKLD(large,source)
    'KLD-PP:NS-1000-BINS', # bKLD(large,unweighted)
]
all_distances = {name: LoadCachedDistances(name) for name in comp_distances}
plot_order = [map_dist2printable[d] for d in comp_distances]
init_colors(exp=EXPERIMENTS, distance_measures=plot_order + [dist for dist in all_distances.keys() if dist not in comp_distances])
color_map = {dist: get_dist_measure_color(dist) for dist in plot_order}
symbol_map = {d: 'circle' for d in plot_order}
symbol_map[map_dist2printable['KLD-PP:NS-W:SN-1000-BINS']] = 'hexagon'
symbol_map[map_dist2printable['KLD-PP:NS-1000-BINS']] = 'hexagram'
dash_map = {d: 'solid' for d in plot_order}
dash_map[map_dist2printable['KLD-PP:NS-W:SN-1000-BINS']] = 'dash'
dash_map[map_dist2printable['KLD-PP:NS-1000-BINS']] = 'dot'

### Step 1: Setup stability

Evaluating the setup stability (see https://arxiv.org/pdf/2204.01403). We compare them for both the aggregate transfer results first and the triple evaluated approach.

In [32]:
tripled_evaluations = get_evaluations(all_distances=all_distances.values(), aggregates=[AggregateStrategy.FIRST, AggregateStrategy.SECOND, AggregateStrategy.THIRD], top_meta_metrics=['regret', 'rank', 'delta', 'gain'], top_mode='avg', top_k=3)

Calculating...: 100%|██████████| 8256/8256 [00:56<00:00, 146.64it/s]


In [34]:
components = ['target', 'metric', 'exp', 'meta metric', 'seed'] 
ss_score = {}
for varying in components:
  fixing = [c for c in components if c != varying]
  ss_score[varying] = get_setup_stability_score(tripled_evaluations[tripled_evaluations['meta metric'] != 'gain'], varying=varying, fixing=fixing)

ss_score

common ('barretts_esophagus_diagnosis', 'Co-Training<br>Data', 'regret', 'first') and varying metric is ONLY NAN VALUES
common ('barretts_esophagus_diagnosis', 'Co-Training<br>Data', 'regret', 'third') and varying metric is ONLY NAN VALUES


  setup_stability_scores.append(np.nanmean(tau_values))
  setup_stability_scores.append(np.nanmean(tau_values))


common ('barretts_esophagus_diagnosis', 'AUROC', 'Co-Training<br>Data', 'regret') and varying seed is ONLY NAN VALUES


  setup_stability_scores.append(np.nanmean(tau_values))


{'target': 0.02321746068113792,
 'metric': 0.32243389341191336,
 'exp': 0.007807748369157206,
 'meta metric': 0.6127320273404088,
 'seed': 0.11677354825994406}

Apparently the setup stability across transfer repetitions (weightedtau ~ 0.12) is stronger than dependency on the target task (weightedtau ~ 0.02) or the transfer scenario (weightedtau ~ 0.01). It is reassuring that setup stability across base metrics (weightedtau ~ 0.32) and meta metrics (weightedtau ~ 0.61) is rather stable. 

In [35]:
single_evaluations = get_evaluations(all_distances=all_distances.values(), top_meta_metrics=['regret', 'rank', 'delta', 'gain'], top_mode='avg', top_k=3)

Calculating...: 100%|██████████| 2752/2752 [00:18<00:00, 146.94it/s]


In [36]:
components = ['target', 'metric', 'exp', 'meta metric'] 
ss_score = {}
for varying in components:
  fixing = [c for c in components if c != varying]
  ss_score[varying] = get_setup_stability_score(single_evaluations[single_evaluations['meta metric'] != 'gain'], varying=varying, fixing=fixing)

ss_score

{'target': 0.028777503859202146,
 'metric': 0.40207181455374924,
 'exp': 0.029645823237488902,
 'meta metric': 0.6023195632783143}

Compared to the triple evaluation case the previous aggregation of transfer experiment repetition has roughly equal stability across target tasks (+0.005) and slightly more stability across transfer scenarios (+0.022). While stability across base metrics increased significantly (+0.1) stability across meta metrics slightly decreased (-0.01).

### Step 2: Win rate analysis

Win rates have been introduced by https://arxiv.org/pdf/2204.01403 and try to compare task similarities on a wide range of setup configurations. 

In [39]:
def show_win_rates(dist_names: Sequence[str]) -> pd.DataFrame:
    win_rates_df = pd.DataFrame()
    for exp in EXPERIMENTS:
        win_rates_df[exp] = get_win_rates(tripled_evaluations[(tripled_evaluations['exp'] == exp) & (tripled_evaluations['meta metric'] != 'gain')])
    win_rates_df['mean'] = win_rates_df.mean(axis=1)
    return win_rates_df

In [40]:
(show_win_rates(all_distances) * 100).sort_values('mean').round(decimals=2)

Unnamed: 0,Model<br>Architecture,Pretraining<br>Data,Augmentation<br>Policy,Co-Training<br>Data,mean
FID,15.02,13.86,12.6,10.66,13.03
KLD-PP:NN,15.21,14.92,13.66,10.17,13.49
FED,15.89,15.21,11.14,12.79,13.76
VDNA-PP:NN-1000-BINS,13.66,18.41,16.09,12.21,15.09
SEMANTIC,20.64,19.86,11.34,16.96,17.2
KLD-PP:NS-W:TS-100-BINS,14.53,25.58,23.84,17.25,20.3
KLD-PP:NS-1000-BINS,14.53,25.39,22.19,20.64,20.69
KLD-PP:NS-W:SN-1000-BINS,34.98,19.86,27.71,32.66,28.8


In [41]:
# keep in mind that more than 100% winning happens in total because of ties
show_win_rates(comp_distances).sum()

Model<br>Architecture     1.444767
Pretraining<br>Data       1.531008
Augmentation<br>Policy    1.385659
Co-Training<br>Data       1.333333
mean                      1.423692
dtype: float64

### Step 3: Compare absolute values of results

In [42]:
# absolut meta metric table
def get_absolute_meta_values(full_evaluations: pd.DataFrame, meta_metric: str, reverse_regret: bool = False) -> pd.DataFrame:
    meta_values_df = pd.DataFrame()
    for exp in EXPERIMENTS:
        sub_df = full_evaluations[(full_evaluations['exp'] == exp) & (full_evaluations['meta metric'] == meta_metric)]
        for group_name, group_df in sub_df.groupby('distances'):
            if meta_metric == 'regret' and reverse_regret:
                val_series = 1 - group_df['score']
            else:
                val_series = group_df['score']
            meta_values_df.at[group_name, exp + '-mean'] = val_series.mean()
            meta_values_df.at[group_name, exp + '-std'] = val_series.std()
    meta_values_df['mean'] = meta_values_df[[exp + '-mean' for exp in EXPERIMENTS]].mean(axis=1)
    return meta_values_df.sort_values(by='mean', ascending=False)

In [43]:
reversed_regret_df = get_absolute_meta_values(full_evaluations=tripled_evaluations, meta_metric='regret', reverse_regret=True)
aligned_regret_df = get_absolute_meta_values(full_evaluations=tripled_evaluations, meta_metric='regret', reverse_regret=False)
reversed_regret_df

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
SEMANTIC,0.209295,0.178388,0.267078,0.216593,0.271664,0.223355,0.255173,0.211627,0.250803
FED,0.217104,0.190314,0.255387,0.209498,0.26259,0.220438,0.258383,0.214018,0.248366
KLD-PP:NS-W:SN-1000-BINS,0.201879,0.172145,0.281343,0.199631,0.251487,0.196773,0.25808,0.217647,0.248197
VDNA-PP:NN-1000-BINS,0.222161,0.187829,0.25308,0.209676,0.252538,0.210621,0.258111,0.212734,0.246473
KLD-PP:NN,0.2222,0.196256,0.254521,0.211576,0.250956,0.209247,0.258162,0.206372,0.24646
FID,0.221051,0.193798,0.252701,0.203469,0.245827,0.198779,0.255462,0.205751,0.24376
KLD-PP:NS-1000-BINS,0.222943,0.196503,0.250836,0.205636,0.24577,0.202491,0.248436,0.207845,0.241996
KLD-PP:NS-W:TS-100-BINS,0.224697,0.199942,0.248471,0.202656,0.244026,0.203254,0.250519,0.208444,0.241928


In [44]:
rank_df = get_absolute_meta_values(full_evaluations=tripled_evaluations, meta_metric='rank')
rank_df

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.487703,0.165807,0.631475,0.178851,0.562785,0.171492,0.511841,0.187924,0.548451
KLD-PP:NS-1000-BINS,0.490268,0.167235,0.626058,0.181913,0.558253,0.168636,0.516741,0.187641,0.54783
FID,0.486135,0.158128,0.60859,0.175298,0.556355,0.160559,0.509273,0.167104,0.540088
KLD-PP:NN,0.488392,0.159993,0.613576,0.165885,0.547516,0.173574,0.506589,0.157054,0.539018
VDNA-PP:NN-1000-BINS,0.481051,0.165324,0.619323,0.180393,0.547284,0.182798,0.495522,0.164291,0.535795
FED,0.496016,0.154828,0.611311,0.163584,0.525618,0.175057,0.499252,0.178252,0.533049
SEMANTIC,0.512213,0.179149,0.590038,0.183498,0.501288,0.170913,0.505639,0.180506,0.527295
KLD-PP:NS-W:SN-1000-BINS,0.529449,0.174184,0.512074,0.168159,0.522981,0.168882,0.518813,0.173972,0.52083


In [45]:
weightedtau_df = get_absolute_meta_values(full_evaluations=tripled_evaluations, meta_metric='weightedtau')
weightedtau_df

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.069241,0.130738,0.101046,0.16172,0.105286,0.179908,-0.011202,0.162172,0.066093
KLD-PP:NS-1000-BINS,0.069807,0.128245,0.097995,0.163935,0.10277,0.176022,-0.013545,0.164635,0.064257
KLD-PP:NS-W:SN-1000-BINS,0.084286,0.14246,-0.000249,0.180443,0.026336,0.176739,0.020983,0.162898,0.032839
VDNA-PP:NN-1000-BINS,0.042119,0.141304,0.070049,0.157239,-0.012196,0.190926,-0.009478,0.141505,0.022623
FID,0.037082,0.139233,0.061096,0.158157,-0.009736,0.172353,-0.023091,0.150118,0.016338
SEMANTIC,0.04357,0.150731,0.076728,0.169545,-0.056338,0.137762,-0.028704,0.148805,0.008814
FED,0.022515,0.143864,0.055493,0.143864,-0.035473,0.153865,-0.027934,0.159048,0.00365
KLD-PP:NN,0.0286,0.136193,0.055824,0.155724,-0.049582,0.149518,-0.021013,0.141721,0.003457


In [46]:
gain_df = get_absolute_meta_values(tripled_evaluations, 'gain')
gain_df

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-1000-BINS,0.652455,0.379165,0.410853,0.382297,0.611111,0.39661,0.621447,0.388024,0.573966
KLD-PP:NS-W:TS-100-BINS,0.647287,0.378935,0.413437,0.381761,0.614987,0.393307,0.616279,0.391823,0.572997
VDNA-PP:NN-1000-BINS,0.642119,0.374039,0.422481,0.383708,0.602067,0.39687,0.617571,0.380231,0.571059
FID,0.647287,0.375497,0.403101,0.376973,0.604651,0.392906,0.618863,0.377544,0.568475
KLD-PP:NN,0.647287,0.373187,0.403101,0.378118,0.599483,0.400777,0.613695,0.382544,0.565891
KLD-PP:NS-W:SN-1000-BINS,0.687339,0.353482,0.320413,0.369393,0.618863,0.400871,0.627907,0.381428,0.56363
FED,0.648579,0.363273,0.395349,0.377187,0.585271,0.400829,0.613695,0.379138,0.560724
SEMANTIC,0.651163,0.384212,0.399225,0.3771,0.54522,0.393281,0.627907,0.382559,0.555879


In [51]:
# create a spider plot from absolute values
def get_spider_plot(distances_names: List[str], meta_names: Tuple[str] = ('gain', 'regret', 'rank')):
    spider_rows = []
    for exp in EXPERIMENTS:
        for dist in distances_names:
            for meta in meta_names:
                meta_df = {'weightedtau': weightedtau_df, 'gain': gain_df, 'regret': aligned_regret_df, 'rank': rank_df}[meta]
                spider_rows.append({'distance': dist, 'meta': meta, 'theta': exp + '-' + meta if len(meta_names) > 1 else exp, 'exp': exp, 'score': meta_df.at[dist, f'{exp}-mean']})
    spider_df = pd.DataFrame(spider_rows).sort_values(by='theta').replace(map_dist2printable)
    fig = px.line_polar(spider_df.replace(map_dist2printable), r='score', theta='theta', color='distance', template='plotly', color_discrete_map=color_map, symbol='distance', line_dash='distance', line_dash_map=dash_map, symbol_map=symbol_map, category_orders={'theta': spider_df['theta'].sort_values().unique().tolist(), 'distance': [map_dist2printable[d] for d in comp_distances]}, markers=True, labels={'distance': 'Task selector'}, line_close=True, log_r='weightedtau' not in meta_names, width=1200, height=500)
    fig.update_layout(font_size=20)
    fig.update_polars(radialaxis_nticks=5) 
    fig.update_traces(line_width=3)
    return fig

In [53]:
get_spider_plot(comp_distances)

In [54]:
get_spider_plot(comp_distances, meta_names=('weightedtau',))

### Step 4: Create rankings according to "mean then rank" (with bootstrapping)

In [56]:
def get_bubble_ranking_plot(
        distances: List[str], 
        meta_metric: str = "regret", 
        exp: Optional[str] = None,  # if None all exps are used
        use_median: bool = False,  # if True median instead of mean is used upon ranks
        inverted: bool = False,  # inverts orientation of the y-axis (such that low ranks are at the top)
        tripled: bool = True  # whether to use the tripled or single evaluations
) -> Tuple[go.Figure, pd.DataFrame]:
    rr_df = pd.DataFrame()
    base_evals = tripled_evaluations if tripled else single_evaluations
    sub_evals = base_evals[base_evals['distances'].isin(distances) & (base_evals['meta metric'] == meta_metric)]
    if exp is not None:
        sub_evals = sub_evals[sub_evals['exp'] == exp]
    rr_df['case'] = sub_evals['seed'] + sub_evals['metric'] + sub_evals['target'] + sub_evals['exp']
    rr_df['task'] = 'dummy'
    rr_df['algorithm'] = sub_evals['distances']
    rr_df['value'] = sub_evals['score']
    rr_df = rr_df.replace(map_dist2printable)
    bsr = BootstrapRanking(data=rr_df, use_median=use_median)
    alg_order = list(bsr.statistics.algorithm.values)  # determine ordering
    if inverted:
        alg_order = alg_order[::-1]
    df_counts = bsr.counts.sort_values(by="algorithm", key=lambda column: column.map(lambda e: alg_order.index(e)))
    fig = px.scatter(df_counts, x='algorithm', y='rank', color='algorithm', size='count', template='plotly', title=f'{meta_metric} on {exp}' if exp else meta_metric, labels={'rank': 'Rank', 'algorithm': 'Algorithm'}, color_discrete_map=color_map, 
                     symbol='algorithm', symbol_map=symbol_map)
    fig.update_layout(showlegend=False)
    fig.add_scatter(x=bsr.statistics['algorithm'], y=bsr.statistics['mean_rank'], error_y={'array': bsr.statistics['std_rank']}, marker_symbol='x-thin',marker_size=15, marker_line_width=2, marker_color='black', mode='markers')
    if inverted:
        fig.update_yaxes(autorange='reversed')
    return fig, bsr.statistics

We now create Figure 5 (bottom) - bootstrapped rankings for 4 knowledge transfer scenarios and 4 meta metrics.

In [61]:
# we use the meta metrics explained in the paper (except for gain, as it is a less granular version of improve)
_meta_metrics = ['regret', 'rank', 'delta', 'weightedtau']
# map the internal names to the ones of the paper
mm_display_map = {'regret': 'Regret', 'rank': 'Percentile', 'delta': 'Improvement', 'weightedtau': 'Weightedtau'}
_inverted = True
_tripled = True
express_figs = []
statistics_collector = []
for exp in EXPERIMENTS:
    for meta_metric in _meta_metrics:
        fig, stats = get_bubble_ranking_plot(distances=comp_distances, meta_metric=meta_metric, inverted=_inverted, exp=exp, tripled=_tripled)
        express_figs.append(fig)
        statistics_collector.append(stats)
fig = make_subplots(rows=4, cols=4, shared_xaxes=False, shared_yaxes=True, vertical_spacing=0.01, horizontal_spacing=0.01, 
                    column_titles=[mm_display_map[m] for m in _meta_metrics],
                    row_titles=EXPERIMENTS,
                    y_title='Rank')
for fig_idx, sub_fig in enumerate(express_figs):
    row = 1 + fig_idx // 4
    col = 1 + fig_idx % 4
    for trace in sub_fig.data:
        fig.add_trace(trace, row=row, col=col)
fig.update_layout(template='plotly', height=1000, width=1200, font_size=20, legend_title='Task selector')
fig.for_each_yaxis(lambda a: a.update(tickvals=list(range(1, len(comp_distances) +1)), autorange='reversed' if _inverted else True))
fig.for_each_trace(lambda trace: trace.update(showlegend=trace.yaxis=='y' and trace.name in plot_order))
fig.for_each_trace(lambda trace: trace.update(legendrank=1000 + plot_order.index(trace.name)) if trace.showlegend else 1000)
fig.for_each_xaxis(lambda a: a.update(showticklabels=False))
fig.for_each_annotation(lambda a: a.update(font_size=20))
fig.layout.legend.x = 1.08
fig.write_image(FIG_PATH / 'fig_5_bottom.png', engine='kaleido')
fig.write_image(FIG_PATH / 'fig_5_bottom.pdf', engine='kaleido')
fig.show()

Based on the statistics (gathered in `statistics_collector`) we can compute the upper part.

In [62]:
all_rank_series = [frame.set_index('algorithm')['mean_rank'] for frame in statistics_collector]
stats_df = pd.DataFrame(all_rank_series)
mean_stats_series = stats_df.mean()
mean_stats_series.name = 'mean'
std_stats_series = stats_df.std()
std_stats_series.name = 'std'
plot_df = pd.DataFrame([mean_stats_series, std_stats_series])
plot_df = plot_df.T.sort_values(by='mean', ascending=False)
plot_df.reset_index(inplace=True)

In [65]:
fig = px.scatter(plot_df, x='algorithm', y='mean', error_y='std', color_discrete_map=color_map, color='algorithm',
                     symbol='algorithm', symbol_map=symbol_map, size=[7]*len(plot_df), labels={'algorithm': 'Task selector', 'mean': 'Mean rank'}
                  )
fig.update_layout(template='plotly', font_size=20, width=1200, height=500)
fig.for_each_yaxis(lambda a: a.update(tickvals=list(range(1, len(comp_distances) +1)), autorange='reversed' if _inverted else True))
fig.update_layout(showlegend=False)  # omit legend
fig.update_xaxes({'tickvals': ['']*len(comp_distances)})  # remove labels
fig.write_image(FIG_PATH / 'fig_5_top.png', engine='kaleido')
fig.write_image(FIG_PATH / 'fig_5_top.pdf', engine='kaleido')
fig.show()

Finally, we merge the top and lower parts.

In [68]:
from PIL import Image
OVERLAY = 50
img_top = Image.open(FIG_PATH / 'fig_5_top.png')
img_bottom = Image.open(FIG_PATH / 'fig_5_bottom.png')
merged = Image.new('RGB', size=(img_top.width, img_top.height + img_bottom.height - OVERLAY))
merged.paste(im=img_bottom, box=(0, img_top.height - OVERLAY))
merged.paste(im=img_top, box=(0, 0))
merged.save(FIG_PATH / 'fig_5.png')
merged.save(FIG_PATH / 'fig_5.pdf')

### Step 5: Create rankings according to "rank then mean" (with bootstrapping)

In [76]:
evals = tripled_evaluations if True else single_evaluations
results = []
rng = np.random.default_rng(seed=42)
for meta in ['regret', 'rank', 'delta', 'weightedtau']:
    for exp in EXPERIMENTS:
        relevant_df = evals[(evals.distances.isin(comp_distances)) & (evals['meta metric'] == meta)].replace(map_dist2printable)
        vary_setups = [col for col in relevant_df.columns if col not in ['distances', 'score']]
        gbobj = relevant_df[relevant_df['exp'] == exp].groupby(vary_setups)
        
        bootstrap_indices = rng.integers(low=0, high=len(gbobj.groups), size=(len(gbobj.groups), 1000))
        group_mapping = {}
        for group_name, group_df in gbobj:
            sub_series = group_df.set_index('distances')['score']
            group_mapping[group_name] = sub_series.rank(method='average', ascending=False).to_dict()
        for dist in relevant_df['distances'].unique().tolist():
            values = np.asarray([ranking[dist] for ranking in group_mapping.values()])
            aggregated = np.mean(values[bootstrap_indices], axis=0)
            results.append({'dist': dist, 'exp': exp, 'mean': np.mean(aggregated), 'std': np.std(aggregated), 'meta': meta})
results = pd.DataFrame(results)

In [78]:

_df = results.replace(mm_display_map)
fig = px.scatter(_df.iloc[::-1], color='dist', x='mean', y='exp', color_discrete_map=color_map, template='plotly',  error_x='std',  facet_row='meta', size=[5] * len(_df), symbol_map=symbol_map, symbol='dist', labels={'mean': 'Rank', 'exp': 'Scenario'}, category_orders={'dist': [map_dist2printable[d] for d in comp_distances]})
fig.update_layout(font_size=20, width=1200, height=1200, showlegend=True, legend_title='Task selector')
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fig.write_image(FIG_PATH / 'fig_7.png', engine='kaleido')
fig.write_image(FIG_PATH / 'fig_7.pdf', engine='kaleido')
fig.show()