## Validation tasks - comparison of selected task distance candidates

This notebook uses the experimental results of the VALIDATION tasks and investigates how the selected candidates (see "1_develop_comparson.ipynb") perform with respect to all transfer scenarios and meta metrics.

In [54]:
import mml.interactive
from pathlib import Path
mml.interactive.init(Path('~/.config/mml.env').expanduser())
import pandas as pd
import numpy as np
from mml_tf.aggregate import AggregateStrategy, get_aggregated_raws, aggregate_observations
from mml_tf.distances import LoadCachedDistances, get_closest, map_dist2printable
from mml_tf.evaluation import get_win_rates, get_evaluations, SHRUNK
from mml_tf.experiments import EXPERIMENTS, METRICS, load_experiment
from mml_tf.visualization import get_exp_color, init_colors, get_dist_measure_color
from mml_tf.tasks import paper_id_map, shrink_map, shrinkable_tasks, target_tasks
from mml_tf.paths import FIG_PATH
from typing import Tuple, Optional
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

MML API already initialized.


In [2]:
# load candidates and set plotting styles
candidates = [
    'KLD-PP:NS-W:TS-100-BINS',
    'KLD-PP:NS-W:SN-1000-BINS',
    'KLD-PP:NS-1000-BINS',
    'SEMANTIC',  # for comparison with manual selection
]
all_distances = {dist: LoadCachedDistances(dist) for dist in candidates}
init_colors(exp=EXPERIMENTS, distance_measures=[map_dist2printable[d] for d in candidates])
color_map = {map_dist2printable[dist]: get_dist_measure_color(map_dist2printable[dist]) for dist in candidates}
exp_color_map = {exp: get_exp_color(exp) for exp in EXPERIMENTS}
line_dash_map = {map_dist2printable['KLD-PP:NS-W:TS-100-BINS']: 'solid',
                 map_dist2printable['KLD-PP:NS-W:SN-1000-BINS']: 'dash',
                 map_dist2printable['KLD-PP:NS-1000-BINS']: 'dashdot',
                 map_dist2printable['SEMANTIC']: 'dot',
                 'baseline': 'solid'}

### Step 1: Explore domain shift

Evaluating the candidates on the validation tasks with the identical setup, what are the absolute differences when switching from develop to validation? We use the same "averaged" transfer experiment measurements (`AggregateStrategy.MEAN`) and metrics.

In [10]:
dev_evaluations = get_evaluations(all_distances=all_distances.values(), aggregates=[AggregateStrategy.MEAN],
                                   metrics=METRICS, experiments=EXPERIMENTS, top_mode='avg', top_k=3,
                                   top_meta_metrics=['regret', 'rank', 'delta', 'gain'], validation=False)
val_evaluations = get_evaluations(all_distances=all_distances.values(), aggregates=[AggregateStrategy.MEAN],
                                   metrics=METRICS, experiments=EXPERIMENTS, top_mode='avg', top_k=3,
                                   top_meta_metrics=['regret', 'rank', 'delta', 'gain'], validation=True)

Calculating...: 100%|██████████| 736/736 [00:05<00:00, 146.78it/s]
Calculating...: 100%|██████████| 1376/1376 [00:09<00:00, 146.71it/s]


In [13]:
# absolut meta metric table
def get_absolute_meta_values(meta_metric: str, reverse_regret: bool = False, validation: bool = True) -> pd.DataFrame:
    meta_values_df = pd.DataFrame()
    evaluation_df = val_evaluations if validation else dev_evaluations
    for exp in EXPERIMENTS:
        sub_df = evaluation_df[(evaluation_df['exp'] == exp) & (evaluation_df['meta metric'] == meta_metric)]
        for group_name, group_df in sub_df.groupby('distances'):
            if meta_metric == 'regret' and reverse_regret:
                val_series = 1 - group_df['score']
            else:
                val_series = group_df['score']
            meta_values_df.at[group_name, exp + '-mean'] = val_series.mean()
            meta_values_df.at[group_name, exp + '-std'] = val_series.std()
    meta_values_df['mean'] = meta_values_df[[exp + '-mean' for exp in EXPERIMENTS]].mean(axis=1)
    return meta_values_df.sort_values(by='mean', ascending=False)

In [15]:
# this time we do not align regret (no following aggregation with other metrics) 
reversed_regret_dev = get_absolute_meta_values(meta_metric='regret', reverse_regret=True, validation=False)
reversed_regret_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
SEMANTIC,0.279245,0.23469,0.287204,0.20023,0.289398,0.207234,0.256161,0.189903,0.278002
KLD-PP:NS-W:SN-1000-BINS,0.252003,0.196438,0.269076,0.191752,0.275663,0.226451,0.236002,0.165232,0.258186
KLD-PP:NS-W:TS-100-BINS,0.255248,0.232733,0.282134,0.183223,0.248669,0.22597,0.229796,0.150314,0.253962
KLD-PP:NS-1000-BINS,0.258868,0.238555,0.27299,0.174439,0.247942,0.226828,0.230148,0.147894,0.252487


In [16]:
reversed_regret_val = get_absolute_meta_values(meta_metric='regret', reverse_regret=True, validation=True)
reversed_regret_val

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
SEMANTIC,0.18784,0.163522,0.20318,0.181096,0.195592,0.185363,0.1823,0.170127,0.192228
KLD-PP:NS-W:SN-1000-BINS,0.176431,0.147059,0.218833,0.168821,0.17099,0.145447,0.18417,0.170121,0.187606
KLD-PP:NS-1000-BINS,0.205878,0.193693,0.193878,0.17826,0.167464,0.156172,0.173578,0.170979,0.1852
KLD-PP:NS-W:TS-100-BINS,0.207949,0.197788,0.19143,0.174855,0.165669,0.157312,0.175244,0.170518,0.185073


In [17]:
# switching from dev to val, what is the trend?
reversed_regret_val - reversed_regret_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-1000-BINS,-0.052991,-0.044862,-0.079112,0.003821,-0.080477,-0.070656,-0.05657,0.023085,-0.067288
KLD-PP:NS-W:SN-1000-BINS,-0.075571,-0.049379,-0.050243,-0.022931,-0.104673,-0.081004,-0.051832,0.004889,-0.07058
KLD-PP:NS-W:TS-100-BINS,-0.047299,-0.034945,-0.090704,-0.008368,-0.083,-0.068657,-0.054552,0.020204,-0.068889
SEMANTIC,-0.091405,-0.071168,-0.084024,-0.019134,-0.093807,-0.021871,-0.073861,-0.019775,-0.085774


Regret has improved (it got lower) across all 4 task distances (3 bKLD variants and manual task selection) and 4 scenarios. 

In [18]:
rank_dev = get_absolute_meta_values(meta_metric='rank', validation=False)
rank_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.550188,0.140238,0.607273,0.160864,0.606307,0.165877,0.543935,0.16318,0.576926
KLD-PP:NS-1000-BINS,0.538111,0.150988,0.62606,0.169965,0.608454,0.16446,0.531857,0.16708,0.576121
KLD-PP:NS-W:SN-1000-BINS,0.550027,0.167754,0.630421,0.157774,0.52441,0.172167,0.541412,0.147742,0.561567
SEMANTIC,0.518344,0.18726,0.593344,0.153184,0.472088,0.170999,0.512279,0.179319,0.524014


In [19]:
rank_val = get_absolute_meta_values(meta_metric='rank', validation=True)
rank_val

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.482569,0.166892,0.676827,0.191611,0.578866,0.167826,0.520081,0.21472,0.564586
KLD-PP:NS-1000-BINS,0.484673,0.170553,0.670819,0.196274,0.571391,0.160038,0.527639,0.213483,0.56363
SEMANTIC,0.513662,0.187015,0.624626,0.205416,0.496665,0.182255,0.504418,0.200418,0.534843
KLD-PP:NS-W:SN-1000-BINS,0.535533,0.185995,0.523605,0.18914,0.531969,0.175136,0.518352,0.178784,0.527365


In [20]:
rank_val - rank_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-1000-BINS,-0.053437,0.019564,0.044759,0.026309,-0.037063,-0.004422,-0.004218,0.046403,-0.01249
KLD-PP:NS-W:SN-1000-BINS,-0.014494,0.018242,-0.106816,0.031367,0.007559,0.002969,-0.023059,0.031042,-0.034203
KLD-PP:NS-W:TS-100-BINS,-0.067619,0.026655,0.069553,0.030748,-0.027441,0.001949,-0.023853,0.05154,-0.01234
SEMANTIC,-0.004682,-0.000246,0.031282,0.052233,0.024577,0.011256,-0.00786,0.021099,0.010829


The percentile has reduced for model architecture transfer, improved for pretraining (except the source weighted bKLD variant), augmentation policy transfer only improved for manual task selection and for co-training has reduced slightly.

In [21]:
weightedtau_dev = get_absolute_meta_values(meta_metric='weightedtau', validation=False)
weightedtau_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.02098,0.210719,0.092076,0.233045,0.178862,0.212012,-0.009524,0.181079,0.070599
KLD-PP:NS-1000-BINS,0.056977,0.206111,0.033058,0.241982,0.156944,0.191699,0.005589,0.202676,0.063142
KLD-PP:NS-W:SN-1000-BINS,0.011972,0.246552,0.239683,0.239278,0.007186,0.188323,-0.019003,0.239843,0.05996
SEMANTIC,-0.037975,0.269799,0.094596,0.222644,-0.059932,0.231933,0.027682,0.232964,0.006093


In [22]:
weightedtau_val = get_absolute_meta_values(meta_metric='weightedtau', validation=True)
weightedtau_val

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.02853,0.133149,0.149676,0.15263,0.148566,0.20595,-0.008998,0.190978,0.079444
KLD-PP:NS-1000-BINS,0.025797,0.131113,0.146882,0.153107,0.144505,0.201323,-0.013835,0.192682,0.075837
KLD-PP:NS-W:SN-1000-BINS,0.027054,0.162731,-0.001078,0.222002,0.027341,0.207979,0.020443,0.176301,0.01844
SEMANTIC,-0.036082,0.156813,0.103471,0.200676,-0.064462,0.144397,-0.044329,0.175984,-0.01035


In [23]:
weightedtau_val - weightedtau_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.00755,-0.077569,0.0576,-0.080415,-0.030296,-0.006062,0.000526,0.009899,0.008845
KLD-PP:NS-1000-BINS,-0.03118,-0.074998,0.113824,-0.088876,-0.012439,0.009624,-0.019424,-0.009994,0.012695
KLD-PP:NS-W:SN-1000-BINS,0.015082,-0.083821,-0.240761,-0.017276,0.020155,0.019656,0.039446,-0.063541,-0.041519
SEMANTIC,0.001893,-0.112986,0.008876,-0.021968,-0.00453,-0.087536,-0.07201,-0.05698,-0.016443


In [25]:
gain_dev = get_absolute_meta_values('gain', validation=False)
gain_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:SN-1000-BINS,0.797101,0.27647,0.376812,0.326664,0.855072,0.311525,0.775362,0.314696,0.701087
KLD-PP:NS-W:TS-100-BINS,0.768116,0.313243,0.297101,0.331314,0.84058,0.295974,0.797101,0.29379,0.675725
KLD-PP:NS-1000-BINS,0.746377,0.338368,0.297101,0.308147,0.84058,0.304202,0.789855,0.317075,0.668478
SEMANTIC,0.782609,0.291589,0.318841,0.313928,0.811594,0.341761,0.746377,0.315718,0.664855


In [26]:
gain_val = get_absolute_meta_values('gain', validation=True)
gain_val

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-W:TS-100-BINS,0.651163,0.399041,0.368217,0.392841,0.647287,0.400505,0.658915,0.375654,0.581395
KLD-PP:NS-1000-BINS,0.651163,0.402304,0.372093,0.397438,0.635659,0.398126,0.658915,0.38255,0.579457
KLD-PP:NS-W:SN-1000-BINS,0.717054,0.341153,0.263566,0.383026,0.643411,0.420988,0.678295,0.397534,0.575581
SEMANTIC,0.674419,0.392667,0.333333,0.389402,0.569767,0.405409,0.670543,0.366913,0.562016


In [27]:
gain_val - gain_dev

Unnamed: 0,Model<br>Architecture-mean,Model<br>Architecture-std,Pretraining<br>Data-mean,Pretraining<br>Data-std,Augmentation<br>Policy-mean,Augmentation<br>Policy-std,Co-Training<br>Data-mean,Co-Training<br>Data-std,mean
KLD-PP:NS-1000-BINS,-0.095214,0.063936,0.074992,0.089292,-0.204921,0.093924,-0.13094,0.065475,-0.089021
KLD-PP:NS-W:SN-1000-BINS,-0.080047,0.064683,-0.113246,0.056362,-0.211662,0.109463,-0.097068,0.082837,-0.125506
KLD-PP:NS-W:TS-100-BINS,-0.116953,0.085798,0.071116,0.061527,-0.193293,0.104531,-0.138187,0.081864,-0.094329
SEMANTIC,-0.10819,0.101078,0.014493,0.075474,-0.241827,0.063647,-0.075834,0.051195,-0.10284


Gain significantly reduced for model architecture transfer, while improving in the pretraining scenario for unweighted and target weighted bKLD (severe reduced gain for source weighted bKLD). Gain drastically reduced in the scenarios of augmentation policy transer as well as co-training.

### Step 2: Compare bKLD candidates with manual task selection and baseline

We compare the bKLD candidates with manual task selection.

In [28]:
# win rates have been introduced by https://arxiv.org/pdf/2204.01403 
win_rates_df = pd.DataFrame()
for exp in EXPERIMENTS:
    win_rates_df[exp] = get_win_rates(val_evaluations[val_evaluations['exp'] == exp])
win_rates_df['mean'] = win_rates_df.mean(axis=1)
win_rates_df

Unnamed: 0,Model<br>Architecture,Pretraining<br>Data,Augmentation<br>Policy,Co-Training<br>Data,mean
KLD-PP:NS-W:TS-100-BINS,0.348837,0.553488,0.506977,0.369767,0.444767
KLD-PP:NS-W:SN-1000-BINS,0.516279,0.351163,0.439535,0.465116,0.443023
KLD-PP:NS-1000-BINS,0.311628,0.518605,0.476744,0.404651,0.427907
SEMANTIC,0.413953,0.451163,0.327907,0.390698,0.39593


Note that win rates sum over 1.0 because of ties occurring! According to the results bKLD(large,source) wins for scenarios 1 & 4, while bKLD(small,target) wins for scenarios 2 & 3. But "win rates" (especially with a lot of ties) is a very coarse assessment method. 

In [41]:
# next we visualize the performance in comparison with the baseline ("no transfer") 
BUDGET = 3
df_rows = []
for metric in METRICS:
    baseline_full = get_aggregated_raws(strat=AggregateStrategy.MEAN, metric=metric, shrunk=False, validation=True)
    baselines_shrunk = get_aggregated_raws(strat=AggregateStrategy.MEAN, metric=metric, shrunk=True, validation=True)
    for exp in EXPERIMENTS:
        exp_df = aggregate_observations(multi_seed_df=load_experiment(experiment_name=exp, metric=metric),
                                        strat=AggregateStrategy.MEAN, is_loss=False)
        for task in target_tasks:
            target_ref = shrink_map[task] if SHRUNK[exp] else task
            # add baseline value 
            base_val = baselines_shrunk[task] if SHRUNK[exp] and task in shrinkable_tasks else baseline_full[task]
            df_rows.append({'exp': exp, 'metric': metric, 'task': task, 'value': base_val, 'method': 'baseline',
                            'task_id': paper_id_map[task]})
            # selected source task as suggested by distances
            for dist in all_distances:
                sources = get_closest(target_task=target_ref, distances=all_distances[dist], budget=BUDGET)
                transfer_performances = exp_df[task].loc[sources]
                best = transfer_performances.max() if 'loss' not in metric else transfer_performances.min()
                df_rows.append({'exp': exp, 'metric': metric, 'task': task, 'value': best, 'method': dist,
                                'task_id': paper_id_map[task]})
baseline_comparison_df = pd.DataFrame(df_rows)

In [46]:
def get_fig(metric: str = 'BA', exp: Optional[str] = None,
            distances: Tuple[str] = ('baseline', 'KLD-PP:NS-W:TS-100-BINS')) -> go.Figure:
    # determine task order
    tmp_df = baseline_comparison_df[(baseline_comparison_df['method'] == 'baseline') & (baseline_comparison_df['exp'].isin([exp] if exp else EXPERIMENTS))].sort_values(by='value')
    task_order = tmp_df[tmp_df['metric'] == metric]['task'].tolist()
    # sort accordingly
    sorted_df = baseline_comparison_df.sort_values(by='method', ascending=False)
    sorted_df = sorted_df.sort_values(by='task', key=lambda column: column.map(lambda e: task_order.index(e)))
    fig = px.line(sorted_df[
                      (sorted_df['metric'] == metric) & (sorted_df['exp'].isin(EXPERIMENTS if exp is None else [exp])) &
                      sorted_df['method'].isin(distances)].replace(map_dist2printable), x='task_id', y='value',
                  color='method', template='plotly', color_discrete_map=color_map,
                  labels={'distance': 'Distance', 'task_id': 'Task ID', 'method': 'Method', 'value': metric},
                  facet_col='exp' if exp is None else None, facet_col_wrap=2 if exp is None else None,
                  category_orders={'exp': EXPERIMENTS}, facet_row_spacing=0.15, line_dash='method', line_dash_map=line_dash_map)
    fig = fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
    fig.update_layout(
        legend=dict(
            orientation="h",
            y=-0.2,
            xanchor="center",
            x=0.5
        ))
    fig.update_layout(font_size=20, width=2000, height=800)
    fig.update_xaxes(nticks=len(task_order))
    return fig

In [47]:
get_fig('BA', distances=('baseline', 'KLD-PP:NS-W:TS-100-BINS', 'SEMANTIC'))

The plot shows mean expectation (over three transfer repetitions) in Balanced Accuracy (BA) for baseline (orange), manual task selection (red) and bKLD(small,target) for all four transfer scenarios (subplots) and all validation tasks (X-axis). Knowledge transfer may improve BA for a lot of tasks. Some tasks (e.g. T71) show especially large impact by knowledge transfer. The validation tasks have been sorted by baseline performance and depict a high heterogeneity (from tough tasks below 50% BA up to close to 99% BA).

In [48]:
get_fig('AUROC', distances=('baseline', 'KLD-PP:NS-W:TS-100-BINS', 'KLD-PP:NS-W:SN-1000-BINS', 'KLD-PP:NS-1000-BINS'))

This plot compares the three variants ob bKLD with the same plot setup as the previous one. 

### Step 4: Summarize bKLD performances within figures

We compute figure 4 of the paper - summarizing bKLD performances along the computational budget. Previously we averaged the top three suggestions by a task selector, below the best of k suggestions is used. This assumes a user queries multiple source tasks and according to the local computational resources applies knowledge transfer for k of them. After internal validation the "best" of these is used for further processing. We also switch from aggregating the three repetitions before meta metric computation to individual computation with afterward aggregation. This better captures the potential risk (by non-deterministic model training) of knowledge transfer.

In [55]:
def get_budget_plot(mode: str = 'best', 
                    max_budget=5,
                    distances: Tuple[str, ...] = ('SEMANTIC', 'KLD-PP:NS-W:TS-100-BINS'), 
                    meta_metric: str = 'rank',
                    metric='BA', 
                    show_standard_error: bool = False, 
                    bottom_legend: bool = True
                    ) -> Tuple[go.Figure, pd.DataFrame]:
    """
    Computes a meta metric along multiple computational budgets and generates a plot out of the results.
    
    :param mode: the `top_mode` of `get_evaluations` - either "avg" for averaging the outcomes of top-k meta metric or "best" for the best performing
    :param max_budget: up to which max budget the plot is generated
    :param distances: which task distances to compare, must be present in `all_distances`
    :param meta_metric: which meta metric to inspect, note that weightedtau is not eligible as it is budget independent
    :param metric: which base metric to use
    :param show_standard_error: computes and displays the deviation along transfer repetition, if false we fall back to the original scheme of aggregating transfer repetitions first before computing the meta metric
    :param bottom_legend: if True places the legend at the bottom, otherwise at the right
    :return: a plotly figure plus the dataframe computed
    """
    budget_evals = []
    for budget in range(1, max_budget + 1):
        tmp_df = get_evaluations(all_distances=[all_distances[name] for name in distances],
                                 aggregates=[AggregateStrategy.FIRST, AggregateStrategy.SECOND,
                                             AggregateStrategy.THIRD] if show_standard_error else [
                                     AggregateStrategy.MEAN], metrics=[metric], experiments=EXPERIMENTS,
                                 top_meta_metrics=[meta_metric], top_k=budget, top_mode=mode, corr_meta_metrics=[],
                                 disable_pbar=True)
        tmp_df['budget'] = budget
        for grp_name, grp_df in tmp_df.groupby(['distances', 'exp', 'seed']):
            budget_evals.append(
                {'distances': map_dist2printable[grp_name[0]], 'score': grp_df['score'].mean(), 'seed': grp_name[2],
                 'budget': budget, 'exp': grp_name[1]})
    if show_standard_error:
        intermediate_df = pd.DataFrame(budget_evals)
        # compute errors along seeds
        budget_evals = []
        for grp_name, grp_df in intermediate_df.groupby(['distances', 'exp', 'budget']):
            budget_evals.append(
                {'distances': grp_name[0], 'score': grp_df['score'].mean(), 'error': grp_df['score'].std(),
                 'budget': grp_name[2], 'exp': grp_name[1]})
    final_df = pd.DataFrame(budget_evals)
    fig = px.line(final_df, x='budget', y='score', color='distances' if len(distances) > 1 else 'exp',
                  template='plotly',
                  labels={'exp': 'Scenario',
                          'score': f'{metric} improvement (multi shot)' if meta_metric == 'delta' else meta_metric,
                          'distances': 'Task selector'},
                  facet_col='exp' if len(distances) > 1 else None,
                  category_orders={'exp': EXPERIMENTS},
                  color_discrete_map=color_map if len(distances) > 1 else exp_color_map,
                  error_y='error' if show_standard_error else None,
                  line_dash='distances' if len(distances) > 1 else None,
                  line_dash_map={map_dist2printable['KLD-PP:NS-W:TS-100-BINS']: 'solid',
                                 map_dist2printable['KLD-PP:NS-W:SN-1000-BINS']: 'dash',
                                 map_dist2printable['KLD-PP:NS-1000-BINS']: 'dashdot',
                                 map_dist2printable['SEMANTIC']: 'dot'})
    if max_budget <= 5:
        fig.update_layout(xaxis=dict(tickvals=list(range(1, max_budget + 1))),
                          xaxis2=dict(tickvals=list(range(1, max_budget + 1))),
                          xaxis3=dict(tickvals=list(range(1, max_budget + 1))),
                          xaxis4=dict(tickvals=list(range(1, max_budget + 1))))
    fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
    if bottom_legend:
        fig.update_layout(legend=dict(
            orientation="h",
            y=-0.2,
            xanchor="center",
            x=0.5
        ))
    fig.for_each_xaxis(lambda x: x.update({'title': ''}))
    fig.add_annotation(
        showarrow=False,
        xanchor='center',
        xref='paper',
        x=0.5,
        yref='paper',
        y=-0.2,
        text='Number of shots'
    )
    fig.update_layout(font_size=20)
    return fig, final_df

In [56]:
# top plot of figure 4
fig_gain, _ = get_budget_plot(max_budget=5, meta_metric='gain', distances=tuple(['KLD-PP:NS-W:TS-100-BINS']), metric='BA',
                           show_standard_error=True, bottom_legend=False)
fig_gain.update_yaxes(title_text='Fraction of improved tasks')

In [57]:
# bottom plot of figure 4
fig_improve, improve_df = get_budget_plot(distances=tuple(['SEMANTIC', 'KLD-PP:NS-W:TS-100-BINS']),
                      mode='best', max_budget=5, meta_metric='delta', metric='AUROC', show_standard_error=True, bottom_legend=False)
fig_improve.add_hline(y=0., annotation_text='No transfer', annotation_position='bottom right', line_color='grey')

In [58]:
# merging and saving figure 4
joined_fig = make_subplots(rows=2, cols=4, shared_xaxes=False, shared_yaxes=True,
                           # column_widths=[0.6, 0.4], row_heights=[0.4, 0.6], 
                           specs=[[{"colspan": 4}, None, None, None], [{}, {}, {}, {}]], subplot_titles=[''] + EXPERIMENTS)
for trace in fig_gain.data:
    trace.legendgroup = 1
    trace.legendgrouptitle.text = 'Scenario'
    joined_fig.add_trace(trace, row=1, col=1)
for trace in fig_improve.data:
    col = {'x': 1, 'x2': 2, 'x3': 3, 'x4': 4}[trace.xaxis]
    trace.legendgroup = 2
    trace.legendgrouptitle.text = 'Task selector'
    joined_fig.add_trace(trace, row=2, col=col)
joined_fig.update_layout(template='plotly', font_size=20, width=1200, height=700, legend_tracegroupgap=70)
joined_fig.update_layout(xaxis=dict(tickvals=list(range(1, 6))), yaxis_title='Fraction of improved tasks',
                         yaxis2_title='AUROC improvement')
joined_fig.update_xaxes(title='Number of shots')
joined_fig.update_annotations(font_size=20)
joined_fig.add_hline(y=0., annotation_text='No transfer', annotation_position='bottom right', line_color='grey', row=2)
joined_fig.write_image(FIG_PATH / 'fig_4.png', width=1200, height=700, engine='kaleido')
joined_fig.write_image(FIG_PATH / 'fig_4.pdf', width=1200, height=700, engine='kaleido')

Lastly we compute the reported numbers of relative higher improvement of bKLD(small,target) over manual task selection for the 4 scenarios. Instead of a single budget, we average the improvement over budgets 1 to 5, which equals a comparison of area under the curve within the 4 plots in the bottom of figure 4.

In [73]:
rearranged_improvement_df = improve_df.set_index(['distances', 'budget', 'exp'])
print('Relative improvement in percent:')
for exp in EXPERIMENTS:
    man_delta = np.mean([rearranged_improvement_df.loc[('Manual', budget, exp)]['score'] for budget in range(1, 6)])
    bkld_delta = np.mean([rearranged_improvement_df.loc[('bKLD(small,target)', budget, exp)]['score'] for budget in range(1, 6)])
    print(exp, (bkld_delta - man_delta) * 100 / man_delta)
print('\nAverage improvement in percentage points:')
for exp in EXPERIMENTS:
    man_delta = np.mean([rearranged_improvement_df.loc[('Manual', budget, exp)]['score'] for budget in range(1, 6)])
    bkld_delta = np.mean([rearranged_improvement_df.loc[('bKLD(small,target)', budget, exp)]['score'] for budget in range(1, 6)])
    print(exp, (bkld_delta - man_delta) * 100)

Relative improvement in percent:
Model<br>Architecture 11.532237438377546
Pretraining<br>Data 56.904790724621854
Augmentation<br>Policy 15.3539301656938
Co-Training<br>Data 2.3152358678444114

Average improvement in percentage points:
Model<br>Architecture 0.2248716446780416
Pretraining<br>Data 0.19508450068244668
Augmentation<br>Policy 0.2768468025118813
Co-Training<br>Data 0.05043448403824202
