# AMLB Tables
This notebook contains code to generate several tables from the paper "AMLB: an AutoML Benchmark" by Gijsbers et al. (2023). As input, it assumes data as preprocessed by `raw_to_clean.ipynb`.

*The notebook's code is pretty messy/terrible. PRs to clean it up are welcome, but must be able to produce identical tables.*

In [37]:
import itertools
import math
from pathlib import Path
import numpy as np
import pandas as pd

In [38]:
from data_processing import get_print_friendly_name, impute_results, calculate_ranks, add_rescale, is_old

In [39]:
PROJECT_ROOT = Path(".").absolute().parent
DATA_DIRECTORY = PROJECT_ROOT / "data"
TABLE_DIRECTORY = PROJECT_ROOT / "tables"
TABLE_DIRECTORY.mkdir(parents=True, exist_ok=True)

In [40]:
results = pd.read_csv(DATA_DIRECTORY / "amlb_all.csv", dtype={"info": str})
results["framework"] = results["framework"].apply(get_print_friendly_name)

### Naive AutoML
We collaborated with the Naive AutoML (NAML) authors, but despite our best efforts we ran the experiments with one major oversight: the provided per-pipeline evaluation time is not actually sufficient. While one could consider not having a good default in the package itself a "bug", the framework was in principle not designed with one-hour runtimes in mind. This has a major effect on the way NAML performs, and we feel it is not representative of the framework and not in-line with our philosophy of avoiding "wrong configuration" of the framework. We discuss this in more detail in the appendix. A future version of NAML will address this issue, and we will do our best to provide updated results on our website. For now, we omit Naive AutoML from our comparisons:

In [41]:
results = results[results["framework"] != "NaiveAutoML"].copy()

## Imputation and Scaling
In our tables we report on the original performance metric results (AUC, logloss, RMSE) and explicitly mention the amount of failures. This allows readers to have a different perspective on the results, for example what their absolute differences in achieved performance is. For this reason, we do not impute missing values and do not apply scaling.

# Table as a DataFrame
Generating pandas DataFrames for tables 4-9 of the appendix.

In [42]:
summary = results.groupby(["framework", "constraint", "task", "id", "metric"], as_index=False).agg({"result": ["mean", "std", "count"]})
summary["fails"] = 10 - summary[("result", "count")]

The benchmark uses "higher is better" results-column which makes processing the data easier. However, when presenting results to the reader, it will be more intuitive to have the "normal" log loss and RMSE presented rather than their negative version:

In [43]:
summary[("result", "mean")] = summary[("result", "mean")].apply(abs)
summary[("metric", "")] = summary[("metric", "")].apply(lambda metric: metric[4:] if metric.startswith("neg_") else metric)

We want to show the `(mean, std, fail)` results in a single cell, we have chosen the following format:

In [44]:
def combine_as_supertext(tuple_):
    metric, mean, std, fails = tuple_
    std_text = f"({std:.3f})" if metric != "neg_rmse" else f"({std:.2g})"
    if fails == 10:
        return "-$\hspace{0.4em}$"
    
    backslash = "\hspace{0.4em}"
    if metric != "neg_rmse":
        return f"{mean:.3f}{std_text}$^{{{int(fails) if int(fails) != 0 else backslash}}}$"
    return f"{mean:.2g}{std_text}$^{{{int(fails) if int(fails) != 0 else backslash}}}$"
        
summary["display"] = summary[[("metric", ""), ("result", "mean"), ("result", "std"), ("fails", "")]].agg(combine_as_supertext, axis=1)

Now that the individual information is contained in `display`, we can drop it:

In [45]:
summary = summary.drop(columns=["result"], level=0)
summary = summary.droplevel(1, axis=1)
summary = summary.drop(columns=["fails"])

We abbreviate the task name to allow the table to fit the page better. Similarly, `openml/` provides no additional information of the `id`.

In [46]:
summary["id"] = summary["id"].apply(lambda s: s.split("/")[-1])
summary["task"] = summary["task"].apply(lambda s: s if len(s) < 10 else (s[:8] + "..."))

Finally, we want results of each framework ordered by column, since that makes it easier to compare results:

In [47]:
table = summary.pivot(index=["id", "task", "constraint", "metric"], columns="framework", values="display")
table.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,framework,AutoGluon(B),AutoGluon(HQ),AutoGluon(HQIL),GAMA(B),H2OAutoML,MLJAR(B),MLJAR(P),RandomForest,TPOT,TunedRandomForest,autosklearn,autosklearn2,constantpredictor,flaml,lightautoml
id,task,constraint,metric,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
10090,amazon-c...,1h8c_gp3,logloss,0.695(0.086)$^{\hspace{0.4em}}$,0.747(0.104)$^{\hspace{0.4em}}$,1.187(0.119)$^{\hspace{0.4em}}$,0.910(0.066)$^{\hspace{0.4em}}$,1.077(0.092)$^{\hspace{0.4em}}$,1.169(0.133)$^{\hspace{0.4em}}$,1.252(0.178)$^{\hspace{0.4em}}$,2.057(0.068)$^{\hspace{0.4em}}$,1.111(0.268)$^{5}$,1.462(0.087)$^{\hspace{0.4em}}$,1.139(0.132)$^{\hspace{0.4em}}$,1.125(0.146)$^{\hspace{0.4em}}$,3.912(0.000)$^{\hspace{0.4em}}$,1.115(0.155)$^{\hspace{0.4em}}$,0.843(0.088)$^{\hspace{0.4em}}$
10090,amazon-c...,4h8c_gp3,logloss,0.673(0.072)$^{\hspace{0.4em}}$,0.707(0.079)$^{\hspace{0.4em}}$,,0.907(0.094)$^{\hspace{0.4em}}$,1.077(0.092)$^{\hspace{0.4em}}$,1.181(0.132)$^{\hspace{0.4em}}$,,2.057(0.068)$^{\hspace{0.4em}}$,0.852(0.159)$^{2}$,1.462(0.115)$^{\hspace{0.4em}}$,1.138(0.130)$^{\hspace{0.4em}}$,0.837(0.122)$^{\hspace{0.4em}}$,3.912(0.000)$^{\hspace{0.4em}}$,1.101(0.150)$^{\hspace{0.4em}}$,0.817(0.077)$^{\hspace{0.4em}}$
146818,Australi...,1h8c_gp3,auc,0.941(0.018)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,0.941(0.022)$^{1}$,0.935(0.025)$^{\hspace{0.4em}}$,0.943(0.020)$^{\hspace{0.4em}}$,0.944(0.017)$^{\hspace{0.4em}}$,0.940(0.021)$^{\hspace{0.4em}}$,0.939(0.022)$^{\hspace{0.4em}}$,0.938(0.022)$^{\hspace{0.4em}}$,0.931(0.022)$^{\hspace{0.4em}}$,0.936(0.019)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.938(0.023)$^{\hspace{0.4em}}$,0.946(0.020)$^{\hspace{0.4em}}$
146818,Australi...,4h8c_gp3,auc,0.941(0.018)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,,0.940(0.019)$^{\hspace{0.4em}}$,0.935(0.021)$^{\hspace{0.4em}}$,0.940(0.024)$^{\hspace{0.4em}}$,,0.940(0.021)$^{\hspace{0.4em}}$,0.936(0.024)$^{\hspace{0.4em}}$,0.938(0.022)$^{\hspace{0.4em}}$,0.931(0.023)$^{\hspace{0.4em}}$,0.940(0.020)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.941(0.025)$^{\hspace{0.4em}}$,0.946(0.020)$^{\hspace{0.4em}}$
146820,wilt,1h8c_gp3,auc,0.995(0.008)$^{\hspace{0.4em}}$,0.994(0.010)$^{\hspace{0.4em}}$,0.994(0.010)$^{\hspace{0.4em}}$,0.996(0.004)$^{\hspace{0.4em}}$,0.992(0.010)$^{\hspace{0.4em}}$,0.999(0.000)$^{8}$,0.995(0.009)$^{\hspace{0.4em}}$,0.989(0.012)$^{\hspace{0.4em}}$,0.996(0.004)$^{\hspace{0.4em}}$,0.991(0.010)$^{\hspace{0.4em}}$,0.994(0.009)$^{\hspace{0.4em}}$,0.995(0.007)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.991(0.011)$^{\hspace{0.4em}}$,0.994(0.007)$^{\hspace{0.4em}}$


We sort the frameworks alphabetically, but put the baseline frameworks at the end. This makes it easier to find the column you are looking for.

In [49]:
baselines = ["constantpredictor", "RandomForest", "TunedRandomForest"]
framework_order = [framework for framework in sorted(summary.framework.unique(), key=lambda s: s.lower()) if framework not in baselines]
framework_order = {framework: i for i, framework in enumerate(framework_order + baselines)}
table = table[[c for c in sorted(table.columns, key=framework_order.get)]]

In [50]:
table.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,framework,AutoGluon(B),AutoGluon(HQ),AutoGluon(HQIL),autosklearn,autosklearn2,flaml,GAMA(B),H2OAutoML,lightautoml,MLJAR(B),MLJAR(P),TPOT,constantpredictor,RandomForest,TunedRandomForest
id,task,constraint,metric,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
10090,amazon-c...,1h8c_gp3,logloss,0.695(0.086)$^{\hspace{0.4em}}$,0.747(0.104)$^{\hspace{0.4em}}$,1.187(0.119)$^{\hspace{0.4em}}$,1.139(0.132)$^{\hspace{0.4em}}$,1.125(0.146)$^{\hspace{0.4em}}$,1.115(0.155)$^{\hspace{0.4em}}$,0.910(0.066)$^{\hspace{0.4em}}$,1.077(0.092)$^{\hspace{0.4em}}$,0.843(0.088)$^{\hspace{0.4em}}$,1.169(0.133)$^{\hspace{0.4em}}$,1.252(0.178)$^{\hspace{0.4em}}$,1.111(0.268)$^{5}$,3.912(0.000)$^{\hspace{0.4em}}$,2.057(0.068)$^{\hspace{0.4em}}$,1.462(0.087)$^{\hspace{0.4em}}$
10090,amazon-c...,4h8c_gp3,logloss,0.673(0.072)$^{\hspace{0.4em}}$,0.707(0.079)$^{\hspace{0.4em}}$,,1.138(0.130)$^{\hspace{0.4em}}$,0.837(0.122)$^{\hspace{0.4em}}$,1.101(0.150)$^{\hspace{0.4em}}$,0.907(0.094)$^{\hspace{0.4em}}$,1.077(0.092)$^{\hspace{0.4em}}$,0.817(0.077)$^{\hspace{0.4em}}$,1.181(0.132)$^{\hspace{0.4em}}$,,0.852(0.159)$^{2}$,3.912(0.000)$^{\hspace{0.4em}}$,2.057(0.068)$^{\hspace{0.4em}}$,1.462(0.115)$^{\hspace{0.4em}}$
146818,Australi...,1h8c_gp3,auc,0.941(0.018)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,0.931(0.022)$^{\hspace{0.4em}}$,0.936(0.019)$^{\hspace{0.4em}}$,0.938(0.023)$^{\hspace{0.4em}}$,0.941(0.022)$^{1}$,0.935(0.025)$^{\hspace{0.4em}}$,0.946(0.020)$^{\hspace{0.4em}}$,0.943(0.020)$^{\hspace{0.4em}}$,0.944(0.017)$^{\hspace{0.4em}}$,0.939(0.022)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.940(0.021)$^{\hspace{0.4em}}$,0.938(0.022)$^{\hspace{0.4em}}$
146818,Australi...,4h8c_gp3,auc,0.941(0.018)$^{\hspace{0.4em}}$,0.942(0.017)$^{\hspace{0.4em}}$,,0.931(0.023)$^{\hspace{0.4em}}$,0.940(0.020)$^{\hspace{0.4em}}$,0.941(0.025)$^{\hspace{0.4em}}$,0.940(0.019)$^{\hspace{0.4em}}$,0.935(0.021)$^{\hspace{0.4em}}$,0.946(0.020)$^{\hspace{0.4em}}$,0.940(0.024)$^{\hspace{0.4em}}$,,0.936(0.024)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.940(0.021)$^{\hspace{0.4em}}$,0.938(0.022)$^{\hspace{0.4em}}$
146820,wilt,1h8c_gp3,auc,0.995(0.008)$^{\hspace{0.4em}}$,0.994(0.010)$^{\hspace{0.4em}}$,0.994(0.010)$^{\hspace{0.4em}}$,0.994(0.009)$^{\hspace{0.4em}}$,0.995(0.007)$^{\hspace{0.4em}}$,0.991(0.011)$^{\hspace{0.4em}}$,0.996(0.004)$^{\hspace{0.4em}}$,0.992(0.010)$^{\hspace{0.4em}}$,0.994(0.007)$^{\hspace{0.4em}}$,0.999(0.000)$^{8}$,0.995(0.009)$^{\hspace{0.4em}}$,0.996(0.004)$^{\hspace{0.4em}}$,0.500(0.000)$^{\hspace{0.4em}}$,0.989(0.012)$^{\hspace{0.4em}}$,0.991(0.010)$^{\hspace{0.4em}}$


# Conversion to $\LaTeX$
Generating latex code for tables 4-9 of the appendix. Unlike the 2022 preprint, the 2023 evaluation contains too many frameworks to fit the tables on one page.
We divvy up the frameworks into two groups (alphabetically): A-H, I-Z and baselines.
We repeat the `id` and `task` columns in each table to make the tables individually interpretable.

We advise you to add each table as a separate `.tex` file, which makes it easier if you need to re-generate a table to update results.
It can then be included in the paper by using `\include{MY_TABLE_FILE.tex}`. Additionally, make sure that you include the following three packages:

```
\usepackage{amsmath}
\usepackage{booktabs}  % required for top-, mid- and bottom rules
\usepackage{lscape}  % required for landscape table
```

In [97]:
def to_latex(table: pd.DataFrame) -> str:
    tex = table.style.to_latex()
    # underscores need to be escaped explicitly
    return tex.replace("_", r"\_")

def drop_table_header_and_footer(latex_table: str) -> list[str]:
    """ Drop latex table header and footer. Only applicable to the expected format in this notebook. """
    *body, footer = latex_table.splitlines()[3:]
    return body
    

By default, the table is generated in portrait layout, but we need landscape in order to be able to display the (sub)tables on a single page

In [52]:
def generate_landscape_header(table: pd.DataFrame) -> list[str]:
    return [
        r"\footnotesize",
        r"\begin{landscape}",
        r"\begin{table}",
        r"\tiny",
        r"\begin{tabular}{rl" + "r" * len(table.columns) + "}",
        r"\toprule",
        " & framework" + "".join([f"& {framework}\ \ " for framework in table.columns]) + r"\\",
        " task id & task name " + "& " * len(table.columns) + r"\\",
        r"\midrule",
    ]

In [53]:
def generate_footer(caption: str, label: str) -> list[str]:
    return [
        r"\bottomrule", 
        r"\end{tabular}", 
        caption, 
        label, 
        r"\end{table}",
         r"\end{landscape}",
    ]

We create two tables for each (constraint, task type) combination, each table contains ~half of the evaluated frameworks.

In [56]:
metrics = {
    "auc": "binary classification (in AUC)", "logloss": "multiclass classification (in logloss)", "rmse": "regression (in RMSE)",
}
constraints = {
    "1h8c_gp3": "one hour", "4h8c_gp3": "four hour"
}
groups = {
    "A-H": list(table.columns)[:8],
    "I-Z": list(table.columns)[8:],
}


In [93]:
def fw_with_old_indicator(framework: str, metric: str, constraint: str) -> str:
    """Provides the framework name with an '*' indicator if the reported results are old. Also adjusts for the change in metric name. """
    if metric.casefold() != "auc":
        metric = f"neg_{metric}"
    indicator = r"\text{*}" if is_old(framework, constraint=constraint, metric=metric) else ""
    return f"{framework}{indicator}"

In [88]:
subset.isna().all(axis=0)

lightautoml                  False
MLJAR(B)                     False
MLJAR(P)                      True
TPOT\text{*}                 False
constantpredictor            False
RandomForest                 False
TunedRandomForest\text{*}    False
dtype: bool

In [96]:
for metric, readable_metric in metrics.items():
    for constraint, readable_constraint in constraints.items():
        for group, frameworks in groups.items():
            caption = f"\\caption{{Results for {readable_metric} on a {readable_constraint} budget, denoted as \\texttt{{mean}}(\\texttt{{std}})$^{{\\mbox{{\\texttt{{fails}}}}}}$.}}"
            label = f"\\label{{tab:{metric}-{constraint}}}"
    
            subset = table.loc[(slice(None), slice(None), constraint, metric), frameworks]
            subset.index = subset.index.droplevel([2, 3])
            subset.columns = [fw_with_old_indicator(framework, metric, constraint) for framework in subset.columns]
            # Some 4 hour budget experiments weren't run on all task types, omit columns for those (framework, type, constraint) combinations:
            subset = subset.loc[slice(None), ~subset.isna().all(axis=0)]
            
            latex_table = to_latex(subset)
            header = generate_landscape_header(subset)
            body = drop_table_header_and_footer(latex_table)
            footer = generate_footer(caption, label)
    
            with open(TABLE_DIRECTORY / f"{metric}-{constraint}-{group}.tex", "w") as fh:
                fh.write("\n".join(header + body + footer))

## Code below this line not updated

In [69]:
raise NotImplementedError("Not used")

NotImplementedError: Not used

Unused code for win/loss tables, champions:

In [None]:
data = get_results("all", budget="1h8c_gp3")
data = impute_values(data, strategy="constantpredictor")
data = data[~data.framework.isin(["mlr3automl", "constantpredictor"])]
data = data.groupby(["framework", "task", "constraint"], as_index=False).mean()

In [None]:
result = data[["framework", "task", "result"]]
cross = result.join(result, how="cross", rsuffix="_other")
cross = cross[(cross["framework"] != cross["framework_other"]) & (cross["task"] == cross["task_other"])]
cross.head()

In [None]:
def win_and_loss(data):
    best_score = data[["result", "result_other"]].max(axis=1)
    return pd.Series(dict(
        win= sum(data["result"] - data["result_other"] > best_score * 0.001),
        loss= sum(data["result_other"] - data["result"] > best_score * 0.001),
        tie= sum(abs(data["result"] - data["result_other"]) < best_score * 0.001)
    ))
# cross.groupby(["framework", "framework_other"], as_index=False).apply(lambda df: sum(df["result"] > df["result_other"]))

In [None]:
wins_and_losses = cross.groupby(["framework", "framework_other"], as_index=False).apply(win_and_loss)
wins_and_losses.sample(5)

In [None]:
wins_and_losses["wl_str"] = wins_and_losses.apply(lambda r: f"{r['win']}/{r['loss']}/{r['tie']}", axis=1)

In [None]:
win_loss_table = wins_and_losses.pivot(index="framework", columns="framework_other", values="wl_str")
framework_order = sorted(win_loss_table.columns, key=lambda s: s.lower())
win_loss_table[framework_order].loc[framework_order]

In [None]:
table = win_loss_table[framework_order].loc[framework_order]
tex = table.style.to_latex().replace("_", r"\_").replace("nan", "-")

with open("win_loss_table.tex", "w") as fh:
    fh.write(tex)
    #fh.write("\n".join(new_header + body + footer))

In [None]:
table = win_loss_table[framework_order].loc[framework_order]
tex = table.style.to_latex().replace("_", r"\_").replace("nan", "-")

# # the headers will be too wide to fit a page, we rotate them:
old_header, body = tex.splitlines()[:3], tex.splitlines()[3:]
new_header = [
    r"\footnotesize",
    r"\begin{landscape}",
    r"\begin{table}",
    r"\tiny",
    r"\begin{tabular}{r" + "r" * len(table.columns) + "}",
    r"\toprule",
    " framework A" + "".join([f"& \\rotatebox[origin=c]{{-90}}{{{framework}}}" for framework in table.columns]) + r"\\",
    #" & \ \ framework B" + "".join([f"& {framework}\ \ " for framework in table.columns]) + r"\\",
    " framework B " + "& " * len(table.columns) + r"\\",
    r"\midrule",
]
*body, footer = body
caption = f"\\caption{{Results of direct comparison between frameworks on a one hour budget across all suites. Each cell denotes the wins, losses, and ties of the row-framework over the column-framework (e.g., AutoGluon(B) wins from autosklearn 78 times). A tie is recorded if the relative difference is smaller than 0.1 percent of the greater score. No statistical tests are used.}}"
label = f"\\label{{tab:head-to-head}}"
footer = [r"\bottomrule", footer, caption, label, r"\end{table}", r"\end{landscape}",]

with open("win_loss_table.tex", "w") as fh:
    fh.write("\n".join(new_header + body + footer))

In [None]:
data = get_results("all", budget="1h8c_gp3")
data = impute_values(data, strategy="constantpredictor")
data = data[~data.framework.isin(["mlr3automl", "constantpredictor"])]
data = data.groupby(["framework", "task", "constraint"], as_index=False).mean()

In [None]:
# Find max per task to have a reference point for `champions'
max_per_task = data.groupby(["task"]).max()["result"]
data["max_for_task"] = data["task"].apply(lambda t: max_per_task.loc[t])
data["is_champion"] = abs(data["max_for_task"] - data["result"]) < abs(data["max_for_task"] * 0.001)

In [None]:
data.groupby("framework").sum()["is_champion"]