# Data Cleaning Notebook
This notebook combines a number of files into a clean dataset that is used for visualization, these files are:
 - `data/amlb_2023Q2.csv`: the raw results of the benchmark that took place June 2023, used for all the automl framework results.
 - `data/amlb_2021Q3.csv`: the _cleaned_ results of the benchmark that took place in the fall of 2021, used for the `constantpredictor` and `tunedrandomforest` baselines as well as `mlr3automl`.

Note that a few (< 20) (framework, task, fold)-combinations still need to be run due to technical difficulties.
The original file contains `NaiveAutoML` data, but after encountering severe issues and correspondence with the authors, we decided to exclude these from the analysis, so we exclude them from the file.

The output file will have results for each (framework, task, fold)-combination as per the table below. 

Legend:
 - ❌ no results
 - ☑️ results from 2021Q3
 - ✅ results from 2023Q2

| [framework](https://github.com/openml/automlbenchmark/blob/12046acc4824dd48414c0543b518cd628490a12d/resources/frameworks_2023Q2.yaml) | [classification, 1h](https://www.openml.org/s/271) | [classification, 4h](https://www.openml.org/s/271) | [regression, 1h](https://www.openml.org/s/269) | [regression, 4h](https://www.openml.org/s/269) |
|--|:--:|:--:|:--:|:--:|
| AutoGluon (benchmark) | ✅ | ✅ | ✅ | ✅ | 
| AutoGluon (high quality)| ✅ | ✅ | ✅ | ✅ | 
| AutoGluon (high quality, infer limit) | ✅ | ❌ | ✅ | ✅ | 
| autosklearn | ✅ | ✅ | ✅ | ✅ | 
| autosklearn 2 | ✅ | ☑️ | ❌ | ❌ | 
| flaml | ✅ | ✅ | ✅ | ✅ | 
| GAMA (benchmark) | ✅ | ☑️ | ✅ | ☑️ | 
| H2O AutoML | ✅ | ✅ | ✅ | ✅ | 
| Light AutoML | ✅ | ✅ | ✅ | ✅ | 
| MLJar Supervised (benchmark) | ✅ | ☑️ | ✅ | ✅ | 
| MLJar Supervised (perform) | ✅ | ❌ | ✅ | ✅ | 
| mlr3automl | ☑️ | ☑️ | ☑️ | ☑️ | 
| Naive AutoML | ✅ | ❌ | ✅ | ❌ | 
| TPOT | ✅ | ☑️ | ✅ | ☑️ | 
| RandomForest | ✅ | ✅ | ✅ | ✅ | 
| TunedRandomForest | ☑️ | ☑️ | ☑️ | ☑️ |

In [1]:
import itertools
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path(".").absolute().parent
DATA_DIRECTORY = PROJECT_ROOT / "data"

In [2]:
pd.concat(
    pd.read_csv(f"http://openml-test.win.tue.nl/amlb/{ttype}_cleaned.csv")    
    for ttype in ["classification", "regression"]
).to_csv(DATA_DIRECTORY / "amlb_2021Q3.csv", index=False)

We load all the data and add meta-data about the source for easier filtering later:

In [3]:
amlb_2021Q3 = pd.read_csv(DATA_DIRECTORY / "amlb_2021Q3.csv")
amlb_2021Q3["source"] = "2021Q3"
amlb_2023Q2 = pd.read_csv(DATA_DIRECTORY / "amlb_2023Q2.csv")
amlb_2023Q2["source"] = "2023Q2"

## Cleaning Duplicates

`autogluon_benchmark` was evaluated twice on `fold 0` of the `regression` benchmark by accident. We will only keep the second evaluation. However, there is one experiment in the second evaluation which was aborted during inference time measurements (`Santander_transaction_value`). Normally, we wound run this experiment again without inference time measurements (as above), but since we already had the accidental second batch of results, we can just make sure that is used instead.

In [4]:
is_autogluon_benchmark = (amlb_2023Q2["framework"] == "AutoGluon_benchmark")
fold_0_Santander = (amlb_2023Q2["task"] == "Santander_transaction_value") & (amlb_2023Q2["fold"] == 0)
one_hour_constraint = (amlb_2023Q2["constraint"] == "1h8c_gp3")
amlb_2023Q2[is_autogluon_benchmark & fold_0_Santander & one_hour_constraint]

Unnamed: 0,id,task,framework,constraint,fold,type,result,metric,mode,version,...,r2,rmse,infer_batch_size_df_1,infer_batch_size_file_1,infer_batch_size_file_10,infer_batch_size_file_100,infer_batch_size_file_1000,infer_batch_size_file_10000,models_ensemble_count,source
815,openml.org/t/233214,Santander_transaction_value,AutoGluon_benchmark,1h8c_gp3,0,regression,-7233710.0,neg_rmse,aws.docker,0.8.0,...,0.229936,7233710.0,20.3407,20.5419,21.4607,22.6191,32.9448,,18.0,2023Q2
869,openml.org/t/233214,Santander_transaction_value,AutoGluon_benchmark,1h8c_gp3,0,regression,,neg_rmse,aws.docker,0.8.0,...,,,,,,,,,,2023Q2


In [5]:
amlb_2023Q2 = amlb_2023Q2.drop(869)

There are several normal reasons why one might find multiple entries for the same (framework, task, fold, constraint)-combination, however normally only one has a result.
This is typically for fails through no fault of the automl framework. This can include, but is not limited to:
 - Bugs in the AutoML benchmark which affected the specific (framework, task)-combination.
 - A job was stopped because the inference time measurements took too long, in which case it was tried again without inference time measurements.
 - Random issues, such as errors when downloading the dataset from `openml`.

There are also other cases were an entry was missing completely, for example because docker denied a `docker pull` during setup, in which case a retry was also queued. Now we can proceed and make sure we use the latest experiment for each:

In [6]:
amlb_2023Q2 = amlb_2023Q2.sort_values(
    by="result", na_position="first"
).drop_duplicates(
    ["task", "framework", "fold", "constraint"], 
    keep="last"
)

In [7]:
assert amlb_2023Q2[amlb_2023Q2.duplicated(["framework", "task", "fold", "constraint"])].empty

# Show Missing Results

There are some (framework, fold, task, constraint)-combinations which do not have any entries:

In [8]:
from IPython.display import display

with pd.option_context("display.max_rows", 64):
    display(amlb_2023Q2.groupby(by=["type","constraint","framework"]).size())

type        constraint  framework                
binary      1h8c_gp3    AutoGluon_benchmark          410
                        AutoGluon_hq                 410
                        AutoGluon_hq_il001           410
                        GAMA_benchmark               410
                        H2OAutoML                    410
                        NaiveAutoML                  410
                        RandomForest                 410
                        TPOT                         410
                        autosklearn                  410
                        autosklearn2                 410
                        flaml                        410
                        lightautoml                  410
                        mljarsupervised_benchmark    410
                        mljarsupervised_perform      410
            4h8c_gp3    AutoGluon_benchmark          410
                        AutoGluon_hq                 410
                        H2OAutoML     

# Transfer Random Forest Results
The `RandomForest` baseline trains at most 2000 trees. This means that results for `1h8c_gp3` and `4h8c_gp3` should be identical as long as 2000 trees were trained. For that reason, we only ran `RandomForest` on `4h8c_gp3` whenever the baseline did not train 2000 trees in the `1h8c_gp3` constraint. We transfer all the `1h8c_gp3` results to `4h8c_gp3` to make for a complete set of results for the `4h8c_gp3` results:

In [9]:
fully_trained_randomforest = (amlb_2023Q2["framework"] == "RandomForest") & (amlb_2023Q2["models_count"] == 2000)
randomforest_1h8c_gp3 = amlb_2023Q2[fully_trained_randomforest & (amlb_2023Q2["constraint"] == "1h8c_gp3")].copy()
randomforest_1h8c_gp3["constraint"] = "4h8c_gp3"
amlb_2023Q2 = pd.concat([amlb_2023Q2, randomforest_1h8c_gp3])

In [10]:
assert amlb_2023Q2[amlb_2023Q2.duplicated(["framework", "task", "fold", "constraint"])].empty

# Transfer Old Results

We couldn't rerun all experiments. We tried to prioritize rerunning frameworks that had bigger changes from their previously benchmarked version. For completeness, we add old results for which we do not have a newer version. These are easily filtered out by the `"source"` column (either `"amlb_2023Q2"` or `"amlb_2021Q3"`). Here are the version differences from frameworks that had their results transferred:

| framework | 2021Q3 | 2023Q2 | note |
| --------- | ------:| ------: | -- |
| autosklearn 2 | 0.14.0 | 0.15.0 | |
| TPOT | 0.11.7 | 0.12.0 | |
| GAMA | 21.0.1 | 23.0.0 | Uses [CalVer](http://calver.org) variant, minor changes only |
| MLJar Supervised | 0.11.0 | 0.11.5 | |
| Tuned Random Forest | 0.24.2 | 1.2.2 | |
| constant predictor | 0.24.2 | 1.2.2 | Results independent of version |

In [11]:
autosklearn2 = (amlb_2021Q3["framework" ] == "autosklearn2") & (amlb_2021Q3["constraint"] == "4h8c_gp3")
tpot = (amlb_2021Q3["framework" ] == "TPOT") & (amlb_2021Q3["constraint"] == "4h8c_gp3")
gama = (amlb_2021Q3["framework" ] == "GAMA_benchmark") & (amlb_2021Q3["constraint"] == "4h8c_gp3")
mljar = (amlb_2021Q3["framework" ] == "mljarsupervised_benchmark") & (amlb_2021Q3["constraint"] == "4h8c_gp3") & (amlb_2021Q3["type"].isin(["binary", "multiclass"]))
constant_predictor = (amlb_2021Q3["framework" ] == "constantpredictor")
tuned_random_forest = (amlb_2021Q3["framework" ] == "TunedRandomForest")
to_transfer = amlb_2021Q3[autosklearn2 | tpot | gama | mljar | constant_predictor | tuned_random_forest]

In [12]:
amlb_2023Q2_padded = pd.concat([to_transfer, amlb_2023Q2])

In [13]:
from IPython.display import display

with pd.option_context("display.max_rows", 90):
    display(amlb_2023Q2_padded.groupby(by=["type","constraint","framework"]).size())

type        constraint  framework                
binary      1h8c_gp3    AutoGluon_benchmark          410
                        AutoGluon_hq                 410
                        AutoGluon_hq_il001           410
                        GAMA_benchmark               410
                        H2OAutoML                    410
                        NaiveAutoML                  410
                        RandomForest                 410
                        TPOT                         410
                        TunedRandomForest            410
                        autosklearn                  410
                        autosklearn2                 410
                        constantpredictor            410
                        flaml                        410
                        lightautoml                  410
                        mljarsupervised_benchmark    410
                        mljarsupervised_perform      410
            4h8c_gp3    AutoGluon_benc

In [14]:
amlb_2023Q2_padded.to_csv(DATA_DIRECTORY / "amlb_all.csv", index=False)