In this notebook we present how to generate all results and then plot the figures. Finishing all experiments may be quite slow unless you have access to multiples CPUs. Consequently, we have included all finished results in the folder "experiment_results_local" such that you do not need to run all of them locally. If you just want to generate results, skip to "Generating results" below.

## Running experiments

In the cell below we define a config for the experiments to be executed. You may exclude "COBRAS" and "nCOBRAS" as they can be quite slow (nCOBRAS in particular). QECC is also the more interesting baseline since it is based on correlation clustering.

In [22]:
config = {
    "_experiment_name": "acc_experiment",
    "_num_repeats": 8,
    "_n_workers": 8,
    "_verbose": True,
    "_overwrite": False,

    "seed": [14],
    "batch_size": [0.001],
    "noise_level": [0.2, 0.4],
    "warm_start": [0],
    "K_init": 10,
    "sim_init": [0.1],
    "sim_init_type": ["random_clustering", "kmeans"],
    "acq_fn": ["unif", "uncert", "freq", "maxmin", "maxexp", "QECC", "COBRAS", "nCOBRAS"],
    "eps": [0.3],
    "beta": [1],
    "tau": [3],
    "num_maxmin_edges": -1,

    "dataset_name": ["synthetic", "20newsgroups", "cifar10", "mnist", "mushrooms", "cardiotocography", "ecoli", "forest_type_mapping", "user_knowledge", "yeast"],

    "dataset_n_samples": 500,
    "dataset_n_clusters": [10],
    "dataset_class_balance": [None],
    "dataset_class_sep": [1.5],
    "dataset_n_features": [10],
    "dataset_y_flip": [0],
}

Cell below generates a .json config file based on the config above. The options in "options_to_keep" correspond to options with more than one value: noise level (2 options), sim_init_type (2 options), acq_fn (8 options) and dataset_name (10 options). Each experiment is repeated 8 times.

In [23]:
from acc.experiment_data import ExperimentReader
er = ExperimentReader()
start_index = 1
start_index = er.generate_experiments(
    folder="../configs/acc_experiment",
    options_to_keep=["noise_level", "sim_init_type", "acq_fn", "dataset_name"],
    start_index=start_index, 
    **config
)

Run the following terminal command **when located in the root folder of the repository**. This runs all combinations of experiments in the above config file. In other words, 8 * 2 * 2 * 8 * 10 = 2560 jobs. The "_n_workers" specifies how many jobs to run in parallel (i.e., how many cores of your CPU to use). See bottom of notebook for how to parellelize this across multiple CPUs, as this will be very slow on one CPU. Alternatively, you can modify the config to only run the jobs you are interested in.

In [None]:
!python acc/run_experiments.py --config=configs/acc_experiment/experiment1.json

## Generating results

After the command above finishes all experiment results is saved to the folder "experiment_results_local". Because this can be slow, we have already finished the experiments and placed the results in the folder "experiment_results". We now illustrate how plots can be generated based on the results. After running the two cells below, plots for all datasets (and other parameters) can be found in "../plots/acc_experiment/". You can modify the config if you only want to generate results for a subset of the experiments.

In [12]:
from acc.experiment_data import ExperimentReader
er = ExperimentReader(metrics=["rand", "ami", "num_repeat_queries", "num_violations", "time_select_batch"])
data = er.read_all_data(folder="../experiment_results/acc_experiment")

# uncomment to load data from your locally executed experiments
#data = er.read_all_data(folder="../experiment_results_local/acc_experiment")

In [21]:
data[data["acq_fn"] == "QECC"].iloc[15]["rand"][0]

array([0.45632351, 0.        , 0.        , 0.        , 0.        ,
       0.32986312, 0.09961036, 0.00577823, 0.35953738, 0.29999942,
       0.39184728, 0.27762071, 0.30126823, 0.34607857, 0.34987935,
       0.3078325 , 0.1497615 , 0.4089467 , 0.23585546, 0.32665499,
       0.19588229, 0.37656926, 0.20331101, 0.18593849, 0.30666243,
       0.31591129, 0.41272392, 0.35803657, 0.36069275, 0.35180472,
       0.22170296, 0.19967989, 0.34386342, 0.2661924 , 0.19074452,
       0.36678703, 0.33303105, 0.23113206, 0.24027535, 0.2959439 ,
       0.31928153, 0.28259881, 0.16019252, 0.18636357, 0.39684724,
       0.23321008, 0.15905161, 0.34335448, 0.36207486, 0.3344368 ,
       0.40120047, 0.33591049, 0.21286232, 0.362982  , 0.40032621,
       0.37578277, 0.34334859, 0.19718765, 0.24644271, 0.39865751,
       0.40755771, 0.29189617, 0.3555325 , 0.30397863, 0.3514252 ,
       0.32613314, 0.35606187, 0.282835  , 0.1478686 , 0.44559785,
       0.21821176, 0.40484403, 0.40209739, 0.31458631, 0.42579

In [13]:
config = {
    "_experiment_name": "acc_experiment",
    "_num_repeats": 8,
    "_n_workers": 8,
    "_verbose": True,
    "_overwrite": False,

    "seed": [14],
    "batch_size": [0.001],
    "noise_level": [0.2, 0.4],
    "warm_start": [0],
    "K_init": 10,
    "sim_init": [0.1],
    "sim_init_type": ["random_clustering", "kmeans"],
    "acq_fn": ["unif", "uncert", "freq", "maxmin", "maxexp", "QECC", "COBRAS", "nCOBRAS"],
    "eps": [0.3],
    "beta": [1],
    "tau": [3],
    "num_maxmin_edges": -1,

    "dataset_name": ["synthetic", "20newsgroups", "cifar10", "mnist", "mushrooms", "cardiotocography", "ecoli", "forest_type_mapping", "user_knowledge", "yeast"],

    "dataset_n_samples": 500,
    "dataset_n_clusters": [10],
    "dataset_class_balance": [None],
    "dataset_class_sep": [1.5],
    "dataset_n_features": [10],
    "dataset_y_flip": [0],
}

er.generate_AL_curves(
    data,
    save_location="../plots/acc_experiment/",
    categorize=[],
    compare=["acq_fn"],
    options_in_file_name=["dataset_name", "noise_level", "sim_init_type"],
    err_style="band",
    marker="o",
    markersize=6,
    capsize=6,
    linestyle="solid",
    prop=True,
    **config
)

<Figure size 900x600 with 0 Axes>

## Running experiment on multiple CPUs

It works the same as above, except we now leave the "options_to_keep" parameter empty. This means it will generate one .json file for each experiment (where each experiment will be repeated 8 times). In total we get 240 .json files, each of which can be run on its own CPU. Each CPU can then run all 8 repeats in parallel on 1 core each.

This assumes access to a compute cluster with multiple CPUs. If you have access to this, it should be straightforward how to distribute them to each CPU.

In [24]:
from acc.experiment_data import ExperimentReader
er = ExperimentReader()
start_index = 1

config = {
    "_experiment_name": "acc_experiment",
    "_num_repeats": 8,
    "_n_workers": 8,
    "_verbose": True,
    "_overwrite": False,

    "seed": [14],
    "batch_size": [0.001],
    "noise_level": [0.2, 0.4],
    "warm_start": [0],
    "K_init": 10,
    "sim_init": [0.1],
    "sim_init_type": ["random_clustering", "kmeans"],
    "acq_fn": ["unif", "uncert", "freq", "maxmin", "maxexp", "QECC", "COBRAS", "nCOBRAS"],
    "eps": [0.3],
    "beta": [1],
    "tau": [3],
    "num_maxmin_edges": -1,

    "dataset_name": ["synthetic", "20newsgroups", "cifar10", "mnist", "mushrooms", "cardiotocography", "ecoli", "forest_type_mapping", "user_knowledge", "yeast"],

    "dataset_n_samples": 500,
    "dataset_n_clusters": [10],
    "dataset_class_balance": [None],
    "dataset_class_sep": [1.5],
    "dataset_n_features": [10],
    "dataset_y_flip": [0],
}

start_index = er.generate_experiments(
    folder="../configs/acc_experiment",
    options_to_keep=[],
    start_index=start_index, 
    **config
)