## Benchmarks results analysis

In this section, we demonstrate how to analyse the result of the benchmark using the same data presented in the paper.


### Hardness of the MDPs in the benchmark
In order to investigate the hardness of the MDPs in the benchmark, we position each MDP according to their diameters and environmental value norms.


In [None]:
import os
import re
import seaborn as sns
from glob import glob

import matplotlib
from matplotlib import cm
from matplotlib import pyplot as plt
from IPython.display import display, HTML

from colosseum.experiments.analyze import analyze
from colosseum.experiments.visualisation import experiment_summary2
from colosseum.experiments.utils import retrieve_experiment_prms, retrieve_n_seed
from colosseum.utils.benchmark_analysis import plot_labels_on_benchmarks_hardness_space
from colosseum.utils.miscellanea import clear_th, get_all_mdp_classes


def crs_in_hardness_space(exp_to_show):
    color_map = cm.get_cmap("Reds")
    df = experiment_summary2(glob(f"{exp_to_show}{os.sep}logs{os.sep}**"))
    df_numerical = df.applymap(lambda s: float(re.findall("\d+\.\d+", s)[0]))

    fig, axes = plt.subplots(1, len(df.columns), figsize=(len(df.columns) * 7, 7))
    for i, (a, ax) in enumerate(zip(df.columns, axes.tolist())):
        plot_labels_on_benchmarks_hardness_space(
            exp_to_show,
            text_f=lambda x: df.loc[x, a]
            .replace("\\textbf", "")
            .replace("$", "")
            .replace("{", "")
            .replace("}", "")
            .replace("\\pm", "±")[:4],
            color_f=lambda x: color_map(
                df_numerical.loc[x, a] / df_numerical.loc[:, a].max()
            ),
            ax=ax,
            multiplicative_factor_xlim=1.1,
            underneath_x_label="\n" + ["(a)", "(b)", "(c)", "(d)"][i],
            set_ylabel=i == 0,
        )
        ax.set_title(clear_th(a))
    plt.tight_layout()
    plt.show()


sns.set_theme()

available_experiments = list(
    sorted(
        filter(
            lambda x: x.split(os.sep)[-1][0] != "_",
            glob(f"experiments_done{os.sep}*"),
        )
    )
)
assert available_experiments == [
    "experiments_done/benchmark_continuous_communicating",
    "experiments_done/benchmark_continuous_ergodic",
    "experiments_done/benchmark_episodic_communicating",
    "experiments_done/benchmark_episodic_ergodic",
]

### Cumulative regrets in hardness space

In order to illustrate how hardness measures relate with cumulative regret and empirically valide the benchmark, we place the average cumulative regret obtained by each agent in each continuous ergodic MDP in a coordinate that corresponds to the diameter and the environmental value norm of that MDP.

#### Benchmark continuous communicating
We note that increasing values of the diameter induce higher regret for UCRL2 and QLearning but not for PSRL.
UCRL2's regrets are not significantly influenced by the value norms, contrary to QLearning and PSRL.
Interestingly, higher values of the value norm yield lower regrets for QLearning and higher regrets for PSRL.


In [None]:
mdp_names = list(sorted(set(clear_th(x.__name__) for x in get_all_mdp_classes())))
COLORS = list(matplotlib.colors.TABLEAU_COLORS.keys())
fig, axes = plt.subplots(1, 4, figsize=(4 * 6, 6))  # , sharey=True)
for ii, exp_to_show in enumerate(available_experiments):
    plot_labels_on_benchmarks_hardness_space(
        exp_to_show,
        text_f=lambda x: str(int(x[1][-1]) + 1),
        color_f=lambda x: COLORS[mdp_names.index(clear_th(x[0]))],
        label_f=lambda x: clear_th(x[0]) if "0" in x[1] else None,
        ax=axes[ii],
        multiplicative_factor_xlim=1.05,
        multiplicative_factor_ylim=1.05,
        set_ylabel=ii == 0,
        set_legend=False,
        # xaxis_measure=("num_states", lambda x : x.num_states)
        # xaxis_measure = "suboptimal_gaps"
    )

leg = plt.legend(
    fontsize=22,
    ncol=8,
    loc="center left",
    bbox_to_anchor=(-4.0, 1.11),
    markerscale=1.3,
)
for x in leg.get_lines():
    x.set_linewidth(4)
plt.tight_layout()
plt.show()

In [None]:
crs_in_hardness_space(available_experiments[0])

#### Benchmark continuous ergodic
In this case, the increases in the value norm and the diameter produce increases in the regret in a similar way.
A similar phenomenon can be seen for QLearning.
For PSRL, instead, the ergodic case looks very similar to the communicating one.


In [None]:
crs_in_hardness_space(available_experiments[1])

#### Benchmark episodic communicating
In this scenario, we see that the diameter has relatively low influence on the cumulative regrets of the agents.
Increases in the environmental value norm induce more significant increases in the regrets, particularly for PSRL.



In [None]:
crs_in_hardness_space(available_experiments[2])

#### Benchmark episodic ergodic
Contrary to the communicating case, here increasing values of the diameter yield more significant increases of the regrets.
Especially for high values of environmental value norm.
Similarly to the previous scenario, the environmental value norm drives significant increases in the regrets of the agents,
in a more pronounced way for QLearning.

In [None]:
crs_in_hardness_space(available_experiments[3])

### Interactive analysis of the benchmarks results.
In addition to the previous visualizations, $\texttt{Colosseum}$ offers an interactive analysis of the results. It is possible to group the results *by MDP* to analyse the performances of the agents on the MDPs in the benchmark, or to group *by agent*, to investigate how the parameters/hardness of the MDPs influences the final cumulative regrets.

In [None]:
analyze(w=5)