## Hyperparameter Tuning {#sec-app-tune}


In [None]:
projectdir = splitpath(pwd()) |>
    ss -> joinpath(ss[1:findall([s == "CounterfactualTraining.jl" for s in ss])[1]]...) 
cd(projectdir)

using CTExperiments
using CTExperiments.CSV
using CTExperiments.DataFrames
using CTExperiments.StatsBase

using DotEnv
DotEnv.load!()

In [None]:
res_dir = ENV["FINAL_GRID_RESULTS"]

Based on the findings from our initial large grid searches (@sec-app-grid), we tune selected hyperparameters for all datasets: namely, the decision threshold $\tau$ and the strength of the energy regularization $\lambda_{\text{reg}}$. The final hyperparameter choices for each dataset are presented in **ADD TABLE**. Detailed results for each data set are shown in **ADD FIGURES**. From **ADD TABLE**, we notice that the same decision threshold of $\tau=0.5$ is optimal for all but on dataset. We attribute this to the fact that a low decision threshold results in a higher share of mature counterfactuals and hence more opportunities for the model to learn from examples. This has played a role in particular for our real-world tabular datasets and MNIST, which suffered from low levels of maturity for higher decision thresholds. In cases where maturity is not an issue, as for *Moons*, higher decision thresholds lead to better outcomes, which may have to do with the fact that the resulting counterfactuals are more faithful to the model. Concerning the regularization strength, we find somewhat high variation across datasets. Most notably, we find that relatively low levels of regularization are optimal for MNIST. We hypothesize that this finding may be attributed to the uniform scaling of all input features (digits). Finally, to increase the proportion of mature counterfactuals for some datasets, we have also investigated the effect on the learning rate $\eta$ for the counterfactual search, but found little effect on the results. 

::: {.callout-warning}

## Package Version (Reproducibility)

Tuning was run using `v1.1.3` of `TaijaData`. The follow-up version `v1.1.4` introduced an option to split real-world tabular datasets into train and test set, ensuring that pre-processing steps like standardization is fit on the training set only. If you are rerunning the tuning experiments with a version of `TaijaData` that is higher than `v1.1.3`, than for the default parameters specified in the configuration files, you may end up with slightly different results, although we would not expect any changes in terms of qualitative findings. For exact reproducibility, please use `v1.1.3`.

:::

### Key Parameters {#sec-app-tune}


In [None]:
grid_dir = joinpath(res_dir, "tune/mlp")
data_dirs = readdir(grid_dir) |> x -> joinpath.(grid_dir, x) |> x -> x[isdir.(x)]
eval_grids = (p -> EvaluationGrid(joinpath(p,"grid_config.toml"))).(data_dirs)
data_names = basename.(data_dirs)
ce_suffix = "evaluation/results/ce/lambda_energy_reg---decision_threshold_exper---lambda_energy_eval---lambda_energy_reg---decision_threshold_exper/"
logs_suffix = "evaluation/results/logs/objective---decision_threshold---lambda_energy_reg/"

The hyperparameter grid for tuning key parameters is shown in @nte-tune-train. The corresponding evaluation grid used for these experiments is shown in @nte-tune-eval.

::: {#nte-tune-train .callout-note}

## Training Phase


In [None]:
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/grid_config.toml")) 
println(CTExperiments.dict_to_quarto_markdown(dict))

:::

::: {#nte-tune-eval .callout-note}

## Evaluation Phase


In [None]:
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/evaluation/evaluation_grid_config.toml"))
println(CTExperiments.dict_to_quarto_markdown(dict))

:::

#### Plausibility


In [None]:
#| output: asis

fig_label_prefix = "tune-plaus"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the plausibility measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)

In [None]:
#| output: asis
 
imgfname = "plausibility_distance_from_target.png"
fig_caption = "Average outcomes for the plausibility measure across key hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(ce_suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption) 
_str = join(include_img_commands, "\n\n")
println(_str)

#### Proportion of Mature CE


In [None]:
#| output: asis

fig_label_prefix = "tune-mat"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the proportion of mature counterfactuals in each epoch are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)

In [None]:
#| output: asis
 
imgfname = "percent_valid.png"
fig_caption = "Proportion of mature counterfactuals in each epoch."
full_paths = joinpath.(data_dirs, joinpath(logs_suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption) 
_str = join(include_img_commands, "\n\n")
println(_str)