# Grid Searches {#sec-app-grid}


```{julia}
projectdir = splitpath(pwd()) |>
    ss -> joinpath(ss[1:findall([s == "CounterfactualTraining.jl" for s in ss])[1]]...) 
cd(projectdir)

using CTExperiments
using CTExperiments.CSV
using CTExperiments.DataFrames
using CTExperiments.StatsBase

using DotEnv
DotEnv.load!()
```

```{julia}
res_dir = ENV["FINAL_GRID_RESULTS"]
```


To assess the hyperparameter sensitivity of our proposed training regime we ran multiple large grid searches for all of our synthetic datasets. We have grouped these grid searches into multiple categories: 

1. **Generator Parameters** (@sec-app-grid-gen): Investigates the effect of changing hyperparameters that affect the counterfactual outcomes during the training phase.
2. **Penalty Strengths** (@sec-app-grid-pen): Investigates the effect of changing the penalty strengths in out proposed objective (@eq-obj).
3. **Other Parameters** (@sec-app-grid-train): Investigates the effect of changing other training parameters, including the total number of generated counterfactuals in each epoch.

We begin by summarizing the high-level findings in @sec-app-grid-hl. For each of the categories, @sec-app-grid-gen to @sec-app-grid-train then present all details including the exact parameter grids, average predictive performance outcomes and key evaluation metrics for the generated counterfactuals. 

## Evaluation Details

To measure predictive performance, we compute the accuracy and F1-score for all models on test data (@tbl-acc-gen, @tbl-acc-pen, @tbl-acc-train). With respect to explanatory performance, we report here our findings for the (im)plausibility and cost of counterfactuals at test time. Since the computation of our proposed divergence metric (@eq-impl-div) is memory-intensive, we rely on the distance-based metric for the grid searches. For the counterfactual evaluation, we draw factual samples from the training data for the grid searches to avoid data leakage with respect to our final results reported in the body of the paper. Specifically, we want to avoid choosing our default hyperparameters based on results on the test data. Since we are optimizing for explainability, not predictive performance, we still present test accuracy and F1-scores. 

### Predictive Performance

We find that CT is associated with little to no decrease in average predictive performance for our synthetic datasets: test accuracy and F1-scores decrease by at most ~1 percentage point, but generally much less (@tbl-acc-gen, @tbl-acc-pen, @tbl-acc-train). Variation across hyperparameters is negligible as indicated by small standard deviations for these metrics across the board. 

### Counterfactual Outcomes {#sec-app-grid-hl}

Overall, we find that counterfactual training (CT) achieves it key objectives consistently across all hyperparameter settings and also broadly across datasets: plausibility is improved by up to ~60 percent (%) for the *Circles* data (e.g. @fig-grid-gen_params-plaus-circles), ~25-30% for the *Moons* data (e.g. @fig-grid-gen_params-plaus-moons) and ~10-20% for the *Linearly Separable* data (e.g. @fig-grid-gen_params-plaus-lin_sep). At the same time, the average costs of faithful counterfactuals are reduced in many cases by around ~20-25% for  *Circles* (e.g. @fig-grid-gen_params-cost-circles) and up to ~50% for *Moons* (e.g. @fig-grid-gen_params-cost-moons). For the *Linearly Separable* data, costs are generally increased although typically by less than 10% (e.g. @fig-grid-gen_params-cost-lin_sep), which reflects a common tradeoff between costs and plausibility [@altmeyer2024faithful]. 

We do observe strong sensitivity to certain hyperparameters, with clear an manageable patterns. Concerning generator parameters, we firstly find that using *REVISE* to generate counterfactuals during training typically yields the worst outcomes out of all generators, often leading to a substantial decrease in plausibility. This finding can be attributed to the fact that *REVISE* effectively assigns the task of learning plausible explanations from the model itself to a surrogate VAE. In other words, counterfactuals generated by *REVISE* are less faithful to the model that *ECCo* and *Generic*, and hence we would expect them to be a less effective and, in fact, potentially detrimental role in our training regime. Secondly, we observe that allowing for a higher number of maximum steps $T$ for the counterfactual search generally yields better outcomes. This is intuitive, because it allows more counterfactuals to reach maturity in any given iteration. Looking in particular at the results for *Linearly Separable*, it seems that higher values for $T$ in combination with higher decision thresholds ($\tau$) yields the best results when using *ECCo*. But depending on the degree of class separability of the underlying data, a high decision-threshold can also affect results adversely, as evident from the results for the *Overlapping* data (@fig-grid-gen_params-plaus-over): here we find that CT generally fails to achieve its objective because only a tiny proportion of counterfactuals ever reaches maturity.

Regarding penalty strengths, we find that the strength of the energy regularization, $\lambda_{\text{reg}}$ is a key hyperparameter, while sensitivity with respect to $\lambda_{\text{div}}$ and $\lambda_{\text{adv}}$ is much less evident. In particular, we observe that not regularizing energy enough or at all typically leads to poor performance in terms of decreased plausibility and increased costs, in particular for *Circles* (@fig-grid-pen-plaus-circles), *Linearly Separable* (@fig-grid-pen-plaus-lin_sep) and *Overlapping* (@fig-grid-pen-plaus-over). High values of $\lambda_{\text{reg}}$ can increase the variability in outcomes, in particular when combined with high values for $\lambda_{\text{div}}$ and $\lambda_{\text{adv}}$, but this effect is less pronounced.

Finally, concerning other hyperparameters we observe that the effectiveness and stability of CT is positively associated with the number of counterfactuals generated during each training epoch, in particular for *Circles* (@fig-grid-train-plaus-circles) and *Moons* (@fig-grid-train-plaus-moons). We further find that a higher number of training epochs is beneficial as expected, where we tested training models for 50 and 100 epochs. Interestingly, we find that it is not necessary to employ CT during the entire training phase to achieve the desired improvements in explainability: specifically, we have tested training models conventionally during the first half of training before switching to CT after this initial burn-in period. 

## Generator Parameters {#sec-app-grid-gen}


```{julia}
grid_dir = joinpath(res_dir, "gen_params/mlp")
data_dirs = readdir(grid_dir) |> x -> joinpath.(grid_dir, x) |> x -> x[isdir.(x)]
eval_grids = (p -> EvaluationGrid(joinpath(p,"grid_config.toml"))).(data_dirs)
data_names = basename.(data_dirs)
suffix = "evaluation/results/ce/decision_threshold_exper---lambda_energy_exper---maxiter_exper---maxiter---decision_threshold_exper/"
```


The hyperparameter grid with varying generator parameters during training is shown in @nte-gen-params-final-run-train. The corresponding evaluation grid used for these experiments is shown in @nte-gen-params-final-run-eval.

::: {#nte-gen-params-final-run-train .callout-note}

## Training Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/grid_config.toml")) 
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

::: {#nte-gen-params-final-run-eval .callout-note}

## Evaluation Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/evaluation/evaluation_grid_config.toml"))
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

### Predictive Performance

Predictive performance measures for this grid search are shown in @tbl-acc-gen.

::: {#tbl-acc-gen}

::: {.content-hidden unless-format="pdf"}


```{julia}
#| output: asis

df = CTExperiments.aggregate_performance(eval_grids; byvars=["objective"]) 
get_table_inputs(df, "mean";backend=Val(:latex)) |>
    inputs -> tabulate_results(inputs; wrap_table=false)
```


:::

Predictive performance measures by dataset and objective averaged across training-phase parameters (@nte-gen-params-final-run-train) and evaluation-phase parameters (@nte-gen-params-final-run-eval).

:::

### Plausibility


```{julia}
#| output: asis

fig_label_prefix = "grid-gen_params-plaus"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the plausibility measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "plausibility_distance_from_target.png"
fig_caption = "Average outcomes for the plausibility measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption) 
_str = join(include_img_commands, "\n\n")
println(_str)
```


### Cost


```{julia}
#| output: asis

fig_label_prefix = "grid-gen_params-cost"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the cost measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "distance.png"
fig_caption = "Average outcomes for the cost measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption)
_str = join(include_img_commands, "\n\n")
println(_str)
```


## Penalty Strengths {#sec-app-grid-pen}


```{julia}
grid_dir = joinpath(res_dir, "penalties/mlp")
data_dirs = readdir(grid_dir) |> x -> joinpath.(grid_dir, x) |> x -> x[isdir.(x)]
eval_grids = (p -> EvaluationGrid(joinpath(p,"grid_config.toml"))).(data_dirs)
data_names = basename.(data_dirs)
suffix = "evaluation/results/ce/lambda_adversarial---lambda_energy_reg---lambda_energy_diff---lambda_adversarial---lambda_adversarial/"
```


The hyperparameter grid with varying penalty strengths during training is shown in @nte-pen-final-run-train. The corresponding evaluation grid used for these experiments is shown in @nte-pen-final-run-eval.

::: {#nte-pen-final-run-train .callout-note}

## Training Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/grid_config.toml")) 
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

::: {#nte-pen-final-run-eval .callout-note}

## Evaluation Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/evaluation/evaluation_grid_config.toml"))
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

### Predictive Performance

Predictive performance measures for this grid search are shown in @tbl-acc-pen.

::: {#tbl-acc-pen}

::: {.content-hidden unless-format="pdf"}


```{julia}
#| output: asis
df = CTExperiments.aggregate_performance(eval_grids; byvars=["objective"]) 
get_table_inputs(df, "mean";backend=Val(:latex)) |>
    inputs -> tabulate_results(inputs; wrap_table=false)
```


:::

Predictive performance measures by dataset and objective averaged across training-phase parameters (@nte-pen-final-run-train) and evaluation-phase parameters (@nte-pen-final-run-eval).

:::

### Plausibility


```{julia}
#| output: asis

fig_label_prefix = "grid-pen-plaus"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the plausibility measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "plausibility_distance_from_target.png"
fig_caption = "Average outcomes for the plausibility measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption) 
_str = join(include_img_commands, "\n\n")
println(_str)
```


### Cost


```{julia}
#| output: asis

fig_label_prefix = "grid-pen-cost"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the cost measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "distance.png"
fig_caption = "Average outcomes for the cost measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption)
_str = join(include_img_commands, "\n\n")
println(_str)
```


## Other Parameters {#sec-app-grid-train}


```{julia}
grid_dir = joinpath(res_dir, "training_params/mlp")
data_dirs = readdir(grid_dir) |> x -> joinpath.(grid_dir, x) |> x -> x[isdir.(x)]
eval_grids = (p -> EvaluationGrid(joinpath(p,"grid_config.toml"))).(data_dirs)
data_names = basename.(data_dirs)
suffix = "evaluation/results/ce/burnin---nce---nepochs---burnin---burnin/"
```


The hyperparameter grid with other varying training parameters is shown in @nte-train-final-run-train. The corresponding evaluation grid used for these experiments is shown in @nte-train-final-run-eval.

::: {#nte-train-final-run-train .callout-note}

## Training Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/grid_config.toml")) 
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

::: {#nte-train-final-run-eval .callout-note}

## Evaluation Phase


```{julia}
#| output: asis
dict = CTExperiments.from_toml(joinpath(grid_dir, "lin_sep/evaluation/evaluation_grid_config.toml"))
println(CTExperiments.dict_to_quarto_markdown(dict))
```


:::

### Predictive Performance

Predictive performance measures for this grid search are shown in @tbl-acc-train.

::: {#tbl-acc-train}

::: {.content-hidden unless-format="pdf"}


```{julia}
#| output: asis
df = CTExperiments.aggregate_performance(eval_grids; byvars=["objective"]) 
get_table_inputs(df, "mean";backend=Val(:latex)) |>
    inputs -> tabulate_results(inputs; wrap_table=false)
```


:::

Predictive performance measures by dataset and objective averaged across training-phase parameters (@nte-train-final-run-train) and evaluation-phase parameters (@nte-train-final-run-eval).

:::

### Plausibility


```{julia}
#| output: asis

fig_label_prefix = "grid-train-plaus"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the plausibility measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "plausibility_distance_from_target.png"
fig_caption = "Average outcomes for the plausibility measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption) 
_str = join(include_img_commands, "\n\n")
println(_str)
```


### Cost


```{julia}
#| output: asis

fig_label_prefix = "grid-train-cost"
fig_labels = (nm -> "fig-$(fig_label_prefix)-$nm").(data_names)
_str = "The results with respect to the cost measure are shown in @$(fig_labels[1]) to @$(fig_labels[end])."
println(_str)
```

```{julia}
#| output: asis
 
imgfname = "distance.png"
fig_caption = "Average outcomes for the cost measure across hyperparameters."
full_paths = joinpath.(data_dirs, joinpath(suffix,imgfname))
include_img_commands = CTExperiments.get_img_command(data_names, full_paths, fig_labels; fig_caption)
_str = join(include_img_commands, "\n\n")
println(_str)
```