(example-optuna)=

 Example: Model selection with optuna
=========================================

Motivation
----------

We know that model selection and/or hyperparameter optimization (HPO) can
have massive impacts on the prediction quality in regular Machine
Learning. Yet, it seems that model selection and hyperparameter
optimization are  of substantial importance for CATE estimation with
MetaLearners, too, see e.g. [Machlanski et. al](https://arxiv.org/abs/2303.01412>).

However, model selection and HPO for MetaLearners look quite different from what we're used to from e.g. simple supervised learning problems. Concretely,

* In terms of a MetaLearners's option space, there are several levels
  to optimize for:

  1. The MetaLearner architecture, e.g. R-Learner vs DR-Learner
  2. The model to choose per base estimator of said MetaLearner architecture, e.g. ``LogisticRegression`` vs ``LGBMClassifier``
  3. The model hyperparameters per base model

*  On a conceptual level, it's not clear how to measure model quality
   for MetaLearners. As a proxy for the underlying quantity of
   interest one might look into base model performance, the R-Loss of
   the CATE estimates or some more elaborate approaches alluded to by
   [Machlanski et. al](https://arxiv.org/abs/2303.01412).

We think that HPO can be divided into two camps:

* Exploration of (hyperparameter, metric evaluation) pairs where the
  pairs do not influence each other (e.g. grid search, random search)

* Exploration of (hyperparameter, metric evaluation) pairs where the
  pairs do influence each other (e.g. Bayesian optimization,
  evolutionary algorithms); in other words, there is a feedback-loop between
  sample result and sample

In this example, we will illustrate the latter camp based on an
application of [optuna](https://github.com/optuna/optuna) -- a
popular framework for HPO -- in interplay with ``metalearners``.

Installation
------------

In order to use ``optuna``, we first need to install the package.
We can do so either via conda and conda-forge

```console
$ conda install optuna -c conda-forge
```

or via pip and PyPI

```console
$ pip install optuna
```

Usage
-----

### Loading the data

Just like in our {ref}`example on estimating CATEs with a MetaLearner
<example-basic>`, we will first load some experiment data:

In [1]:
import pandas as pd
from pathlib import Path
from git_root import git_root

df = pd.read_csv(git_root("data/learning_mindset.zip"))
outcome_column = "achievement_score"
treatment_column = "intervention"
feature_columns = [
    column for column in df.columns if column not in [outcome_column, treatment_column]
]
categorical_feature_columns = [
    "ethnicity",
    "gender",
    "frst_in_family",
    "school_urbanicity",
    "schoolid",
]
# Note that explicitly setting the dtype of these features to category
# allows both lightgbm as well as shap plots to
# 1. Operate on features which are not of type int, bool or float
# 2. Correctly interpret categoricals with int values to be
#    interpreted as categoricals, as compared to ordinals/numericals.
for categorical_feature_column in categorical_feature_columns:
    df[categorical_feature_column] = df[categorical_feature_column].astype("category")

Now that we've loaded the experiment data, we can split it up into
train and validation data:

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation, w_train, w_validation = train_test_split(
    df[feature_columns], df[outcome_column], df[treatment_column], test_size=0.25
)

### Optimizing base model hyperparameters

Let's say that we want to work with an
{class}`~metalearners.rlearner.RLearner` and ``LightGBM`` estimators
for base models. We will seek two optimize three hyperparameters of
our base models:

* The number of estimators ``n_estimators`` of our outcome model.
* The max depth ``max_depth`` of our outcome model.
* The number of estimators ``n_estimators`` of our treatment effect
  model.

We can mold this ambition into the following simple script creating an
``optuna`` ``study``:

In [3]:
import optuna
from metalearners.rlearner import r_loss
from metalearners.utils import simplify_output
from metalearners import RLearner
from lightgbm import LGBMRegressor, LGBMClassifier


def objective(trial):

    n_estimators_nuisance = trial.suggest_int("n_estimators_nuisance", 5, 250)
    max_depth_nuisance = trial.suggest_int("max_depth_nuisance", 3, 30)
    n_estimators_treatment = trial.suggest_int("n_estimators_treatment", 5, 100)

    rlearner = RLearner(
        nuisance_model_factory=LGBMRegressor,
        nuisance_model_params={
            "n_estimators": n_estimators_nuisance,
            "max_depth": max_depth_nuisance,
            "verbosity": -1,
        },
        propensity_model_factory=LGBMClassifier,
        propensity_model_params={"n_estimators": 5, "verbosity": -1},
        treatment_model_factory=LGBMRegressor,
        treatment_model_params={
            "n_estimators": n_estimators_treatment,
            "verbosity": -1,
        },
        is_classification=False,
        n_variants=2,
    )

    rlearner.fit(X=X_train, y=y_train, w=w_train)

    return rlearner.evaluate(
        X=X_validation,
        y=y_validation,
        w=w_validation,
        is_oos=True,
    )["r_loss_1_vs_0"]


study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

[I 2024-06-24 13:22:16,992] A new study created in memory with name: no-name-91d5c937-f1af-49ed-80c9-85ae6b2fef65
[I 2024-06-24 13:22:22,633] Trial 0 finished with value: 0.83053857095663 and parameters: {'n_estimators_nuisance': 142, 'max_depth_nuisance': 29, 'n_estimators_treatment': 8}. Best is trial 0 with value: 0.83053857095663.
[I 2024-06-24 13:22:27,327] Trial 1 finished with value: 0.8279630501887009 and parameters: {'n_estimators_nuisance': 89, 'max_depth_nuisance': 19, 'n_estimators_treatment': 39}. Best is trial 1 with value: 0.8279630501887009.
[I 2024-06-24 13:22:36,426] Trial 2 finished with value: 0.8438524520544455 and parameters: {'n_estimators_nuisance': 223, 'max_depth_nuisance': 28, 'n_estimators_treatment': 48}. Best is trial 1 with value: 0.8279630501887009.
[I 2024-06-24 13:22:42,400] Trial 3 finished with value: 0.8336463393551226 and parameters: {'n_estimators_nuisance': 87, 'max_depth_nuisance': 22, 'n_estimators_treatment': 76}. Best is trial 1 with value: 0

Note that the metric to be optimized is the R-Loss here. We can obtain
it -- among other metrics -- via the
{class}`~metalearners.rlearner.RLearner`'s
{meth}`~metalearners.rlearner.RLearner.evaluate` method. We can see it
evolve as follows across the 100 trials.

Alternatively, if we'd like to optimize a base model in light of its
individual metric -- in this case an RMSE on the observed outcomes for an
the outcome model -- we can easily do that, too:

In [4]:
from sklearn.metrics import root_mean_squared_error


def objective_individual(trial):

    n_estimators_nuisance = trial.suggest_int("n_estimators_nuisance", 5, 250)
    max_depth_nuisance = trial.suggest_int("max_depth_nuisance", 3, 30)

    rlearner = RLearner(
        nuisance_model_factory=LGBMRegressor,
        nuisance_model_params={
            "n_estimators": n_estimators_nuisance,
            "max_depth": max_depth_nuisance,
            "verbosity": -1,
        },
        propensity_model_factory=LGBMClassifier,
        treatment_model_factory=LGBMRegressor,
        is_classification=False,
        n_variants=2,
    )

    rlearner.fit_nuisance(X=X_train, y=y_train, model_kind="outcome_model", model_ord=0)

    outcome_predictions = rlearner.predict_nuisance(
        X=X_validation, model_kind="outcome_model", model_ord=0, is_oos=True
    )

    return root_mean_squared_error(y_validation, outcome_predictions)


study_individual = optuna.create_study(direction="minimize")
study_individual.optimize(objective_individual, n_trials=100)

[I 2024-06-24 13:27:32,573] A new study created in memory with name: no-name-c8e6fb60-ecbd-4bf5-95f8-fdbea711e7d2
[I 2024-06-24 13:27:37,408] Trial 0 finished with value: 0.8557745599990257 and parameters: {'n_estimators_nuisance': 140, 'max_depth_nuisance': 14}. Best is trial 0 with value: 0.8557745599990257.
[I 2024-06-24 13:27:44,610] Trial 1 finished with value: 0.8645415696834704 and parameters: {'n_estimators_nuisance': 225, 'max_depth_nuisance': 10}. Best is trial 0 with value: 0.8557745599990257.
[I 2024-06-24 13:27:51,077] Trial 2 finished with value: 0.8608539811229967 and parameters: {'n_estimators_nuisance': 197, 'max_depth_nuisance': 14}. Best is trial 0 with value: 0.8557745599990257.
[I 2024-06-24 13:27:53,395] Trial 3 finished with value: 0.8463791376788804 and parameters: {'n_estimators_nuisance': 73, 'max_depth_nuisance': 6}. Best is trial 3 with value: 0.8463791376788804.
[I 2024-06-24 13:27:57,908] Trial 4 finished with value: 0.8555935808124866 and parameters: {'n_

### Optimizing over architectures

``optuna``'s flexibility allows for not only the search over classical
hyperparameters of a given estimator but also to iterate over the
choice of base estimator architectures. Pushing it a step further, one
can even optimize over the space of MetaLearner architectures.

In the following example we will attempt to optimize over the
following search space:

1. MetaLearner: R-Learner vs DR-Learner
2. Nuisance model: ``LGBMRegressor`` vs ``Ridge``
3. Hyperparameter: ``n_estimators`` if ``LGBMRegressor`` and ``alpha``
   if ``Ridge``

Note that the choice of the base learner in the second step should be
conditioned on the choice in the first step. In other words, we do not
want to update our belief system on outcome learners for the R-Learner
by observing outcome learner for the DR-Learner. The same idea applies
to the interplay between steps two and three. This conditioning
becomes apparent in the source code below via the underscores,
e.g. ``nuisance_r``, which is only sampled (and thereby updated) if we
are using an R-Learner and and ``nuisance_dr``, which is only sampled
(and thereby updated) if we are using a DR-Learner.

In [5]:
import optuna
from metalearners.utils import metalearner_factory, simplify_output
from sklearn.linear_model import Ridge
from metalearners.rlearner import r_loss


# Arbitrary models for R-Loss
outcome_estimates = (
    LGBMRegressor(verbose=-1).fit(X_train, y_train).predict(X_validation)
)
propensity_scores = (
    LGBMClassifier(verbose=-1).fit(X_train, w_train).predict(X_validation)
)


def objective_overall(trial):

    ### SAMPLING

    # Highest level of granularity: we sample the MetaLearner architecture.
    architecture = trial.suggest_categorical("architecture", ["R", "DR"])

    # We distinguish cases because we do not want a DR-Learner run to influence
    # the optimizing process of R-Learner-related parameters.
    if architecture == "R":

        # Second level of granularity: we sample the nuisance base model.
        # Note that this is conditioned on using the R-Learner.
        nuisance_r = trial.suggest_categorical("nuisance_r", ["LGBMRegressor", "Ridge"])
        nuisance_dr = None
        nuisance_dr_lin_reg_alpha = None
        nuisance_dr_lgbm_n_estimators = None

        if nuisance_r == "LGBMRegressor":

            # Lowest level of granularity: we sample the nuisance base model hyperparameters.
            nuisance_r_lgbm_n_estimators = trial.suggest_int(
                "nuisance_r_lgbm_n_estimators", 5, 250
            )
            nuisance_r_lin_reg_alpha = None

            nuisance_params = {
                "n_estimators": nuisance_r_lgbm_n_estimators,
                "verbose": -1,
            }
        else:
            nuisance_r_lin_reg_alpha = trial.suggest_float(
                "nuisance_r_lin_reg_alpha", 0, 10
            )
            nuisance_r_lgbm_n_estimators = None

            nuisance_params = {"alpha": nuisance_r_lin_reg_alpha}

    else:
        nuisance_dr = trial.suggest_categorical(
            "nuisance_dr", ["LGBMRegressor", "Ridge"]
        )
        nuisance_r = None
        nuisance_r_lin_reg_alpha = None
        nuisance_r_lgbm_n_estimators = None

        if nuisance_dr == "LGBMRegressor":

            # Lowest level of granularity: we sample the nuisance base model hyperparameters.
            nuisance_dr_lgbm_n_estimators = trial.suggest_int(
                "nuisance_dr_lgbm_n_estimators", 5, 250
            )
            nuisance_dr_lin_reg_alpha = None

            nuisance_params = {
                "n_estimators": nuisance_dr_lgbm_n_estimators,
                "verbose": -1,
            }

        else:
            nuisance_dr_lin_reg_alpha = trial.suggest_float(
                "nuisance_dr_lin_reg_alpha", 0, 10
            )
            nuisance_dr_lgbm_n_estimators = None

            nuisance_params = {"alpha": nuisance_dr_lin_reg_alpha}

    ### LEARNING

    _metalearner_factory = metalearner_factory(architecture)
    # We know that only one of them is not None, therefore we can use or.
    nuisance_model_type = nuisance_r or nuisance_dr
    metalearner = _metalearner_factory(
        nuisance_model_factory=(
            LGBMRegressor if nuisance_model_type == "LGBMRegressor" else Ridge
        ),
        nuisance_model_params=nuisance_params,
        propensity_model_factory=LGBMClassifier,
        propensity_model_params={"n_estimators": 5, "max_depth": 5, "verbose": -1},
        treatment_model_factory=LGBMRegressor,
        treatment_model_params={"n_estimators": 5, "max_depth": 5, "verbose": -1},
        is_classification=False,
        n_variants=2,
    )
    metalearner.fit(X_train, y_train, w_train)

    ### EVALUATING

    cate_estimates = simplify_output(metalearner.predict(X_validation, is_oos=True))

    return r_loss(
        cate_estimates=cate_estimates,
        outcome_estimates=outcome_estimates,
        propensity_scores=propensity_scores,
        outcomes=y_validation,
        treatments=w_validation,
    )


study_overall = optuna.create_study(direction="minimize")
study_overall.optimize(objective_overall, n_trials=100)

[I 2024-06-24 13:30:27,597] A new study created in memory with name: no-name-0752a341-284a-424f-9e2e-398ed679066b
[I 2024-06-24 13:30:34,032] Trial 0 finished with value: 0.8365666989350328 and parameters: {'architecture': 'R', 'nuisance_r': 'LGBMRegressor', 'nuisance_r_lgbm_n_estimators': 179}. Best is trial 0 with value: 0.8365666989350328.
[I 2024-06-24 13:30:34,508] Trial 1 finished with value: 0.8369255991554195 and parameters: {'architecture': 'R', 'nuisance_r': 'Ridge', 'nuisance_r_lin_reg_alpha': 4.542475752771536}. Best is trial 0 with value: 0.8365666989350328.
[I 2024-06-24 13:30:39,029] Trial 2 finished with value: 0.8358899088075391 and parameters: {'architecture': 'DR', 'nuisance_dr': 'LGBMRegressor', 'nuisance_dr_lgbm_n_estimators': 54}. Best is trial 2 with value: 0.8358899088075391.
[I 2024-06-24 13:30:43,603] Trial 3 finished with value: 0.8364125332969841 and parameters: {'architecture': 'R', 'nuisance_r': 'LGBMRegressor', 'nuisance_r_lgbm_n_estimators': 114}. Best i