# DoubleML meets FLAML: Comparing AutoML tuning

In this notebook we are going to explore how to tune learners with [AUTOML](https://github.com/microsoft/FLAML) in [DoubleML](https://docs.doubleml.org/stable/index.html) framework.

## Data Generation

We created synthetic data using the [make_plr_CCDDHNR2018](https://docs.doubleml.org/stable/api/generated/doubleml.datasets.make_plr_CCDDHNR2018.html) function, which generates data for a potential outcomes framework with 1000 observations and 50 features. The data generated will have 50 covariates variables, 1 treatment variable and 1 outcome variable.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from doubleml.datasets import make_plr_CCDDHNR2018
import doubleml as dml
from flaml import AutoML

# Generate synthetic data
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=1000, dim_x=50, return_type="DataFrame")
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X43,X44,X45,X46,X47,X48,X49,X50,y,d
0,-0.95878,-0.507249,-1.237299,-0.337038,-0.788422,0.665523,0.678125,0.523511,-0.284777,1.06838,...,-0.143543,0.921138,0.665548,1.054878,0.189905,0.720578,-0.233471,-0.234903,-0.380769,-1.175313
1,-2.087473,-1.130447,0.217536,1.159132,0.127548,0.21289,-0.26966,-0.345525,-0.839233,0.753701,...,-1.894524,-0.906226,-1.606975,-0.064347,-0.114221,-0.440485,-0.636874,-0.587217,-1.186199,-1.249428
2,0.068261,-1.063256,-0.73668,-1.522021,-1.32332,-0.698194,-0.727295,-1.579768,-1.694986,-1.289432,...,-0.88546,-0.930117,0.798309,0.937743,1.833759,2.17664,1.375478,0.615388,0.109074,-0.324468
3,1.747301,1.280251,1.550265,0.94575,0.341581,-0.217725,0.228431,-0.370202,-0.153972,0.265725,...,-0.435378,0.47458,1.246745,0.721851,-0.044084,-0.403426,0.139609,1.077118,2.470187,2.178118
4,-0.193777,0.53578,0.396372,1.069551,1.078034,0.928753,-0.737033,-0.784358,-0.684771,0.346732,...,-0.519718,-1.189992,-0.800305,-0.582279,-0.385156,0.063406,-0.570224,0.270637,-1.056063,-0.723587


## Manual Tuning with FLAML

In this section, we manually tune two [XGBoost](https://xgboost.readthedocs.io/en/stable/) models using FLAML for a [partially linear regression](https://docs.doubleml.org/stable/guide/models.html#partially-linear-regression-model-plr) setup. This means, we implement the tuning with `FLAML` for the nuisance estimation manually. Once the tuning has been completed, we pass the learners to `DoubleML`.

### Step 1: Initialize and Train the AutoML Models:

We use FLAML to automatically tune two separate [XGBoost](https://xgboost.readthedocs.io/en/stable/) models:

• Outcome Model ($ml_\ell$): This model predicts the outcome variable y. We configured the FLAML AutoML with a time budget of 120 seconds, using XGBoost as the estimator and rmse as the performance metric.

• Treatment Model (ml_m): This model predicts the treatment variable d. Similarly, we set the time budget to 120 seconds, used XGBoost, and optimized for rmse.

In [2]:
# Initialize AutoML for outcome model (ml_l): Predict Y based on X
automl_l = AutoML()
settings_l = {
    "time_budget": 120,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_l.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=2, **settings_l)

# Initialize AutoML for treatment model (ml_m): Predict D based on X
automl_m = AutoML()
settings_m = {
    "time_budget": 120,
    "metric": 'rmse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}
automl_m.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=2, **settings_m)


### Step 2: Evaluate the Tuned Models 

We can evaluate the loss as reported by `FLAML`. For more details, we refer to the [FLAML documentation](https://microsoft.github.io/FLAML/docs/Getting-Started)

• `rmse_oos_ml_m` represents the out-of-sample RMSE for the treatment model.

• `rmse_oos_ml_l` represents the out-of-sample RMSE for the outcome model.

In [None]:
# Check for Overfitting: Compare in-sample (train), out-of-sample (test) MSE
# ml_m
rmse_oos_ml_m = automl_m.best_loss
rmse_oos_ml_l = automl_l.best_loss
print("rmse_oos_ml_m:",rmse_oos_ml_m)
print("rmse_oos_ml_m:",rmse_oos_ml_l)


### Step 3: Create and Fit DoubleML Model

We create a [DoubleMLData](https://docs.doubleml.org/stable/guide/data_backend.html) object with the dataset, specifying $y$ as the outcome variable and $d$ as the treatment variable. We then initialize a `DoubleMLPLR` model using the tuned `FLAML` models for both the treatment and outcome components. 

In [None]:
# Create DoubleMLData object with the evaluation set
obj_dml_data = dml.DoubleMLData(data, "y", "d")

# Initialize DoubleMLPLR with the trained models from flaml
obj_dml_plr_manual_tuned = dml.DoubleMLPLR(obj_dml_data, ml_m=automl_m.model.estimator,
                                           ml_l=automl_l.model.estimator)

# Fit the DoubleMLPLR model
obj_dml_plr_manual_tuned.fit(store_predictions=True)

print(obj_dml_plr_manual_tuned.summary)
manual_tuned_summary = obj_dml_plr_manual_tuned.summary
print(manual_tuned_summary)


We can also use the `DoubleML`'s built-in learner evaluation, which is based on the cross-fitting procedure.

In [None]:
# Evaluate learners using evaluate_learners() (MSE for all nuisance components)
rmse_dml_ml_l = obj_dml_plr_manual_tuned.evaluate_learners()['ml_l'][0]
rmse_dml_ml_m = obj_dml_plr_manual_tuned.evaluate_learners()['ml_m'][0]

# Print results
print("RMLSE evaluated by DoubleML (ml_l):", rmse_dml_ml_l)
print("RMSE evaluated by DoubleML (ml_m):", rmse_dml_ml_m)

## Comparison of Model Tuning Approaches

Instead of externally tuning the `FLAML` learners, it is also possible to tune the AutoML learners internally. To do so, we have to define custom classes for integrating `FLAML` with `DoubleML`. The tuning will be automatically be started when calling `DoubleML`'s `fit()` method. This approach does not make it necessary to manually specify the learning tasks.


### Step 1: Designing Custom FLAML Models for Double Machine Learning

In this section, we define custom classes for integrating FLAML (Fast Lightweight AutoML) with Double Machine Learning (DML). These classes are designed to facilitate automated machine learning model tuning for both regression and classification tasks.


In [42]:
from flaml import AutoML
from sklearn.utils.multiclass import unique_labels

class FlamlRegressorDoubleML:
    _estimator_type = 'regressor'

    def __init__(self, time, estimator_list, metric, *args, **kwargs):
        self.auto_ml = AutoML(*args, **kwargs)
        self.time = time
        self.estimator_list = estimator_list
        self.metric = metric

    def set_params(self, **params):
        self.auto_ml.set_params(**params)
        return self

    def get_params(self, deep=True):
        dict = self.auto_ml.get_params(deep)
        dict["time"] = self.time
        dict["estimator_list"] = self.estimator_list
        dict["metric"] = self.metric
        return dict

    def fit(self, X, y):
        self.auto_ml.fit(X, y, task="regression", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
        self.tuned_model = self.auto_ml.model.estimator
        return self

    def predict(self, x):
        preds = self.tuned_model.predict(x)
        return preds
        
class FlamlClassifierDoubleML:
    _estimator_type = 'classifier'

    def __init__(self, time, estimator_list, metric, *args, **kwargs):
        self.auto_ml = AutoML(*args, **kwargs)
        self.time = time
        self.estimator_list = estimator_list
        self.metric = metric

    def set_params(self, **params):
        self.auto_ml.set_params(**params)
        return self

    def get_params(self, deep=True):
        dict = self.auto_ml.get_params(deep)
        dict["time"] = self.time
        dict["estimator_list"] = self.estimator_list
        dict["metric"] = self.metric
        return dict

    def fit(self, X, y):
        self.classes_ = unique_labels(y)
        self.auto_ml.fit(X, y, task="classification", time_budget=self.time, estimator_list=self.estimator_list, metric=self.metric, verbose=False)
        self.tuned_model = self.auto_ml.model.estimator
        return self

    def predict_proba(self, x):
        preds = self.tuned_model.predict_proba(x)
        return preds

### Step 2: Using Custom FLAML Models when calling `DoubleML`'s `fit()` Method

We integrate the custom `FLAML`-based models `FlamlRegressorDoubleML` into the Double Machine Learning (DML) framework. The steps involve defining the `FLAML` regressors, setting up the `DoubleMLPLR` object, and fitting the model.

In [None]:
# Define the FlamlRegressorDoubleML
ml_l = FlamlRegressorDoubleML(time=120, estimator_list=['xgboost'], metric='rmse')
ml_m = FlamlRegressorDoubleML(time=120, estimator_list=['xgboost'], metric='rmse')

# Create DoubleMLPLR object using the new regressors
dml_plr_obj_api_tuned = dml.DoubleMLPLR(obj_dml_data, ml_m, ml_l)

# Fit the DoubleMLPLR model
dml_plr_obj_api_tuned.fit(store_predictions=True)

#Retrieve the summary for API Tuned Models
api_tuned_summary = dml_plr_obj_api_tuned.summary

# Print the summary
print(dml_plr_obj_api_tuned.summary)

## Comparison to Dummy Models and Untuned AutoML Learners


### Dummy Learners

As a comparison, we can use dummy `sklearn`'s  `DummyRegressor` learners

• `ml_l_dummy`: A dummy regressor for the outcome model, which predicts the mean value of the outcome.

• `ml_m_dummy`: A dummy regressor for the treatment model, also predicting the mean value.

These dummy models are used to create a `DoubleMLPLR` object, which was then fit to the data. We retrieve and stored the summary of this model to compare with other methods.

In [44]:
from sklearn.dummy import DummyRegressor

# Initialize and fit dummy models
ml_l_dummy = DummyRegressor(strategy='mean')
ml_m_dummy = DummyRegressor(strategy='mean')

# Create DoubleMLPLR object using dummy regressors
dml_plr_obj_dummy = dml.DoubleMLPLR(obj_dml_data, ml_m_dummy, ml_l_dummy)
dml_plr_obj_dummy.fit(store_predictions=True)

# Retrieve the summary for dummy models
dummy_summary = dml_plr_obj_dummy.summary

### AutoML Untuned Models

We set up AutoML models with minimal tuning for both the outcome and treatment variables. This process allows us to compare the performance of untuned models against those that have been manually or API-tuned.

In [45]:
# AutoML Untuned
automl_untuned_l = AutoML()
settings = {
    "time_budget": 0.01,
    "metric": 'mse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}

automl_untuned_l.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["y"].values, verbose=0, **settings)

automl_untuned_m = AutoML()
settings = {
    "time_budget": 0.01,
    "metric": 'mse',
    "estimator_list": ['xgboost'],
    "task": 'regression',
}

automl_untuned_m.fit(X_train=data.drop(columns=["y", "d"]).values, y_train=data["d"].values, verbose=0, **settings)

##### DoubleMLPLR with Untuned AutoML Models

Here, we create a `DoubleMLPLR` object using the untuned AutoML models for the outcome and treatment regressions. We then fit the `DoubleMLPLR` model and retrieve the summary of the results. This section allows us to evaluate the performance of the untuned AutoML models in the context of DoubleML.

In [46]:
# Create DoubleMLPLR object using AutoML models
dml_plr_obj_untuned_automl = dml.DoubleMLPLR(obj_dml_data, automl_untuned_l.model.estimator, automl_untuned_m.model.estimator)
untuned_automl_summary = dml_plr_obj_untuned_automl.fit(store_predictions=True).summary

## Summary

We combine the summaries from various models: manually tuned FLAML models, API-tuned FLAML models, untuned AutoML models, and dummy models.

In [47]:
# Combine summaries for comparison
summary = pd.concat([manual_tuned_summary ,api_tuned_summary, untuned_automl_summary, dummy_summary],
                    keys=['FLAML Manual Tuned', 'FLAML API Tuned', 'AutoML Untuned', 'Dummy'])
summary.index.names = ['Model Type', 'Metric']

# Print the summary
print(summary)

# Plots

##### Plot Coefficients and 95% Confidence Intervals

This section generates a plot comparing the coefficients and 95% confidence intervals for each model type. The plot helps visualize the differences in the estimated coefficients and their uncertainties.

In [None]:
# Check the structure of the summary DataFrames
print("Manual Tuned Summary:")
print(manual_tuned_summary.head())

print("API Tuned Summary:")
print(api_tuned_summary.head())

print("Untuned Summary:")
print(untuned_automl_summary.head())

print("Dummy Summary:")
print(dummy_summary.head())

# Extract model labels and coefficient values
model_labels = summary.index.get_level_values('Model Type')
coef_values = summary['coef'].values

# Calculate errors
errors = np.full((2, len(coef_values)), np.nan)
errors[0, :] = summary['coef'] - summary['2.5 %']
errors[1, :] = summary['97.5 %'] - summary['coef']

# Plot Coefficients and 95% Confidence Intervals
plt.figure(figsize=(10, 6))
plt.errorbar(model_labels, coef_values, fmt='o', yerr=errors, capsize=5)
plt.axhline(0.5, color='red', linestyle='--')
plt.xlabel('Model')
plt.ylabel('Coefficients and 95%-CI')
plt.title('Comparison of Coefficients and 95% Confidence Intervals')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### Compare Metrics for Nuisance Estimation

In this section, we compare metrics for different models and plot a bar chart to visualize the differences in their performance. We also save the comparison results to a file for future reference.

In [None]:
def print_scores(dml_obj):
    summary_df = dml_obj.summary
    print("Summary DataFrame columns:", summary_df.columns)
    print("Summary DataFrame index:", summary_df.index)
      
    scores = summary_df.loc['d']
    return scores

# Calculate and store scores for comparison
scores = {
    "FLAML Manual Tuned": obj_dml_plr_manual_tuned.summary.loc['d'],
    "FLAML API Tuned": dml_plr_obj_api_tuned.summary.loc['d'],
    "AutoML Untuned": dml_plr_obj_untuned_automl.summary.loc['d'],
    "Dummy": dml_plr_obj_dummy.summary.loc['d']
}

# Convert the scores dictionary to a DataFrame for plotting
scores_df = pd.DataFrame(scores).T

# Plot MSE for l_of_X and m_of_X separately
scores_df['coef'].plot(kind="bar", title="MSE for l_of_X")
plt.ylabel('RMSE')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

scores_df['std err'].plot(kind="bar", title="MSE for m_of_X")
plt.ylabel('RMSE')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Observations

- **Coefficient Values**: The coefficients for the `FLAML Manual Tuned` and `FLAML API Tuned` models are quite similar, with the API-tuned model having a slightly higher coefficient. Both are lower compared to the `AutoML Untuned` and `Dummy` models.
- **Untuned AutoML Models**: The `AutoML Untuned` models yield a higher coefficient compared to the manually tuned FLAML models, indicating that the automated process of model tuning in AutoML may have overestimated the effect. The `Dummy` model has the highest coefficient, suggesting it could be overfitting or has a higher baseline value.

### Conclusion

- The **FLAML Manual Tuned** and **FLAML API Tuned** models provide similar results with coefficients close to 0.5, suggesting robust performance within their tuned configurations.
- The **AutoML Untuned** models offer higher coefficient values, indicating that even though they are untuned, they still provide a noticeable increase in coefficient compared to the tuned FLAML models.
- The **Dummy** model, having the highest coefficient, shows the largest discrepancy. This reflects the fact, that the learner is not actually learning any meaningful relationship between the features and the outcome/treatment variable. 

Overall, the manually tuned FLAML models and the API-tuned FLAML models show good alignment with the expectations, while the untuned and dummy models present larger coefficients which may suggest the need for further tuning or validation.