# Python: Sensitivity Analysis

This notebook illustrates the sensitivity analysis tools with the partiallly linear regression model (PLR). <br>
The DoubleML package implements sensitivity analysis based on [Chernozhukov et al. (2022)](https://www.nber.org/papers/w30302).

In [3]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

import doubleml as dml
from doubleml.datasets import make_confounded_plr_data, fetch_401K

## Simulation Example

For illustration purposes, we will work with generated data. This enables us to set the counfounding strength, such that we can correctly access quality of e.g. the robustness values.

### Data

The data will be generated via `make_confounded_plr_data` and set the confounding values to `cf_y=0.1` and `cf_d=0.1`.

Both parameters determine the strength of the confounding

- `cf_y` measures the proportion of residual variance in the outcome explained by confounders
- `cf_d` measires the porportion of residual variance of the Riesz Representer generated by confounders. In the PLR
the following representation $$\text{cf\_d}=\frac{\eta^2_{D\sim A|X}}{1-\eta^2_{D\sim A|X}},$$ where $\eta^2_{D\sim A|X}$ is 
the nonparametric $R^2$ and measures the proportion of residual variation of the treatment explained by confounders.


### DoubleML Object

### Sensitivity Analysis

## Application 401k

In [4]:
data = fetch_401K(return_type='DataFrame')

# Set up basic model: Specify variables for data-backend
features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
                 'twoearn', 'db', 'pira', 'hown']

# Initialize DoubleMLData (data-backend of DoubleML)
data_dml = dml.DoubleMLData(data,
                                 y_col='net_tfa',
                                 d_cols='e401',
                                 x_cols=features_base)
print(data_dml)


------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9915 entries, 0 to 9914
Columns: 14 entries, nifa to hown
dtypes: float32(4), int8(10)
memory usage: 329.2 KB



In [6]:
# Random Forest
randomForest = RandomForestRegressor(
    n_estimators=500, max_depth=7, max_features=3, min_samples_leaf=3)
randomForest_class = RandomForestClassifier(
    n_estimators=500, max_depth=5, max_features=4, min_samples_leaf=7)

np.random.seed(42)
dml_plr = dml.DoubleMLPLR(data_dml,
                                 ml_l = randomForest,
                                 ml_m = randomForest_class,
                                 n_folds = 5)
dml_plr.fit()

print(dml_plr)


------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_depth=7, max_features=3, min_samples_leaf=3,
                      n_estimators=500)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=4, min_samples_leaf=7,
                       n_estimators=500)
Out-of-sample Performance:
Learner ml_l RMSE: [[53459.62662778]]
Learner ml_m RMSE: [[0.44283111]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
            coef      std err         t         

In [10]:
dml_plr.sensitivity_analysis()

<doubleml.double_ml_plr.DoubleMLPLR at 0x14643dbddd0>

In [11]:
dml_plr.sensitivity_plot()

In [8]:
np.random.seed(42)
dml_irm = dml.DoubleMLIRM(data_dml,
                                 ml_g = randomForest,
                                 ml_m = randomForest_class,
                                 n_folds = 5)
dml_irm.fit()

print(dml_irm)


------------------ Data summary      ------------------
Outcome variable: net_tfa
Treatment variable(s): ['e401']
Covariates: ['age', 'inc', 'educ', 'fsize', 'marr', 'twoearn', 'db', 'pira', 'hown']
Instrument variable(s): None
No. Observations: 9915

------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2

------------------ Machine learner   ------------------
Learner ml_g: RandomForestRegressor(max_depth=7, max_features=3, min_samples_leaf=3,
                      n_estimators=500)
Learner ml_m: RandomForestClassifier(max_depth=5, max_features=4, min_samples_leaf=7,
                       n_estimators=500)
Out-of-sample Performance:
Learner ml_g0 RMSE: [[47356.78511058]]
Learner ml_g1 RMSE: [[63837.15187355]]
Learner ml_m RMSE: [[0.44276213]]

------------------ Resampling        ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: True

------------------ Fit summary       ------------------
             coef   