# DoubleML for Flexible Covariate Adjustment in Regression Discontinuity Designs (RDD)

This notebook demonstrates how to use RDD designs within ``DoubleML``. Our implementation ``RDFlex`` follows the work from [Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942). 

In RDD treatment assignment is determined by a continuous running variable ("score", $S$) crossing a known threshold ("cutoff", $c$). We aim to estimate the local average treatment effect 
$$\theta_{0} = \mathbb{E}[Y_i(1)-Y_i(0)\mid S = c]$$
at the cutoff value. We therefore assume, that individuals are not able to manipulate their score in the neighborhood of the cutoff, and that there is a discontinuity in outcome which is sorely to be explained by the score.

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from statsmodels.nonparametric.kernel_regression import KernelReg

from lightgbm import LGBMRegressor, LGBMClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression

from rdrobust import rdrobust

import doubleml as dml
from doubleml.rdd import RDFlex
from doubleml.rdd.datasets import make_simple_rdd_data

## Sharp RDD

In the sharp design, the treatment assignment is deterministic given the score. Namely, all the individuals with a score higher than the cutoff, receive the treatment $$D_i = \mathbb{I}[S_i > c].$$

### Generate Data

The function ``make_simple_rdd_data()`` can be used to generate data of a rather standard RDD setting. If we set ``fuzzy = False``, the generated data follows a sharp RDD. We also generate covariates $X$ that can be used to adjust the estimation at a later stage.

In [17]:
np.random.seed(42)

fuzzy = False
data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=fuzzy)

cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]
df = pd.DataFrame(
    np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])),
    columns=['y', 'd', 'score'] + cov_names,
)
df.head()

Unnamed: 0,y,d,score,x0,x1,x2
0,2.131533,1.0,0.496714,-0.665035,-0.790864,0.27286
1,9.205261,0.0,-0.138264,0.412951,-0.936828,0.872424
2,3.337677,1.0,0.647689,-0.896057,0.082593,0.418121
3,7.767124,1.0,1.52303,0.741938,0.428174,0.603456
4,3.518407,0.0,-0.234153,-0.3211,0.62965,-0.83977


In [18]:
fig = px.scatter(
    x=df['score'],
    y=df['y'],
    color=df['d'].astype(bool),
    labels={
        "x": "Score",   
        "y": "Outcome",
        "color": "Treatment"
    },
    title="Scatter Plot of Outcome vs. Score by Treatment Status"
)

fig.update_layout(
    xaxis_title="Score",
    yaxis_title="Outcome"
)
fig.show()





### Oracle Values and Comparisons

The generated oracle values for the potential outcomes can be used in a kernel regression to get an oracle estimator at the cutoff. 

In [23]:
ite = data_dict['oracle_values']['Y1'] - data_dict['oracle_values']['Y0']
score = data_dict['score']

oracle_model = KernelReg(endog=ite, exog=score, reg_type='ll', var_type='c', ckertype='gaussian')

score_grid = np.linspace(-1, 1, 100)
oracle_effects, _ = oracle_model.fit(score_grid)


scatter = go.Scatter(
    x=score,
    y=ite,
    mode='markers',
    name='ITE',
    marker=dict(color='blue')
)
line = go.Scatter(
    x=score_grid,
    y=oracle_effects,
    mode='lines',
    name='Average Effect Estimate',
    line=dict(color='red')
)


fig = go.Figure(data=[scatter, line])
fig.update_layout(
    title='Locally Linear Kernel Regression of ITE on Score',
    xaxis_title='Score',
    yaxis_title='Effect',
    legend=dict(x=0.8, y=0.2)
)

print(f"The oracle LATE is estimated as {oracle_model.fit([0])[0][0]}")

fig.show()

The oracle LATE is estimated as 0.9213381266257252


### RDD with Linear Adjustment

The standard RDD estimator for the sharp design takes the form 

$$\hat{\theta}_{\text{SRD}}(h) = \sum_{i=1}^n w_i(h)(Y_i-X_i^T\hat{\gamma}_h)$$

where $w_i(h)$ are local linear regression weights that depend on the data through the realizations of the running variable $S_i$ only and $h>0$ is a bandwidth. $\hat{\gamma}_h$ is a minimizer.

The packages ``rdrobust`` implements this estimation.

In [5]:
rdrobust_linear = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], covs=df[cov_names], c=0.0)
rdrobust_linear

Call: rdrobust
Number of Observations:                  1000
Polynomial Order Est. (p):                  1
Polynomial Order Bias (q):                  2
Kernel:                            Triangular
Bandwidth Selection:                    mserd
Var-Cov Estimator:                         NN

                                Left      Right
------------------------------------------------
Number of Observations           490        510
Number of Unique Obs.            490        510
Number of Effective Obs.         244        263
Bandwidth Estimation           0.653      0.653
Bandwidth Bias                 1.023      1.023
rho (h/b)                      0.638      0.638

Method             Coef.     S.E.   t-stat    P>|t|       95% CI      
-------------------------------------------------------------------------
Conventional       2.798     3.98    0.703   4.821e-01   [-5.003, 10.598]
Robust                 -        -    0.717   4.735e-01   [-5.806, 12.502]




### RDD with flexible adjustment

[Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942) propose an estimator that reduces the variance of the above esimator, using a flexible adjustment of the outcome by ML. For more details, see our User Guide. The estimator here takes the form 

$$\hat{\theta}_{\text{RDFlex}}(h) = \sum_{i=1}^n w_i(h)M_i(\eta),\quad M_i(\eta) = Y_i - \eta(X_i),$$


with $\eta(\cdot)$ being potentially nonlinear adjustment functions.

We initialize a `DoubleMLData` object using the usual package syntax.

Note: `x_cols` refers to the covariates to be adjusted for, and `s_col` is the score.

In [6]:
dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')

In [7]:
ml_g = LGBMRegressor(n_estimators=500, learning_rate=0.01, verbose=-1)
ml_m = LGBMClassifier(n_estimators=500, learning_rate=0.01, verbose=-1)

rdflex_model = RDFlex(dml_data,
                      ml_g,
                      ml_m,
                      fuzzy=fuzzy,
                      n_folds=5,
                      n_rep=1)
rdflex_model.fit(n_iterations=2)

print(rdflex_model)

Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      3.475     2.231     1.558    1.193e-01  [-0.897, 7.847]
Robust                 -        -     2.021    4.330e-02  [0.185, 12.112]


## Fuzzy RDD

In the fuzzy design, the treatment assignment is still deterministic given the score ($T_i = \mathbb{I}[S_i > c].$).
However, in the neighborhood of the cutoff, there is a probability of observations not picking up the treatment they were assignt. These "defiers" cause the probability jump of treatment at the cutoff to be smaller than 1.

### Generate Data

The function ``make_simple_rdd_data()`` with ``fuzzy = True`` generates basic data for the fuzzy case.

In [24]:
np.random.seed(42)

fuzzy = True
data_dict = make_simple_rdd_data(n_obs=1000, fuzzy=fuzzy)

cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]
df = pd.DataFrame(
    np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])),
    columns=['y', 'd', 'score'] + cov_names,
)
df.head()

Unnamed: 0,y,d,score,x0,x1,x2
0,2.131533,1.0,0.496714,-0.665035,-0.790864,0.27286
1,10.104291,1.0,-0.138264,0.412951,-0.936828,0.872424
2,3.337677,1.0,0.647689,-0.896057,0.082593,0.418121
3,7.767124,1.0,1.52303,0.741938,0.428174,0.603456
4,4.20974,1.0,-0.234153,-0.3211,0.62965,-0.83977


In [29]:
fig = px.scatter(
    x=df['score'],
    y=df['y'],
    color=df['d'].astype(bool),
    labels={
        "x": "Score",   
        "y": "Outcome",
        "color": "Treatment"
    },
    title="Scatter Plot of Outcome vs. Score by Treatment Status"
)

fig.update_layout(
    xaxis_title="Score",
    yaxis_title="Outcome"
)
fig.show()





### Oracle Values and Comparisons

The generated oracle values for the potential outcomes can be used in a kernel regression to get an oracle estimator at the cutoff. 

Since in the fuzzy design, we calculate the treatment effect on the treated, we drop defiers for the oracle computation.

In [27]:
complier_mask = ((data_dict["score"] < 0) & (data_dict["D"] == False)) | ((data_dict["score"] > 0) & (data_dict["D"] == True))
ite = data_dict['oracle_values']['Y1'][complier_mask] - data_dict['oracle_values']['Y0'][complier_mask]
score = data_dict['score'][complier_mask]

oracle_model = KernelReg(endog=ite, exog=score, reg_type='ll', var_type='c', ckertype='gaussian')

score_grid = np.linspace(-1, 1, 100)
oracle_effects, _ = oracle_model.fit(score_grid)


scatter = go.Scatter(
    x=score,
    y=ite,
    mode='markers',
    name='ITE',
    marker=dict(color='blue')
)
line = go.Scatter(
    x=score_grid,
    y=oracle_effects,
    mode='lines',
    name='Average Effect Estimate',
    line=dict(color='red')
)


fig = go.Figure(data=[scatter, line])
fig.update_layout(
    title='Locally Linear Kernel Regression of ITE on Score',
    xaxis_title='Score',
    yaxis_title='Effect',
    legend=dict(x=0.8, y=0.2)
)

print(f"The oracle LATE is estimated as {oracle_model.fit([0])[0][0]}")

fig.show()

The oracle LATE is estimated as 0.9087280080883967


### RDD with Linear Adjustment

The standard RDD estimator for the fuzzy design takes the form 

$$\hat{\theta}_{\text{FRD}}(h) = \frac{\hat{\theta}_{\text{SRD}}(h)}{\hat{\theta}_{\text{D}}(h)} = \frac{\sum_{i=1}^n w_i(h)(Y_i-X_i^T\hat{\gamma}_{Y, h})}{\sum_{i=1}^n w_i(h)(D_i-X_i^T\hat{\gamma}_{D, h})}$$

The packages ``rdrobust`` implements this estimation.

In [30]:
rdrobust_linear = rdrobust(y=df['y'], x=df['score'], fuzzy=df['d'], covs=df[cov_names], c=0.0)
rdrobust_linear

Call: rdrobust
Number of Observations:                  1000
Polynomial Order Est. (p):                  1
Polynomial Order Bias (q):                  2
Kernel:                            Triangular
Bandwidth Selection:                    mserd
Var-Cov Estimator:                         NN

                                Left      Right
------------------------------------------------
Number of Observations           490        510
Number of Unique Obs.            490        510
Number of Effective Obs.         244        263
Bandwidth Estimation           0.653      0.653
Bandwidth Bias                 1.023      1.023
rho (h/b)                      0.638      0.638

Method             Coef.     S.E.   t-stat    P>|t|       95% CI      
-------------------------------------------------------------------------
Conventional       2.798     3.98    0.703   4.821e-01   [-5.003, 10.598]
Robust                 -        -    0.717   4.735e-01   [-5.806, 12.502]




### RDD with flexible adjustment

[Noack, Olma and Rothe (2024)](https://arxiv.org/abs/2107.07942) propose an estimator that reduces the variance of the above esimator, using a flexible adjustment of the outcome by ML. For more details, see our User Guide. The estimator here takes the form 

$$\hat{\theta}_{\text{RDFlex, FRD}}(h) = \frac{\sum_{i=1}^n w_i(h)(Y_i - \hat{\eta}_Y(X_i))}{\sum_{i=1}^n w_i(h)(D_i - \hat{eta}_D(X_i))},$$


with $\eta_Y(\cdot), \eta_D(\cdot)$ being potentially nonlinear adjustment functions.

We initialize a `DoubleMLData` object using the usual package syntax.

Note: `x_cols` refers to the covariates to be adjusted for, and `s_col` is the score.

In [None]:
dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')

In [None]:
ml_g = LGBMRegressor(n_estimators=500, learning_rate=0.01, verbose=-1)
ml_m = LGBMClassifier(n_estimators=500, learning_rate=0.01, verbose=-1)

rdflex_model = RDFlex(dml_data,
                      ml_g,
                      ml_m,
                      fuzzy=fuzzy,
                      n_folds=5,
                      n_rep=1)
rdflex_model.fit(n_iterations=2)

print(rdflex_model)

Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      3.475     2.231     1.558    1.193e-01  [-0.897, 7.847]
Robust                 -        -     2.021    4.330e-02  [0.185, 12.112]


### Global and Local Learners

All learners have to support the `sample_weight` in their `fit` method.

In [13]:
from doubleml.utils import GlobalRegressor, GlobalClassifier

In [14]:
reg_estimators = [
    ('lr local', LinearRegression()),
    ('rf local', RandomForestRegressor()),
    ('lr global', GlobalRegressor(base_estimator=LinearRegression())),
    ('rf global', GlobalRegressor(base_estimator=RandomForestRegressor()))
]

class_estimators = [
    ('lr local', LogisticRegression()),
    ('rf local', RandomForestClassifier()),
    ('lr global', GlobalClassifier(base_estimator=LogisticRegression())),
    ('rf global', GlobalClassifier(base_estimator=RandomForestClassifier()))
]

ml_g = StackingRegressor(
    estimators=reg_estimators,
    final_estimator=RandomForestRegressor(n_estimators=10,
                                          random_state=42)
)

ml_m = StackingClassifier(
    estimators=class_estimators,
    final_estimator=RandomForestClassifier(n_estimators=10,
                                           random_state=42)
)

In [15]:
rdflex_model = RDFlex(dml_data,
                      ml_g,
                      ml_m,
                      fuzzy=fuzzy,
                      n_folds=5,
                      n_rep=1)
rdflex_model.fit(n_iterations=2)

print(rdflex_model)

Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      1.154     1.046     1.103    2.699e-01  [-0.896, 3.205]
Robust                 -        -     1.333    1.825e-01  [-1.000, 5.251]
