# Regression Discontinuity
This notebook illustrates the use of Regression Discontinuity in an empirical study. We analyze the effect of the antipoverty program *Progresa/Oportunidades* on the consumption behavior of families in Mexico in the early 2000s.

The program was intended for families in extreme poverty and included financial incentives for participation in measures that improved the family's health, nutrition and children's education. The effect of this program is a widely studied problem in social and economic sciences and, according to the WHO, was a very successful measure in terms of reducing extreme poverty in Mexico.

Eligibility for the program was determined based on a pre-intervention household poverty-index. Individuals above a certain threshold received the treatment (participation in the program) while individuals below the threshold were excluded and recorded as a control group. All observations above the threshold participated in the program, which makes the analysis fall into the standard (sharp) regression discontinuity design.

First, we need to install and load some packages.

In [None]:
!pip install rdd rdrobust
import pandas as pd
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from lightgbm import LGBMRegressor
import matplotlib.pyplot as plt
import numpy as np
import patsy
from rdd.rdd import optimal_bandwidth
from rdrobust import rdrobust

We use a dataset assembled by [Calonico et al. (2014)](https://rdpackages.github.io/references/Calonico-Cattaneo-Titiunik_2014_ECMA--Supplemental.pdf) and follow the analysis in [Noack et al. (2023)](https://arxiv.org/pdf/2107.07942.pdf).

First, we open the data and remove any observations that have NaN values.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/progresa.csv",
                 index_col=0)
df = df.dropna()
df.rename(columns={"index": "pov_index"}, inplace=True)
print("Shape of Data:")
print(df.shape)
print("Variable Names:")
print(df.columns)
df.head()

The data set contains 1,944 observations for which full covariate information of 27 variables is available.

We want to measure the local average treatment effect of program participation on four outcome variables. The outcome variables are food and non-food consumption of the recorded families at two points in time, one year and two years after the implementation of the program.

The baseline covariates, recorded prior to program implementation, include the household's size; household head's age, sex, years of education and employment status; spouse's age and years of education; number of children not older than five years and their sex, and physical characteristics of the house: whether the house has cement floors, water connection, water connection inside the house, a bathroom, electricity, number of rooms, pre-intervention consumption, and an identifier of the urban locality in which the house is located.

The data fits to the pattern of a sharp RD design, namely, all individuals that were below the cut-off index received no intervention, and all individuals above the cut-off were eligible to join the *progresa* program and thus participated.

## Estimation without Covariates

First, we will perform a very simple RD estimation with a weighted linear regression. We use a triangular kernel, which assigns weights to observations based on their distance from the cutoff point. The weights decrease linearly as the distance from the cutoff point increases.

In [None]:
def triangular_kernel(index, h):
    weights = 1 - np.abs(index) / h
    weights[weights < 0] = 0
    return weights

The parameter `h` is the bandwidth that controls the range of observations that receive non-zero weights. We use the `IKbandwidth` function from the `rdd` package that implements the *Imbens-Kalyanaraman* method. Another standard approach would be to use the standard deviation of `index`.

In [None]:
h = optimal_bandwidth(X=df.pov_index, Y=df.conspcfood_t1, cut=0)

We use the triangular kernel function to calculate weights for each observation. After that, we can fit two seperate linear regressions for both treatment and control groups.

In [None]:
weights = triangular_kernel(df.pov_index, h)
model_treated, model_control = LinearRegression(), LinearRegression()
model_treated.fit(y=df.loc[df.pov_index > 0, "conspcfood_t1"].values.reshape(-1, 1),
                  X=df.loc[df.pov_index > 0, "pov_index"].values.reshape(-1, 1),
                  sample_weight=weights[df.pov_index > 0])
model_control.fit(y=df.loc[df.pov_index < 0, "conspcfood_t1"].values.reshape(-1, 1),
                  X=df.loc[df.pov_index < 0, "pov_index"].values.reshape(-1, 1),
                  sample_weight=weights[df.pov_index < 0])

The treatment effect at the cutoff point is estimated as the difference between the predictions of the two models at the cutoff point.

In [None]:
cutoff = 0
treatment_effect = model_treated.predict(np.array([cutoff]).reshape(-1, 1))
treatment_effect -= model_control.predict(np.array([cutoff]).reshape(-1, 1))
treatment_effect[0, 0]

We estimate that the participation in the program reduced food consumption by $22.1$ units in the year following the intervention.The following plot visualizes the two weighted regressions at the cut-off for the last outcome variable (for food consumption in `t1`). We can clearly see the "jump" at the cut-off, which is our LATE.

In [None]:
plt.scatter(df.pov_index, df.conspcfood_t1, s=10, color='black')  # s controls the size
neg_xval = np.linspace(-0.1, 0, 100)
neg_line = model_control.predict(neg_xval.reshape(-1, 1))
pos_xval = np.linspace(0, 0.1, 100)
pos_line = model_treated.predict(pos_xval.reshape(-1, 1))
plt.plot(neg_xval, neg_line, linewidth=3, label="Control Regression")
plt.plot(pos_xval, pos_line, linewidth=3, label="Treated Regression")
plt.axvline(x=0, color='red', linestyle='--', label="Cut-Off")
plt.legend()
plt.xlim(-0.1, 0.1)
plt.ylim(250, 350)
plt.show()

We can repeat the estimation using the `rdd` package, which yields us an estimate as well as a confidence band calculated according to the formulas presented in the book. We look at all four targets.

In [None]:
df.pov_index.duplicated().sum()

In [None]:
result = []
for outcome in ["conspcfood_t1", "conspcnonfood_t1", "conspcfood_t2", "conspcnonfood_t2"]:
    rdd_result = rdrobust(x=df.pov_index, y=df[outcome], rho=1, masspoints="off")
    result.append([rdd_result.coef.iloc[0].values[0], rdd_result.se.iloc[2].values[0]])
res_dataframe = pd.DataFrame(result, columns=["LATE", "s.e."],
                             index=["Food T_1", "Non-Food T_1", "Food T_2", "Non-Food T_2"])
res_dataframe

While the effects in the first year after the intervention are negative, we observe significant positive effects in the second year after an individual or household was accepted in the *Progresa* program. This is in accordance to the previous analysis of this dataset. One possible explanation for this is that the program households have more money and can thus afford more. This was the desired effect of the program to combat hunger and extreme poverty.

## Estimation with Covariates

For the identification and estimation of the average treatment effect at the cutoff value no covariate information is required except the running variable, but nevertheless in many applications additional covariates are collected which might be exploited for the analysis to improve the efficiency of the estimates.

The standard approach is simply to take up the regressors in the weighted least squares regression.

In [None]:
model_treated, model_control = LinearRegression(), LinearRegression()
model_treated.fit(y=df.loc[df.pov_index > 0, "conspcfood_t1"].values.reshape(-1, 1),
                  X=df.loc[df.pov_index > 0, ["pov_index", "hhownhouse", "headage", "heademp", "headeduc"]],
                  sample_weight=weights[df.pov_index > 0])
model_control.fit(y=df.loc[df.pov_index < 0, "conspcfood_t1"].values.reshape(-1, 1),
                  X=df.loc[df.pov_index < 0, ["pov_index", "hhownhouse", "headage", "heademp", "headeduc"]],
                  sample_weight=weights[df.pov_index < 0])
pred_t = model_treated.predict(pd.DataFrame({"pov_index": cutoff,
                                             "hhownhouse": np.average(df.loc[df.pov_index > 0, "hhownhouse"],
                                                                      weights=weights[df.pov_index > 0]),
                                             "headage": np.average(df.loc[df.pov_index > 0, "headage"],
                                                                   weights=weights[df.pov_index > 0]),
                                             "heademp": np.average(df.loc[df.pov_index > 0, "heademp"],
                                                                   weights=weights[df.pov_index > 0]),
                                             "headeduc": np.average(df.loc[df.pov_index > 0, "headeduc"],
                                                                    weights=weights[df.pov_index > 0])},
                                            index=[0]))
pred_c = model_control.predict(pd.DataFrame({"pov_index": cutoff,
                                             "hhownhouse": np.average(df.loc[df.pov_index < 0, "hhownhouse"],
                                                                      weights=weights[df.pov_index < 0]),
                                             "headage": np.average(df.loc[df.pov_index < 0, "headage"],
                                                                   weights=weights[df.pov_index < 0]),
                                             "heademp": np.average(df.loc[df.pov_index < 0, "heademp"],
                                                                   weights=weights[df.pov_index < 0]),
                                             "headeduc": np.average(df.loc[df.pov_index < 0, "headeduc"],
                                                                    weights=weights[df.pov_index < 0])},
                                            index=[0]))
treatment_effect = pred_t - pred_c
treatment_effect[0][0]

Including these selected covariates does not have a significant impact on the LATE estimation. Again, we can also use `rdd` to repeat the estimation with all other outcomes.


In [None]:
res_dataframe

In [None]:
result = []
for outcome in ["conspcfood_t1", "conspcnonfood_t1", "conspcfood_t2", "conspcnonfood_t2"]:
    rdd_result = rdrobust(x=df.pov_index, y=df[outcome], rho=1, masspoints="off",
                          covs=df.iloc[:, [0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 18, 21]])
    result.append([rdd_result.coef.iloc[0].values[0], rdd_result.se.iloc[2].values[0]])
res_dataframe_adj = pd.DataFrame(result, columns=["LATE", "s.e."],
                                 index=["Food T_1", "Non-Food T_1", "Food T_2", "Non-Food T_2"])
res_dataframe_adj["% reduction"] = (res_dataframe_adj["s.e."] - res_dataframe["s.e."]) * 100 / res_dataframe["s.e."]
res_dataframe_adj

Overall, the adjustment by only a few covariates has not changed the estimated coefficient much from the result without covariates. However, including covariates does reduce the standard deviation of the estimation.

## Estimation using ML

As discussed in the book, including many covariates in RDD estimation can be beneficial for multiple reasons:
1. **Efficiency and power improvements**: As in randomized control trials, using covariates can increase efficiency and improve power.
2. **Auxiliary information**: In RDD the score determines the treatment assignment and measurement errors in the running variable can distort the results. Additional covariates can be exploited to overcome these issues or to deal with missing data problems.
3. **Treatment effect heterogeneity**: Covariates can be used to define subgroups in which the treatment effects differ.
4. **Other parameters of interest and extrapolation**: As the identified treatment effect in RDD is local at the cutoff, additional covariates might help for extrapolation of the treatment effects or identify other causal parameters.

However, including a high number of covariates also comes with additional challenges, such as variables selection, non-linearities or interactions between covariates. The best way to overcome these is the use of modern ML methods.

There are multiple ways to implement the estimators presented in the book, we will closely follow the analysis of [Noack et al. (2023)](https://arxiv.org/pdf/2107.07942.pdf). We set up running variable and outcome as above. The baseline covariates will be all the other variables in the data.

In [None]:
# Running Variable and Outcome
investigated_outcome = "conspcfood_t1"
df_ml = df.rename(columns={"pov_index": "X", investigated_outcome: "Y"})

# Baseline covariates including consumption
b_covs = df_ml.columns[[0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 18, 21]]

# Fixed effects for localities
i_fe = pd.get_dummies(df_ml['clus'], drop_first=True)

# Flexible covariates including localities indicators
f_covs = patsy.dmatrix('~ (' + ' + '.join(b_covs) + ')**2', data=df_ml, return_type='dataframe')

# Dropping the intercept column that is automatically added by patsy
f_covs = f_covs.iloc[:, 1:]

Z_lasso = pd.concat([i_fe, f_covs], axis=1)

We will use the package `rdrobust` for the RD estimation. Before starting the DML procedure, we have to estimate a bandwidth to restrict the samples in the first stage estimation.

In [None]:
h_fs = rdrobust(y=df_ml.Y, x=df_ml.X, masspoints="off").bws.values[1, 0]

The next chunk sets up the crossfitting and estimates the function $\eta(Z)$, which we will use to adjust $Y$ for the second stage. We use Random Forest, a Boosting implementation, Linear Regression and Lasso with both a baseline and flexible covariate structure.

In [None]:
def first_stage(df_ml, b_covs, h_fs, Z_lasso, Kf=5, random_seed=123):
    '''
    df_ml: dataframe
    b_covs: which columns of the data frame will be used as baseline covariates
    h_fs: bandwidth around discontinuity for training points
    Z_lasso: dataframe with extra baseline covariates in high-dimensional specification
    '''
    # Set up the cross-fitting
    n = df_ml.shape[0]
    # Matrix to store eta predictions
    eta_fit = np.empty((n, 5))

    # Create vector of observations to be considered in the first stage model
    weights = np.abs(df_ml.X) < h_fs

    for train, test in KFold(shuffle=True, n_splits=Kf, random_state=random_seed).split(df_ml.X, df_ml.Y):

        df_train = df_ml.iloc[train]
        treated_train = (df_train.X > 0) & (weights.iloc[train] > 0)
        control_train = (df_train.X < 0) & (weights.iloc[train] > 0)
        data_treated = df_train[treated_train]
        data_control = df_train[control_train]

        data_fold = df_ml.iloc[test]

        rf1 = RandomForestRegressor(max_features=4, n_estimators=1000, random_state=random_seed)
        rf1.fit(data_treated[b_covs], data_treated.Y)
        rf0 = RandomForestRegressor(max_features=4, n_estimators=1000, random_state=random_seed)
        rf0.fit(data_control[b_covs], data_control.Y)
        eta_fit[test, 0] = (rf1.predict(data_fold[b_covs]) + rf0.predict(data_fold[b_covs])) / 2

        lgbm1 = LGBMRegressor(verbosity=-1, random_state=random_seed)
        lgbm1.fit(data_treated[b_covs], data_treated.Y)
        lgbm0 = LGBMRegressor(verbosity=-1, random_state=random_seed)
        lgbm0.fit(data_control[b_covs], data_control.Y)
        eta_fit[test, 1] = (lgbm1.predict(data_fold[b_covs]) + lgbm0.predict(data_fold[b_covs])) / 2

        lm1 = LinearRegression()
        lm1.fit(data_treated[b_covs], data_treated.Y)
        lm0 = LinearRegression()
        lm0.fit(y=data_control.Y, X=data_control[b_covs])
        eta_fit[test, 2] = (lm1.predict(data_fold[b_covs]) + lm0.predict(data_fold[b_covs])) / 2

        las_base1 = LassoCV(random_state=random_seed)
        las_base1.fit(data_treated[b_covs], data_treated.Y)
        las_base0 = LassoCV(random_state=random_seed)
        las_base0.fit(data_control[b_covs], data_control.Y)
        eta_fit[test, 3] = (las_base1.predict(data_fold[b_covs]) + las_base0.predict(data_fold[b_covs])) / 2

        X_flex_treated = pd.concat([Z_lasso.loc[data_treated.index], data_treated[b_covs]], axis=1)
        X_flex_control = pd.concat([Z_lasso.loc[data_control.index], data_control[b_covs]], axis=1)
        X_flex_fold = pd.concat([Z_lasso.loc[data_fold.index], data_fold[b_covs]], axis=1)
        X_flex_treated.columns = X_flex_treated.columns.astype(str)
        X_flex_control.columns = X_flex_control.columns.astype(str)
        X_flex_fold.columns = X_flex_fold.columns.astype(str)
        las_flex1 = LassoCV(random_state=random_seed)
        las_flex1.fit(X_flex_treated, data_treated.Y)
        las_flex0 = LassoCV(random_state=random_seed)
        las_flex0.fit(X_flex_control, data_control.Y)
        eta_fit[test, 4] = (las_flex1.predict(X_flex_fold) + las_flex0.predict(X_flex_fold)) / 2

    return eta_fit

In [None]:
eta_fit = first_stage(df_ml, b_covs, h_fs, Z_lasso, random_seed=0)

With the estimated $\hat{\eta}(Z)$ we can correct for confounding in $Y$ and now run the RD estimation as second stage again.

In [None]:
methods = ["Random Forest", "Gradient Boosting", "Linear Regression",
           "Lasso Baseline", "Lasso Flexible"]


def second_stage(df_ml, eta_fit, methods):
    adj_results = []
    for i in range(len(methods)):
        M_Y = df_ml.Y - eta_fit[:, i]
        rd_call = rdrobust(y=M_Y, x=df_ml.X, masspoints="off")
        adj_results.append([rd_call.coef.iloc[0].values[0],
                            rd_call.se.iloc[2].values[0]])
    return adj_results

In [None]:
adj_frame = pd.DataFrame(second_stage(df_ml, eta_fit, methods),
                         columns=["LATE", "s.e."],
                         index=methods)
adj_frame

Finally, we create a small simulation study with only $R=20$ repetitions to show the variance reducing effect of the inclusion of ML-based estimators for the covariates. The next block runs up to ten minutes.

In [None]:
from joblib import Parallel, delayed


def exp(i):
    eta_fit = first_stage(df_ml, b_covs, h_fs, Z_lasso, random_seed=i)
    return np.array(second_stage(df_ml, eta_fit, methods))


results = Parallel(n_jobs=-1, verbose=3)(delayed(exp)(i) for i in range(20))

estimates = [results[i][:, 0] for i in range(20)]
std_err = [results[i][:, 1] for i in range(20)]

We aggregate the median of the estimates, the mean of the standard errors and also calculate the mean reduction of standard error compared to the "no covariates" estimation. We see, that including covariates can reduce the standard error of estimation around 15-20%.

In [None]:
med_est = np.median(np.array(estimates), axis=0)
mean_se = np.array(std_err).mean(axis=0)
adj_frame = pd.DataFrame(np.c_[med_est, mean_se], index=methods, columns=["LATE", "s.e."])
adj_frame["% reduction"] = (adj_frame["s.e."] - res_dataframe.loc["Food T_1", "s.e."]) * 100
adj_frame["% reduction"] /= res_dataframe.loc["Food T_1", "s.e."]
adj_frame.loc["Linear Adjusted (no cross-fit)"] = res_dataframe_adj.loc["Food T_1"]
adj_frame

## We now repeat the exercise for the other outcomes (excluding the simulation).

Non-Food Consumption (Year 1)

In [None]:
# Running Variable and Outcome
investigated_outcome = "conspcnonfood_t1"
df_ml = df.rename(columns={"pov_index": "X", investigated_outcome: "Y"})
h_fs = rdrobust(y=df_ml.Y, x=df_ml.X, masspoints="off").bws.values[1, 0]

eta_fit = first_stage(df_ml, b_covs, h_fs, Z_lasso)

adj_frame = pd.DataFrame(second_stage(df_ml, eta_fit, methods),
                         columns=["LATE", "s.e."],
                         index=methods)
adj_frame

Food Consumption (Year 2)

In [None]:
# Running Variable and Outcome
investigated_outcome = "conspcfood_t2"
df_ml = df.rename(columns={"pov_index": "X", investigated_outcome: "Y"})
h_fs = rdrobust(y=df_ml.Y, x=df_ml.X, masspoints="off").bws.values[1, 0]

eta_fit = first_stage(df_ml, b_covs, h_fs, Z_lasso)

adj_frame = pd.DataFrame(second_stage(df_ml, eta_fit, methods),
                         columns=["LATE", "s.e."],
                         index=methods)
adj_frame

Non-Food Consumption (Year 2)

In [None]:
# Running Variable and Outcome
investigated_outcome = "conspcnonfood_t2"
df_ml = df.rename(columns={"pov_index": "X", investigated_outcome: "Y"})

h_fs = rdrobust(y=df_ml.Y, x=df_ml.X, masspoints="off").bws.values[1, 0]

eta_fit = first_stage(df_ml, b_covs, h_fs, Z_lasso)

adj_frame = pd.DataFrame(second_stage(df_ml, eta_fit, methods),
                         columns=["LATE", "s.e."],
                         index=methods)
adj_frame