# Python: Difference-in-Differences

In this example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate the average treatment effect on the treated (ATT) under the conditional parallel trend assumption. The estimation is based on [Chang (2020)](https://doi.org/10.1093/ectj/utaa001), [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) and [Zimmert et al. (2018)](https://arxiv.org/abs/1809.01643).

In this example, we will adopt the notation of [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003).

In the whole example our treatment and time variable $t\in\{0,1\}$ will be binary. 
Let $D_i\in\{0,1\}$ denote the treatment status of unit $i$ at time $t=1$ (at time $t=0$ all units are not treated) and let $Y_{it}$ be the outcome of interest of unit $i$ at time $t$.
Using the potential outcome notation, we can write $Y_{it}(d)$ for the potential outcome of unit $i$ at time $t$ and treatment status $d$. Further, let $X_i$ denote a vector of pre-treatment covariates.
In these difference-in-differences settings [Abadie (2005)](https://doi.org/10.1111/0034-6527.00321) showed that the ATTE

$$\theta = \mathbb{E}[Y_{i1}(1)- Y_{i1}(0)|D_i=1]$$

is identified when panel data are available or under stationarity assumptions for repeated cross-sections. Further, the basic assumptions are 

 - **Parallel Trends:** We have $\mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=1] = \mathbb{E}[Y_{i1}(0) - Y_{i0}(0)|X_i, D_i=0]\quad a.s.$

- **Overlap:** For some $\epsilon > 0$, $P(D_i=1) > \epsilon$ and $P(D_i=1|X_i) \le 1-\epsilon$ a.s.

For a detailed explanation of the assumptions see e.g. [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) or [Zimmert et al. (2018)](https://arxiv.org/abs/1809.01643).


## Panel Data (Repeated Outcomes)

At first, we will consider two-period panel data, where we observe i.i.d. data $W_i = (Y_{i0}, Y_{i1}, D_i, X_i)$.

### Data

We will use the implemented data generating process `make_did_SZ2020` to generate data according to the simulation in [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) (Section 4.1). 

In this example, we will use `dgp_tpye=4`, which corresponds to the misspecified settings in [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) (other data generating processes are also available via the `dgp_type` parameter). In all settings the true ATTE is zero.

To specify a corresponding `DoubleMLData` object, we have to specify a single outcome `y`. For panel data, the outcome consists of the difference of 

$$\Delta Y_i = Y_{i1}- Y_{i0}.$$

This difference will then be defined as outcome in our `DoubleMLData` object. The data generating process `make_did_SZ2020` already specifies the outcome `y` accordingly.

# Notes
- Description of dgp: 
https://docs.doubleml.org/stable/api/generated/doubleml.datasets.make_did_SZ2020.html#doubleml.datasets.make_did_SZ2020

Which simulations to do?
1. TWFE of DiD (todo)
2. run the doubly robust of S'ant Anna with logistic and linear estimation (done)
    - how do I get the results from the paper? (done)
3. Just run with one method (linear or logistic) (todo)
3. run the doubly robust of S'ant Anna with deep learning and linear estimation (done)
4. Just run deep learning (todo)
    - is it possible to just run one model?

In [None]:
import numpy as np
from doubleml import DoubleMLData
from doubleml.datasets import make_did_SZ2020

np.random.seed(42)
n_obs = 1000
x, y, d = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=4,
    cross_sectional_data=False,
    return_type="array",
)  # max 1- 6 dgp types
dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d)
print(dml_data)

### ATTE Estimation

To estimate the ATTE with panel data, we will use the `DoubleMLDID` class. 

As for all `DoubleML` classes, we have to specify learners, which have to be initialized first.
Here, we will just rely on a tree based method. 

The learner `ml_g` is used to fit conditional expectations of the outcome $\mathbb{E}[\Delta Y_i|D_i=0, X_i]$, whereas the learner `ml_m` will be used to estimate the propensity score $P(D_i=1|X_i)$.

In [None]:
from lightgbm import LGBMClassifier, LGBMRegressor

n_estimators = 30
ml_g = LGBMRegressor(n_estimators=n_estimators)  # putcome regression
ml_m = LGBMClassifier(n_estimators=n_estimators)  # propensity

# linear model trials

In [None]:
import lightgbm as lgb

# not exactly the same as in the paper, but similar
# results are more consiten than sklean library
ml_m = lgb.LGBMClassifier(
    objective="binary",
    metric="binary_logloss",
    n_estimators=n_estimators,
)
ml_g = lgb.LGBMRegressor(
    objective="regression",
    metric="mse",
    n_estimators=n_estimators,
)

In [None]:
from sklearn.linear_model import LinearRegression, LogisticRegression

ml_g = LinearRegression()  # as in the paper, estimators not needed
ml_m = LogisticRegression()  # as in the paper, estimators not needed

# main model that worked!

In [None]:
# important main model

import numpy as np
from doubleml import DoubleMLData, DoubleMLDID
from doubleml.datasets import make_did_SZ2020
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_obs = 1000
n_rep = 200
ATTE = 0.0  # The true value of the ATTE

# Storage for estimates
ATTE_estimates = np.full(n_rep, np.nan)
coverage = np.full(n_rep, np.nan)
ci_length = np.full(n_rep, np.nan)
asymptotic_variance = np.full(n_rep, np.nan)

# Model definitions
poly_features = PolynomialFeatures(degree=2, include_bias=False)
linear_model = Pipeline([("poly", poly_features), ("linear", LinearRegression())])
logistic_model = LogisticRegression(solver="liblinear")

for i_rep in range(n_rep):
    if i_rep % int(n_rep / 10) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")

    # Generate data
    x, y, d = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=4,
        cross_sectional_data=False,
        return_type="array",
    )
    dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d)

    # Define DoubleML model
    dml_did = DoubleMLDID(dml_data, ml_g=linear_model, ml_m=logistic_model, n_folds=5)
    dml_did.fit()

    # Store results
    ATTE_estimates[i_rep] = dml_did.coef.squeeze()
    confint = dml_did.confint(level=0.95)
    coverage[i_rep] = (confint["2.5 %"].iloc[0] <= ATTE) & (
        confint["97.5 %"].iloc[0] >= ATTE
    )
    ci_length[i_rep] = confint["97.5 %"].iloc[0] - confint["2.5 %"].iloc[0]
    # Extract standard error from the summary
    summary_df = dml_did.summary
    std_err = summary_df.loc["d", "std err"]
    asymptotic_variance[i_rep] = std_err**2

# Calculate metrics
avg_bias = np.mean(ATTE_estimates - ATTE)
med_bias = np.median(ATTE_estimates - ATTE)
rmse = np.sqrt(np.mean((ATTE_estimates - ATTE) ** 2))
avg_asymptotic_variance = np.mean(asymptotic_variance)
coverage_probability = np.mean(coverage)
avg_ci_length = np.mean(ci_length)

# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")
print(f"Asy. V: {avg_asymptotic_variance}")
print(f"Cover: {coverage_probability}")
print(f"CIL: {avg_ci_length}")

In [None]:
dml_did.summary

# Two way fixed effects approach

In [None]:
import numpy as np
import pandas as pd
from doubleml.datasets import make_did_SZ2020
from sklearn.linear_model import LinearRegression

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_obs = 1000
n_rep = 200
ATTE = 0.0  # Adjust this to reflect the true treatment effect

# Storage for estimates
ATTE_estimates = np.full(n_rep, np.nan)
coverage = np.full(n_rep, np.nan)
ci_length = np.full(n_rep, np.nan)
asymptotic_variance = np.full(n_rep, np.nan)

biases = []

for i_rep in range(n_rep):
    if i_rep % int(n_rep / 10) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")

    # Generate data
    x, y, d = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=1,
        cross_sectional_data=False,
        return_type="array",
    )
    dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d)

    # Fit linear regression model
    model = LinearRegression()
    model.fit(x, y)

    # Extract the treatment effect estimate (coefficient of the interaction term)
    ATTE_estimates[i_rep] = model.coef_[-1]

    # Calculate and store the bias
    bias = model.coef_[-1] - ATTE
    biases.append(bias)

    # Store other results as needed

# Calculate metrics
avg_bias = np.mean(biases)
med_bias = np.median(biases)
rmse = np.sqrt(np.mean(np.square(biases)))

# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from doubleml.datasets import make_did_SZ2020

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_obs = 1000
n_rep = 200
ATTE = 0.0  # Adjust this to reflect the true treatment effect

# Storage for estimates
ATTE_estimates = np.full(n_rep, np.nan)
coverage = np.full(n_rep, np.nan)
ci_length = np.full(n_rep, np.nan)
asymptotic_variance = np.full(n_rep, np.nan)

biases = []

for i_rep in range(n_rep):
    if i_rep % int(n_rep / 10) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")

    # Generate data
    x, y, d = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=1,
        cross_sectional_data=False,
        return_type="array",
    )

    # Convert to DataFrame
    df = pd.DataFrame(x, columns=[f"X{i+1}" for i in range(x.shape[1])])
    df["y"] = y
    df["d"] = d
    df["time"] = np.random.randint(
        2,
        size=len(df),
    )  # Example time indicator (replace with your time indicator)

    # Fit TWFE model using statsmodels
    formula = "y ~ d + time + " + " + ".join([f"X{i+1}" for i in range(x.shape[1])])
    model = smf.ols(formula=formula, data=df).fit()

    # Extract the treatment effect estimate (coefficient of the treatment variable)
    tau_fe = model.params["d"]
    ATTE_estimates[i_rep] = tau_fe

    # Calculate and store the bias
    bias = tau_fe - ATTE
    biases.append(bias)

    # Confidence intervals
    ci = model.conf_int().loc["d"]
    ci_length[i_rep] = ci[1] - ci[0]
    coverage[i_rep] = ci[0] <= ATTE <= ci[1]
    asymptotic_variance[i_rep] = model.bse["d"] ** 2

# Calculate metrics
avg_bias = np.mean(biases)
med_bias = np.median(biases)
rmse = np.sqrt(np.mean(np.square(biases)))

# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")

## workin on closest to being correct
Problem: DiD works to well

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from doubleml.datasets import make_did_SZ2020

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_obs = 1000
n_rep = 200
ATTE = 0.0  # Adjust this to reflect the true treatment effect

# Storage for estimates
ATTE_estimates = np.full(n_rep, np.nan)
coverage = np.full(n_rep, np.nan)
ci_length = np.full(n_rep, np.nan)
asymptotic_variance = np.full(n_rep, np.nan)

biases = []

for i_rep in range(n_rep):
    if i_rep % int(n_rep / 10) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")

    # Generate data
    x, y, d = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=1,
        cross_sectional_data=False,
        return_type="array",
    )

    # Convert to DataFrame
    df = pd.DataFrame(x, columns=[f"X{i+1}" for i in range(x.shape[1])])
    df["y"] = y
    df["d"] = d
    df["time"] = np.random.randint(
        2,
        size=len(df),
    )  # Example time indicator (replace with your time indicator)

    # Fit TWFE model using statsmodels
    formula = "y ~ d + time + d:time + " + " + ".join(
        [f"X{i+1}" for i in range(x.shape[1])],
    )
    model = smf.ols(formula=formula, data=df).fit()

    # Extract the treatment effect estimate (coefficient of the treatment variable)
    tau_fe = model.params["d"]
    ATTE_estimates[i_rep] = tau_fe

    # Calculate and store the bias
    bias = tau_fe - ATTE
    biases.append(bias)

    # Confidence intervals
    ci = model.conf_int().loc["d"]
    ci_length[i_rep] = ci[1] - ci[0]
    coverage[i_rep] = ci[0] <= ATTE <= ci[1]
    asymptotic_variance[i_rep] = model.bse["d"] ** 2

# Calculate metrics
avg_bias = np.mean(biases)
med_bias = np.median(biases)
rmse = np.sqrt(np.mean(np.square(biases)))
mean_coverage = np.mean(coverage)
mean_ci_length = np.mean(ci_length)
mean_asymptotic_variance = np.mean(asymptotic_variance)
# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")
print(f"Coverage:{mean_coverage}")
print(f"mean_ci_length:{mean_ci_length}")
print(f"mean_asymptotic_variance:{mean_asymptotic_variance}")

rumprobieren hier

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from doubleml.datasets import make_did_SZ2020

# Set seed for reproducibility
np.random.seed(42)

# Parameters
n_obs = 1000
n_rep = 200
ATTE = 0.0  # Adjust this to reflect the true treatment effect

# Storage for estimates
ATTE_estimates = np.full(n_rep, np.nan)
coverage = np.full(n_rep, np.nan)
ci_length = np.full(n_rep, np.nan)
asymptotic_variance = np.full(n_rep, np.nan)

biases = []

for i_rep in range(n_rep):
    if i_rep % int(n_rep / 10) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")

    # Generate data
    data = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=1,
        cross_sectional_data=False,
        return_type="DataFrame",
    )

    # Convert to DataFrame
    df = pd.DataFrame(x, columns=[f"X{i+1}" for i in range(x.shape[1])])
    df["y"] = y
    df["d"] = d
    df["time"] = np.random.randint(
        2,
        size=len(df),
    )  # Example time indicator (replace with your time indicator)

    # Fit TWFE model using statsmodels
    formula = "y ~ d"
    model = smf.ols(formula=formula, data=df).fit()

    # Extract the treatment effect estimate (coefficient of the treatment variable)
    tau_fe = model.params["d"]
    ATTE_estimates[i_rep] = tau_fe

    # Calculate and store the bias
    bias = tau_fe - ATTE
    biases.append(bias)

    # Confidence intervals
    ci = model.conf_int().loc["d"]
    ci_length[i_rep] = ci[1] - ci[0]
    coverage[i_rep] = ci[0] <= ATTE <= ci[1]
    asymptotic_variance[i_rep] = model.bse["d"] ** 2

# Calculate metrics
avg_bias = np.mean(biases)
med_bias = np.median(biases)
rmse = np.sqrt(np.mean(np.square(biases)))
mean_coverage = np.mean(coverage)
mean_ci_length = np.mean(ci_length)
mean_asymptotic_variance = np.mean(asymptotic_variance)
# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")
print(f"Coverage:{mean_coverage}")
print(f"mean_ci_length:{mean_ci_length}")
print(f"mean_asymptotic_variance:{mean_asymptotic_variance}")

In [None]:
data = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=1,
    cross_sectional_data=False,
    return_type="DataFrame",
)
# data
formula = "y ~ d + Z1 + Z2 + Z3 + Z4"
model = smf.ols(formula=formula, data=data).fit()

# Extract the treatment effect estimate (coefficient of the treatment variable)
model.summary()

In [None]:
import doubleml as dml
import numpy as np
from doubleml.datasets import make_did_SZ2020
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

np.random.seed(42)
ml_g = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_leaf=5)
ml_m = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=5)
data = make_did_SZ2020(n_obs=500, return_type="DataFrame")
obj_dml_data = dml.DoubleMLData(data, "y", "d")
dml_did_obj = dml.DoubleMLDID(obj_dml_data, ml_g, ml_m)
dml_did_obj.fit().summary

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df_pa = pd.DataFrame(ATTE_estimates, columns=["Estimate"])
g = sns.kdeplot(df_pa, fill=True)
plt.show()

In [None]:
data

In [None]:
data = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=4,
    cross_sectional_data=False,
    return_type="DataFrame",
)

In [None]:
# check for hetergoeneity
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import f_oneway

# Visual Inspection: Pairplot for visualizing relationships between variables
sns.pairplot(data)
plt.show()

# Descriptive Statistics: Mean comparison for each group (treated vs. control)
mean_grouped = data.groupby("d").mean()
print("Mean comparison between groups:")
print(mean_grouped)

# Statistical Tests: ANOVA for testing differences in 'y' across treatment groups
group0_y = data[data["d"] == 0]["y"]
group1_y = data[data["d"] == 1]["y"]
f_stat, p_value = f_oneway(group0_y, group1_y)
print("\nANOVA F-statistic:", f_stat)
print("ANOVA p-value:", p_value)

In [None]:
import numpy as np


def f_reg(W):
    return 210 + 27.4 * W[0] + 13.7 * np.sum(W[1:])


def f_ps(W):
    return 0.75 * (-W[0] + 0.5 * W[1] - 0.25 * W[2] - 0.1 * W[3])


def p_function(Z):
    return np.exp(f_ps(Z)) / (1 + np.exp(f_ps(Z)))


def generate_data(DGP, n_samples, U):
    X = np.random.multivariate_normal(mean=np.zeros(4), cov=np.eye(4), size=n_samples)
    Z = np.column_stack(
        [
            np.exp(0.5 * X[:, 0]),
            10 + X[:, 1] / (1 + np.exp(X[:, 0])),
            (0.6 + X[:, 0] * X[:, 2] / 25) ** 3,
            (20 + X[:, 1] + X[:, 3]) ** 2,
        ],
    )

    if DGP == 1:
        p = p_function(Z)
        D = (p >= U).astype(int)
        noise_Y0 = np.random.normal(size=(n_samples,))
        noise_Y1 = np.random.normal(size=(n_samples,))
        Y0 = f_reg(Z) + noise_Y0 + np.random.normal(scale=0.1, size=(n_samples,))
        Y1 = 2 * f_reg(Z) + noise_Y1 + np.random.normal(scale=0.1, size=(n_samples,))
        return X, Y0, Y1, D
    elif DGP == 2:
        p = p_function(X)
        D = (p >= U).astype(int)
        noise_Y0 = np.random.normal(size=(n_samples,))
        noise_Y1 = np.random.normal(size=(n_samples,))
        Y0 = f_reg(Z) + noise_Y0 + np.random.normal(scale=0.1, size=(n_samples,))
        Y1 = 2 * f_reg(Z) + noise_Y1 + np.random.normal(scale=0.1, size=(n_samples,))
        return X, Y0, Y1, D
    elif DGP == 3:
        p = p_function(Z)
        D = (p >= U).astype(int)
        noise_Y0 = np.random.normal(size=(n_samples,))
        noise_Y1 = np.random.normal(size=(n_samples,))
        Y0 = f_reg(X) + noise_Y0 + np.random.normal(scale=0.1, size=(n_samples,))
        Y1 = 2 * f_reg(X) + noise_Y1 + np.random.normal(scale=0.1, size=(n_samples,))
        return X, Y0, Y1, D
    elif DGP == 4:
        p = p_function(X)
        D = (p >= U).astype(int)
        noise_Y0 = np.random.normal(size=(n_samples,))
        noise_Y1 = np.random.normal(size=(n_samples,))
        Y0 = f_reg(X) + noise_Y0 + np.random.normal(scale=0.1, size=(n_samples,))
        Y1 = 2 * f_reg(X) + noise_Y1 + np.random.normal(scale=0.1, size=(n_samples,))
        return X, Y0, Y1, D
    else:
        msg = "Invalid DGP number"
        raise ValueError(msg)


# Example usage:
np.random.seed(42)
n_samples = 1000
U = 0.5

X, Y0, Y1, D = generate_data(DGP=1, n_samples=n_samples, U=U)

check tomorrow with gpt 4o

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression


def twfe_did_panel(y1, y0, D, covariates, i_weights=None, inffunc=False):
    # Check dimensions
    n, k = covariates.shape
    assert (
        len(y1) == len(y0) == len(D) == n
    ), "Dimensions of y1, y0, and D must match with covariates"

    # Default weights
    if i_weights is None:
        i_weights = np.ones(n)

    # Main regression model
    X = np.concatenate((covariates, D.reshape(-1, 1)), axis=1)
    model = LinearRegression()
    model.fit(X, y1, sample_weight=i_weights)
    y1_hat = model.predict(X)

    model.fit(X, y0, sample_weight=i_weights)
    y0_hat = model.predict(X)

    ATT = np.mean((y1 - y1_hat) - (y0 - y0_hat))
    se = np.std((y1 - y1_hat) - (y0 - y0_hat)) / np.sqrt(n)

    if inffunc:
        # Calculate influence function
        att_inf_func = None  # Calculate influence function

    return {"ATT": ATT, "se": se, "att_inf_func": att_inf_func if inffunc else None}


# Example data
df_pre = df[df["d"] == 0]
df_post = df[df["d"] == 1]

# Applying the function with pre-treatment and post-treatment periods
result = twfe_did_panel(
    df_post["y"].values,
    df_pre["y"].values,
    df_post["d"].values,
    df_post[["Z1", "Z2", "Z3", "Z4"]].values,
)


print(result)

In [None]:
print(model.summary())

In [None]:
data

In [None]:
ATTE_estimates

TWFE build by hand

In [None]:
import numpy as np
from doubleml import DoubleMLData
from doubleml.datasets import make_did_SZ2020

ATTE = 0.0  # Adjust this to reflect the true treatment effect

custom_delta = 0
np.random.seed(42)
n_obs = 1000
data = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=1,
    cross_sectional_data=False,
    return_type="pd.DataFrame",
)
# data
formula = "y ~ d + Z1 + Z2 + Z3 + Z4"
twfe_model = smf.ols(formula=formula, data=data).fit()

# Extract the treatment effect estimate (coefficient of the treatment variable)

twfe_model.params["d"]

In [None]:
data

In [None]:
x, y, d = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=4,
    cross_sectional_data=False,
    return_type="array",
)  # max 1- 6 dgp types
dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d)
print(dml_data)

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from doubleml.datasets import make_did_SZ2020

# Parameters for the simulation
np.random.seed(42)
n_obs = 1000
n_simulations = 100
true_treatment_effect = 0.0  # Adjust this based on your DGP

# Storage for results
biases = []
ci_lengths = []
coverages = []
asymptotic_variances = []

for _ in range(n_simulations):
    # Generate data with heterogeneous treatment effects and non-parallel trends
    data = make_did_SZ2020(
        n_obs=n_obs,
        dgp_type=2,
        cross_sectional_data=False,
        return_type="DataFrame",
    )  # Use dgp_type=2 to simulate more complex DGP

    # Convert to panel data structure
    data["time"] = np.where(
        data.index < n_obs / 2,
        0,
        1,
    )  # Assume the first half is pre-treatment, the second half is post-treatment
    data["id"] = np.tile(
        np.arange(n_obs // 2),
        2,
    )  # Assign unique IDs for each individual

    # Prepare the data for the TWFE model
    panel_data = pd.DataFrame(
        {
            "id": data["id"],
            "time": data["time"],
            "y": data["y"],
            "d": data["d"],
            "Z1": data["Z1"],
            "Z2": data["Z2"],
            "Z3": data["Z3"],
            "Z4": data["Z4"],
        },
    )

    # Fit the TWFE model
    model = sm.OLS.from_formula("y ~ time * d + Z1 + Z2 + Z3 + Z4", data=panel_data)
    results = model.fit()

    # Extract the estimated treatment effect and confidence interval
    estimated_tau = results.params["time:d"]
    ci = results.conf_int().loc["time:d"]
    ci_length = ci[1] - ci[0]
    asymptotic_variance = results.bse["time:d"] ** 2

    # Calculate bias and check if the true effect is within the confidence interval
    bias = estimated_tau - true_treatment_effect
    coverage = 1 if ci[0] <= true_treatment_effect <= ci[1] else 0

    # Store the results
    biases.append(bias)
    ci_lengths.append(ci_length)
    coverages.append(coverage)
    asymptotic_variances.append(asymptotic_variance)

# Calculate metrics
avg_bias = np.mean(biases)
med_bias = np.median(biases)
rmse = np.sqrt(np.mean(np.array(biases) ** 2))
mean_coverage = np.mean(coverages)
mean_ci_length = np.mean(ci_lengths)
mean_asymptotic_variance = np.mean(asymptotic_variances)

# Print the results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")
print(f"Coverage: {mean_coverage}")
print(f"Mean CI Length: {mean_ci_length}")
print(f"Mean Asymptotic Variance: {mean_asymptotic_variance}")

pip install linearmodels

# Main Model approach

The `DoubleMLDID` class can be used as any other `DoubleML` class. 

The score is set to `score='observational'`, since the we generated data where the treatment probability depends on the pretreatment covariates. Further, we will use `in_sample_normalization=True`, since normalization generally improved the results in our simulations (both `score='observational'` and `in_sample_normalization=True` are default values).

After initialization, we have to call the `fit()` method to estimate the nuisance elements.

In [None]:
from doubleml import DoubleMLDID

dml_did = DoubleMLDID(
    dml_data,
    ml_g=ml_g,
    ml_m=ml_m,
    score="observational",
    in_sample_normalization=True,
    n_folds=5,
)

dml_did.fit()
print(dml_did)

As usual, confidence intervals at different levels can be obtained via

In [None]:
print(dml_did.confint(level=0.90))

### Coverage Simulation

Here, we add a small coverage simulation to highlight the difference to the linear implementation of [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003). We generate multiple datasets, estimate the ATTE and collect the results (this may take some time). 

In [None]:
n_rep = 200
ATTE = 0.0

ATTE_estimates = np.full((n_rep), np.nan)
coverage = np.full((n_rep), np.nan)
ci_length = np.full((n_rep), np.nan)

np.random.seed(42)
for i_rep in range(n_rep):
    if (i_rep % int(n_rep / 10)) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")
    dml_data = make_did_SZ2020(n_obs=n_obs, dgp_type=4, cross_sectional_data=False)

    dml_did = DoubleMLDID(dml_data, ml_g=ml_g, ml_m=ml_m, n_folds=5)
    dml_did.fit()

    ATTE_estimates[i_rep] = dml_did.coef.squeeze()
    confint = dml_did.confint(level=0.95)
    coverage[i_rep] = (confint["2.5 %"].iloc[0] <= ATTE) & (
        confint["97.5 %"].iloc[0] >= ATTE
    )
    ci_length[i_rep] = confint["97.5 %"].iloc[0] - confint["2.5 %"].iloc[0]

    summary_df = dml_did.summary
    std_err = summary_df.loc["d", "std err"]
    asymptotic_variance[i_rep] = std_err**2

# Calculate metrics
avg_bias = np.mean(ATTE_estimates - ATTE)
med_bias = np.median(ATTE_estimates - ATTE)
rmse = np.sqrt(np.mean((ATTE_estimates - ATTE) ** 2))
avg_asymptotic_variance = np.mean(asymptotic_variance)
coverage_probability = np.mean(coverage)
avg_ci_length = np.mean(ci_length)

# Print results
print(f"Av. Bias: {avg_bias}")
print(f"Med. Bias: {med_bias}")
print(f"RMSE: {rmse}")
print(f"Asy. V: {avg_asymptotic_variance}")
print(f"Cover: {coverage_probability}")
print(f"CIL: {avg_ci_length}")

Let us take a look at the corresponding coverage and the length of the confidence intervals.

In [None]:
print(f"Coverage: {coverage.mean()}")
print(f"Average CI length: {ci_length.mean()}")

Here, we can observe that the coverage is still valid, since we did not rely on linear learners, so the setting is not misspecified in this example. 

If we know the conditional expectation is correctly specified (linear form), we can use this to obtain smaller confidence intervals but in many applications, we may want to safeguard against misspecification and use flexible models such as random forest or boosting.

The distribution of the estimates takes the following form

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df_pa = pd.DataFrame(ATTE_estimates, columns=["Estimate"])
g = sns.kdeplot(df_pa, fill=True)
plt.show()

backtracking:
pip install keras==2.12.0

# Deep learning model that works

In [None]:
import numpy as np
from doubleml import DoubleMLData, DoubleMLDID
from doubleml.datasets import make_did_SZ2020
from lightgbm import LGBMRegressor
from scikeras.wrappers import KerasClassifier  # pip install scikeras
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

np.random.seed(42)
n_obs = 1000
x, y, d = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=4,
    cross_sectional_data=False,
    return_type="array",
)
dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d)

# Function to create Keras model


def create_model():
    model = Sequential()
    model.add(Dense(64, input_dim=x.shape[1], activation="relu"))
    model.add(Dense(32, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))  # Assuming binary classification
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model


# Wrap the Keras model with KerasClassifier
keras_classifier = KerasClassifier(
    build_fn=create_model,
    epochs=10,
    batch_size=32,
    verbose=0,
)

# Use StandardScaler to normalize data and then use the Keras classifier in a pipeline
ml_m = Pipeline([("scaler", StandardScaler()), ("nn", keras_classifier)])

# Use LGBMRegressor for regression
n_estimators = 30
ml_g = LGBMRegressor(n_estimators=n_estimators)

dml_plr = DoubleMLDID(dml_data, ml_g, ml_m)
dml_plr.fit()

print(dml_plr)

## Repeated Cross-Sectional Data

For repeated cross-sectional data, we assume that we observe i.i.d. data $W_i = (Y_{i}, D_i, X_i, T_i)$. 

Here $Y_i = T_i Y_{i1} + (1-T_i)Y_{i0}$ corresponds to the outcome of unit $i$ which is observed at time $T_i$.

### Data

As for panel data, we will use the implemented data generating process `make_did_SZ2020` to generate data according to the simulation in [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) (Section 4.2). 

In this example, we will use `dgp_tpye=4`, which corresponds to the misspecified settings in [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003) (other data generating processes are also available via the `dgp_type` parameter). In all settings the true ATTE is zero.

In contrast to other `DoubleMLData` objects, we have to specify which column corresponds to our time variable $T$.

The time variable can be simply set via the argument `t`.

In [None]:
import numpy as np
from doubleml import DoubleMLData
from doubleml.datasets import make_did_SZ2020

np.random.seed(42)
n_obs = 1000
x, y, d, t = make_did_SZ2020(
    n_obs=n_obs,
    dgp_type=4,
    cross_sectional_data=True,
    return_type="array",
)
dml_data = DoubleMLData.from_arrays(x=x, y=y, d=d, t=t)
print(dml_data)

### ATTE Estimation

To estimate the ATTE with panel data, we will use the `DoubleMLDIDCS` class. 

As for all `DoubleML` classes, we have to specify learners, which have to be initialized first.
Here, we will just rely on a tree based method. 

The learner `ml_g` is used to fit conditional expectations of the outcome $\mathbb{E}[\Delta Y_i| D_i=d, T_i =t, X_i]$ for all combinations of $d,t\in\{0,1\}$, whereas the learner `ml_m` will be used to estimate the propensity score $P(D_i=1|X_i)$.

In [None]:
from lightgbm import LGBMClassifier, LGBMRegressor

n_estimators = 30
ml_g = LGBMRegressor(n_estimators=n_estimators)
ml_m = LGBMClassifier(n_estimators=n_estimators)

The `DoubleMLDIDCS` class can be used as any other `DoubleML` class. 

The score is set to `score='observational'`, since the we generated data where the treatment probability depends on the pretreatment covariates. Further, we will use `in_sample_normalization=True`, since normalization generally improved the results in our simulations (both `score='observational'` and `in_sample_normalization=True` are default values).

After initialization, we have to call the `fit()` method to estimate the nuisance elements.

In [None]:
from doubleml import DoubleMLDIDCS

dml_did = DoubleMLDIDCS(
    dml_data,
    ml_g=ml_g,
    ml_m=ml_m,
    score="observational",
    in_sample_normalization=True,
    n_folds=5,
)

dml_did.fit()
print(dml_did)

As usual, confidence intervals at different levels can be obtained via

In [None]:
print(dml_did.confint(level=0.90))

### Coverage Simulation

Again, we add a small coverage simulation to highlight the difference to the linear implementation of [Sant'Anna and Zhao (2020)](https://doi.org/10.1016/j.jeconom.2020.06.003). We generate multiple datasets, estimate the ATTE and collect the results (this may take some time). 

In [None]:
n_rep = 200
ATTE = 0.0

ATTE_estimates = np.full((n_rep), np.nan)
coverage = np.full((n_rep), np.nan)
ci_length = np.full((n_rep), np.nan)

np.random.seed(42)
for i_rep in range(n_rep):
    if (i_rep % int(n_rep / 10)) == 0:
        print(f"Iteration: {i_rep}/{n_rep}")
    dml_data = make_did_SZ2020(n_obs=n_obs, dgp_type=4, cross_sectional_data=True)

    dml_did = DoubleMLDIDCS(dml_data, ml_g=ml_g, ml_m=ml_m, n_folds=5)
    dml_did.fit()

    ATTE_estimates[i_rep] = dml_did.coef.squeeze()
    confint = dml_did.confint(level=0.95)
    coverage[i_rep] = (confint["2.5 %"].iloc[0] <= ATTE) & (
        confint["97.5 %"].iloc[0] >= ATTE
    )
    ci_length[i_rep] = confint["97.5 %"].iloc[0] - confint["2.5 %"].iloc[0]

Let us take a look at the corresponding coverage and the length of the confidence intervals.

In [None]:
print(f"Coverage: {coverage.mean()}")
print(f"Average CI length: {ci_length.mean()}")

As for panel data the coverage is still valid, since we did not rely on linear learners, so the setting is not misspecified in this example. 

If we know the conditional expectation is correctly specified (linear form), we can use this to obtain smaller confidence intervals but in many applications, we may want to safeguard against misspecification and use flexible models such as random forest or boosting.

The distribution of the estimates takes the following form

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df_pa = pd.DataFrame(ATTE_estimates, columns=["Estimate"])
g = sns.kdeplot(df_pa, fill=True)
plt.show()