<a href="https://colab.research.google.com/github/POLSEAN/XTDML/blob/main/examples/02_xtdml_for_wg_approx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DML for panel data: WG (approximation) approach**

---

*Description*

Estimation of the structural parameter using double machine learning (DML) with partially linear regression (PLR) models in the context of panel data with fixed effects as in Clarke and Poselli(2023).

The package `XTDML` allows the estimation of the nuisance functions by machine learning methods and  the computation of the Neyman orthogonal score functions. `XTDML` is built on the CRAN package `DoubleML` (Bach et al., 2024), which uses the `mlr3` ecosystem and the `R6` package.




*References*

[1] Bach, P., Chernozhukov, V., Kurz, M. S., Spindler, M. and Klaassen, S. (2024), DoubleML - An Object-Oriented Implementation of Double Machine Learning in R, *Journal of Statistical Software*, 108(3):1-56.

[2] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. *The Econometrics Journal*, 21(1):C1-C68.

[3] Clarke, P. and Polselli, A. (2023). Double machine learning for static panel models with fixed effects. *arXiv preprint*, arXiv:2312.08174.

[4] Mundlak, Y. (1978). On the pooling of time series and cross section data. *Econometrica*, pages 69-85.

*Overview Code*

1. Installation of XTDML and other R packages
2. Loading the data
3. Data management with WG transformation
4. Set up of DML data environment
5. Set up of DML estimation environment
6. Extraction of DML estimates


### **The Installation of `XTDML` package**

The `XTDML` package can be installed following either options below:

1. **Installation directly from GitHub:**
  ```
    #install.packages("devtools")
    library(devtools)

    install_github("POLSEAN/XTDML")
    library(XTDML)
  ```
  *Note this code works **ONLY with RStudio (desktop)**, but not with online platforms such as Google Colab or Kaggle.*


2. **Download all folders in `XTDML`** from `https://github.com/POLSEAN/XTDML` pressing `<> CODE > Download ZIP`. Rename the downloaded .zip folder as `XTDML`, and upload it on Google Colab. Get the path and run the code `!unzip XTDML.zip` in Python, then change the RUNTIME to R and run
   ```
    #install.packages("devtools")
    library(devtools)

    wd = "~ your-directory/XTDML"
    devtools::load_all(wd)
   ```

For illustration purposes on Google Colab, we follow the second approach, but the first is recommended with RStudio (desktop).

**Set RUNTIME > CHANGE RUNTIME TYPE > Python 3**

The code below unzips the XTDML.zip folder that you have previously uploaded.

In [None]:
!unzip XTDML.zip

Archive:  XTDML.zip
 extracting: XTDML/.gitignore        
  inflating: XTDML/.Rbuildignore     
  inflating: XTDML/.RData            
  inflating: XTDML/.Rhistory         
   creating: XTDML/.Rproj.user/
   creating: XTDML/.Rproj.user/22C44D20/
   creating: XTDML/.Rproj.user/22C44D20/bibliography-index/
 extracting: XTDML/.Rproj.user/22C44D20/cpp-definition-cache  
   creating: XTDML/.Rproj.user/22C44D20/ctx/
   creating: XTDML/.Rproj.user/22C44D20/explorer-cache/
   creating: XTDML/.Rproj.user/22C44D20/pcs/
  inflating: XTDML/.Rproj.user/22C44D20/pcs/files-pane.pper  
 extracting: XTDML/.Rproj.user/22C44D20/pcs/source-pane.pper  
  inflating: XTDML/.Rproj.user/22C44D20/pcs/windowlayoutstate.pper  
  inflating: XTDML/.Rproj.user/22C44D20/pcs/workbench-pane.pper  
   creating: XTDML/.Rproj.user/22C44D20/presentation/
   creating: XTDML/.Rproj.user/22C44D20/profiles-cache/
 extracting: XTDML/.Rproj.user/22C44D20/rmd-outputs  
 extracting: XTDML/.Rproj.user/22C44D20/saved_source_markers  

**From now set RUNTIME > CHANGE RUNTIME TYPE > R**

In [None]:
# 1. Install and import R packages
# Install packages
list.of.packages <- c("datawizard","mlr3","mlr3learners","mlr3tuning","paradox","xgboost","ranger","MLmetrics","devtools","tidyverse")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")

# Load general packages
library(devtools)
library(tidyverse)
library(checkmate)
library(dplyr)
library(tibble)  ##for add_column()
library(datawizard)
library(data.table)
# ML packages
library(mlr3)
library(mlr3learners)
library(rpart)
library(xgboost)
library(ranger)
# Packages for HP tuning
library(mlr3misc)
library(mlr3tuning)
library(paradox)
library(MLmetrics)

# Suppress error messages from ML packages
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘bitops’, ‘gtools’, ‘caTools’, ‘globals’, ‘listenv’, ‘PRROC’, ‘gplots’, ‘insight’, ‘checkmate’, ‘future’, ‘future.apply’, ‘lgr’, ‘mlbench’, ‘mlr3measures’, ‘mlr3misc’, ‘parallelly’, ‘palmerpenguins’, ‘bbotk’, ‘RcppEigen’, ‘ROCR’


Loading required package: usethis

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::fil

In [None]:
# Additional package required to install XTDML (not always necessary, depends on the R version)
list.of.packages <- c("mvtnorm","clusterGeneration","readstata13")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")

library(mvtnorm)
library(clusterGeneration)
library(readstata13)

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘mvtnorm’


The following object is masked from ‘package:datawizard’:

    standardize


Loading required package: MASS


Attaching package: ‘MASS’


The following object is masked from ‘package:dplyr’:

    select




In [None]:
# Install XTDML package
wd = "/content/XTDML"
devtools::load_all(wd)


[1m[22m[36mℹ[39m Loading [34mXTDML[39m


### **The Data**

We use simulated data for DGP3 as in Clarke and Polselli (2023). We use a subsample (N=250) of the original dataset (with N=1,000,000), where each unit is observed over $T=10$ periods.

In this dataset, the nuisance functions are generated as follows

\begin{align*}
    l_0(x_{it}) & = b \, (x_{it,1}\cdot x_{it,3}) + a \, (x_{it,3}\cdot 1[x_{it,3}>0])\\
    m_0(x_{it}) & = a \, (x_{it,1}\cdot 1[x_{it,1}>0]) + b \, (x_{it,1}\cdot x_{it,3})
\end{align*}

where $a=0.25$ and $b=0.5$. The true structural effect is 0.5.


Note that the WG approach requires to use **transformed** data
* $\tilde{y}_{it} = y_{it} - \sum_{t=1}^T {y}_{it}$ is the output variable (continuous or binary).
* $\tilde{d}_{it}  = d_{it} - \sum_{t=1}^T {d}_{it}$ is the treatment variable (continuous or binary).
* $\tilde{x}_{it} = (\tilde{x}_{it,1}, \dots, \tilde{x}_{it,p})'$, where $\tilde{x}_{it,k} = x_{it,k} - \sum_{t=1}^T {x}_{it,k}$, are the set of $p=30$ control variables, but only $s=2$ are relevant.

In [None]:
# 2. Load simulated data from GitHub
# The data is already TIME-DEAMED (below the code to apply the within-group transformation)
df = read.csv("https://raw.githubusercontent.com/POLSEAN/XTDML/main/data/dgp3_wg_short.csv")
head(df)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,⋯,x25,x26,x27,x28,x29,x30,y,d,id,time
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
1,1.207286,-0.1393181,-1.7420989,11.5612946,7.657902,-5.766983,3.31557,2.8193932,-8.311597,-1.5291095,⋯,-2.2089016,-6.522599,-5.5568672,-1.8528365,1.0801914,2.8072718,-0.3297543,-0.7631192,1,1
2,6.456362,-0.9376213,-1.0419866,-3.6251928,2.767801,4.143498,1.89334,0.6039319,2.802924,2.4015315,⋯,3.0280572,3.250733,-3.3710752,7.7339929,9.870127,4.4161274,-6.6604693,-4.7479225,1,2
3,-7.799506,4.9583355,-0.7621036,3.0598577,6.897287,1.549369,4.027569,-0.3454077,9.021499,0.1900359,⋯,-0.1310073,-1.586182,-0.677769,0.3936301,-0.1313761,1.4435891,14.7532417,8.7176737,1,3
4,-1.388265,-5.3669211,2.8604658,-8.1199156,8.376423,9.301903,2.940048,-0.7906528,9.286537,-4.3100712,⋯,-1.2243048,12.425531,-0.2647207,3.7781437,6.0148416,6.0582704,6.3742881,1.9074289,1,4
5,8.033281,-0.3526331,1.2944688,-0.1433285,-7.331704,-1.02142,-3.291111,-3.6899825,-7.480111,-1.2449216,⋯,-0.821045,-6.741179,3.7456886,-2.5871991,0.4788044,0.2555598,6.6105295,6.028984,1,5
6,-6.85626,3.952757,-1.5199853,2.3989451,-2.429696,-2.162288,-4.30851,4.5548429,3.380496,0.4730454,⋯,2.8901936,4.782938,4.8563051,-1.6722178,-1.7085113,-5.2948253,15.6372517,8.901297,1,6


**NOTE:** If your data is not time demeaned, you need to transform *all* data (outcomes, treatments, covariates, dummies). The uploaded dataset containes data that has been transformed previously. Below an example of code that allows you to apply the within-group transformation.

Adapt the code below according to your dataset. Note that Google Colab does not allow the use of `select(df, starts_with("x"))`.

```
# Load the dataset created for the CRE approach from GitHub
df.git = read.csv("https://raw.githubusercontent.com/POLSEAN/XTDML/main/data/dgp4_cre_short.csv")

# Time-demean the data as follows:
#     1. Calculate the individual means for each variable (xbar_i), and
#     2. Calculate the grand means of each variable (xbar).

# Keep meain variables without panel indices (id, time)
df_no_idx  = select(df.git, starts_with(c("x","y","d")))
X = paste0("x",1:30)

# Calculate the grand means
df_gm = df_no_idx %>%
  mutate(across(c(X, d, y), ~ mean(.x)))
gmX_list = as.list(select(df_gm[1,], starts_with(c("x", "d", "y"))))

# Calculate the individual means
df_mi = df.git %>%
  group_by(id) %>%
  mutate(across(c(X, d, y), ~  mean(.x)))
mX_list = as.list(select(df_mi, starts_with(c("x", "d", "y"))))
mX_list = mX_list[-1]

# Calculate time-demeaned variables as: xtilde_it = x_it - xbar_i + xbar (the grand mean allows a consistent estimate of the constant term)
df_dm = df_no_idx %>%
  mutate(across(all_of(names(mX_list)), ~ .x - mX_list[[cur_column()]] + gmX_list[[cur_column()]]))
df_dm$id = df.git$id
df_dm$time = df.git$time

# Save the .csv file with the transformed data
write.csv(df_dm, file = "~your_directory_here_to_export_file/dgp3_wg_short.csv", row.names = FALSE)


```

## **Estimation and inference with DML for WG**

The section below consists in setting up the DML data and estimation environments, and proceed with the actual estimation.

### **4. Set up DML data environment**
Initalization of `dml_approx_data`  from `data.frame`. Arguments to pass:

```
dml_approx_data_from_data_frame(data,
                  x_cols = NULL,
                  y_col = NULL,
                  d_cols = NULL,
                  z_cols = NULL,
                  cluster_cols = NULL
                  )

```                             

In [None]:
# 3. Set up DML data environment
x_cols <- paste0("x", 1:30)

# set up data for DML procedure
obj_dml_data = dml_approx_data_from_data_frame(df,
                            x_cols = x_cols,  y_col = "y", d_cols = "d",
                            cluster_cols = "id")
obj_dml_data$print()



------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Cluster variable(s): id
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27, x28, x29, x30
Instrument(s): 
No. Observations: 2500


### **4. Set up DML estimation environment**

Arguments to pass in `dml_approx_plr` function that Creates a new instance of this R6 class.

```
 dml_approx_plr$new(data,
      ml_l,
      ml_m,
      ml_g = NULL,
      n_folds = 5,
      n_rep = 1,
      score = "orth-PO",               # or "orth-IV"
      dml_procedure = "dml2",          # or "dml1"
      draw_sample_splitting = TRUE,
      apply_cross_fitting = TRUE
      )

```

In [None]:
# 5. Set up DML estimation environment
set.seed(1408)
learner = lrn("regr.rpart")
ml_l = learner$clone()
ml_m = learner$clone()

dml_rpart = dml_approx_plr$new(obj_dml_data,
                            ml_l = ml_l, ml_m = ml_m)

# set up a list of parameter grids
param_grid = list("ml_l" = ps(cp = p_dbl(lower = 0.001, upper = 0.02),
                              maxdepth = p_int(lower = 2, upper = 10)),
                  "ml_m" = ps(cp = p_dbl(lower = 0.001, upper = 0.02),
                              maxdepth = p_int(lower = 2, upper = 10)))

tune_settings = list(terminator = mlr3tuning::trm("evals", n_evals = 10),
                      algorithm = tnr("grid_search"), resolution = 20)

dml_rpart$tune(param_set = param_grid, tune_settings = tune_settings)

# Estimate target/causal parameter
dml_rpart$fit()
dml_rpart$print()
print(dml_rpart$params)

TuningInstanceSingleCrit is deprecated. Use TuningInstanceBatchSingleCrit instead.

TuningInstanceSingleCrit is deprecated. Use TuningInstanceBatchSingleCrit instead.



[1] "rmses in fold 1 : 20.8425618528673"
[1] "theta_subsample_mean in fold 1: 1.20967326152633"
[1] "rmses in fold 2 : 21.6552961084982"
[1] "theta_subsample_mean in fold 2: 1.26796763289583"
[1] "rmses in fold 3 : 19.386253563844"
[1] "theta_subsample_mean in fold 3: 1.29530566370593"
[1] "rmses in fold 4 : 18.8041160665854"
[1] "theta_subsample_mean in fold 4: 1.30031407414006"
[1] "rmses in fold 5 : 19.6349951037836"
[1] "theta_subsample_mean in fold 5: 1.27814720150453"
[1] "theta in dml2: 1.27814720150453"
[1] "rmse in dml2: 19.6349951037836"



------------------ Data summary ------------------
Outcome variable: y
Treatment variable: d
Covariates: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27, x28, x29, x30
Cluster variables: id
No. Observations: 2500
No. Groups: 250

------------------ Score & algorithm ------------------
Score function: orth-PO
DML algorithm: dml2
DML approach: transformed variables 