# R: Sample Selection Models

In this example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate the average treatment effect (ATE) under sample selection or outcome attrition. The estimation is based on a simulated DGP from Appendix E of [Bia, Huber and Lafférs (2023)](https://doi.org/10.1080/07350015.2023.2271071).  

Consider the following DGP:
$$
\begin{align*}
Y_i &= \theta_0 D_i + X_i'\beta_0 + \varepsilon_i,\\
S_i &= \mathbb{1}\{D_i + \gamma_0 Z_i + X_i'\beta_0 + \upsilon_i > 0\}, \\
D_i &= \mathbb{1}\{X_i'\beta_0 + \xi_i > 0\}
\end{align*}
$$
where $Y_i$ is observed if $S_i=1$
with
$$X_i \sim N(0, \sigma^2_X), \quad Z_i \sim N(0, 1), \quad (\varepsilon,_i \nu_i) \sim N(0, \sigma^2_{\varepsilon, \nu}), \quad \xi_i \sim N(0, 1).$$

Let $D_i\in\{0,1\}$ denote the treatment status of unit $i$ and let $Y_{i}$ be the outcome of interest of unit $i$.
Using the potential outcome notation, we can write $Y_{i}(d)$ for the potential outcome of unit $i$ and treatment status $d$. Further, let $X_i$ denote a vector of pre-treatment covariates.  

## Outcome missing at random (MAR)  
Now consider the first setting, in which the outcomes are missing at random (MAR), according to assumptions in [Bia, Huber and Lafférs (2023)](https://doi.org/10.1080/07350015.2023.2271071). 
Let the covariance matrix $\sigma^2_X$ be such that $a_{ij} = 0.5^{|i - j|}$, $\gamma_0 = 0$, $\sigma^2_{\varepsilon, \upsilon} = \begin{pmatrix} 1 & 0 \\  0 & 1 \end{pmatrix}$ and finally, let the vector of coefficients $\beta_0$ resemble a quadratic decay of coefficients importance; $\beta_{0,j} = 0.4/j^2$ for $j = 1, \ldots, p$. 


### Data

We will use the implemented data generating process `make_ssm_data` to generate data according to the simulation in Appendix E of [Bia, Huber and Lafférs (2023)](https://doi.org/10.1080/07350015.2023.2271071). The true ATE in this DGP is equal to $\theta_0=1$ (it can be changed by setting the parameter `theta`). 

The data generating process `make_ssm_data` by default settings already returns a `DoubleMLData` object (however, it can return a pandas DataFrame or a NumPy array if `return_type` is specified accordingly). In this first setting, we are estimating the ATE under missingness at random, so we set `mar=True`.
The selection indicator `S` can be set via `s_col`.

In [None]:
#remotes::install_github("DoubleML/doubleml-for-r", ref = "p-ssm_branch", force = TRUE)

In [None]:
library(DoubleML)
library(mlr3)

set.seed(1234)
n_obs = 2000
df = make_ssm_data(n_obs=n_obs, mar=TRUE, return_type="data.table")

dml_data = DoubleMLData$new(df, y_col="y", d_cols="d", s_col="s")
dml_data


### Estimation

To estimate the ATE under sample selection, we will use the `DoubleMLSSM` class. 

As for all `DoubleML` classes, we have to specify learners, which have to be initialized first.
Given the simulated quadratic decay of coefficients importance, Lasso regression should be a suitable option (as for propensity scores, this will be a $\mathcal{l}_1$-penalized Logistic Regression). 

The learner `ml_g` is used to fit conditional expectations of the outcome $\mathbb{E}[Y_i|D_i, S_i, X_i]$, whereas the learners `ml_m` and `ml_pi` will be used to estimate the treatment and selection propensity scores $P(D_i=1|X_i)$ and $P(S_i=1|D_i, X_i)$.

In [None]:
ml_g = lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_m = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")
ml_pi = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")

The `DoubleMLSSM` class can be used as any other `DoubleML` class. 

The score is set to `score='missing-at-random'`, since the parameters of the DGP were set to satisfy the assumptions of outcomes missing at random. Further, since the simulation in [Bia, Huber and Lafférs (2023)](https://doi.org/10.1080/07350015.2023.2271071) uses normalization of inverse probability weights, we will apply the same setting by `normalize_ipw=True`.

After initialization, we have to call the `fit()` method to estimate the nuisance elements.

In [None]:
dml_ssm = DoubleMLSSM$new(dml_data, ml_g, ml_m, ml_pi, score="missing-at-random",
                        normalize_ipw = TRUE)