# Data Generating Process Simulation

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Topic" data-toc-modified-id="Topic-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Topic</a></span></li><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Motivation</a></span></li><li><span><a href="#Properties" data-toc-modified-id="Properties-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Properties</a></span></li></ul></li><li><span><a href="#Literature" data-toc-modified-id="Literature-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Literature</a></span></li><li><span><a href="#Package-Content" data-toc-modified-id="Package-Content-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Package Content</a></span><ul class="toc-item"><li><span><a href="#General-Model:-Partial-Linear-Regression" data-toc-modified-id="General-Model:-Partial-Linear-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>General Model: Partial Linear Regression</a></span></li><li><span><a href="#Covariates-generation" data-toc-modified-id="Covariates-generation-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Covariates generation</a></span><ul class="toc-item"><li><span><a href="#Continuous-covariates" data-toc-modified-id="Continuous-covariates-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Continuous covariates</a></span></li><li><span><a href="#Binary-and-categorical-covariates" data-toc-modified-id="Binary-and-categorical-covariates-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Binary and categorical covariates</a></span></li></ul></li><li><span><a href="#Treatment-assignment" data-toc-modified-id="Treatment-assignment-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Treatment assignment</a></span><ul class="toc-item"><li><span><a href="#Random" data-toc-modified-id="Random-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Random</a></span></li><li><span><a href="#Dependent-on-covariates" data-toc-modified-id="Dependent-on-covariates-3.3.2"><span class="toc-item-num">3.3.2&nbsp;&nbsp;</span>Dependent on covariates</a></span></li></ul></li><li><span><a href="#Treatment-effects" data-toc-modified-id="Treatment-effects-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Treatment effects</a></span><ul class="toc-item"><li><span><a href="#Positive-&amp;-negative-constant-effect" data-toc-modified-id="Positive-&amp;-negative-constant-effect-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span>Positive &amp; negative constant effect</a></span></li><li><span><a href="#Positive-&amp;-negative-continuous-heterogeneous-effect" data-toc-modified-id="Positive-&amp;-negative-continuous-heterogeneous-effect-3.4.2"><span class="toc-item-num">3.4.2&nbsp;&nbsp;</span>Positive &amp; negative continuous heterogeneous effect</a></span></li><li><span><a href="#No-effect" data-toc-modified-id="No-effect-3.4.3"><span class="toc-item-num">3.4.3&nbsp;&nbsp;</span>No effect</a></span></li><li><span><a href="#Discrete-heterogeneous-treatment-effect" data-toc-modified-id="Discrete-heterogeneous-treatment-effect-3.4.4"><span class="toc-item-num">3.4.4&nbsp;&nbsp;</span>Discrete heterogeneous treatment effect</a></span></li></ul></li><li><span><a href="#Output-variable" data-toc-modified-id="Output-variable-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Output variable</a></span><ul class="toc-item"><li><span><a href="#Continuous" data-toc-modified-id="Continuous-3.5.1"><span class="toc-item-num">3.5.1&nbsp;&nbsp;</span>Continuous</a></span></li><li><span><a href="#Binary" data-toc-modified-id="Binary-3.5.2"><span class="toc-item-num">3.5.2&nbsp;&nbsp;</span>Binary</a></span></li></ul></li></ul></li></ul></div>

## Introduction

### Topic
As modern science becomes increasingly data-driven among virtually all fields, it is obligatory to inspect not only how scientists analyze data but also *what kind* of data is used. Naturally, the performance of a model is bound by the quality of underlying data.
This blog post explores the properties and constituents of realistic data sets and proposes flexible and user-friendly software to support research by means of 'Simulated Data Generating Processes' (SDGP).

### Motivation
With increasing dimension, standard Machine Learning Techniques tend to suffer from the 'Curse of Dimensionality', referring to the phenomenon of data points becoming sparse for constant sample sizes, as well as large parameter spaces which render consistent parameter estimation to be difficult (Chernozhukov).
However, having more and more data available, these problems need to be addressed in a specialized framework.
Another fundamental problem that researchers face is the non-observability of counterfactuals and treatment effects, firstly addressed in the 'Potential Outcomes Framework' (Neyman, Rubin). 
### Properties 
The researcher's setting is located in a high-dimensional, partial-linear model.
The covariates are pseudo-randomly generated in such a manner that they are (partially) correlated with the output variable, which itself can take binary, discrete or continuous values. The relation between the covariates and the output variable can be specified as linear, non-linear,(heterogeneous or a mixture // move to treatment effect).
Individual treatment assignment, propensity, being the probability of being in the treatment group, can be random or non-random. One of the fundamental problems researchers face is the non-observability of counterfactuals and treatment effects, firstly addressed in the 'Potential Outcomes Framework' (Neyman, Rubin). In the proposed setting, treatment effects are *known* and can be customized in their effect.
By proposing a model in which all components are expounded, researchers receive support to evaluate and compare various models that are applied to a realistic data set.

## Literature

## Package Content

In the following sections you will find a step by step explanation of how our data simulation package works internally, accompanied by the corresponding formulas, code snippets and explanatory graphs.

### General Model: Partial Linear Regression

The model that our package is based on is a partial linear regression model as it is described in Chernozhukov et al. (2016). It consists of the covariates X that are possibly non-linear in relation to y, the treatment term consisting the treatment effect and the treatment assginment vector and an normally distributed error term. 

<br/>

$$ \begin{align} Y = \theta_0  D + g_0(X)+ U,  &&  & E[U|X,D] = 0 \\ 
   D = m_0(X) + V, &&   &  E[V|X] = 0 \\
  \theta_0 =  t_0(Z) + W, && & E[W|Z] = 0, \; Z \subseteq X  \\
   \end{align}$$

$Y$ - Outcome Variable $\quad \theta_0$ - True treatment effect $\quad D$ - Treatment Dummy $ \quad X_{n*k}$ - Covariates

$U$, $V$ & $W$ - normally distributed error terms with expected value 0

### Covariates generation

#### Continuous covariates

The covariate matrix $X$ in our simulation is drawn from a multivariate normal distribution with an expected value of 0 and a specified covariance matrix $\Sigma$. $\Sigma$ is constructed the following way. First, values for Matrix $A$ are drawn from a uniform distribution. In a second step, to make sure that there exist negative correlations and that not all variables are highly correlated with each other, we create an overlay matrix $B$. This overlay matrix $B$ consists of values 1 and -1. Third, we multiply the two matrices element-wise and adjust the result with a correction term to assure that values in $\Sigma$ are not increasing in $k$. This result is represented by the matrix $\Lambda$. In a final step, we calculate $\Sigma$ by multiplying $\Lambda$ with its transposed to assure that it is positive definite. 

<br>

$$ X_{n*k} \sim N_k(0,\Sigma)$$

Where, 

$n = Number \; of \; Observations, \quad k =  Number \; of \; Covariates$

$\Sigma = \Lambda*\Lambda^T, \quad \Lambda = \frac{10}{k} (A \circ B), \quad A \sim U(0,1), \quad B \sim Bernoulli(0.5)\;,B \in \{-1,1\}$

$Matrices \; A, \; B, \; and \; \Lambda \; are \; all \; of \; dimension \; k*k$

<br>

<script src="https://gist.github.com/Tobias-K93/f550c942f3ceea379271c9d89913fac7.js"></script>

The following heat-map shows an example of Sigma with k=10 covariates. Depending on the chosen random seed, typically correlations range between -0.7 and 0.7 with slightly varying minimum and maximum values.

<img align="center" width="660" height="500" style="display:block;margin:0 auto;" src="/blog/img/seminar/data_generating_process/covariates_correlation.png">

#### Binary and categorical covariates
Binary and categorical covariates are created from the continuous covariates in X. 

$$ p_{binary} = \textrm{min-max-standardize}(X) $$

$$ X_{binary} = Bernoulli\left( p_{binary} \right) $$

### Treatment assignment
There are two ways treatment can be assigned. The first one is to simulate a random control trial where all observations are assigned treatment with the same probability $m_0$. The second one is assigning treatment in the fashion of an observational study where assignment of treatment depends on covariates $X$ and probability $m_0$ differs between observations. Either way, the assignment vector $D$ is drawn from a Bernoulli distribution with probability $m_0$.

$$ D_{n*1} \sim Bernoulli(m_0) $$

#### Random 
In the random assignment case, only an assignment probability $m_0$ has to be chosen which is then used to draw realizations from a Bernoulli distribution to create the assignment vector $D$.

$$ m_0 \in [0,1] $$

#### Dependent on covariates
To obtain the individual probabilities, as a first step, a subset $Z \subseteq X$  with dimensions $n$ times $l$ is chosen. Then, $Z$ is multiplied with a weight vector $b$, which is made up of values drawn from a uniform distribution, resulting in vector $a$. The standardized version of vector $a$ is then used to draw values from a Normal CDF which eventually serve as assignment probabilities in vector $m_0$. To include some randomness, random noise, drawn from a uniform distribution and labeled as $\eta$, is added to $a$ before the standardization. 

<br/>

$$ m_{0, n*1} = \Phi\left(\frac{a-\hat{\mu}(a)}{\hat{\sigma}(a)}\right) $$

Where,

$ a_{n*1} = Z * b + \eta, \quad Z_{n*l} \subseteq X_{n*k}, \quad b_{l*1} \sim U(0,1), \quad  \eta_{n*1} \sim U(0,0.25)$

<br/>

<script src="https://gist.github.com/Tobias-K93/4b94f744158ec4ee2411075c2cf66e06.js"></script>

The following histograms show the distributions of propensity scores ($m_0$) in the case of non-random assignment with low, medium, and high average assignment probability. In the trivial case of random assignment, propensity scores are all the same for each observation. 

<img align="center" width="800" height="350" style="display:block;margin:0 auto;" src="/blog/img/seminar/data_generating_process/propensity_score_plot.png">



### Treatment effects
The package offers 6 different options for the treatment effect $\theta_0
$. These effects are: Positive constant, negative constant , positive heterogeneous, negative heterogeneous, no effect, and discrete heterogeneous. When applying these treatment effects, one can choose a single option or pick preferred options and apply a mix of them. 

#### Positive & negative constant effect
In the case of a constant effect, treatment is the same for each individual and, thus, is just a constant $c$. Depending on the chosen sign either positive or negative.

**Option 1** positive constant: $$ \theta_0 = c $$

**Option 2** negative constant: $$\theta_0 = - \; c$$

#### Positive & negative continuous heterogeneous effect
In contrast to the constant effect, for which each observation has the same treatment effect, the continuous heterogeneous effect differs between individuals within a specified interval. Moreover, depends the size of the individual treatment effect on a subset Z of the covariates X. The creation of this treatment effect begins similar to the treatment assignment with taking the dot product of the subset $Z$ and the same weight vector used in the treatment assignment $b$. The result is put into a sinus function and added an normally distributed error term $W$. The result $\gamma$ is then min-max-standardized and adjusted in size with a constant $c$. Note that eventually the size of treatment depends on the intensity chosen which will be explained later on in the application part. In case that the heterogeneous effect is supposed to be partly or entirely negative, the resulting distribution is shifted by the respective quantile value $q_{neg}$ that corresponds to the wanted negative percentage share. 

<br/>

$$\gamma =  sin(Z * b) + W$$


$$\theta_0 = \frac{\gamma - min(\gamma)}{max(\gamma) - min(\gamma)}*c - q_{neg}$$

Where,

$Z \subseteq X,\quad  b_{k*1} \stackrel{ind.}{\sim} U(0,1), \quad W \stackrel{ind.}{\sim} N(0,1), \quad q_{neg} \in [0,c], \quad c = size \; adjustment \; parameter$

<br/>

<script src="https://gist.github.com/Tobias-K93/80c8db186625da23d99caebf1bbfc913.js"></script>

The following histogram shows an example where 30% of the continuous heterogeneous treatment effect is negative and 70% is positive. 

<img align="center" width="500" height="400" style="display:block;margin:0 auto;" src="/blog/img/seminar/data_generating_process/pos_neg_heterogeneous_treatment_effect.png">


#### No effect
This option allows that there are cases of treatment assignment that show no effect, i.e. $\theta_0$ is simply 0.

$$\theta_0 = 0$$

#### Discrete heterogeneous treatment effect




### Output variable
#### Continuous
The continuous output variable $Y$ comprises of three main parts. The treatment part $\theta_0 D $ as explained above, the possibly non-linearly transformed covariates $X$, and the error term $U$. The transformation $g_0()$ consists of two parts. First, the scalar product of $X$ and the weighting vector $b$ is taken. The values of $b$ are drawn from an uneven beta distribution which assures that a few covariates impact the outcome much more than others which seems more realistic than e.g. a uniformly distributed impact. Second, there is either no further transformation, i.e. the relation is linear, a partial non-linear transformation which consists of a linear and non-linear part, or an entirely non-linear transformation. Moreover, the package offers an option to add interaction terms of the form $x_i*x_j$ that are drawn randomly out of $X$ and added into $g_0()$. The number of interaction terms $I$ is set to $\sqrt{k}$ as a default and can be adjusted if wanted. 

<br/>

$$ Y = \theta_0  D + g_0(X)+ U, \quad E[U|X,D] = 0$$

where,

$g_0(x) \in  \{x, \; \; 2.5*cos(x)^3 + 0.5*x, \; \; 3*cos(x)^3\}, \qquad x \in \{X_{n*k}*b_{k*1}, \quad X_{n*k}*b_{k*1} + X_{in,n*I}*b_{in,I*1} \},$

$b_{k*1} \sim Beta(1,5), \qquad X_{in}*b_{in} = \sum_{i,j=1}^{I} b_{i,1*1} * (x_{i,n*1} \circ x_{j,n*1}), \qquad i,j \in \{i_1,...,i_I\} \sim U\{0,k\}, $

$I = number \; of \; interaction \; terms$

The following code snippet shows a slightly simplified version of how the continuous output variable is implemented with the two examples of a simple linear transformation and a non-linear transformation including interaction terms.

<script src="https://gist.github.com/Tobias-K93/c8a7a18db3149d5762070fca1315db8e.js"></script>


The following scatter plots display the three different types of relation between X and y that can be chosen. The underlying data was simulated without treatment effects or interaction terms. 

<img align="center" width="850" height="350" style="display:block;margin:0 auto;" src="/blog/img/seminar/data_generating_process/y_transformations_plot.png">

#### Binary

$$Y_{n*1} \sim Bernoulli(p)$$

$$ p_{n*1} = \frac{\delta - min(\delta)}{max(\delta) - min(\delta)}*0.8 + 0.1 + \theta_{binary}*D$$

where,

$ \delta = g_0(X) $ as in the continuous case above

$\theta_{binary} = \frac{1}{10} * \theta_0 $