# Data Generating Process Simulation

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Generating-Process-Simulation" data-toc-modified-id="Data-Generating-Process-Simulation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Generating Process Simulation</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Literature" data-toc-modified-id="Literature-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Literature</a></span></li><li><span><a href="#Package-Content" data-toc-modified-id="Package-Content-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Package Content</a></span><ul class="toc-item"><li><span><a href="#Covariates-generation" data-toc-modified-id="Covariates-generation-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Covariates generation</a></span><ul class="toc-item"><li><span><a href="#Continuous-covariates" data-toc-modified-id="Continuous-covariates-1.3.1.1"><span class="toc-item-num">1.3.1.1&nbsp;&nbsp;</span>Continuous covariates</a></span></li><li><span><a href="#Binary-and-categorical-covariates" data-toc-modified-id="Binary-and-categorical-covariates-1.3.1.2"><span class="toc-item-num">1.3.1.2&nbsp;&nbsp;</span>Binary and categorical covariates</a></span></li></ul></li></ul></li><li><span><a href="#Treatment-Assignment" data-toc-modified-id="Treatment-Assignment-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Treatment Assignment</a></span></li><li><span><a href="#Treatment-effect" data-toc-modified-id="Treatment-effect-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Treatment effect</a></span></li><li><span><a href="#Composition-of-dependent-variable" data-toc-modified-id="Composition-of-dependent-variable-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Composition of dependent variable</a></span></li><li><span><a href="#Application-of-module/package" data-toc-modified-id="Application-of-module/package-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Application of module/package</a></span></li><li><span><a href="#Distribution-of-propensity-scores-according-to-treatment-assignment" data-toc-modified-id="Distribution-of-propensity-scores-according-to-treatment-assignment-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Distribution of propensity scores according to treatment assignment</a></span></li><li><span><a href="#Treatment-effect-options" data-toc-modified-id="Treatment-effect-options-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Treatment effect options</a></span></li><li><span><a href="#Customized-treatment-distribution" data-toc-modified-id="Customized-treatment-distribution-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Customized treatment distribution</a></span></li><li><span><a href="#Outputs-depending-on-treatment-assignment" data-toc-modified-id="Outputs-depending-on-treatment-assignment-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Outputs depending on treatment assignment</a></span></li><li><span><a href="#Average-Treatment-Effect" data-toc-modified-id="Average-Treatment-Effect-1.12"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Average Treatment Effect</a></span></li></ul></li></ul></div>

## Introduction

## Literature

## Package Content

In the following sections you will find a step by step explanation of how our data simulation package works internally, accompanied by the corresponding formulas, code snippets and explanatory graphs.

### Covariates generation

#### Continuous covariates

The covariate matrix **X** in our simulation is drawn from a multivariate normal distribution with an expected value of 0 and a specified covariance matrix Sigma. Sigma is constructed the following way. First, values for Matrix A are drawn from a uniform distribution. In a second step, to make sure that there exist negative correlations and that not all variables are highly correlated with each other, we create an overlay matrix B. This overlay matrix B consists of values 1 and -1. Third, we multiply the two matrices element-wise and adjust the result with a correction term to assure that values in Sigma are not increasing in k. This result is represented by the matrix $\Lambda$. In a final step, we calculate Sigma by multiplying $\Lambda$ with its transposed to assure that it is positive definite. 

<br>

$$ X_{n*k} \sim N_k(0,\Sigma)$$

Where, 

$n = Number \; of \; Observations, \quad k =  Number \; of \; Covariates$

$\Sigma = \Lambda*\Lambda^T, \quad \Lambda = \frac{10}{k} (A \circ B), \quad A \sim U(0,1), \quad B \sim Ber(0.5)\;,B \in \{-1,1\}$

$Matrices \; A, \; B, \; and \; \Lambda \; are \; all \; of \; dimension \; k*k$

<br>

<script src="https://gist.github.com/Tobias-K93/f550c942f3ceea379271c9d89913fac7.js"></script>

The following heat-map shows an example of Sigma with k=10 covariates. Depending on the chosen random seed, typically correlations range between -0.7 and 0.7 with slightly varying minimum and maximum values.

<img align="center" width="660" height="500" style="display:block;margin:0 auto;" src="/blog/img/seminar/data_generating_process/covariates_correlation.png">

#### Binary and categorical covariates
Binary and categorical covariates are created from the continuous covariates in X. 

$$ probability_{binary} = \textrm{min-max-standardize}(X) $$

$$X_{binary} = Ber(probability_{binary})$$

##

<br/>

$$ \begin{align} Y = \theta_0  D + g_0(X)+ U,  &&  & E[U|X,D]  \\ 
   D = m_0(X) + V, &&   &  E[V|X] = 0 \\
  \theta_0 =  t_0(Z) + W, && & E[W|Z] = 0, Z \subseteq X  \\
   \end{align}$$

$$ Y = \theta_0  D + g_0(X)+ U, \quad E[U|X,D] = 0 $$ 
<br/>
$$  D = m_0(X) + V, E[V|X] = 0 $$
<br/>
$$ \theta_0 =  t_0(Z) + W, E[W|Z] = 0, Z \subseteq X  $$


Y - Outcome Variable

$\theta_0$ - True treatment effect 

D - Treatment Dummy

$X_{n*k}$ - Covariates


## Treatment Assignment
##### Random
Choose probability $m_0$

##### Conditioned covariates 
Weight vector  $ b_{k*1} \stackrel{ind.}{\sim} U(0,1), \qquad a_{n*1} = X_{n*k} * b_{k*1} $

$$m_0(X) = \Phi\left(\frac{a-\hat{\mu}(a)}{\hat{\sigma}(a)}\right) $$

##### Create assignment vector
$D \stackrel{ind.}{\sim} Bernoulli(m_0)$


## Treatment effect

###### Option 1: Constant
$\theta_0 = c$

###### Option 2: Continuous heterogeneous effect

$Z \subseteq X,\quad Weight vector \quad b_{k*1} \stackrel{ind.}{\sim} U(0,1), \quad W \stackrel{ind.}{\sim} N(0,1)$

$$\gamma =  sin(Z * b)^2 + W$$


$$\theta_0 = \frac{\gamma - min(\gamma)}{max(\gamma) - min(\gamma)}(0.3 - 0.1)$$

###### Option 3: Negative 

$\theta_0 \stackrel{ind.}{\sim} U(-0.3,0)$

###### Option 4: No treatment effect

$\theta_0 = 0$


## Composition of dependent variable 

###### Non-linearity: 
$$g(X) = sin(X*b)^2 + U$$

###### Option 1: Continuous 
$$Y_i  = \theta_{0,i}  D_i + g_0(X_i)+ U_i$$

###### Option 2: Binary
$$p_i = \frac{Y_i - min(Y_i)}{max(Y_i) - min(Y_i)}(0.9 - 0.1)$$

<br/>

$$\theta_0 \stackrel{ind.}{\sim} Bernoulli(p_i)$$

## Application of module/package 

In [1]:
# For next print statement
import numpy as np

In [18]:
from SimulateData import UserInterface

u = UserInterface(N = 10000,k = 10, seed = 12) 

u.generate_treatment(random_assignment = True, 
                    assignment_prob = 0.5, 
                    constant = True, 
                    heterogeneous = False,
                    negative = False, 
                    no_treatment = False, 
                    treatment_option_weights = [0, 0.7, 0.1, 0.2]) 
  # default: None, [constant, heterogeneous, negative, no effect]

y, X, assignment_vector, treatment_effect = u.output_data(binary=False)

print('Shapes y, X, assignment, treatment: ' + str([np.shape(variable) for variable in [y, X, assignment_vector, treatment_effect]]))
      


Shapes y, X, assignment, treatment: [(10000,), (10000, 10), (10000,), (10000,)]


In [4]:
    from plot_functions import propensity_score_plt, all_treatment_effect_plt, single_treatment_effect_plt, output_difference_plt, avg_treatment_effect_plt 

    
    u = UserInterface(10000,10, seed=5)
    u.generate_treatment(random_assignment=False, treatment_option_weights = [1, 0, 0, 0])
    y, X, assignment, treatment = u.output_data()
    
    prop_score_conditioned = u.s.propensity_score

    
    u = UserInterface(10000,10, seed=5)
    u.generate_treatment(random_assignment=True, assignment_prob = 0.5,  treatment_option_weights = [1, 0, 0, 0])
    y, X, assignment, treatment = u.output_data()
    
    prop_score_random = u.s.propensity_score

## Distribution of propensity scores according to treatment assignment

In [7]:
import numpy as np
##### treatment effects plots

### Each option alone 
treatment_list = []
assignment_list = []

for i in range(4):
    treatment_option_weights = np.zeros(4)
    treatment_option_weights[i] = 1
    
    u = UserInterface(10000,10, seed=123)
    u.generate_treatment(random_assignment=True, treatment_option_weights = treatment_option_weights)
    y, X, assignment, treatment = u.output_data(binary=False)
    
    treatment_list.append(treatment)
    assignment_list.append(assignment)

## Treatment effect options 

In [9]:
##### Realistic case of treatment options

u = UserInterface(10000,10, seed=23)
u.generate_treatment(treatment_option_weights = [0, 0.7, 0.1, 0.2])
y, X, assignment, treatment = u.output_data(binary=False)

## Customized treatment distribution

## Outputs depending on treatment assignment 

In [11]:
##### Output differences treated/not_treated plots

### continous 
u = UserInterface(10000,10, seed=7)
u.generate_treatment(random_assignment=True, treatment_option_weights = [0, 1, 0, 0])
y, X, assignment, treatment = u.output_data(False)

y_treated = y[assignment==1]
y_not_treated = y[assignment==0]

In [13]:
### binary
u = UserInterface(10000,10, seed=15)
u.generate_treatment(random_assignment=True, treatment_option_weights = [0, 1, 0, 0])
y, X, assignment, treatment = u.output_data(True)

y_treated = y[assignment==1]
y_not_treated = y[assignment==0]

## Average Treatment Effect 