# Generate synthetic data non parametrically

## Objectives
- Understand the math desciption of a simulation study for medicine
- Understand the code corresponding to this description
- Adapt the data generating process for your own study

### Description of the data generating process

##### 1. Generate $p$ correlated features, $d$ of which are binary, others are log-normally distributed

Sample an orthogonal matrix $O,$ (ie, a matrix such that $OO^\top = I.)$

Chose eigenvalues for the future variance covariance matrix of our $p$ features. For example:

$$(\lambda_{1},\lambda_{2},\dots,\lambda_{p})= 1+ 0.2 \times (1,2,\dots,p).$$

Generate the variance-covariance matrix $\Sigma$ for our features as follows:

$$      
\Sigma = O
\begin{bmatrix} 
\lambda_{1} & 0               & \dots  & 0\\
0           & \lambda_{2}     & \ddots & \vdots\\
\vdots      & \ddots          & \ddots & 0\\
0           & \dots           & 0      & \lambda_{p}
\end{bmatrix}
O^{T}
.$$

Sample our features from a multivariate normal distribution

$$    \left(X_{1}', X_{2}', \ldots,  X_{p}' \right)^{T} \sim \mathcal{N}(0,\,\Sigma).$$

Binarize the fist $d$ featrues and exponentiate the remaining features so they appear log-noramally distributed

$$(X_{1},\dots,X_{d}) =\big(\text{I}\{X_{1}'>0\},\dots,\text{I}\{X_{d}'>0\}\big),$$
$$(X_{d+1},\dots,X_{p}) =\big(\exp(X_{d+1}'),\dots,\exp(X_{p}')\big).$$

##### 2. Generate treatment $T$

$$T|X \sim  \text{Bernoulli}(e(X))$$
where $e(x)=\mathbb{E}(T|X=x)$ denotes a propensity score function.

##### 3. Generate a continuous outcome $Y$

$$Y|X \sim \mathcal{N}(b(X)+T\tau(X),\quad\sigma^2) \$$
where $b(x)=\mathbb{E}(Y|T=0,X=x)$ denotes a baseline risk function, $\tau(x)=\mathbb{E}(Y|T=1,X=x)-\mathbb{E}(Y|T=0,X=x)$ denotes the conditional average treatement effect function (CATE), and $\sigma$ is a dispersion parameter adding Gaussian noise.

##### 4. Generate a binary outcome $Y_{bin}$

$$Y_{bin}|X \sim \text{Bernoulli}\big(\text{expit}(b(X)+T\tau(X))\big) $$
where $\text{expit}(x)=\frac{1}{1+\exp(-x)}.$ 

### Programming the data generating process

In [1]:
import numpy as np
from scipy.stats import ortho_group
from scipy.special import expit

In [2]:
def gen_data(ps_fun,
             x_to_y_con,
             x_to_y_bin,
             n = 1000,
             p = 5,
             d = 2,
             eigval = None,
             sig_sq = 1):

    """ Generate data for a simulation study

    Parameters
    ----------
    n : int
        Number of data points
    p : int
        Number of features
    d : int
        Number of discrete features
    eigval : array of shape (p,)
        Eigenvalues of the covariance matrix of the features
    ps_fun : function
        function that maps features to P(treatment|features)
    x_to_y_con : function
        function that maps features and treatment to the conditional mean of a continuous outcome
        ie. E[Y_con|X, T]
    sig_sq : float
        Variance parameter determining noise added to the continuous outcome
    x_to_y_bin : function
        function that maps features and treatment to the conditional mean of a binary outcome
        ie. E[Y_bin|X, T]

    Returns
    -------
    X : array of shape (n, p)
        Features matrix with d discrete features and p-d log-noramally distributed features
    T : array of shape (n,)
        Treatment
    Y_con : array of shape (n,)
        Continuous outcome
    Y_bin : array of shape (n,)
        Binary outcome

    Example
    -------
    X, T, Y_con, Y_bin = gen_data(ps_fun = ps_fun,
                                  x_to_y_con = x_to_y_con,
                                  x_to_y_bin = x_to_y_bin,
                                  n = 1000,
                                  p = 5,
                                  d = 2,
                                  eigval = None,
                                  sig_sq = 1)
"""

    if eigval is None:
        # Generate eigenvalues for the covariance matrix
        eigval =  1 + np.arange(1, p + 1) * .2 

    # sample an orthogonal matrix
    O = ortho_group.rvs(dim=len(eigval))

    # Create a variance-covariance matrix
    Sigma = O.dot(np.diag(eigval)).dot(O.T)

    # Sample from a multivariate normal distribution with mean 0 and covariance matrix O diag(eigval) O^T
    X = np.random.multivariate_normal(np.zeros(len(eigval)), Sigma, size=n)

    # Discretize the the first d features
    X[:, 0:d] = (X[:, 0:d] > 0).astype(int)

    # Exponentiate the remaining features for log-normality
    X[:, d:] = np.exp(X[:, d:])

    # Generate the treatment
    T = np.random.binomial(1, ps_fun(X))

    # Generate a continuous outcome
    Y_con = x_to_y_con(X, T) + np.random.normal(0, sig_sq, size=len(X))

    # Generate a binary outcome
    Y_bin = np.random.binomial(1, x_to_y_bin(X, T))

    return X, T, Y_con, Y_bin

Modify the following functions according to taste.

In [3]:
def base_fun(X):
    # baseline risk function
    # here, basline risk depends on all features
    mask_mat = np.ones((X.shape[0], X.shape[1]))
    mask_mat[:, ::2] = -1

    return np.log(1 + np.exp((mask_mat * X).sum(axis=1)))

def cate_fun(X):
    # conditional average treatment effect function
    # here, the treatment response depends only on the fourth feature (index X3)
    return np.log(1+np.exp(X[:,3]))

def ps_fun(X, rct=False):
    # propensity score function
    if rct:
        # constant ps
        ps = np.array([.5] * len(X))
    else:
        # propensity score depends on all features
        mask_mat = np.ones((X.shape[0], X.shape[1]))
        mask_mat[:, ::2] = -1 
        ps = 1 / (1 + np.exp(-(mask_mat * X).sum(axis=1)))
    return ps

Do not modify the function below (there's no need for that as they call the functions we specified above).

In [4]:
def x_to_y_con(X, T):
    # map X, T to E[Y_con|X, T] (edit functions above not this one)
    # mean continuous outcome is the sum of the baseline risk, the treatment response
    return base_fun(X) + T * cate_fun(X)

def x_to_y_bin(X, T):
    # map X, T to E[Y_bin|X, T] (edit functions above not this one)
    # the risk is expit of the sum of the baseline risk, the treatment response
    risk = expit(base_fun(X) + T * cate_fun(X))
    return risk

### Generate data and put it into a pandas dataframe

In [9]:
import pandas as pd

# Generate data
X, T, Y_con, Y_bin = gen_data(ps_fun = ps_fun,
                              x_to_y_con = x_to_y_con,
                              x_to_y_bin = x_to_y_bin,
                              n = 10000,
                              p = 5,
                              d = 3,
                              eigval = None,
                              sig_sq = 1)

# Put the features into a dictionary 
dict_df = {f"X{i}":X[:,i] for i in range(X.shape[1])}

# Add the treatment and outcomes to the dictionary
dict_df.update({"T":T, "Y_con":Y_con, "Y_bin":Y_bin})

# Create a pandas dataframe
pd.DataFrame(dict_df)

# Look at the first 10 rows
pd.DataFrame(dict_df).head(10)

Unnamed: 0,X0,X1,X2,X3,X4,T,Y_con,Y_bin
0,1.0,1.0,0.0,1.141308,1.108073,1,2.602736,1
1,1.0,1.0,1.0,1.473029,5.148575,0,-0.814245,1
2,0.0,0.0,0.0,0.404469,1.098611,1,1.451034,1
3,0.0,0.0,1.0,0.200296,0.158181,0,0.799718,1
4,0.0,0.0,1.0,5.059064,5.548925,1,5.738407,1
5,0.0,1.0,0.0,2.709216,0.196366,1,5.20683,1
6,0.0,0.0,0.0,1.910389,6.522437,0,0.881251,0
7,0.0,1.0,1.0,0.730344,0.890449,0,0.397045,1
8,1.0,1.0,1.0,1.270817,0.068375,1,1.62942,1
9,1.0,1.0,0.0,0.326774,3.447467,0,-0.33227,1


Note that for your own study, you would need to add a seed for reproducibility.
Before getting there, I recommend to *NOT* use a seed so that we can have a sense of the variability our code produces.

## References

This data generating process is inspired by:
- For the generation of clinical features: <br>
*[Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.](https://academic.oup.com/jamia/article-abstract/27/8/1244/5867235?redirectedFrom=fulltext)* R Bey et al. JAMIA 2020.
-  For the non parametric functions base, cate, and propensity score <br>
*[Quasi-oracle estimation of heterogeneous treatment effects.](https://academic.oup.com/biomet/article-abstract/108/2/299/5911092)* X Nie and S Wager. Biometrika 2021. 