# Chapter 10 Design
     
 

## 10.1 Randomized experiment

### 10.1.1 Simple randomized experiments

Suppose that we want to study whether treatment A causes any changes in the reponse. We know from the potential outcome framework that independence between the treatment assignment and the potential outcomes is key to average treatment/causal effect.

A simple randomized experiment is as follows. 
1. For any subject, $i=1,\ldots, n$, randomize its treatment $Z_i=0$ or $1$ with a probability $p$.
2. For $i=1,\ldots, n$, randomly allocate them into two groups with $n/2$ in each group. 

Unit-level treatment effect as $\tau_i(j,j')=Y_i(j)-Y_i(j')$. 

Population-level treatment effect is 
\[
\tau(j,j')=N^{-1} \sum_{i=1}^n \tau_i (j,j') = N^{-1} \sum_{i=1}^N \{ Y_i(j)-Y_i(j')\} \equiv \bar{Y}(j)-\bar{Y}(j'). 
\]

**Fisher's sharp null hypothesis** of zero individual treatment effects
\[
H_{0F}L Y_i(1) = \cdots = Y_j(J) \ (i=1,\ldots, N)
\]
This leads to the permutation test.

**Neyman's null hypothesis**: no average treatment effects. 
\[
H_{0N}: \bar{Y}(1)=\cdots = \bar{Y}(J).
\]
Weaker restriction on the potential outcomes. 

_Claim_: Fisher's randomization test fails to control type I error in unbalanced experiments. 




### 10.1.2 Stratified randomized experiment

This is also known as randomized block design. The blocking can improve efficiency, and sometimes is required for feasibility.  Stratification during design can be seen as covariate adjustment in a randomized experiment. 

Recall that ${\rm ACE}\equiv \mathbb{E}[Y(1)-Y(0)]$ and $\hat{\rm ACE}=\bar{Y}_1 - \bar{Y}_0$. Let $X_i$ be the centered covarites and $\mathbb{E}[X_i]=0$, $\bar{X}_Z =N_Z^{-1} \sum_{i=1}^{N} X_i 1[Z_i=z]$, and $\mathbb{E}[\bar{Z}_z | Z]=0$. Finally, we assume that $Z_i \perp X_i$. Consider a new estimator 
\[
\hat{\rm ACE}(\gamma_1, \gamma_0) = \big(\bar{Y}_1 - \gamma_1^T \bar{X}_1 \big)\cdot \big(\bar{Y}_0 -\gamma_0^T \bar{X}_0\big).
\]
We can verify that $\mathbb{E}\big[\hat{\rm ACE}(\gamma_1, \gamma_0)\big]={\rm ACE}$ for any $\gamma_1$ and $\gamma_0$. Therefore, we have seen that the new estimator is an unbiased estimator, now our task is to find the best $\gamma_0$ and $\gamma_1$ to minimize its variance. We can use the linear projection idea that we discussed in the Causal inference chapter. 




In [None]:

# A simple simulation for stratification 
set.seed(10928)
# Data generating mechanism: 
n=40;n.strata=10;
X= sample(x=(1:n.strata),size=n,replace=TRUE);
ACE=4; coef.X=2; 
Y.1=ACE+coef.X*X+rnorm(n,mean=0,sd=1); # potential outcome
Y.0=coef.X*X+rnorm(n,mean=0,sd=1); # potential outcome 
trt= sample(1:n,size=(n/2),replace=FALSE);Z=rep(0,n);Z[trt]=1; # randomization

Z.s=rep(0,n);
for (i in 1:n.strata){# randomization within stratum 
  id.stratum= which(X==i);
  trt= sample(id.stratum,size=floor(length(id.stratum)/2),replace=FALSE);
  Z.s[trt]=1; 
}

Y=Y.1*Z+Y.0*(1-Z); # observation w/o stratification 
Y.s=Y.1*Z.s+Y.0*(1-Z.s); # observation w stratification 


In [None]:
# Analysis, w/o stratification 

lm.vanilla=summary(lm(Y~Z));
lm.X=summary(lm(Y~Z+X));

plot(y=Y,x=X,pch=16,col=c('red','blue')[Z+1])
legend(x=1.1,y=22,legend=c('Trt','Ctrl'),col=c('red','blue'),pch=16)
text(x=5.1,y=5,labels=paste('Without X, Est. ACE = ',round(lm.vanilla$coef[2,1],2), 'with s.e.', round(lm.vanilla$coef[2,2],2),sep=' '  ),pos=4  )

text(x=5.1,y=2,labels=paste('With X,  Est. ACE = ',round(lm.X$coef[2,1],2), 'with s.e.', round(lm.X$coef[2,2],2),sep=' '  ),pos=4  )


In [None]:
# Analysis, w stratification 

lm.simple=summary(lm(Y.s~Z.s));
lm.strat=summary(lm(Y.s~Z.s+X));

plot(y=Y.s,x=X,pch=16,col=c('red','blue')[Z.s+1])
legend(x=1.1,y=22,legend=c('Trt','Ctrl'),col=c('red','blue'),pch=16)
text(x=5.1,y=5,labels=paste('Without X, Est. ACE = ',round(lm.simple$coef[2,1],2), 'with s.e.', round(lm.simple$coef[2,2],2),sep=' '  ),pos=4  )

text(x=5.1,y=2,labels=paste('With X,  Est. ACE = ',round(lm.strat$coef[2,1],2), 'with s.e.', round(lm.strat$coef[2,2],2),sep=' '  ),pos=4  )


In [None]:
# Repeat the above procedure 10000 times to evaluate the efficiency 

# Wrap up the code in one function 

strat.sim<-function(ACE){ 
n=40;n.strata=10;
X= sample(x=(1:n.strata),size=n,replace=TRUE);
 coef.X=5; 
Y.1=ACE+coef.X*X+rnorm(n,mean=0,sd=1); # potential outcome
Y.0=coef.X*X+rnorm(n,mean=0,sd=1); # potential outcome 
trt= sample(1:n,size=(n/2),replace=FALSE);Z=rep(0,n);Z[trt]=1; # randomization

Z.s=rep(0,n);
for (i in 1:n.strata){# randomization within stratum 
  id.stratum= which(X==i);
  trt= sample(id.stratum,size=floor(length(id.stratum)/2),replace=FALSE);
  Z.s[trt]=1; 
}

Y=Y.1*Z+Y.0*(1-Z); # observation w/o stratification 
Y.s=Y.1*Z.s+Y.0*(1-Z.s); # observation w stratification 


lm.vanilla=summary(lm(Y~Z));
lm.X=summary(lm(Y~Z+X));

lm.simple=summary(lm(Y.s~Z.s));
lm.strat=summary(lm(Y.s~Z.s+X));

est.ACE=c(lm.vanilla$coef[2,1],lm.X$coef[2,1],lm.simple$coef[2,1],lm.strat$coef[2,1])
return(est.ACE)
}
ACE=4;

sim.result=replicate(n=1e4,strat.sim(ACE=ACE));

In [None]:

(mse=apply(sim.result-ACE,MARGIN=1,sd))

## 10.2 Design of observational study


### 10.2.1 Overview 

In Chapter 9, we mention that an observational study is a study where the treatment are not randomized. We have discussed some designs of randomized experiments, and one would expect that similar designs might apply to observational studies, where one needs to collect data. 

There are other designs that also employ the stratified idea, for instance, repeated measures designs, split-plot design, nested design, etc.  Note that the causal interpretation relies on randomization. A study can be stratified but not randomized. As a result, the causal interpretation might not always exist. 

Depending on the length of observation, we would have the so-called longitudinal studies (or panel data) where subjects are observed over a period of time. 



### 10.2.2 Reporting bias in survey sampling 

In Chapter 9, we discuss how we can "remove" selection bias with assumptions. Here we demonstrate how to actually avoid bias with experimental design. 

We consider the classic scenario of self-reporting bias. Responders of surveys are less likely to report true status if they worry about their privacies (e.g., the "shy" Trump voters in recent election polls). In order to obtain true response, we can employ a randomized response technique (Warner model 1965). Let $\pi_A$ be the sensitive proportion. With probability $p$, the question is whether you belong to set $A$; with probability $1-p$, the question is whether you belong to set $A^c$. Then we have 
\[
{\rm pr}({\rm Yes})=\pi_A p+(1-\pi_A)(1-p) \ {\rm pr}({\rm No})=(1-\pi_A)p+\pi_A(1-p).
\]
Hence we have $\hat{\pi}_A = [n_1/n-(1-p)]/(2p-1)$, which is unbiased as long as $p\neq 0.5$. In addition, we have 
\[
{\rm var}\big( \hat{\pi}_A\big)=\frac{\pi_A (1-\pi_A)}{n} \frac{p(1-p)}{(2p-1)n}
\]

### 10.2.2 Case-control study 

Another popular design for observational study is the case-control study. In medical research, the cases are often easy to identify as patients' records are available from the health records. Furthermore, many diseases are rare in the general population. Therefore, rather than watching a group of participants till they develop diseases, it is more practical to select known cases and find matching control group. 

Briefly, a case-control study identifies two populatons by the outcomes. The actual samples are drawn from the two populations, repectively. To be specific, the case group is drawn from $X, Y \mid Y=1$, and the control group is drawn from   $X,Y \mid Y=0$. As a result, it is no longer possible to recover the ${\rm ACE}$ from a case-control study, as the averages are taken on wrong populaions. 

However, the log-odds ratio from a case-control study can still be transfered to general population. 