<a href="https://colab.research.google.com/github/NikolayLenkovNikolaev/SAS-in-Clinical-Trial/blob/main/B_L6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression models for correlated data.
# Linear models for clustered data
- Linear regression analysis is the basic tool of statistical modeling.
- This technique assumes that residuals are independent and identically
distributed Normal random variables.

- Independence assumption:
  - snowing the value of $e_j$ for one observation would not help us to quess the value of $e_j^8$ in another observation.
  - $y_j= \alpha + \beta.X_j + e_j$
  - $E(e_j)= 0$
  - $e_j \sim N(0. \sigma_e^2)$
  - $var(y_i|\alpha, \beta, X)= \sigma_e^2$


## Correlated data

Often this asumption does not hold: i.e. $e_j$ ic correlated to $e_j^*$
- th=wins ,siblings or subjects belonging to the same family.
- patients treated in the same center
- repeted measures taken on the same subject at different time points

$cov(e_j, e_j^*)= 0$

Correlated data occur when:
- single observations are collected on subjects. The subjects are clustered
together in groups (clustered, hierarchical or multilevel data)
- multiple observations are collected from a single subject (the cluster is the
subject himself) (longitudinal or repeated measures data)

Clustered data:
- If clustered data are treated as if they were independent, the wrong standard
errors, incorrect confidence intervals and the wrong p-values are obtained for
the measures of interest.
- Confidence intervals may be too narrow or too wide, depending on whether the
factor of interest varies between or within clusters. Therefore, p-values may be
too small or too large.
- In order to make a correct inference, it is necessary to estimate the degree
of correlation present in the data.
- Regression models for clustered data are tools which permit to relax the
independence assumption and to take into account more complicated data
structures in a flexible way.

Example:
- AIM: To study the satisfaction of patients with their medical doctors’ treatments.
A two-stage sample is used since it is cheaper than selecting patients at random.
- fig 1

```
FILENAME REFFILE '/home/u50340329/00.University/Bagnaridi/data/MD.xlsx';

PROC IMPORT DATAFILE=REFFILE
	DBMS=XLSX
	OUT=WORK.MD;
	GETNAMES=YES;
RUN;

/*NAIVE ANALYSIS*/
PROC MEANS DATA=MD MEAN STDDEV STDERR CLM;
VAR Y;
RUN;
```

StdDev and StdEtrror

$\frac{\sigma_e}{\sqrt{N}}$

## Effective sample size:
If we ignore the potential correlation among patients within medical doctors (some doctors could have more satisfied patients than the others), we will estimate the standard error of the mean satisfaction as $\frac{\sigma_e}{\sqrt{N}}$ -> Here we assume that each data point contains one data point's worth of
information. However, if the data are correlated, then each data point contains
less than one data point's worth of information.

Effective sample size: rappresents eessentially how many piece of uncorrelated information the sample would compare.

If we knew the intraclass Correlation Coefficient- ICC or $\rho$, the effective sample size could be estimated as $N_{eff}= N . \frac{1}{1+(k-1).\rho}$ with k-the number of patients per clusted. The st/error of the mean satisfaction will therefore be $\frac{\sigma_e}{\sqrt{N_{eff}}}$

- fig - 2

- fig 3

Adjusted interval estimate with $\rho=0.45(k=5)$

$N_{eff} = N.\frac{1}{1+(k-1)/\rho} = 15.\frac{1}{1+(5-1).0.45}= 5.35$

$SE_{adjusted} = 1.278/ \sqrt{5.35}= 0.55$


To clarify the idea of the “effective sample size”, we could use the responses of a person’s left and right eyes to the same treatment.
If these have a correlation of 0.54, then 200 eyes, two from each of 100 persons,contribute the “statistical equivalent” of one-eye contributions from each of 130 persons (200 ´ 1/(1 + 0.54) = 130). The closer the correlation is to 1, the closer the effective sample size is to 100

Sample size considerations for study with correlated data often rely on the idea of the effective sample size.

- fig-4

## How t estimate $\rho$ ?

$\frac{\sigma_u^2}{\sigma_e^2} +\sigma_u^2$
- $\sigma_u^2$ - between clustr variance
- $\sigma_e^2$ - residual variance

- fig- 5

The higher the variance within the clusters (residual variance), the lower the variance between the clusters and the lower the ICC ($\rho$)

- $\sigma_e^2[a] > \sigma_e^2[b] > \sigma_e^2[c]$

- $\sigma_u^2[a] < \sigma_u^2[b] < \sigma_u^2[c]$

- $\rho(a) < \rho(b) < \rho(c)$


## Mixed Model:

Mixed (random and fixed effects) regression models (also known as multilevel
or hierarchical models) explicitly model and estimate the between-cluster
variation and incorporate this, and the residual variance, into standard errors of the regression parameters.
- fig 6

$\sigma_e^2 =1.0308$

$\sigma_u^2= 0.8422$

$ICC =\frac{0.8422}{1.0308 + 0.8422}= 0.45=\rho$

## Random Effects : Estimation Iterative Process

fig 7



PROC MIXED:

```
PROC MIXED DATA=MD;
CLASS MD;
MODEL Y= / SOLUTION;
RANDOM INTERCEPT / SUBJECT=MD;
RUN;
QUIT;
```

level 1: patient
- $y_{ij} = \alpha_i + e_{ij}$
- $e_{ij} \sim N(0, \sigma_e^2)$

Level 2: MD:
- $\alpha_i = \alpha +u_i$
- $u_i  \sim N(0, \sigma_u^2)$

- i=cluster
- j-subject

```
PROC MEANS DATA=MD MEAN STDDEV STDERR CLM;
VAR Y;
RUN;
```
fig-8

std.error= naive, biased, estimate




## SImulazione:

Cosa significa dire che i dati sono correlati?

In questo contesto vuol dire che due pazienti presi a caso e in cura dallo stesso medico sono più simili tra loro rispetto a due pazienti presi a caso e in cura da due medici diversi.

In altri termini, c'è una componente data dall'effetto 'medico' che rende più simile i due pazienti.

Questa componente viene definita effetto casuale perché i medici sono un
campione di tutti i medici, non esauriscono la popolazione dei medici.

Inoltre, non siamo interessati a stimare l'effetto singolo del medico, ma
semplicemente vorremmo tenere conto della correlazione tra osservazioni che
questo effetto dà.

Proviamo a simulare dei dati che abbiano un tipo di struttura simile a quella vista nell'esempio.

```
DATA MEDICI;
CALL STREAMINIT(3567);
DO MEDICO=1 TO 100;
U=RAND("NORMAL",0,2);
DO PAZIENTE=1 TO 20;
Y=1+U+RAND("NORMAL",0,4);
OUTPUT;
END;
END;
RUN;
```

$\sigma_e^2 = 4^2 = 16$

$\sigma_u^2= 2^2 = 4$

$ICC = 4/ (16+4)= 4/20=0.2$

Verifichiamo che il coefficiente correlazione intraclasse stimato con la PROC
MIXED approssima il valore ipotizzato nella simulazione:

```
PROC MIXED DATA=MEDICI;
CLASS MEDICO;
MODEL Y= / SOLUTION;
RANDOM INTERCEPT / SUBJECT=MEDICO;
RUN;
QUIT;
```


$\sigma_e^2 = 4^2 = 16.21$

$\sigma_u^2= 2^2 = 4.83$

$ICC = 4/ (16+4)= 4/20=0.23$

fig -9



## Clustered randomized clinical trials

Clustered randomized clinical trials are conducted in several areas of
intervention research, where treatments are randomly assigned to clusters (e.g.
the subjects in the same clinic receive the same treatment).
Under this type of design, the assumption of independence of observations
within cluster may not hold because the subjects share the clinical
characteristics.

## Cluster-Constant covariates

A Cluster-constant covariate has the same value for all the units in the cluster.
In this example, treatment is a Cluster-constant covariate.

- fig-10

```
PROC SGPLOT DATA=CC;
XAXIS INTEGER OFFSETMIN=0.1 OFFSETMAX=0.1 ;
SCATTER X=TREAT Y=Y / GROUP=CENTER;
REG X=TREAT Y=Y / NOMARKERS;
RUN;


PROC REG DATA=CC;
MODEL Y=TREAT;
RUN;QUIT;
```

$\sigma_e^2 = 1.27^2= 1.61$

$S.E(\beta)= \sqrt{\frac{\sigma_e^2}{N_0} +\frac{\sigma_e^2}{N_1}}= \sqrt{\frac{1.61}{8} +\frac{1.61}{8}} = 0.64$

If we assume that thr $ICC(\rho)= 0.42$ we have:

$N_{0-eff}= 8. \frac{1}{1+(4-1).042}= 3.54$

$N_{1-eff}= 8. \frac{1}{1+(4-1).042}= 3.54$

$S.E(\beta)= \sqrt{\frac{\sigma_e^2}{N_0} +\frac{\sigma_e^2}{N_1}}= \sqrt{\frac{1.61}{3.54} +\frac{1.61}{3.54}} = 0.96$

Mixed Model:
$y_{ij}= \alpha +\beta.T_{ij} +u_i +r_{ij}$
- $\beta.T_{ij}$ - fixed effect
- $u_i$ = random effect

```
PROC MIXED DATA=CC;
CLASS CENTER;
MODEL Y=TREAT / SOLUTION;
RANDOM INTERCEPT / SUBJECT=CENTER;
RUN;
```

level 1: patient
- $y_{ij} = \alpha_i + \beta.T_{ij} + e_{ij}$
- $e_{ij} \sim N(0, \sigma_e^2)$

Level 2: MD:
- $\alpha_i = \alpha +u_i$
- $u_i  \sim N(0, \sigma_u^2)$

- i=cluster
- j-subject


Mixed model:
- $\sigma_e^2$= 1.14
- $\sigma_u$^2 = 0.83

- $ICC= \frac{0.83}{1.14+0.83}= 0.42=\rho$



## Cluster-Constant covariates

If observations are positively correlated then the variance of the
estimated treatment effect will be underestimated if the data are
analyzed as though all observations are independent ($\rho$ = 0).

## Cluster-Varying covariates
A Cluster-varying covariate varies across the units within each cluster.
In this example, treatment is a Cluster-varying covariate.

- fig-11

```
PROC REG DATA=CV;
MODEL Y=TREAT;
RUN;QUIT
```


$\sigma_e^2= 2.47$

```
PROC MIXED DATA=CV;
CLASS CENTER;
MODEL Y=TREAT / SOLUTION CL;
RANDOM INTERCEPT / SUBJECT=CENTER;
RUN;
````


Mixed model:
- $\sigma_e^2$= 0.21
- $\sigma_u$^2 = 2.83

- $ICC= \frac{2.83}{2.83+0.21}= 0.93=\rho$


If observations are positively correlated then the variance of the
estimated time effect will be overestimated if the data are analyzed as
though all observations are independent ($\rho$ = 0).


In contrast to mixed model, fixed effects regression includes clusters
as fixed effects (i.e. the n clusters are represented by n-1 dummy
variables)

Fixed effects model with clusters as dummy variables:

$Y_{ij} = \alpha + \beta_1 . T_{ij} + \beta_2 .DummyCluster_{ij} + \beta_3.DummyCluster_{2j} + ..+ \beta_n.DummyCluster_{m-1,j} +e_{ij}$

$e_{ij} \sim(0, \sigma^2)$

[i, cluster, j subject]

2 treatment and n clusters: 1+(n-1) fixed effect

```
PROC MIXED DATA=CV;
CLASS CENTER;
MODEL Y=TREAT CENTER / SOLUTION CL;
RUN;

```

fig-12

Summary:

Comparison of traditional models to the Mixed model with respect to estimated
Standard Error (S.E. ($\beta$) of the effect of interest (e.g. a binary treatment),
assuming equal cluster sizes.

|Type of covariate|Independence model|Fixed Effect(ie.cluster as dummy variables)|Mixed Model (ie. cluster as random effect)|
|---|---|---|---|
|Clustr-constant(e.g. a treatment randomized at cluster level)|Biased - underestimated|Biased-underestimated|Correct|
|CLuster-varying(e.ga treament randomized at partient level)|Biased(overestimated)|Correct|Correcty|

