<a href="https://colab.research.google.com/github/NikolayLenkovNikolaev/SAS-in-Clinical-Trial/blob/main/B_L3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Il fenomeno della regressione verso la media

## Galton (1822-1911) e la regressione verso la media:
- “Galton set up a biometrical laboratory in London and collected heights, weights, measurements of specific bones, and other characteristics of family members. He was looking for some way to predict measurements from parents to children.”
- “It was obvious, for instance, that tall parents tended to have tall children, but was there some mathematical formula that would predict how tall the children would be, using only the heights of the parents?“
- “Galton discovered a phenomenon of “regression to the mean”. It turned out that sons of very tall fathers tended to be shorter than their fathers and sons of very short fathers tended to be taller than their fathers. As a result, the heights of humans tend to remain stable, on the average.”
- “Regression to the mean is a phenomenon that maintains stability and keeps a given species pretty much the same from generation to generation.
Galton discovered a mathematical measure of this relationship. He called it the “coefficient of correlation””

da:  Senn S. “Dicing with Death

## Regressione verso la media: simulazione

Vediamo ora in una simulazione SAS cosa accadrebbe se il fenomeno della
regressione verso la media non fosse vero e l’altezza dei figli (generazione i+1) fosse in media pari all’altezza dei loro genitori (generazione i)

```
DATA GENERAZIONI;
CALL STREAMINIT(31102019);
ARRAY H(*) H1-H20;
DO I=1 TO 1000;
/*PRIMA GENERAZIONE*/
H1=RAND("NORMAL",165,8);
DO GEN=2 TO 20;
/*GENERAZIONI SUCCESSIVE*/
H(GEN)=RAND("NORMAL",H(GEN-1),8);
END;
OUTPUT;
END;
RUN;
PROC MEANS MEAN MIN MAX;
VAR H1-H20;
RUN;
TITLE;
PROC SGPLOT DATA=GENERAZIONI;
HISTOGRAM H1 / TRANSPARENCY=0.75 FILLATTRS=(COLOR=RED);
HISTOGRAM H5 / TRANSPARENCY=0.75 FILLATTRS=(COLOR=YELLOW);
HISTOGRAM H10 / TRANSPARENCY=0.75 FILLATTRS=(COLOR=GREEN);
HISTOGRAM H20 / TRANSPARENCY=0.75 FILLATTRS=(COLOR=BLUE);
KEYLEGEND / LOCATION=OUTSIDE POSITION=BOTTOM;RUN;
RUN;
```


Galton thought about his remarkable finding and then realized that it had to be true, that it could have been predicted before making all his observations. Suppose, he said, that regression to the mean did not occour. Then, on average, the sons of tall fathers would be as tall as their fathers. In this case, some of the sons would have to be taller than their fathers (in order to average out the ones who are shorter). The sons of this generation of taller men would then average their heights, so some sons would be even taller. It would go
on, generation after generation. Similarly, there would be some sons shorter than their fathers, and so on. After not too many generations, the human race would consist of ever taller people at one end and ever shorter ones at the other. This does not happen. The heights of humans tend to remain stable, on average. This can only happen if the sons of very tall fathers average shorter heights and the sons of very short fathers average greater heights.

da: Salsburg D. “The Lady Tasting Tea”


Il fenomeno della regressione verso la media:
- $E(Y|X=x)= \alpha +\beta.x$

con Y variabile dipendente (altezza dei figli) e X variabile indipendente (altezza dei genitori). L’equazione di regressione può essere anche scritta come:
- $E(Y|X=x)= \mu_y +\phi.\xigma_y /\sigma_x (x-\mu_x)$

- con $\rho$ - coeff.di correlazione tra Y and X
- $\mu_x, \mu_y$- media di X e Y
- $\sigma_x, \sigma_y$- dev.std

- Nel nostro caso è possibile assumere che $\sigma_x = \sigma_y$ eche X e Y siano misurate sulla stessa scala. L'equazione si semplifica cosi:

$E(Y|X=x)= \mu_y + \rho(x-\mu_x)$

Quello che emerge è che, eccetto il caso di perfetta correlazione $\rho=1$, la media di Y per un detrminato valore di x derivera da $\mu_y$ in media menao di quanto x devia da $\mu_x$

Piu debole e' le correlazione, maggiore e' l'impato dellregressione verso la media.

RegressionToTheMean_Fitzmaurice.pdf

## Regressione verso la media e distorsione da selezione
La regressione verso la media è una causa molto diffusa del cambiamento
(spontaneo) dei valori assunti dalla variabile di interesse e si verifica in studi dove gli individui vengono selezionati e studiati perché i loro valori sono estremi. In questi casi si potrebbe erroneamente giudicare efficace un intervento solo per la regressione spontanea verso la media dei valori estremi osservati prima della sua somministrazione.

Stephen Senn. Dicing with Death

Circa la metà dei soggetti "selezionati" nello studio perché con
pressione elevata all’inizio dell’osservazione (baseline), e
sottoposti a un trattamento con effetto nullo, riportano dopo un po’ di tempo
una pressione minore. Questo non perché il trattamento sia stato efficace,
ma per il fenomeno della regressione verso la media.

Da un punto di vista clinico però, è corretto trattare solo i soggetti che
ne hanno bisogno (cioè gli ipertesi alla baseline).

## Simulazione in SAS
Replicare i dati dell’esempio di Senn visto prima (medie 125,
deviazione std 12, coefficiente di correlazione 0.7).

```

DATA SBP;
KEEP SBP1 SBP2 CHANGE;
CALL STREAMINIT(12345);
/*MU1->MEDIA PRE, MU2->MEDIA POST, VAR1->VARIANZA PRE, VAR2->VARIANZA POST*/
MU1=125; MU2=125; VAR1=12**2; VAR2=12**2;
RHO=0.7;
DO I = 1 TO 1000;
/*CODICE PER GENERARE UNA NORMALE BIVARIATA*/
SBP1 = MU1+SQRT(VAR1)*RAND("NORMAL",0,1);
SBP2 = (MU2+RHO*(SQRT(VAR2)/SQRT(VAR1))*(SBP1-MU1)) +
SQRT(VAR2*(1-RHO**2))*RAND("NORMAL",0,1);
CHANGE=SBP2-SBP1;
OUTPUT;
END;
RUN;

PROC SGPLOT DATA=SBP;
SCATTER X=SBP1 Y=SBP2;
REFLINE 140 / AXIS=X LINEATTRS=(COLOR=RED THICKNESS=4);
REFLINE 140 / AXIS=Y LINEATTRS=(COLOR=RED THICKNESS=4);
RUN;

/*analisi sul dataset completo (no selection bias)*/
PROC MEANS DATA=SBP MEAN CLM;
VAR CHANGE;
RUN;

/*analisi sui soggetti che al pre hanno valori > 140*/
DATA IPERTESI;
SET SBP;
IF SBP1>140;
RUN;
PROC MEANS DATA=IPERTESI MEAN CLM;
VAR CHANGE;
RUN;

PROC SGPLOT DATA=SBP;
SCATTER X=SBP1 Y=CHANGE;
RUN;
```

A causa della regressione verso la media, i valori al basale sono negativamente
associati ai cambiamenti tra i valori post e i valori basali

## Distorsione da selezione e ruolo della randomizzazione

“…the puzzling, pervasive and apparently perverse phenomenon of
regression to the mean is one amongst many reasons why the
randomized controlled trial (RCT) has gained such popularity as a
means to test the effects of medical innovation. If regression to the
mean applies, it will also affect the control group and this permits its
biasing influence to be removed by comparison.”
Stephen Senn. Dicing with Death.


Esercizio:
Þ Simulare i dati di uno studio controllato randomizzato in cui si vuole valutare l’effetto di
un trattamento rispetto a un placebo nel diminuire la pressione sistolica.
Utilizzare gli stessi criteri usati nella simulazione precedente (N=1000 per gruppo, media
alla baseline=125, dev. std.=12, rho=0.7), ipotizzando che il trattamento diminuisca in media
di 8 unità la SBP, e che la diminuzione media nel gruppo placebo sia invece nulla.
Selezionare poi, sia nei dati simulati nel gruppo placebo che nel gruppo trattato, solo i
pazienti con SBP al basale > 140, e stimare l’effetto del trattamento in termini di differenza
tra la diminuzione media osservata nel gruppo placebo rispetto a quella osservata nel
gruppo di controllo.
La randomizzazione è servita a correggere la distorsione da selezione dovuta alla
regressione verso la media nello stimare l’effetto del trattamento?



# Analisi di studi controllati pre-post con outcome continuo

## RCT with baseline and follow-up measurements
In many randomised trials researchers measure a continuous variable at baseline and again as an outcome assessed at follow up.

Baseline measurements are common in trials of chronic conditions where researchers want to see whether a treatment can reduce preexisting levels of pain, anxiety, hypertension, and the like.

Vickers, Andrew J., and Douglas G. Altman. "Analysing controlled trials with baseline and follow up measurements." Bmj 323.7321 (2001): 1123-1124.

Ex:

As an illustration, Kleinhenz et al randomised 52 patients with shoulder pain to either true or sham acupuncture. Patients were assessed before and after treatment using a 100 points rating scale of pain and function, with
lower scores indicating poorer outcome.
- Kleinhenz, Julia, et al. "Randomised clinical trial comparing the effects of acupuncture and a newly designed placebo needle in rotator cuff tendinitis." Pain 83.2 (1999): 235-241.

## Analisi preliminare
Come prima cosa dobbiamo valutare se i due gruppi posti a confronto sono tra loro confrontabili per quel che riguarda i valori al basale.

Trattandosi di uno studio randomizzato, non ha senso utilizzare un test statistico per valutare differenze ‘significative’ al basale (la randomizzazione assicura che l’ipotesi nulla di uguale distribuzione tra gruppi al basale sia vera).

Questo però non implica che ci possano essere delle differenze dovute al caso.

```
data work;
set acupuncture;
run;
proc means data=work;
var pre;
class group;
run;
```
## Baseline:

I pazienti trattati nel gruppo placebo hanno valori più bassi della variabile di
risposta al basale (scala del dolore e della funzionalità dell’arto: valori bassi indicano situazioni peggiori).

Essendo uno studio randomizzato (e fidandoci del fatto che la
randomizzazione sia stata effettuata nel modo corretto) possiamo
tranquillamente dire che la differenza è dovuta al caso, anche se un test
statistico formale porterebbe a concludere che la differenza è ‘statisticamente
significativa’ (p=0.04). Questo p-value però non può essere interpretato come
evidenza contro l’ipotesi nulla, in quanto sappiamo già che l’ipotesi nulla è
vera, e che la differenza osservata è dovuta al caso (non ha senso quindi
verificare questa ipotesi).

## Strategie di analisi
There are four possibilities for how such data can be entered
into the statistical analysis of such trials
1. Comparison of follow up (post treatment) scores (“POST”)
- Vickers, Andrew J., and Douglas G. Altman. "Analysing controlled trials with baseline and follow up measurements." Bmj 323.7321 (2001): 1123-1124.

```
proc ttest data=work;
var post;
class group;
run;
```

Mean scores were 16 points (95% confidence interval 6.3 to
25.8 points) greater in the treatment group.


2. One can analyze the change from baseline, by looking at
absolute differences ("CHANGE")
- Vickers, Andrew J., and Douglas G. Altman. "Analysing controlled trials with baseline and follow up measurements." Bmj 323.7321 (2001): 1123-1124.


```
data work;
set acupuncture;
change=post-pre;
run;
proc ttest data=work;
var change;
class group;
run;
```
Pain reductions were 8.8 points (95% CI 0.3 to 17.3 points)
greater on treatment than control

Some use change scores to take account of chance imbalances at baseline between the treatment groups.

However, analysing change does not control for baseline imbalance because of regression to the mean: baseline values are negatively correlated with change because patients with low scores at baseline generally improve more
than those with normal scores.


3. The most sophisticated method is to construct a regression model which adjusts the post-treatment score by the baseline score (”ANCOVA")

ANCOVA, despite its name, is a regression method.

In effect two parallel straight lines (linear regression) are obtained relating
outcome score to baseline score in each group

They can be summarised as a single regression equation:

follow up score = constant + a×baseline score + b×group

where a and b are estimated coefficients and group is a binary variable coded 1
for treatment and 0 for control.

The coefficient b is the effect of interest—the estimated difference between the
two treatment groups.

In effect an analysis of covariance adjusts each patient's follow up score for his or her baseline score, but has the advantage of being unaffected by baseline
differences.

```
proc glm data=work;
class group;
model post=pre group / solution;
run;quit;
```
The coefficient beta for treatment should be interpreted as the expected
difference between the mean change scores of each group, at any given value of
(adjusted for) baseline score.

```
proc glm data=work;
class group;
model post=pre group / solution;
run;quit;
```
Pain and function score improved on average by an estimated 10.9 points (95% CI: 2.2 to 19.6 points) in the treatment group than in the control group

If, by chance, baseline scores are worse in the control group, the
treatment effect will be overestimated by a follow up score analysis and
underestimated by looking at change scores (because of regression to
the mean).

By contrast, analysis of covariance gives the same answer whether or not
there is baseline imbalance.


## Treatment effect variance
Senn_Chapter7.pdf

## Relative efficency
An additional advantage of analysis of covariance is that it generally has
greater statistical power to detect a treatment effect than the other
methods.

- Borm, Fransen and Lemmens. A simple sample size formula for
analysis of covariance in randomized clinical trials. Journal of
Clinical Epidemiology, 60: 1234-1238. 2007

## Sample size for pre-post RCT

Analysis of post values

$n= \frac{2(z_{1-\alpha/2} + z_{1-\beta})}{\frac{\mu_0 -\mu_1}{\sigma}}$

with
- $alpha=0.05$ -> $z_{1-\alpha/2}=1.96$
- $\beta=0.2$ -> $z_{1-\beta}=0.84$
- $2*(1.96+0.84)^2=15.68$

- denominatore $△$- effect size , std.difference

with $\mu_0$ and $\mu_1$ the expectd outcome values at post in not treated and treated patients and $\sigma$ the common st.deviation of the outcome.

## Sample size for pre-post RCT

$n= \frac{2[z_{1-\alpha/2} + z_{1-\beta}]^2}{[\frac{\mu+0-\mu_1}{\sigma}]^2}*[2-2\rho]$


with $\mu_0$ and $\mu_1$ the expectd outcome values at post in not treated and treated patients and $\sigma$ the common st.deviation of the outcome and $\rho$ the correlation between post and pre values

#### ANCOVEA:
$n= \frac{2[z_{1-\alpha/2} + z_{1-\beta}]^2}{[\frac{\mu+0-\mu_1}{\sigma}]^2}*[1-\rho^2]$

For example, a trial with a correlation between baseline and follow up
scores of 0.6 that required 85 patients for analysis of follow up scores,
would require 68 [85×(2-2×0.6))=85×0.8] for a change score analysis but
only 54 [85×(1-0.62))=85×0.64] for analysis of covariance.

The efficiency gains of analysis of covariance compared with a change
score are low when there is a high correlation (say r > 0.8) between
baseline and follow-up measurements.

4. One can analyze the change from baseline, by looking at
percentage change from baseline ("FRACTION”)

```
data work;
set acupuncture;
fraction=(post-pre)/pre*100;
run;
proc ttest data=work;
var fraction;
class group;
run;
```

Functional and pain score improves by 36.7% in the Acupuncture group vs 23.1%
in the placebo group.

Reporting a percentage change from baseline gives the results of a
randomized trial in clinically relevant terms immediately accessible to
patients and clinicians alike.

Percentage change from baseline should therefore not be used in
statistical analysis.


https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-1-6


## Materiale di riferimento

Analysing controlled trials with baseline and follow up measurements
- Vickers_Altman.pdf

Baselines and Covariate Information
- Senn_Chapter7.pdf



## : simulazione potenza analisi pre-post


Uno studio di simulazione per valutare la potenza associata alle diverse strategie di
analisi dei dati pre-post.
Þ Valutare, mediante uno studio di simulazione, come varia la potenza statistica
associata alle quattro strategie di analisi di dati continui pre-post presentate nella lezione, al
variare del grado di correlazione tra la misura presa alla baseline e la misura presa alla fine
del follow-up.