# Sample Size - How much data is enough for your experiment?

<div class="alert alert-block alert-danger">
<b>Alert:</b> If you're running this on <b>Google Colab</b>, then uncomment and run the next two cells.
</div>

In [None]:
# !git clone https://github.com/Mark-Kramer/METER-Units.git

In [None]:
# import sys
# sys.path.insert(0,'/content/METER-Units')

## 0 - Setup & Introduction

In [None]:
# Load modules
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import statsmodels.api as sm
# Load custom functions
from sample_size_functions import *

<div class="alert alert-block alert-info">

*Introduction*:

- Recent works suggests there exists a *genetic biomarker for longevity*, substance $x$. 
- We'd like to perform an experiment to test this novel biomarker. To do so, we need to compute the **sample size** for our experiment.
- That's the goal of this unit: perform a sample size calcuation.
- To start, we'll provide limitied information about substance $x$ and longevitiy, and ask you to compute the sample size.
- **No idea how to start a sample size calcuation?** That's great! The goal of this unit is to teach you a general approach.
- By the end of this unit, you'll have a deeper understanding of what sample size means, and a practical approach to compute it.
</div>

## 1 - Just Google it?

- Recent works suggests there exists a genetic biomarker for longevity (i.e., age at death), substance $x$. 
- We'd like to perform an experiment to test this novel biomarker. To do so, we need to compute the **sample size** for our experiment.
- So, let's do it.
- Here's some limited information about substance $x$ and longevitiy
    - *People have a normal distribution of expression of substance $x$.*
    - *Individuals at the high end of expression levels tend to live about 5 years longer than people at the low end.*

<div class="alert alert-block alert-success">

**Q:** Given this information, how many individuals do you need to observe to have a reasonable chance of demonstrating this hypothesis is correct? (I.e., What is the **sample size**?)

**A:**

</div>

<div class="alert alert-block alert-danger">
<b>Alert: Wait, I have no idea how to answer this?</b>

</p>

- That's great!
- The goal of this unit is to teach you to tackle this problem.
- Please make your best attempt (guess?) to compute a sample size, even if you’re not confident in the results.

A possible place to start is "[Google it](https://www.google.com/)".
- Doing so, you might end up at a website [like this](https://researchmethodsresources.nih.gov/grt-calculator) or [like this](https://www.abs.gov.au/websitedbs/d3310114.nsf/home/sample+size+calculator).
- Can you compute the sample size using these websites?
- NOTE: This won't be easy / obvious! 


</div>

<div class="alert alert-block alert-success">

**Q:** Given this description, sketch a plot of lifespan versus $x$.
- What types of values do you expect for each variable? What are their distributions, do you think?
- How are the variables related?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** We provided very little information and asked you to compute the sample size. What other information would you have liked to help compute the sample size?

**A:**

</div>

## 2- Underpowered experiments are doomed to failure.

Now that you've deteremined (or guessed) the sample size `N` for your experiment, let's perform the experiment.

You collect `N` samples of data, so that you receive from each individual:

* `x` - a measure of the proposed biomarker for longevitiy,

* `lifespan` - the individual's age at death.


In [None]:
N = 100                               # Here, learner will input N they found in Mini 1.
x,lifespan = load_data(N)             # Use this by default.
# x,lifespan = load_data_Colab(N)     # Use this if on !!GOOGLE COLAB!!

Let's start by plotting the data.

In [None]:
plt.scatter(x,lifespan)
plt.xlabel('Genetic biomarker x')
plt.ylabel('Lifespan (years)');

<div class="alert alert-block alert-success">

**Q:** What do you observe? Do you see the hypothesize relationship between the biomarker $x$ and lifespan?

**A:**

</div>

Let's assess the relationship between the biomarker `x` and `lifespan` beyond visual inspection.

There are many ways to do so.

Here, we'll fit a line to the data and compute the slope.

In [None]:
# Estimate a line from the data.

from statsmodels.formula.api import ols

dat                = {"x": x, "lifespan": lifespan}
regression_results = ols("lifespan ~ 1 + x", data=dat).fit()

<div class="alert alert-block alert-success">

**Q:** If this code is new to you, don't worry. Can you see the equation for the line in the code above?

**A:**
    
</div>

Now, with the line estimated, we can print the estimated slope and its p-value.

In [None]:
print('Slope estimate =',regression_results.params[1])
print('p-value        =',regression_results.pvalues[1])

Let's also **visualize** the estimated line by plotting it with the data.

In [None]:
pred   = regression_results.get_prediction().summary_frame()
mn     = pred['mean']
ci_low = pred['mean_ci_lower'] 
ci_upp = pred['mean_ci_upper']

# And plot it.
indices_sorted = np.argsort(x,0)
plt.scatter(x,lifespan)
plt.plot(x[indices_sorted[:,0]],mn[indices_sorted[:,0]], 'r')
plt.plot(x[indices_sorted[:,0]],ci_low[indices_sorted[:,0]], ':r')
plt.plot(x[indices_sorted[:,0]],ci_upp[indices_sorted[:,0]], ':r')
plt.xlabel('Genetic biomarker x')
plt.ylabel('Lifespan (years)');

<div class="alert alert-block alert-success">

**Q:** Do you find a significant relationship between the genertic biomarker `x` and `lifespan`?

**A:**

</div>

<div class="alert alert-block alert-danger">
<b>Alert: Wait, this doesn't make sense!</b>

</p>

- We've applied a standard approach to compute sample size `N` and performed the experiment using this sample size.

- We see a trend supporting the hypothesized relationship, but it's not significant.

- The experiment has failed!

What's going on?
</div>

<div class="alert alert-block alert-info">

*Moment of tension*:

- Hook the learner - "something isn't right and I want to know why."

</div>

## 3- What is the effect size?

In Minis 1 & 2, we determined a sample size `N`, collected data with that sample size, and tested for a relationship between longevity and biomarker $x$.

Our results failed to support the hypothesis!

Despite this failure, these data but are still useful.

In this Mini, we'll see how to use these data to compute the **effect size**.

<div class="alert alert-block alert-success">

**Q:** What is the effect size for the data you analyzed in Mini 2?
- What is the numerical value?
- What does it mean in words?
- Is it consistent with your hypothesis?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** Imagine you repeated the experiment. What would you expect for the effect size?

**A:**

</div>

## 4- With resampling you can repeat any experiment.


- The data provided in Mini 2 represent one instantiation of the experiment.
- Maybe we were unlucky and repeating the experiment with sample size `N` would produce significant results.
- But, repeating the experiment is expensive.
- An alternative is to resample your data in hand.

In this Mini, we'll implement a resampling procedure (with fixed sample size `N`) and examine variability in the effect size. 

Our resmapling procedure consists of 3 steps:

1) Draw a new (random) set of labels we can use to index our data (biomarker $x$ and lifespan).
2) Use these indices to create a resampled data set.
3) Compute the relationship between our data (i.e., between the biomarker $x$ and lifespan).

We'll now describe each step. For a related example, [see this video](https://youtu.be/mqDEJyW_z4c?si=heigY8z5PqAjnwKZ).

**First**, we need to draw a set of labels we can use to index our data.

To do so, we'll create a random list of indices, of size equal to the length of our data.

We'll do so *with replacement*, so that the same index may be listed once, twice, or more, or not at all.

In [None]:
ind = np.random.choice(np.size(x), np.size(x))
print(ind)

<div class="alert alert-block alert-success">

**Q:** Look at the values in `ind`. Do they make sense?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** Run the code to generate `ind` again. What do you find? (I.e., is it the same or different than the first time?)

**A:**

</div>

**Second**, we'll used these indices to generate the resampled data

In [None]:
x_resampled = x[ind]
lifespan_resampled = lifespan[ind]

<div class="alert alert-block alert-success">

**Q:** Look at the values in `x_resampled` and `lifespan_resampled`. Do they make sense?

**A:**

</div>

**Third**, we determine the relationship between the resampled biomarker $x$ and the resampled lifespan.

To do so, we'll fit the same linear model to our new resampled data, and again compute the slope and significance.

In [None]:
dat                = {"x": x_resampled, "lifespan": lifespan_resampled}
regression_results = ols("lifespan ~ 1 + x", data=dat).fit()

print('Slope estimate =',regression_results.params[1])
print('p-value        =',regression_results.pvalues[1])

<div class="alert alert-block alert-success">

**Q:** Repeat these steps to generate results from multiple "experiments". Do you ever find a significant result? What p-values do you find?

**A:**

</div>

<div class="alert alert-block alert-info">

*Conclusion*:

- Using the sample size `N` and resampling the data we get qualitatively similar - but quantitatively different - results.
- The effect size (i.e., slope) is consistently positive across resamples. That's consistent with our hypothesis.
- But, we do not (or very infrequently) find a significant relationship between lifespan and $x$.
- The significance (p-value) changes for each resample. 

</div>

## 5- You have the power!  

Resampling provides a direct, computational approach to compute the **power**.

<div class="alert alert-block alert-danger">
Define power.
    
- It's a mysterious parameter in the online calculator.
- What does Power = 0.8 mean?
- What does alpha = 0.05 mean?
</div>

The procedure to compute the power using resampling is simple. Here are the steps in general:

1. Resample the data.
2. Compute the significance of the proposed effect.
3. Repeat 1-2 many times.

The power is the number of times we detect a significant effect, divided by the total number of repetitions.

Let's apply this procedure to our data and effect of interest:

In [None]:
number_of_repetitions = 1000
p_value = np.zeros(number_of_repetitions)
for k in np.arange(number_of_repetitions):
    ind = np.random.choice(np.size(x), np.size(x))
    x_resampled = x[ind]
    lifespan_resampled = lifespan[ind]
    dat                = {"x": x_resampled, "lifespan": lifespan_resampled}
    regression_results = ols("lifespan ~ 1 + x", data=dat).fit()
    p_value[k] = regression_results.pvalues[1]

<div class="alert alert-block alert-success">

**Q:** Does the code above make sense? Can you see the data resampled, and the estimated model?

**A:**

</div>

Now, compute the power.

In [None]:
Power = np.sum(p_value < 0.05)/number_of_repetitions
print(Power)

<div class="alert alert-block alert-success">

**Q:** What value do you find for `Power`?

- Interpert this value.
- Do you have enough data to detect the effect?

**A:**

</div>

## 5* (Optional Extension)- You have the power!

In this optiona Mini, we consider an **alternative strategy to compute the power**.

Instead of resampling the data, we **fit a model to the data**, then use that model to generate new data samples.


---
Let's begin by fitting a model to the data.

We'll use a simple model: a line.

We've already fit this model to estimate the effect size and its significance.

We'll now fit the model, save the model parameters, then use this model to generate new data samples.

Let's begin by fitting the model to our data.

In [None]:
# Fit the model (a line) to the data.

from statsmodels.formula.api import ols
dat   = {"x": x, "lifespan": lifespan}
model = ols("lifespan ~ 1 + x", data=dat).fit()

The model estimates two parameters: the `slope` and `intercept`.

Let's get those two parameters:

In [None]:
intercept = model.params[0]
slope     = model.params[1]
print('Intercept estimate = ',intercept)
print('Slope estimate     = ',slope)

We'll need one more estimate from the model fit: the dispersion.

<div class="alert alert-block alert-danger">
The dispersion parameter ... [URI HELP].
</div>

In [None]:
dispersion = np.sqrt(model.scale)
print('Dispersion parameter = ',dispersion)

With the 3 estimated parameters, we can now simulate realizations of the model. 

To do so, we'll evaluate this model:

`lifespan_modeled = intercept + slope * x + np.random.normal(loc=0.0, scale=dispersion, size=N)`

<div class="alert alert-block alert-success">

**Q:** Describe - in words - each term in the equation.
- What variables do you recognize?
- What variables are now?
- What is the equation doing???

**A:**

</div>

Now, let's see what the model does.

To do so, we'll evaluate the model, and compare the `lifespan_modeled` to the original `lifespan`.

In [None]:
lifespan_modeled = intercept + slope * x + np.random.normal(loc=0.0, scale=dispersion, size=[N,1])
plt.scatter(x,lifespan, label='original data')
plt.scatter(x,lifespan_modeled, label='lifespan modeled')
plt.legend()
plt.grid()
plt.ylim([45, 105]);

<div class="alert alert-block alert-success">

**Q:** Compare the original `lifespan` data and the `lifespan_modeled` data.
- How do they look similar?
- How do they look different?
- Do the modeled data provide a "good" representation of the original data?

**A:**

</div>

With the model estimated, we can use it to compute the power. Here are the steps in general:

1. Use the model to generate `lifespan_modeled` data.
2. Compute the significance of the proposed effect.
3. Repeat 1-2 many times.

The power is the number of times we detect a significant effect, divided by the total number of repetitions.

Let's apply this procedure to our data and effect of interest:

In [None]:
number_of_repetitions = 1000
p_value = np.zeros(number_of_repetitions)
for k in np.arange(number_of_repetitions):
    lifespan_modeled = intercept + slope * x + np.random.normal(loc=0.0, scale=dispersion, size=[N,1])
    dat                = {"x": x, "lifespan": lifespan_modeled}
    regression_results = ols("lifespan ~ 1 + x", data=dat).fit()
    p_value[k] = regression_results.pvalues[1]
Power = np.sum(p_value < 0.05)/number_of_repetitions
print('Power = ',Power)

<div class="alert alert-block alert-success">

**Q:** Does the code above make sense? Can you see the modeled lifespan data, and the estimated model?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** What value do you find for `Power`?

- Interpert this value.
- Do you have enough data to detect the effect?
- How do these results compare to the resampling approached described in Mini 5?

**A:**

</div>

## 6- You have the sample size!

- In Mini 5, we used resampling with a fixed sample size to compute the power.
- We can use the same resampling approach to compute the **sample size** for a fixed power.

<div class="alert alert-block alert-danger">
Provide some intuition for this calculation.
</div>

Let's do it!

In [None]:
N_resampled = 1000;
number_of_repetitions = 1000
p_value = np.zeros(number_of_repetitions)
for k in np.arange(number_of_repetitions):
    ind = np.random.choice(np.size(x), N_resampled)
    x_resampled = x[ind]
    lifespan_resampled = lifespan[ind]
    dat                = {"x": x_resampled, "lifespan": lifespan_resampled}
    regression_results = ols("lifespan ~ 1 + x", data=dat).fit()
    p_value[k] = regression_results.pvalues[1]
Power = np.sum(p_value < 0.05)/number_of_repetitions
print('N=',N_resampled,'Power=',Power)

<div class="alert alert-block alert-success">

**Q:** Compare this code to the code in Mini 5. What has changed?

**A:** ... we include `N_resampled` as a new variable, and use it in the code.

</div>

<div class="alert alert-block alert-success">

**Q:** How does increasing the sample size `N_resampled` impact the power?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** At what value of `N_resampled` is the power just above 0.8?

- How does this compare to the results from your online calculator?

**A:**

</div>

## 6* (Optional Extension) - You have the sample size!

In this optiona Mini, we consider an **alternative strategy to compute the sample size**.

Instead of resampling the data, we **fit a model to the data**, then use that model to generate new data samples.

---
If you haven't done so, please first complete *Mini 5\* (Extensions)*.

In that optional Mini, we introduce the modeling approach, and apply it to compute the power.

---
Let's now use this same strategy to compute the sample size.

The first step is to estimate the model. We use the same code as in *Mini 5\* (Extensions)*.

In [None]:
# Fit the model (a line) to the data.
from statsmodels.formula.api import ols
dat   = {"x": x, "lifespan": lifespan}
model = ols("lifespan ~ 1 + x", data=dat).fit()

intercept  = model.params[0]
slope      = model.params[1]
dispersion = np.sqrt(model.scale)

We'll now use the model to generate new samples of data, given a sample size `N_modeled`.

To do so, we'll re-use the code from Mini 6, with some small changes.

In [None]:
N_modeled = 800;                                  # This is the a part!
number_of_repetitions = 1000
p_value = np.zeros(number_of_repetitions)
for k in np.arange(number_of_repetitions):
    ind = np.random.choice(np.size(x), N_modeled)  # There's something new here, and in the next line.
    lifespan_modeled = intercept + slope * x[ind] + np.random.normal(loc=0.0, scale=dispersion, size=[N_modeled,1])
    dat                = {"x": x[ind], "lifespan": lifespan_modeled}
    regression_results = ols("lifespan ~ 1 + x", data=dat).fit()
    p_value[k] = regression_results.pvalues[1]
Power = np.sum(p_value < 0.05)/number_of_repetitions
print('N=',N_modeled,'Power=',Power)

<div class="alert alert-block alert-danger">
<b>Alert:</b>
</p>

- Notice the use of `ind` and `x[ind]` in the code. Why do we need this here?
</div>

<div class="alert alert-block alert-success">

**Q:** Compare this code to the code in Mini 6. What has changed?

**A:** ... we include `N_modeled` as a new variable.

</div>

<div class="alert alert-block alert-success">

**Q:** At what value of `N_modeled` is the power just above 0.8?

- How does this compare to the results from your online calculator?
- How does this result compare to your resampling approach in Mini 6?

**A:**

</div>

## 7- Well powered experiments can provide strong evidence

- We've now used (at least) two approaches to calculate the sample size
  1) online calculator
  2) resampling
- We've found that resampling suggests a much larger sample size is required for sufficient power to detect the effect.
- Let’s now “collect new data” and see if we detect a significant effect.

Let's begin by collecting the new data, now using the sample size you found through resampling.

In [None]:
N = 1000                     #NOTE: Learner inputs sample size, based on results in previous Minis.
x,lifespan = load_data(N)             # Use this by default.
# x,lifespan = load_data_Colab(N)     # Use this if on !!GOOGLE COLAB!!

Let's plot it.

In [None]:
plt.scatter(x,lifespan)
plt.xlabel('Genetic biomarker x')
plt.ylabel('Lifespan (years)');

<div class="alert alert-block alert-success">

**Q:** Compare the plot of these new data (with `N=800`) to the plot of the original data. How are the plots similar or different?

- Do you see the same trend in both plots?

**A:**

</div>

Now, let's test our hypothesis in this new data set.

In [None]:
from statsmodels.formula.api import ols

dat                = {"x": x, "lifespan": lifespan}
regression_results = ols("lifespan ~ 1 + x", data=dat).fit()
print('Slope estimate =',regression_results.params[1])
print('p-value        =',regression_results.pvalues[1])

<div class="alert alert-block alert-success">

**Q:** What is the effect size and significance?

- How do these results compare to the original data set?

**A:**

</div>

Let's also **visualize** the estimated line by plotting it with the data.

In [None]:
pred   = regression_results.get_prediction().summary_frame()
mn     = pred['mean']
ci_low = pred['mean_ci_lower'] 
ci_upp = pred['mean_ci_upper']

# And plot it.
indices_sorted = np.argsort(x,0)
plt.scatter(x,lifespan)
plt.plot(x[indices_sorted[:,0]],mn[indices_sorted[:,0]], 'r')
plt.plot(x[indices_sorted[:,0]],ci_low[indices_sorted[:,0]], ':r')
plt.plot(x[indices_sorted[:,0]],ci_upp[indices_sorted[:,0]], ':r')
plt.xlabel('Genetic biomarker x')
plt.ylabel('Lifespan (years)');

<div class="alert alert-block alert-success">

**Q:** How do these results compare to the original data set?

**A:**

</div>

<div class="alert alert-block alert-success">

**Q:** What do you conclude about the relationship between the genetic biomarker `x` and `lifespan`?

**A:**

</div>

<div class="alert alert-block alert-info">

*Conclusion*:

- The resampling procedure allowed us to compute a large enough sample size, so our experiment was sufficiently powered, and we detected a significant effect.


</div>

## 8- Summary

- Internet searches or stats 101 textbooks will arrive at sample sizes of 10-100.
- Here's a sketch of the initial naive intuition of lifespan vs $x$.
    - We usually underestimate variability in lifespan → underestimate sample size.
- However, the actual required effect size is more like 500-1000
    - This is due to the large variability in human lifespans compared to the relatively small effect size.
- Although the effect size is meaningful scientifically, it is small compared to the measurement variability.
- Warning: initial draw of a small sample size may produce (by chance) an opposite effect. In that case, resampling will not produce meaningful power/sample size results.

<div class="alert alert-block alert-success">

**Q:** Consider a new data set (provided). Yese these data to estimate the sample size required to acheive 80% power.

**A:**

</div>