---
title: Putting the p-value in Context
project:
  type: website
format:
  html:
    code-fold: true
    code-tools: true
jupyter: python 3
number-sections: false
filters:
    - pyodide
---

Neuroscience researchers typically report p-values to express the strength of statistical evidence; but p-values are not sufficient on their own to understand the meaning and value of a scientific inference. In this unit, learners will learn how to interpret the p-value, how to express the size of an effect and uncertainty about a result, and how to interpret results at both the individual and population levels.

# 1 - You want to do some science; your PI just wants the p's!

<div class="alert alert-block alert-info">

## Introduction

- You work in a sleep lab studying the effect of a new treatment regimen on memory consolidation during sleep.

- Your lab collects an EEG biomarker of memory ([sleep spindles](https://en.wikipedia.org/wiki/Sleep_spindle)) from N=20 human subjects.

- To do so, your lab measures the [power](https://mark-kramer.github.io/Case-Studies-Python/03.html) in the spindle band (9-15 Hz) twice per minute. Your lab has a [reliable method](https://mark-kramer.github.io/Case-Studies-Python/04.html#multitaper) to detect spindle activity; this detector is known to have small measurement errors outside of treatment. You expect it to still work during treatment, but also expect more variability in the spindle power estimates (hence more variability in the detections) during treatment.
    
- For each subject, your lab measures spindle activity during three conditions:
    
    - **Baseline**: Data collection lasts 7 hours while the subject sleeps the night before the intervention. This results in 840 samples of spindle activity for each subject.
    
    - **During Treatment**: Data collection during a 15 minute intervention during sleep, resulting in 30 samples of spindle activity for each subject.
    
    - **Post-treatment**: Data collection after intervention lasts 7 hours, while the subject sleeps, resulting in 840 samples of spindle activity for each subject.

Here's a graphical representation of the data collected from one subject:
![title](IMG_Pvalue/Data_Layout.jpg)

Your PI says: "*I hypothesize that some subjects will show an increase in spindle activity as a result of this treatment. Other subjects may not respond to the treatment. Conduct a hypothesis test for each subject to determine if they are responsive and report the p-values associated with each test.*"

</div>

::: {.callout-note}
## What information would the p-values associated with these hypothesis tests provide?

<form id="quiz-form">
    <input type="radio" name="answer" value="incorrect"> The p-values indicate which subjects show an increase in spindle activity as a result of treatment. <br>
    <input type="radio" name="answer" value="incorrect"> The p-values indicate subjects who are more likely to show an increase in spindle activity if the treatment were applied again.<br>
    <input type="radio" name="answer" value="incorrect"> The p-values indicate subjects for whom the null hypothesis that treatment has no effect is probably false. <br>
    <input type="radio" name="answer" value="incorrect"> The p-values indicate subjects for whom the effect of the treatment was large enough to be of scientific significance. <br>
    <input type="radio" name="answer" value="correct"> The p-values indicate the probability under a specific statistical model that a selected statistic would be equal to or more extreme than its observed value if treatment does not have an effect on spindle activity.<br><br>
    <button type="button" onclick="checkAnswer()">Submit</button>
</form>

<p id="feedback"></p>

<script>
function checkAnswer() {
    var selected = document.querySelector('input[name="answer"]:checked');
    var feedback = document.getElementById("feedback");
    
    if (!selected) {
        feedback.innerHTML = "Please select an answer.";
        feedback.style.color = "blue";
        return;
    }

    if (selected.value === "correct") {
        feedback.innerHTML = "✅ Correct! But does this anaswer the question you want to ask?";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Incorrect. Why?";
        feedback.style.color = "red";
    }
}
</script>

:::

Fundamentally, p-values indicate how incompatible a data set is with a specified statistical model, but they do not express a probability that any scientific hypothesis is correct or whether on hypothesis is more likely true than another. P-values can be part of a strong statistical argument, but do not provide a robust measure of evidence about a hypothesis on their own. In particular, a p-value needs to be paired with a measure of effect size to describe whether an effect is scientifically meaningful.

Another important issue is that p-values are often only meaningful in cases where the scientific question to answer has only a binary outcome. Most scientific questions require more than yes or no answers, but it is common to see researchers try to shoehorn their experiments to produce binary outcomes just so that they can express their results using p-values. This risks throwing away useful information and decreasing the statistical power of an argument. 

::: {.callout-note}
## For each scenario, is it most appropriate to use a p-value?

<form id="pvalue-quiz">
    <ol>
        <li>
            Does the proportion of people who prefer brand A over brand B differ from 50%?<br>
            <input type="radio" name="q1" value="yes" onclick="showFeedback(1, 'yes')"> Yes
            <input type="radio" name="q1" value="no" onclick="showFeedback(1, 'no')"> No
            <p id="feedback1" class="feedback"></p>
        </li><br>
        <li>
            What is the exact average height of an adult giraffe in meters?<br>
            <input type="radio" name="q2" value="yes" onclick="showFeedback(2, 'yes')"> Yes
            <input type="radio" name="q2" value="no" onclick="showFeedback(2, 'no')"> No
            <p id="feedback2" class="feedback"></p>
        </li><br>
        <li>
            Does a new vaccine significantly reduce infection rates compared to the old vaccine?<br>
            <input type="radio" name="q3" value="yes" onclick="showFeedback(3, 'yes')"> Yes
            <input type="radio" name="q3" value="no" onclick="showFeedback(3, 'no')"> No
            <p id="feedback3" class="feedback"></p>
        </li><br>
        <li>
            How long does it take to completely drain a 50,000-gallon swimming pool?<br>
            <input type="radio" name="q4" value="yes" onclick="showFeedback(4, 'yes')"> Yes
            <input type="radio" name="q4" value="no" onclick="showFeedback(4, 'no')"> No
            <p id="feedback4" class="feedback"></p>
        </li><br>
        <li>
            Does changing the color of a website’s button increase user clicks statistically significantly?<br>
            <input type="radio" name="q5" value="yes" onclick="showFeedback(5, 'yes')"> Yes
            <input type="radio" name="q5" value="no" onclick="showFeedback(5, 'no')"> No
            <p id="feedback5" class="feedback"></p>
        </li>
    </ol>
</form>

<script>
const feedbackText = {
    1: {
        yes: "✅ Perhaps. The p-value helps test if the proportion differs from 50%, but doesn’t express how many prefer Brand A.",
        no:  "⚠️ Not quite. You might miss detecting if there's statistical evidence that the proportion differs from 50%."
    },
    2: {
        yes: "⚠️ Not quite. A p-value is for hypothesis testing, not for estimating an exact value.",
        no:  "✅ Correct. We’re just estimating a quantity, not testing a hypothesis."
    },
    3: {
        yes: "✅ Perhaps. A p-value helps test if the difference is statistically significant. But consider also reporting a confidence interval.",
        no:  "⚠️ Not quite. You might still want to test whether the difference is significant."
    },
    4: {
        yes: "⚠️ Not quite. This is a direct measurement, not a hypothesis test.",
        no:  "✅ Correct. Measuring a quantity doesn’t require a p-value."
    },
    5: {
        yes: "✅ Perhaps. A p-value can help test if the click rate change is significant.",
        no:  "⚠️ Not quite. You might miss a statistically significant improvement in click rates."
    }
};

function showFeedback(questionNumber, choice) {
    const feedback = document.getElementById(`feedback${questionNumber}`);
    feedback.innerHTML = feedbackText[questionNumber][choice];
    feedback.style.color = choice === 'yes' ? 'green' : 'blue';
}
</script>

:::


::: {.callout-note}
## Discussion
Based on your understanding of how to interpret p-values, are there any concerns about your PIs analysis plan to just report p-values for separate tests conducted on each subject? What other approaches could you use to evaluate the effect of treatment in the subject population?
:::

At this point, perhaps you feel that we should just get rid of p-values entirely. Good idea! Many researchers and statisticians agree with you. But despite multiple organized efforts to downplay the use of p-values in scientific research, a focus on computing and reporting p-values has persisted.

You have a detailed discussion with your PI about the issues with focusing on p-values for this study, but your PI says: “We’re not going to be able to publish anything unless we show statistical significance so just give me the p’s!”

---

# 2- Let’s do it: Define & compute p-values.

Before we compute p-values, let's consider what a p-value means.


### What does a p-value mean?

A p-value is used to compare two competing hypotheses. If our scientific hypothesis is that spindle activity changes during treatment relative to its baseline level of activity, we need another hypothesis to compare this to. In this case, we can hypothesize that the spindle activity does not change during treatment.  This is called the **null hypothesis**. 

Our goal is to collect data that provides evidence in favor of our scientific hypothesis over the null hypothesis. But this is not a fair fight; we start by assuming that the null hypothesis is true and only reject it after we achieve a sufficiently high bar of evidence. The p-value tells us how high a bar we have achieved.

One useful analogy is proof by contradiction. There, we assume that a hypothesis is true and show that this assumption leads to a contraction. If we were to observe data that could not possibly occur if the null hypothesis were true, this would be definitive evidence against that hypothesis. However, it is not the case that if we observe data that is unlikely if the null hypothesis were true then the null hypothesis is itself unlikely.

For example, most people would agree with the following statement, "If a person is American, they probably are not the US President.” Now imagine that we select an individual at random and they happen to be the US President. It is clearly not the case that this individual is probably not American. While the observation that this individual is the US president is unlikely under the null hypothesis that this individual is American, it is much more unlikely (or impossible) under the alternate hypothesis that they are not American.

Another useful analogy is to a prosecutor at a trial. In this analogy, the null hypothesis is akin to the hypothesis that the defendant is innocent. The court assumes that defendant is innocent until proven guilty. The prosecutor tries to amass and present evidence to demonstrate that the defendant is guilty beyond a reasonable doubt. A strong argument needs to include evidence that would be unlikely to occur if the hypothesis that the defendant is innocent is true, and more likely to occur if the hypothesis that the defendant is guilty is true. If the prosecutor fails to provide sufficient evidence that the defendant is guilty, it doesn’t necessarily mean that they are innocent.

In a statistical test, the p-value indicates how surprising our evidence would be if the null hypothesis were true. For our problem, if we're sufficiently surprised by the observed data, then we'll reject the null hypothesis, and conclude that we have evidence that the spindle activity changes relative to baseline.

Alternatively, if we're not surprised by the observed data, then we'll conclude that we lack sufficient evidence to reject the null hypothesis. There's an important subtlety here that statisicians like to point out - when we're testing this way, **we never accept the null hypothesis**. Instead, the best we can do is talk like a statistician and say things like *"We fail to reject the null hypothesis"*. In our court analogy, this is equivalent to finding the defendant not guilty rather than innocent, because we realize that it is possible that defendant committed the crime but we lacked the evidence to convince a jury beyond a reasonable doubt.

Multiple factors impact the evidence we have to reject a null hypothesis. In this Unit, we'll explore these factors and how they influence the p-values we compute.

::: {.callout-note}
## The PI says: *“I expect that during treatment the spindle activity exceeds the baseline spindle activity.”* What is the null hypothesis?

<form id="null-hypothesis-form">
    <input type="radio" name="null-hypothesis" value="wrong1"> The average spindle activity during treatment is guaranteed to be higher than baseline.<br>
    <input type="radio" name="null-hypothesis" value="wrong2"> The average spindle activity during treatment is guaranteed to be lower than baseline.<br>
    <input type="radio" name="null-hypothesis" value="wrong3"> A difference in average spindle activity exists between treatment and baseline.<br>
    <input type="radio" name="null-hypothesis" value="correct"> No difference in average spindle activity exists between treatment and baseline.<br><br>
    <button type="button" onclick="checkNullHypothesis()">Submit</button>
</form>

<p id="null-hypothesis-feedback"></p>

<script>
function checkNullHypothesis() {
    var selected = document.querySelector('input[name="null-hypothesis"]:checked');
    var feedback = document.getElementById("null-hypothesis-feedback");
    
    if (!selected) {
        feedback.innerHTML = "Please select an answer.";
        feedback.style.color = "blue";
        return;
    }

    if (selected.value === "correct") {
        feedback.innerHTML = "✅ Correct. The null hypothesis always assumes no effect or no difference.";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Incorrect. The null hypothesis typically assumes no difference or no effect.";
        feedback.style.color = "red";
    }
}
</script>
:::


### What does p<0.05 mean?

The probability of observing the data, or something more extreme, under the null hypothesis is less than 5%. This is typically considered sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis (which posits that there is an effect or a difference). In other words, a p-value less than 0.05 suggests that the observed data is unlikely to have occurred by random chance alone, assuming the null hypothesis is true, leading researchers to reject the null hypothesis.

In our case, the null hypothesis we will first investigate is:

**Null hypothesis:** No difference in average spindle activity  between treatment and baseline conditions.

::: {.callout-note}
## Which of the following factors might impact the evidence you have to reject this null hypothesis?
(*Select all that apply*)

<form id="factors-form">
    <input type="checkbox" name="factors" value="sample_size"> Sample Size: Collecting more spindle samples reduces random error and provides more precise estimates.<br>
    <input type="checkbox" name="factors" value="effect_size"> Effect Size: Bigger differences in spindle activity between conditions are easier to detect.<br>
    <input type="checkbox" name="factors" value="variability"> Variability (or Noise) in Measurements: High variability in the spindle estimates during treatment can make it harder to detect a real effect.<br><br>
    <button type="button" onclick="checkFactors()">Submit</button>
</form>

<p id="factors-feedback"></p>

<script>
function checkFactors() {
    const correct = ["sample_size", "effect_size", "variability"];
    const selected = Array.from(document.querySelectorAll('input[name="factors"]:checked')).map(cb => cb.value);
    const feedback = document.getElementById("factors-feedback");

    if (selected.length === 0) {
        feedback.innerHTML = "Please select at least one factor.";
        feedback.style.color = "blue";
        return;
    }

    const missed = correct.filter(x => !selected.includes(x));
    const incorrect = selected.filter(x => !correct.includes(x));

    if (missed.length === 0 && incorrect.length === 0) {
        feedback.innerHTML = "✅ Correct! All of these factors influence the evidence against the null hypothesis.";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Not quite. All listed factors affect the strength of evidence you have to reject the null hypothesis.";
        feedback.style.color = "red";
    }
}
</script>
:::


Now, let's load the spindle data and compute p-values to test our null hypothesis

```{pyodide-python}
import pandas as pd

# Load custom functions
import requests
url = "https://raw.githubusercontent.com/Mark-Kramer/METER-Units/refs/heads/main/pvalue_functions-SHORT.py"
response = requests.get(url)
exec(response.text)

baseline, during_treatment, post_treatment= load_data()
```

Let's start by investigating the structure of the data.

```{pyodide-python}
print(baseline.shape)
print(during_treatment.shape)
print(post_treatment.shape)
```

All three variables consist of observations from **20 subjects** (the number of columns).

During baseline:  We collect **840 samples** per subject.

During treatment: We collect **30 samples** per subject.

After treatment: We collect **840 samples** per subject.

The number of samples is the number of rows for each variable.

You might think of these variables as rectangles (or matrices)
- each row indicates a sample of spindle activity, and each column indicates a subject:

![](IMG_Pvalue/simple_boxes_during_and_post.jpg)

::: {.callout-note}
## Look at the representations of the data above. What differs about the data during treatment, compared to baseline and post-treatment?

<form id="treatment-difference-form">
    <input type="radio" name="treatment-difference" value="wrong"> There are fewer subjects during treatment.<br>
    <input type="radio" name="treatment-difference" value="correct"> There are fewer samples during treatment.<br><br>
    <button type="button" onclick="checkTreatmentDifference()">Submit</button>
</form>

<p id="treatment-difference-feedback"></p>

<script>
function checkTreatmentDifference() {
    const selected = document.querySelector('input[name="treatment-difference"]:checked');
    const feedback = document.getElementById("treatment-difference-feedback");

    if (!selected) {
        feedback.innerHTML = "Please select an answer.";
        feedback.style.color = "blue";
        return;
    }

    if (selected.value === "correct") {
        feedback.innerHTML = "✅ Correct. There are fewer samples during treatment.";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Incorrect. The number of subjects stays the same — it’s the number of samples that’s fewer.";
        feedback.style.color = "red";
    }
}
</script>
:::

To get a sense for the the data, let's plot the spindle activity during `baseline`, `treatment`, and `post-treatment` conditions for one subject:

```{pyodide-python}
f, ax = plt.subplots(3,1)
ax[0].plot(baseline[:,0]);         ax[0].set_xlim([0,850]); ax[0].set_ylim([-6,6]); ax[0].set_title('Baseline');
ax[1].plot(during_treatment[:,0]); ax[1].set_xlim([0,850]); ax[1].set_ylim([-6,6]); ax[1].set_title('During Treatment');
ax[2].plot(post_treatment[:,0]);   ax[2].set_xlim([0,850]); ax[2].set_ylim([-6,6]); ax[2].set_title('Post-Treatment');
plt.xlabel('Samples'); plt.ylabel('Spindle activity');
plt.show()
```

::: {.callout-note}
## What values do you observe for the spindle activity?

<form id="spindle-values-form">
    <input type="radio" name="spindle-values" value="wrong1"> The values are always positive.<br>
    <input type="radio" name="spindle-values" value="wrong2"> The values are always negative.<br>
    <input type="radio" name="spindle-values" value="wrong3"> The values stay constant at zero.<br>
    <input type="radio" name="spindle-values" value="correct"> The values tend to fluctuate around 0 and can be both positive and negative.<br><br>
    <button type="button" onclick="checkSpindleValues()">Submit</button>
</form>

<p id="spindle-values-feedback"></p>

<script>
function checkSpindleValues() {
    const selected = document.querySelector('input[name="spindle-values"]:checked');
    const feedback = document.getElementById("spindle-values-feedback");

    if (!selected) {
        feedback.innerHTML = "Please select an answer.";
        feedback.style.color = "blue";
        return;
    }

    if (selected.value === "correct") {
        feedback.innerHTML = "✅ Correct. Spindle activity fluctuates around zero, taking both positive and negative values.";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Incorrect. The spindle activity values actually fluctuate around zero and include both positive and negative values.";
        feedback.style.color = "red";
    }
}
</script>
:::


Here, the spindle activity has been [z-scored](https://en.wikipedia.org/wiki/Standard_score) during each recording interval relative to baseline. 

So, the values we observe indicate changes relative to the mean baseline spindle activity.

Positive (negative) values indicate increases (decreases) in spindle activity relative to the baseline activity.

::: {.callout-note}
## What differences or similarities do you notice in spindle activity during the baseline, treatment, and post‐treatment conditions for this subject?

<form id="spindle-comparison-form">
    <input type="radio" name="spindle-compare" value="wrong1">
    All three conditions have the same number of samples and similar fluctuation sizes.<br>
    <input type="radio" name="spindle-compare" value="wrong2">
    Only baseline and post‐treatment fluctuate around zero; treatment values stay positive.<br>
    <input type="radio" name="spindle-compare" value="wrong3">
    Treatment has more samples than baseline and post‐treatment, with smaller fluctuations.<br>
    <input type="radio" name="spindle-compare" value="correct">
    All conditions fluctuate around zero; during treatment there are fewer samples and larger fluctuations (greater variability).<br><br>
    <button type="button" onclick="checkSpindleComparison()">Submit</button>
</form>

<p id="spindle-comparison-feedback"></p>

<script>
function checkSpindleComparison() {
    const selected = document.querySelector('input[name="spindle-compare"]:checked');
    const feedback = document.getElementById("spindle-comparison-feedback");

    if (!selected) {
        feedback.innerHTML = "Please select an answer.";
        feedback.style.color = "blue";
        return;
    }

    if (selected.value === "correct") {
        feedback.innerHTML = "✅ Correct. All series fluctuate around zero; treatment has fewer samples and greater variability.";
        feedback.style.color = "green";
    } else {
        feedback.innerHTML = "❌ Incorrect. Recall that all data hover around zero, but during treatment you see fewer samples and larger fluctuations.";
        feedback.style.color = "red";
    }
}
</script>
:::


::: {.callout-note}
## During treatment, we collect fewer, noisier samples compared to the baseline and post‐treatment conditions. How might these factors impact the evidence we have to reject the null hypothesis?

**Null hypothesis:** No difference in average spindle activity between treatment and baseline conditions.

<form id="impact-form">
  <input type="checkbox" name="impact" value="fewer_increases"> Fewer spindle samples increases our evidence.<br>
  <input type="checkbox" name="impact" value="fewer_decreases"> Fewer spindle samples decreases our evidence.<br>
  <input type="checkbox" name="impact" value="noisier_increases"> Noisier spindle samples increases our evidence.<br>
  <input type="checkbox" name="impact" value="noisier_decreases"> Noisier spindle samples decreases our evidence.<br><br>
  <button type="button" onclick="checkImpact()">Submit</button>
</form>

<p id="impact-feedback"></p>

<script>
function checkImpact() {
  const correct = ["fewer_decreases", "noisier_decreases"];
  const selected = Array.from(
    document.querySelectorAll('input[name="impact"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("impact-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = "✅ Correct! Fewer and noisier samples both decrease our evidence against the null.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Both fewer samples and higher noise reduce the strength of evidence to reject the null.";
    feedback.style.color = "red";
  }
}
</script>
:::


### Is there a significant effect treatment? Let's now compute some p-values.

To do so, we again assume the null hypothesis:

**Null hypothesis:** *No difference in average spindle activity between treatment and baseline conditions.*

To test this hypothesis, we'll compute a [two-sample t-test](https://en.wikipedia.org/wiki/Student%27s_t-test#:~:text=.-,Two%2Dsample%20t%2Dtests,-%5Bedit%5D).

The two-sample t-test is by far the most popular choice when comparing two distribution.

We use this method method used to determine if there's a significant difference between the means of two independent groups. It's commonly applied to compare the average values of (continuous) variables across two different populations or conditions.

In our case, we'd like to compare the average spindle activity between treatment and baseline conditions. So, the two-sample t-test is (at first glance) a completely reasonable approach.


```{pyodide-python}
import scipy.stats as stats
result = stats.ttest_ind(baseline, during_treatment)
p_values = result.pvalue
print(p_values)
```

The list above consists of 20 p-values, one for each subject.

Each p-value indicates the probability of observing the data, or something more extreme, under the null hypothesis:

**Null hypothesis**: *No difference in average spindle activity between treatment and baseline conditions.*

Let's print the p-values for each subject:

```{pyodide-python}
import numpy as np
for k in np.arange(0,20):
    print('Subject ', k, ', p=', np.array2string(p_values[k], precision=4))
```

::: {.callout-note}
## For Subject 0, we find *p* = 0.0189. What does this mean?

<form id="subject0-pvalue-form">
  <input type="radio" name="subject0-pvalue" value="wrong1">
    There is a 1.89% difference in the mean spindle rate during treatment versus baseline.<br>
  <input type="radio" name="subject0-pvalue" value="correct">
    There is a 1.89% probability of observing these data (or something more extreme) if the difference in mean spindle rate between treatment and baseline is zero.<br>
  <input type="radio" name="subject0-pvalue" value="wrong2">
    There is a 98.11% probability that the alternative hypothesis is correct.<br>
  <input type="radio" name="subject0-pvalue" value="wrong3">
    There is a 1.89% chance that the result is clinically meaningful.<br><br>
  <button type="button" onclick="checkSubject0P()">Submit</button>
</form>

<p id="subject0-pvalue-feedback"></p>

<script>
function checkSubject0P() {
  const selected = document.querySelector('input[name="subject0-pvalue"]:checked');
  const feedback = document.getElementById("subject0-pvalue-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. A p-value of 0.0189 is the probability of observing data this extreme (or more) under the null hypothesis of no difference in spindle rate between treatment and baseline conditions.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember, the p-value tells us how likely the observed data are under the null, not the chance the null is true.";
    feedback.style.color = "red";
  }
}
</script>
:::


Let's also plot the p-values:

```{pyodide-python}
plt.figure()
plt.stem(p_values)
plt.axhline(y=0.05, color='r', linestyle='--')
plt.xticks(np.arange(len(p_values)))  
plt.xlabel('Subject'); plt.ylabel('p value'); plt.title('During Treatment'); plt.yscale('log')
plt.show()
```

::: {.callout-note}
## Interpret the print‐out and plots of p‐values. What do you see?

<form id="pvalue-interpretation-form">
  <input type="radio" name="pvalue-interpret" value="correct">
    The p‐values are below 0.05, indicating significant effects across subjects.<br>
  <input type="radio" name="pvalue-interpret" value="wrong1">
    The p‐values are above 0.05, indicating no evidence of significant effects across subjects.<br>
  <input type="radio" name="pvalue-interpret" value="wrong2">
    All p‐values cluster exactly at the threshold, suggesting borderline significance.<br>
  <input type="radio" name="pvalue-interpret" value="wrong3">
    The p‐values form a uniform distribution from 0 to 1, indicating no pattern in the data.<br><br>
  <button type="button" onclick="checkPvalueInterpret()">Submit</button>
</form>

<p id="pvalue-interpret-feedback"></p>

<script>
function checkPvalueInterpret() {
  const selected = document.querySelector('input[name="pvalue-interpret"]:checked');
  const feedback = document.getElementById("pvalue-interpret-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The p‐values are all less than 0.05. But we're not done yet ...";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Notice that the p‐values are below 0.05.";
    feedback.style.color = "red";
  }
}
</script>
:::


## Do we have evidence to reject the null hypothesis?

Maybe ... if we had performed one statistical test, then we typically reject the null hypothesis if

`p < 0.05`

But here we compute 20 test (one for each subject).

When we perform multiple tests, it's important we consider the impact of **multiple comparisons**. We cover this topic in detail in the [Multiplicity Unit](https://github.com/Mark-Kramer/METER-Units/blob/main/METER_Exploratory.ipynb).

Here we'll chose a specific approach to deal with multiplicity - we'll apply a **Bonferroni correction**. The Bonferroni correction reduces the Type I error rate by dividing the desired overall significance level (here 0.05) by the number of tests performed (here 20 tests, one test per subject). Stated simply, the Bonferroni test adjusts the significance level by dividing it by the number of tests we perform. Doing so reduces the risk of false positives (Type I errors); or more information, see [Multiplicity Unit](https://github.com/Mark-Kramer/METER-Units/blob/main/METER_Exploratory.ipynb).

So, for our analysis of the p-values from 20 subjects, let's compare the p-values to a stricter threshold of

`p < 0.05 / 20` or `p < 0.0025`

Thresholding in this way provides a binary, yes/no answer to the question: do we have evidence that the spindle activity during treatment differs from 0?

Let's plot the p-values versus this new threhsold.

```{pyodide-python}
plt.figure()
plt.stem(p_values)
plt.axhline(y=0.05/20, color='r', linestyle='--')
plt.xticks(np.arange(len(p_values)))  
plt.xlabel('Subject'); plt.ylabel('p value'); plt.title('During Treatment'); plt.yscale('log')
plt.show()
print('Significant p-values during treatment = ',np.sum(p_values < 0.05/20))
```

::: {.callout-note}
## After the Bonferroni correction, can we reject the null hypothesis for any subject?

<form id="bonferroni-form">
  <input type="radio" name="bonferroni" value="correct">
    No. None of the p-values are less than 0.05/20.<br>
  <input type="radio" name="bonferroni" value="wrong">
    Yes. Some of the p-values are small enough.<br><br>
  <button type="button" onclick="checkBonferroni()">Submit</button>
</form>

<p id="bonferroni-feedback"></p>

<script>
function checkBonferroni() {
  const selected = document.querySelector('input[name="bonferroni"]:checked');
  const feedback = document.getElementById("bonferroni-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. After correcting to 0.05/20, none of the p-values remain below the threshold.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. The adjusted threshold is 0.05/20, and no p-value falls below that.";
    feedback.style.color = "red";
  }
}
</script>
:::  


::: {.callout-note}
## The PI requested “Give me the p’s!”. Do you have evidence to reject the null hypothesis during treatment?

<form id="give-ps-form">
  <input type="radio" name="give-ps" value="correct">
    No! The p-values are large, so we find no evidence to reject the null hypothesis for any subject.<br>
  <input type="radio" name="give-ps" value="wrong">
    Yes! The p-values are large, so the spindle activity during treatment is large.<br><br>
  <button type="button" onclick="checkGivePs()">Submit</button>
</form>

<p id="give-ps-feedback"></p>

<script>
function checkGivePs() {
  const selected = document.querySelector('input[name="give-ps"]:checked');
  const feedback = document.getElementById("give-ps-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The p-values are not small, so there’s no evidence to reject the null during treatment.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Large p-values indicate a lack of evidence against the null, not a large treatment effect.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## We do not find any p-values that pass our significance threshold during treatment. Does this mean that the spindle activity during treatment does not change relative to baseline?

<form id="change-null-form">
  <input type="radio" name="change-null" value="correct">
    No! We never accept the null hypothesis. Instead, we say: “We fail to reject the null hypothesis that the spindle activity during treatment differs from baseline.”<br>
  <input type="radio" name="change-null" value="wrong">
    Yes! Because the p-values are large, we can accept the null hypothesis.<br><br>
  <button type="button" onclick="checkChangeNull()">Submit</button>
</form>

<p id="change-null-feedback"></p>

<script>
function checkChangeNull() {
  const selected = document.querySelector('input[name="change-null"]:checked');
  const feedback = document.getElementById("change-null-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. We never accept the null hypothesis; we say we fail to reject the null hypothesis that the spindle activity during treatment differs from baseline.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. In statistics we don’t accept the null hypothesis; instead we say we fail to reject it.";
    feedback.style.color = "red";
  }
}
</script>
:::


<div class="alert alert-block alert-danger">
<b>Summary:</b>
    
    
We've computed a p-value for each subject. After correcting for multiple comparisons, our intial results suggest **no evidence** that we can reject the null hypothesis (of no difference in spindle activity between baseline and treatment).

</div>

## Mini Summary & Review

We sought to answer the scientific question:

- Does the spindle activity during `treatment` differ from the `baseline` spindle activity?

To answer this question, we assumed a null hypothesis:

**Null hypothesis**: *No difference in average spindle activity between treatment and baseline conditions.*

We tested this null hypothesis for each subject, computing a p-value for each subject.

Because we computed 20 p-values (one for each subject), we corrected for multiple comparsions using a Bonferroni correction (see [Multiplicity Unit](https://github.com/Mark-Kramer/METER-Units/blob/main/METER_Exploratory.ipynb)).

We found no p-values small enough to reject the null hypothesis. 

In other words, using our initial approach, we found **no evidence** that the spindle activity during `treatment` differs from `baseline`.

::: {.callout-note}
## Our initial results show no evidence that spindle activity during treatment differs from baseline. What factors might impact our evidence?

*(Select all that apply)*

<form id="evidence-impact-form">
  <input type="checkbox" name="impact" value="sample_size">
    **Sample Size:** We collect only 30 spindle samples during treatment, which increases random error and provides less precise estimates.<br>
  <input type="checkbox" name="impact" value="effect_size">
    **Effect Size:** Small differences in spindle activity between conditions are difficult to detect.<br>
  <input type="checkbox" name="impact" value="variability">
    **Variability (or Noise) in Measurements:** High variability in the spindle estimates during treatment can make it harder to detect a real effect.<br>
  <input type="checkbox" name="impact" value="testing_approach">
    **Approach to Statistical Testing:** A different approach may provide more insight into differences in spindle activity between treatment and baseline.<br>
  <input type="checkbox" name="impact" value="no_effect">
    **Treatment has no Effect:** It may be that the treatment does not impact spindle activity relative to baseline.<br><br>
  <button type="button" onclick="checkEvidenceImpact()">Submit</button>
</form>

<p id="evidence-impact-feedback"></p>

<script>
function checkEvidenceImpact() {
  const correct = [
    "sample_size",
    "effect_size",
    "variability",
    "testing_approach",
    "no_effect"
  ];
  const selected = Array.from(
    document.querySelectorAll('input[name="impact"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("evidence-impact-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one factor.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = "✅ Correct! All of these factors can influence the strength of evidence against the null hypothesis.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember that sample size, effect size, measurement variability, testing approach, and even a true lack of effect all affect our evidence.";
    feedback.style.color = "red";
  }
}
</script>
:::


---

# 3- Maybe there's something else we can publish?

Our initial results are discouraging; we find no evidence of a change in spindle activity from baseline during treatment.

That's dissappointing. Rather than abandon our data (which took years to collect), our PI asks us to continue **exploring the data**.

**Data exploration is common in neuroscience.** In general, as practicing neuroscientists, we explore our data for interesting features.

However, when undertaking data exploration, we must make it clear (e.g., by reporting what we explored, whether the results are significant or not).

Our PI recommends that we **examine the change in spindle activity `post-treatment`**.

Perhaps the treatment produces a longer-term affect that manifests during the `post-treatment` period.

<div class="alert alert-block alert-danger">
    
**Exploratory vs Confirmatory Analyses & Guarding Against _p_-Hacking**

Our initial results are discouraging: we find no evidence of a change in spindle activity from baseline during treatment.

Instead of discarding years of data, our PI encourages us to **explore the data** for unexpected patterns — this is perfectly legitimate as long as we remain transparent.

- **_Data exploration_** helps generate new hypotheses. We might notice trends, outliers, or condition-specific features that suggest where real effects could lie, to help guide future experiments.

- **_P-hacking_** occurs when we repeatedly mine the data—trying different subsets, covariates, or outcomes—until something “significant” emerges. This inflates false positives and misleads follow-up studies.

- To stay honest, every exploratory analysis must be clearly labeled as such. We should report exactly what we tested (e.g. “we examined spindle rate in the baseline, treatment, and post-treatment intervals), and include both significant and nonsignificant findings.

- **_Confirmatory analysis_** comes later: once exploration suggests a specific hypothesis (for example, an increase in post-treatment spindle rates), we pre-register that test or validate it in a fresh dataset. Only then do _p_-values carry their usual weight.

- Our next step—examining post-treatment spindle activity—serves as a bridge. We explore here, but plan to follow up with a dedicated, confirmatory protocol before drawing firm conclusions.

</div>

::: {.callout-note}
## Given this new analysis, what is the null hypothesis?

<form id="null-hypothesis-post-form">
  <input type="radio" name="null-post" value="correct">
    No difference in average spindle activity between post-treatment and baseline conditions.<br>
  <input type="radio" name="null-post" value="wrong1">
    The average spindle activity is higher post-treatment compared to baseline.<br>
  <input type="radio" name="null-post" value="wrong2">
    The average spindle activity is lower post-treatment compared to baseline.<br><br>
  <button type="button" onclick="checkNullPost()">Submit</button>
</form>

<p id="null-hypothesis-post-feedback"></p>

<script>
function checkNullPost() {
  const selected = document.querySelector('input[name="null-post"]:checked');
  const feedback = document.getElementById("null-hypothesis-post-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The null hypothesis states that there is no difference in average spindle activity between post-treatment and baseline.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Recall that the null hypothesis always assumes no difference or no effect.";
    feedback.style.color = "red";
  }
}
</script>
:::


For our new analysis, our null hypothesis now focuses on the `post-treatment` data:

**Null hypothesis**: Mean spindle activity `post-treatment` differs from the mean spindle activity during `baseline`.

::: {.callout-note}
## How do data in the post-treatment condition differ from the treatment condition?

*(Select all that apply)*

<form id="post-vs-treatment-form">
  <input type="checkbox" name="postdiff" value="more_samples">
    The number of samples is higher in the post-treatment condition.<br>
  <input type="checkbox" name="postdiff" value="more_subjects">
    The number of subjects is higher in the post-treatment condition.<br>
  <input type="checkbox" name="postdiff" value="less_noisy">
    The spindle estimates are less noisy in the post-treatment condition.<br>
  <input type="checkbox" name="postdiff" value="more_noisy">
    The spindle estimates are more noisy in the post-treatment condition.<br><br>
  <button type="button" onclick="checkPostVsTreatment()">Submit</button>
</form>

<p id="post-vs-treatment-feedback"></p>

<script>
function checkPostVsTreatment() {
  const correct = ["more_samples", "less_noisy"];
  const selected = Array.from(
    document.querySelectorAll('input[name="postdiff"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("post-vs-treatment-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = "✅ Correct! Post-treatment has more samples and less noisy spindle estimates.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember: post-treatment has more samples and the data are less noisy.";
    feedback.style.color = "red";
  }
}
</script>
:::


During the `post-treatment` condition:

- we have many more samples to analyze (N=840) compared to the `treatment` condition (N=30).

- the spindle estimates are less noisy compared to the `treatment` condition

::: {.callout-note}
## How do these two factors impact the evidence we have to reject the null hypothesis?

*(Select all that apply)*

<form id="evidence-post-form">
  <input type="checkbox" name="evidence-post" value="sample_size">
    Sample Size: Collecting more spindle samples reduces random error and provides more precise estimates.<br>
  <input type="checkbox" name="evidence-post" value="variability">
    Variability (or Noise) in Measurements: Lower variability in the spindle estimates post-treatment can make it easier to detect an effect.<br><br>
  <button type="button" onclick="checkEvidencePost()">Submit</button>
</form>

<p id="evidence-post-feedback"></p>

<script>
function checkEvidencePost() {
  const correct = ["sample_size", "variability"];
  const selected = Array.from(
    document.querySelectorAll('input[name="evidence-post"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("evidence-post-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one factor.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = "✅ Correct! More samples and lower variability both strengthen the evidence against the null hypothesis.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember that collecting more samples and reducing variability both increase our ability to reject the null.";
    feedback.style.color = "red";
  }
}
</script>
:::


Let's repeat our previous analysis, but now examine the `post-treatment` spindle activity.

```{pyodide-python}
import scipy.stats as stats
result = stats.ttest_ind(baseline, post_treatment)
p_values_post = result.pvalue
print(p_values_post)
```

Let's print and plot the p-values for each subject:

```{pyodide-python}
import numpy as np
for k in np.arange(0,20):
    print('Subject ', k, ', p=', np.array2string(p_values_post[k], precision=4))
```

::: {.callout-note}
## For Subject 0, we find *p* is small. What does this mean?

<form id="subject0-small-p-form">
  <input type="radio" name="subject0-small-p" value="wrong1">
    There is a small chance that the mean spindle rate is truly zero.<br>
  <input type="radio" name="subject0-small-p" value="wrong2">
    There is a small chance that the mean spindle rate is truly nonzero.<br>
  <input type="radio" name="subject0-small-p" value="correct">
    There is a small probability of observing these data (or something more extreme) if the difference in mean spindle rates is zero.<br>
  <input type="radio" name="subject0-small-p" value="wrong3">
    There is a small chance that the result is clinically meaningful.<br><br>
  <button type="button" onclick="checkSubject0SmallP()">Submit</button>
</form>

<p id="subject0-small-p-feedback"></p>

<script>
function checkSubject0SmallP() {
  const selected = document.querySelector('input[name="subject0-small-p"]:checked');
  const feedback = document.getElementById("subject0-small-p-feedback");
  
  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. A small p-value is the probability of observing data this extreme (or more) under the null hypothesis that the mean is zero.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Recall that the p-value reflects how likely the observed data are under the null, not the chance the mean itself is zero or nonzero.";
    feedback.style.color = "red";
  }
}
</script>
:::


```{pyodide-python}
plt.figure()
plt.stem(p_values_post)
plt.axhline(y=0.05, color='r', linestyle='--')
plt.xticks(np.arange(len(p_values_post)))  
plt.xlabel('Subject'); plt.ylabel('p value'); plt.title('Post-Treatment'); plt.yscale('log')
plt.show()
```

Because we've computed 20 p-values (one from each subject), let's again correct for multiple compariosns using a Bonferroni correction (see [Multiplicity Unit](https://github.com/Mark-Kramer/METER-Units/blob/main/METER_Exploratory.ipynb)).

```{pyodide-python}
plt.figure()
plt.stem(p_values_post)
plt.axhline(y=0.05/20, color='r', linestyle='--')
plt.xticks(np.arange(len(p_values_post)))  
plt.xlabel('Subject'); plt.ylabel('p value'); plt.title('Post-Treatment'); plt.yscale('log')
plt.show()
print('Significant p-values post-treatment = ',np.sum(p_values_post < 0.05/20))
```

::: {.callout-note}
## Compare these two sets of p‐values, calculated during treatment (previous section) and post‐treatment. What does it mean?

<form id="compare-pvalues-form">
  <input type="radio" name="compare-p" value="correct">
    No. None of the p‐values are small enough (less than 0.05/20).<br>
  <input type="radio" name="compare-p" value="wrong">
    Yes. Some of the p‐values are small enough (less than 0.05/20).<br><br>
  <button type="button" onclick="checkCompareP()">Submit</button>
</form>

<p id="compare-p-feedback"></p>

<script>
function checkCompareP() {
  const selected = document.querySelector('input[name="compare-p"]:checked');
  const feedback = document.getElementById("compare-p-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. After correction (0.05/20), none of the p‐values from either treatment or post‐treatment fall below the threshold.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember that with a Bonferroni adjustment to 0.05/20, no p‐value meets the significance cutoff in either set.";
    feedback.style.color = "red";
  }
}
</script>
:::


Look at how small the p-values are `post-treatment`!

 - All 20 p-values post-treatment are less than 0.05/20, the Bonferroni corrected p-value threshold.

Remembe that, during `treatment`, the p-values are much larger, and we find no p-values less than 0.05/20.

We find many more significant p-values `post-treatment` (20 out of 20, after Bonferroni correction).

## Our results seem to reveal a new conclusion: 

- In Mini 2, we found no evidence of a change in spindle activity `during treatment`.

- In this Mini, we find many very small p-values (less than 0.05/20) `post-treatment`.

More specifically, we find evidence of a significant change in spindle activity post-treatment in 9/20 subjects.

The PI is very excited with our new results, which appear to upend the literature.

The PI drafts the title for a high-impact paper:

Draft paper title: *Post-Treatment Paradox: Clear Human Responses, Despite Absence of Treatment Effect*

## But are we sure?

::: {.callout-note}
## Review the characteristics of data during treatment and post‐treatment. How might these characteristics impact the p‐values we observe?

*(Select all that apply)*

<form id="data-characteristics-form">
  <input type="checkbox" name="char" value="more_samples_post">
    We collect more samples post‐treatment, which can provide more precise estimates.<br>
  <input type="checkbox" name="char" value="fewer_samples_treatment">
    We collect fewer samples during treatment, which can provide less precise estimates.<br>
  <input type="checkbox" name="char" value="less_noisy_post">
    The measures are less noisy post‐treatment, which can make it easier to detect an effect.<br>
  <input type="checkbox" name="char" value="more_noisy_treatment">
    The measures are more noisy during treatment, which can make it harder to detect an effect.<br><br>
  <button type="button" onclick="checkDataCharacteristics()">Submit</button>
</form>

<p id="data-characteristics-feedback"></p>

<script>
function checkDataCharacteristics() {
  const correct = [
    "more_samples_post",
    "fewer_samples_treatment",
    "less_noisy_post",
    "more_noisy_treatment"
  ];
  const selected = Array.from(
    document.querySelectorAll('input[name="char"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("data-characteristics-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = "✅ Correct! More post‐treatment samples and lower noise strengthen evidence (lower p), while fewer treatment samples and higher noise weaken it (higher p).";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. All listed factors (more post‐treatment samples, fewer treatment samples, less post‐treatment noise, more treatment noise) influence the p‐values in the described ways.";
    feedback.style.color = "red";
  }
}
</script>
:::

This is a very important question ... and we haven't fully answered it yet.

We collect many more samples `post-treatment`, and our measurements are more accurate `post-treatment` compared to during `treatment`.

Both of these features impact the evidence we collect to reject the null hypothesis.

## So, are you sure about the `post-treatment` results?

<div class="alert alert-block alert-danger">
<b>Alert:</b>

- Wait, I'm not so sure ... 
</p>

- Why did you ask me to review the characteristics of the data, and think about how this might impact the data?
</div>

## Mini Summary & Review

We sought to answer the scientific question:

- Does the spindle activity `post-treatment` differ from the `baseline` spindle activity?

To answer this question, we assumed a null hypothesis:

**Null hypothesis**: *No difference in average spindle activity between post-treatment and baseline conditions.*

We tested this null hypothsis for each subject, computing a p-value for each subject.

Because we computed 20 p-values (one for each subject), we corrected for multiple comparsions using a Bonferroni correction (see [Multiplicity Unit](https://github.com/Mark-Kramer/METER-Units/blob/main/METER_Exploratory.ipynb)).

We found all of the p-values were small enough to reject the null hypothesis. 

In other words, in this exploratory analysis, we found **evidence** that the spindle activity during `post-treatment` differs from `baseline`.

This differs from our results during `treatment`, in which we found **no evidence** that the spindle activity during `treatment` differs from `baseline`.

---

# 4- Not so fast: visualize the measured data, always.

In our previous analysis, we *may* have found an interesting result: spindle activity `post-treatment`, but not during `treatment`, differs from `baseline`.

Scientifically, we might conlcude that our treatment has a long-lasting effect, impacting spindle activity `post-treatment`.

To infer these results, we computed and compared p-values, testing specific null hypotheses for each subject.

We've hinted above that **something isn't right** ... let's now dive in and identify what we could have done differently.

**Our initial approach has focused exclusively on p-values.**

P-values indicate how much evidence we have to reject a null hypothesis given the data we observe.

Let's again plot the p-values during `treatment` and `post-treatment` for each subject:

```{pyodide-python}
import scipy.stats as stats
import numpy as np
result = stats.ttest_ind(baseline, during_treatment)
p_values_during = result.pvalue
result = stats.ttest_ind(baseline, post_treatment)
p_values_post = result.pvalue

fig, ax = plt.subplots()
x = np.arange(p_values_during.shape[0])
ax.stem(x, p_values_during, linefmt='r-', markerfmt='ro', basefmt=' ', label='During Treatment')
ax.stem(x, p_values_post,   linefmt='b-', markerfmt='bo', basefmt=' ', label='Post-treatment')
ax.axhline(y=0.05/20, color='k', linestyle='--', label='Bonferroni Threshold')
ax.set_xticks(x);
ax.set_xlabel('Subject'); ax.set_ylabel('p value'); ax.set_yscale('log')
ax.set_title('During Treatment (red), Post-treatment (blue)')
ax.legend()
plt.show()
```

::: {.callout-note}
## For each subject, compare the p-values during treatment (red) versus post-treatment (blue). What do you observe?

<form id="compare-post-treatment-form">
  <input type="radio" name="compare-pt" value="correct">
    P-values tend to be smaller post-treatment compared to during treatment.<br>
  <input type="radio" name="compare-pt" value="wrong">
    P-values tend to be larger post-treatment compared to during treatment.<br><br>
  <button type="button" onclick="checkComparePT()">Submit</button>
</form>

<p id="compare-pt-feedback"></p>

<script>
function checkComparePT() {
  const selected = document.querySelector('input[name="compare-pt"]:checked');
  const feedback = document.getElementById("compare-pt-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The blue post-treatment p-values are generally smaller than the red during-treatment values.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. In fact, the blue post-treatment p-values are typically smaller than the red during-treatment ones.";
    feedback.style.color = "red";
  }
}
</script>
:::


We've focused on p-values to draw our scientific conclusions. 

**However, we've almost completely ignored the spindle measurements themselves!**

Let's return to the spindle activity measurements themselves, and see how these measurements relate to the p-values.

::: {.callout-note}
## Consider Subject 6. We find during treatment \(p = 0.033\), and post-treatment \(p = 0.0021\). The p-value is much smaller post-treatment. How do you think the spindle activity differs during treatment versus post-treatment?

<form id="subject4-effectsize-form">
  <input type="radio" name="subject4-effectsize" value="wrong1">
    The p-value is smaller post-treatment, so I expect a big effect – I expect spindle activity that differs from 0.<br>
  <input type="radio" name="subject4-effectsize" value="wrong2">
    The p-value is bigger during treatment, so I expect a small effect – I expect spindle activity near 0.<br>
  <input type="radio" name="subject4-effectsize" value="correct">
    It's dangerous to deduce effect size from the p-value.<br><br>
  <button type="button" onclick="checkSubject4Effect()">Submit</button>
</form>

<p id="subject4-effectsize-feedback"></p>

<script>
function checkSubject4Effect() {
  const selected = document.querySelector('input[name="subject4-effectsize"]:checked');
  const feedback = document.getElementById("subject4-effectsize-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. A p-value speaks only to evidence against the null, not the magnitude of the effect.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember, p-values do not convey effect size—they only quantify evidence against the null hypothesis.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## Consider the p-values computed for all subjects. How do you expect spindle activity to behave during treatment and post-treatment?

<form id="expectation-form">
  <input type="radio" name="expectation" value="wrong1">
    Because we do not find significant p-values during treatment, I expect spindle activity values to appear near 0.<br>
  <input type="radio" name="expectation" value="wrong2">
    Because we do find significant p-values post-treatment, I expect spindle activity values to differ from 0.<br>
  <input type="radio" name="expectation" value="correct">
    It’s dangerous to deduce effect size from the p-value.<br><br>
  <button type="button" onclick="checkExpectation()">Submit</button>
</form>

<p id="expectation-feedback"></p>

<script>
function checkExpectation() {
  const selected = document.querySelector('input[name="expectation"]:checked');
  const feedback = document.getElementById("expectation-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. P-values indicate evidence against the null, not the magnitude or direction of an effect.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. Remember, p-values do not tell you how large or near-zero the underlying values are.";
    feedback.style.color = "red";
  }
}
</script>
:::


Now, let's return to the spindle activity and look at those values directly.

Let's begin with an example from Subject 6.

```{pyodide-python}
import numpy as np
plt.figure()
plt.plot(np.tile(0,(840,1))-0.25, baseline[:,7], '.', color='k')
plt.plot(np.tile(0,(30,1)), during_treatment[:,7], '.', color='r')
plt.plot(np.tile(0,(840,1))+0.25, post_treatment[:,7], '.', color='b')
plt.axhline(y=0, color='k')
plt.xlabel('Subject 6'); plt.ylabel('Spindle activity'); plt.title('Baseline (black), During Treatment (red), Post-Treatment (blue)'); #plt.ylim([0,1]);
plt.xticks([]);
plt.show()
```

For Subject 6, we found:

- `treatment`   p=0.033

- `post-treatment` p=0.0021

From these p-values, we might expect:

- Spindle activity during `treatment` near 0 (i.e., similar to `baseline`).

- Spindle activity `post-treatment` far from 0 (i.e., different from `baseline`).

But, we **find the opposite**.

- Spindle activity during `treatment` far from 0 (i.e., different from `baseline`).

- Spindle activity `post-treatment` near 0 (i.e., similar to `baseline`).

Let's make similar plots for all 20 subjects.

```{pyodide-python}
plt.figure()
for k in np.arange(0,20):
    plt.plot(np.tile(k,(840,1))-0.25, baseline[:,k], '.', color='k')
for k in np.arange(0,20):
    plt.plot(np.tile(k,(30,1)), during_treatment[:,k], '.', color='r')
for k in np.arange(0,20):
    plt.plot(np.tile(k,(840,1))+0.25, post_treatment[:,k], '.', color='b')
plt.axhline(y=0, color='k')
plt.xlabel('Subject'); plt.ylabel('Spindle activity'); plt.title('Baseline (black), During Treatment (red), Post-Treatment (blue)'); #plt.ylim([0,1]);
plt.xticks(np.arange(0,20));
plt.show()
```

::: {.callout-note}
## Looking at the plots of spindle measurements, do you observe an effect during treatment (red) compared to baseline (black)?

<form id="treatment-effect-plot-form">
  <input type="radio" name="treatment-effect-plot" value="wrong1">
    No, the red and black measurements look about the same.<br>
  <input type="radio" name="treatment-effect-plot" value="wrong2">
    No, the red measurements appear lower than baseline.<br>
  <input type="radio" name="treatment-effect-plot" value="correct">
    Yes, the spindle measurements during treatment (red) appear larger than baseline (black).<br>
  <input type="radio" name="treatment-effect-plot" value="wrong3">
    It’s impossible to tell any difference from the plot.<br><br>
  <button type="button" onclick="checkTreatmentEffectPlot()">Submit</button>
</form>

<p id="treatment-effect-plot-feedback"></p>

<script>
function checkTreatmentEffectPlot() {
  const selected = document.querySelector('input[name="treatment-effect-plot"]:checked');
  const feedback = document.getElementById("treatment-effect-plot-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The red (treatment) measurements generally sit above the black (baseline), suggesting an increase—even though the p-values remain large.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. In fact, the red markers during treatment tend to be higher than the black baseline markers.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## Looking at the plots of spindle measurements, do you observe an effect post-treatment (blue) compared to baseline (black)?

<form id="post-treatment-effect-vis-form">
  <input type="radio" name="post-effect-vis" value="wrong1">
    Yes – the blue post-treatment values clearly exceed the black baseline values.<br>
  <input type="radio" name="post-effect-vis" value="wrong2">
    No – the blue and black measurements look identical with no visible difference.<br>
  <input type="radio" name="post-effect-vis" value="correct">
    No. Although we find significant p-values post-treatment, it’s difficult to see whether the values differ from the distribution of baseline values.<br>
  <input type="radio" name="post-effect-vis" value="wrong3">
    Yes – the blue values appear much more variable than the baseline black, indicating an effect.<br><br>
  <button type="button" onclick="checkPostEffectVis()">Submit</button>
</form>

<p id="post-effect-vis-feedback"></p>

<script>
function checkPostEffectVis() {
  const selected = document.querySelector('input[name="post-effect-vis"]:checked');
  const feedback = document.getElementById("post-effect-vis-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. Despite significant p-values post-treatment, the raw plots don’t clearly show a shift from baseline.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. The key insight is that even with significant p-values, the visual plots may not reveal a clear difference.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## Looking at the plots of spindle measurements, are these plots consistent with your p-value results?

<form id="plots-consistency-form">
  <input type="radio" name="plots-consistency" value="correct">
    No — We found many significant p-values post-treatment and concluded there’s an effect, but these plots of spindle activity aren’t consistent with that conclusion.<br>
  <input type="radio" name="plots-consistency" value="wrong">
    Yes — The raw spindle plots clearly match our statistical conclusions.<br><br>
  <button type="button" onclick="checkPlotsConsistency()">Submit</button>
</form>

<p id="plots-consistency-feedback"></p>

<script>
function checkPlotsConsistency() {
  const selected = document.querySelector('input[name="plots-consistency"]:checked');
  const feedback = document.getElementById("plots-consistency-feedback");

  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }

  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. Despite significant p-values post-treatment, the spindle‐activity plots don’t visually support that effect.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. The plots don’t align with our statistical findings—so they’re not consistent.";
    feedback.style.color = "red";
  }
}
</script>
:::


It's nice to visualize all of the data, but doing so can also be overwhelming.

Let's summarize the spindle activity in for each subject by ploting the [mean and the standard error of the mean](https://mark-kramer.github.io/Case-Studies-Python/02.html#cis-m1).

```{pyodide-python}
plt.figure()
for k in np.arange(0,20):
    mn = np.mean(baseline[:,k]);
    sd = np.std( baseline[:,k]);
    K  = np.shape(baseline)[0];
    plt.plot(k-0.2, mn, 'o', color='k')
    plt.plot([k-0.2,k-0.2], [mn-2*sd/np.sqrt(K), mn+2*sd/np.sqrt(K)], color='k')
for k in np.arange(0,20):
    mn = np.mean(during_treatment[:,k]);
    sd = np.std( during_treatment[:,k]);
    K  = np.shape(during_treatment)[0];
    plt.plot(k, mn, 'o', color='r')
    plt.plot([k,k], [mn-2*sd/np.sqrt(K), mn+2*sd/np.sqrt(K)], color='r')
for k in np.arange(0,20):
    mn = np.mean(post_treatment[:,k]);
    sd = np.std( post_treatment[:,k]);
    K  = np.shape(post_treatment)[0];
    plt.plot(k+0.2, mn, 'o', color='b')
    plt.plot([k+0.2,k+0.2], [mn-2*sd/np.sqrt(K), mn+2*sd/np.sqrt(K)], color='b')
plt.axhline(y=0, color='k')
plt.xlabel('Subject'); plt.ylabel('Spindle activity'); plt.title('Baseline (black), During Treatment (red), Post-Treatment (blue)');
plt.xticks(np.arange(0,20));
plt.show()
```

::: {.callout-note}
## Looking at the summary plots of the spindle activity for each subject, do you observe an effect during treatment (red)?

<form id="summary-effect-form">
  <input type="radio" name="summary-effect" value="wrong1">
    No — the mean spindle activity during treatment is similar to baseline.<br>
  <input type="radio" name="summary-effect" value="wrong2">
    Yes — the mean is larger and the standard error is small.<br>
  <input type="radio" name="summary-effect" value="wrong3">
    No — the mean spindle activity during treatment is smaller than baseline.<br>
  <input type="radio" name="summary-effect" value="correct">
    Yes — the mean spindle activity during treatment appears larger than baseline, but the large standard error explains why p-values aren’t significant.<br><br>
  <button type="button" onclick="checkSummaryEffect()">Submit</button>
</form>

<p id="summary-effect-feedback"></p>

<script>
function checkSummaryEffect() {
  const selected = document.querySelector('input[name="summary-effect"]:checked');
  const feedback = document.getElementById("summary-effect-feedback");
  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }
  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The mean is higher during treatment, but the large standard error means the p-values aren’t significant.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. In reality, the mean treatment effect looks larger but high SE leads to nonsignificant p-values.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## Looking at the summary plots of the spindle activity for each subject, do you observe an effect post-treatment (blue)?

<form id="summary-post-effect-form">
  <input type="radio" name="summary-post-effect" value="wrong1">
    Yes — the blue post-treatment means are clearly above the black baseline means.<br>
  <input type="radio" name="summary-post-effect" value="wrong2">
    Yes — the blue dots appear less variable and shifted from zero.<br>
  <input type="radio" name="summary-post-effect" value="wrong3">
    No — the blue means overlap with baseline, but only because the sample size is too small.<br>
  <input type="radio" name="summary-post-effect" value="correct">
    No — although we found significant p-values post-treatment, the black and blue dots overlap near zero, so we don’t clearly see an effect.<br><br>
  <button type="button" onclick="checkSummaryPostEffect()">Submit</button>
</form>

<p id="summary-post-effect-feedback"></p>

<script>
function checkSummaryPostEffect() {
  const selected = document.querySelector('input[name="summary-post-effect"]:checked');
  const feedback = document.getElementById("summary-post-effect-feedback");
  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }
  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. Despite significant p-values post-treatment, the mean dots (black vs. blue) overlap near zero, so the plot doesn’t show a clear effect.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. The key point is that the blue and black dots overlap near zero, so you can’t visually confirm an effect even with significant p-values.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## Looking at the summary plots of the spindle activity for each subject, are these plots consistent with your p-value results?

<form id="summary-plots-consistency-form">
  <input type="radio" name="summary-consistency" value="correct">
    No — we concluded from p-values that there’s an effect post-treatment but not during treatment, yet these summary plots don’t support that conclusion.<br>
  <input type="radio" name="summary-consistency" value="wrong">
    Yes — the summary plots clearly align with our p-value conclusions.<br><br>
  <button type="button" onclick="checkSummaryPlotsConsistency()">Submit</button>
</form>

<p id="summary-plots-consistency-feedback"></p>

<script>
function checkSummaryPlotsConsistency() {
  const selected = document.querySelector('input[name="summary-consistency"]:checked');
  const feedback = document.getElementById("summary-plots-consistency-feedback");
  if (!selected) {
    feedback.innerHTML = "Please select an answer.";
    feedback.style.color = "blue";
    return;
  }
  if (selected.value === "correct") {
    feedback.innerHTML = "✅ Correct. The summary plots don’t visually support our statistical conclusion of an effect post-treatment but not during treatment.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = "❌ Incorrect. In fact, the summary plots aren’t consistent with our p-value-based conclusions.";
    feedback.style.color = "red";
  }
}
</script>
:::


Let's summarize what we've found so far:

| State | p-values | spindle activity |
|----------|----------|----------|
| During treatment   | p>0.05/20 (not significant) | mean spindle activity > 0   |
| Post-treatment    | p<<0.05/20  (signficiant)   | mean spindle activity $\approx$ 0.   |

Something's not adding up here ... 

- During `treatment`, we find **no** evidence of a signficant change in spindle activity from baseline (i.e., the p-values are big). However, looking at the mean spindle activity, we find spindle activities that often exceed 0.

- `Post-treatment`, we find evidence of a signficant change in spindle activity from baseline (i.e., the p-values are small) in each subject. However, looking at the mean spindle activity, we find those values tend to appear near 0.

So, why do the spindle activities during `treatment` often exceed 0 (i.e., exceed baseline) spindle activity, but p>0.05?

And, why are the `post-treatment` spindle activities so near 0 (i.e., so near the baseline) spindle activity, but p<<0.05?

**I'm confused!**

<div class="alert alert-block alert-danger">
<b>Alert:</b>
    
These confusing conclusions occur because **we've made two common errors**:

- We compared p-values between the `treatment` and `post-treatment` groups.

- We focused exclusively on p-values without thinking more carefully about the data used to compute those p-values.

</p>
</div>

## To resolve these confusing conclusions, let's think more carefully about what the p-value represents.

The p-value measures the strength of evidence against the null hypothesis.

Three factors can impact the strength of evidence:

- **Sample Size** (i.e., the number of observations).

- **Effect Size** (i.e., bigger differences in spindle activity between conditions are easier to detect.)

- **Variability (or Noise) in Measurements**: (i.e., how reliably we measure spindle activity).

::: {.callout-note}
## How does the [sample size] differ during `treatment` versus `post‐treatment`? How might this impact the results?

*(Select all that apply)*

<form id="sample-size-impact-form">
  <input type="checkbox" name="size_impact" value="post_many">
    We have many more observations post‐treatment (N=840).<br>
  <input type="checkbox" name="size_impact" value="during_few">
    We have few observations during treatment (N=30).<br>
  <input type="checkbox" name="size_impact" value="wrong1">
    We have many more observations during treatment (N=840).<br>
  <input type="checkbox" name="size_impact" value="wrong2">
    We have few observations post‐treatment (N=30).<br><br>
  <button type="button" onclick="checkSampleSizeImpact()">Submit</button>
</form>

<p id="sample-size-impact-feedback"></p>

<script>
function checkSampleSizeImpact() {
  const correct = ["post_many", "during_few"];
  const selected = Array.from(
    document.querySelectorAll('input[name="size_impact"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("sample-size-impact-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML = 
      "✅ Correct! We have many more observations post-treatment (N=840), " +
      "so we can accumulate enough evidence to detect a weak effect post-treatment. " +
      "We have few observations during treatment (N=30), so even if the effect is strong, " +
      "we lack enough evidence to reject the null hypothesis during treatment.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML = 
      "❌ Incorrect. Remember that post-treatment has N=840 (high power for weak effects), " +
      "while treatment has only N=30 (low power even if effects exist).";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## How does the [effect size] differ during `treatment` versus `post-treatment`? How might this impact the results?

*(Select all that apply)*

<form id="effect-size-impact-form">
  <input type="checkbox" name="effect_impact" value="post_small">
    The effect size appears small post-treatment (mean values near zero).<br>
  <input type="checkbox" name="effect_impact" value="treatment_large">
    The effect size appears large during treatment (mean values exceed zero).<br>
  <input type="checkbox" name="effect_impact" value="wrong1">
    The effect size is equally small in both treatment and post-treatment.<br>
  <input type="checkbox" name="effect_impact" value="wrong2">
    A larger effect size during treatment guarantees statistical significance.<br><br>
  <button type="button" onclick="checkEffectSizeImpact()">Submit</button>
</form>

<p id="effect-size-impact-feedback"></p>

<script>
function checkEffectSizeImpact() {
  const correct = ["post_small", "treatment_large"];
  const selected = Array.from(
    document.querySelectorAll('input[name="effect_impact"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("effect-size-impact-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML =
      "✅ Correct! The effect size appears small post-treatment (means near zero), so even though we detect a change it’s small. " +
      "The effect size appears large during treatment (means exceed zero), but with few samples and high variability, " +
      "we lack sufficient evidence to reject the null hypothesis during treatment.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML =
      "❌ Incorrect. Remember: small post-treatment effect size means the detected change is small, " +
      "and a large treatment effect size can still fail to reach significance when sample size is low and variability is high.";
    feedback.style.color = "red";
  }
}
</script>
:::


::: {.callout-note}
## How does the [measurement variability] differ during `treatment` versus `post-treatment`? How might this impact the results?

*(Select all that apply)*

<form id="measurement-variability-form">
  <input type="checkbox" name="measurement_impact" value="post_less">
    We have less measurement variability post-treatment. Lower variability makes it easier to detect a difference from 0 (i.e., difference from baseline).<br>
  <input type="checkbox" name="measurement_impact" value="treatment_more">
    We have more measurement variability during treatment. Higher variability makes it harder to detect a difference from 0 (i.e., difference from baseline) and harder to reject the null hypothesis.<br>
  <input type="checkbox" name="measurement_impact" value="wrong1">
    Measurement variability is roughly the same in both conditions, so it has no impact on detection.<br>
  <input type="checkbox" name="measurement_impact" value="wrong2">
    Higher variability post-treatment makes it harder to detect a difference, improving our evidence.<br><br>
  <button type="button" onclick="checkMeasurementVariability()">Submit</button>
</form>

<p id="measurement-variability-feedback"></p>

<script>
function checkMeasurementVariability() {
  const correct = ["post_less", "treatment_more"];
  const selected = Array.from(
    document.querySelectorAll('input[name="measurement_impact"]:checked')
  ).map(cb => cb.value);
  const feedback = document.getElementById("measurement-variability-feedback");

  if (selected.length === 0) {
    feedback.innerHTML = "Please select at least one option.";
    feedback.style.color = "blue";
    return;
  }

  const missed = correct.filter(x => !selected.includes(x));
  const extras = selected.filter(x => !correct.includes(x));

  if (missed.length === 0 && extras.length === 0) {
    feedback.innerHTML =
      "✅ Correct! Post-treatment variability is lower, which increases our power to detect differences, while treatment variability is higher, which reduces our ability to reject the null.";
    feedback.style.color = "green";
  } else {
    feedback.innerHTML =
      "❌ Incorrect. Remember that lower variability post-treatment helps detect an effect, and higher variability during treatment hinders it.";
    feedback.style.color = "red";
  }
}
</script>
:::


## Conclusion / Summary / Morale:

We began with the scientific statement:

*“I expect during treatment that spindle activity exceeds the baseline spindle activity.”*

Our initial approach focused on computing and comparing p-values.

That's a bad idea.

We're not interested in comparing the **evidence** we have for each null-hypothesis (the p-value); the evidence depends on the sample size, effect size, and measurement variability.

Instead, we're more interested in comparing the spindle activity between condidtions.

In other words, **we're intested in the effect size**, not the p-value.

This observation suggests a different analysis path for an improved approach.

We can answer the same scienfitic question by **comparing the spindle activities between conditions**, not the p-values.

We've started to see this in the plots of spindle activity at `baseline`, during `treatment`, and `post-treatment`.

For more analysis (e.g., different statistical test and effect size) continue on to the next sections.

---

# 5- So, what went wrong?

In our initial analysis, we've made a couple of common mistakes.

**Mistake #1: Confusing p-values with effect size.**

- A p-value indicates the amount of statistical evidence, not the effect size. In our analysis, we found small p-values when comparing the spindle rates at baseline versus post-treatment. However, the effect size was very small; the large number of observations increased our statistical evidence, allowing us to detect a small effect size.

**Remedy #1: Estimate what you care about.**

- We're interested in the effect during treatment; so, examine the spindle rate during treatment. Plotting the mean spindle rate, and standard error of the mean, during treatment allows us to directly estimate the effect of interest.

---

**Mistatke #2: Comparing p-values instead of data.**

- We found much smaller p-values post-treatment compared to during treatment (both computed versus the same baseline). Therefore, the effect is "stronger" post-treatment, right? WRONG! The p-values indicate that we have more statistical evidence of a difference post-treatment, but not that the effect size is strong.

**Remedy #2: Compare the effect sizes.**

- Comparing the effect sizes, we find larger means spindle rates during treatment compared to baseline (and post-treatment)

---

**Mistake #3: Separate statistical tests for each subject.**

- Our scientific question was, initially, focused on the impact of treatment on spindle rate. We're not necessarily interested in this for each individual subject [URI HELP!]

**Remedy #3: Use an omnibus test.**

- See the next section to learn more.

---

# 6- One test to rule them all: an omnibus test.

(PENDING)

Do there exist subjects for which there is a significant effect? NO

1. Concatenate data from all subjects (works if you beleive everyone has an effect).
    - show CI, provide associated p-values 
1a. Treatment (significant & meaningful @ population level)
1b. Post ( significant & not meaningful @ population level)
3. If not everyone has an effect, dilute effect size / less power.
4. Alternartive, if you believe not everyone has an effect, mixed-effect model.

# 7 - Optional Section: LME

(PENDING)

Estimate effect size and responders

In Intro: initial H is some people respond and some don't

# 8- Summary

(PENDING)