# Reference: Hypothesis Testing

- Is chocolate good for you?
- Does coffee cause sleep deprivation?
- Does smoking cause cancer?

Observation is a key to good science. An ***observational study*** is one in which scientists make conclusions based on data that they have observed but had no hand in generating.

In data science, many such studies involve observations on a group of individuals, a factor of interest called a **treatment**, and an **outcome** measured on each individual.

The establishment of **causality** often takes place in two stages.

- First, an association is observed.
- Next, a more careful analysis leads to a decision about causality.

## The Hierarchy of Scientific Evidence

**Randomized Controlled Trials** (RCTs) are the **gold standard for scientific research**.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Hierarchy_of_Evidence.png/390px-Hierarchy_of_Evidence.png">

Figure: Hierarchy of Scientific Evidence

We note few things here:

- Not all scientific studies are created equal
- Case reports (of one patient) are at the bottom of the hierarchy
- Observational studies (like cohort and case-control studies) are in the middle
- Randomized controlled trials (RCTs) are at the top
- Meta-analyses are considered the most reliable forms of evidence we have


You can read more about the hierarchy of evidence at [**The Logic of Science**](https://thelogicofscience.com/2016/01/12/the-hierarchy-of-evidence-is-the-studys-design-robust/).

## Step 1: Experiment Design, Conducting Trials and Collecting Data

We need to collect data while avoiding bias. The best way scientists know of is: *Randomization*.

> ***Randomization*** is the process of assigning participants to treatment and control groups, assuming that each participant has an equal chance of being assigned to any group. ([NIH](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267325/#:~:text=What%20Is%20Randomization%3F,being%20assigned%20to%20any%20group)). This helps to control for *confounding factors* that could influence the results.

In *Random Experiments*, we have two groups:

- ***Test or Treatment Group***: a group where the treament is applied
- ***Control Group***: a group where no treatment is applied (often, a placebo is given to eliminate psychological effects)

### Confounding Factors

In observational studies, differences between treatment and control groups beyond the treatment itself can complicate conclusions about causality. These differences, known as **confounding factors**, can obscure the true relationship between variables.

**Example: Coffee and Lung Cancer**

In the 1960s, research indicated that coffee drinkers had higher lung cancer rates. This led some to mistakenly think coffee was a cause of lung cancer. The real issue was a confounding factor: smoking. Many coffee drinkers also smoked, and smoking is a known cause of lung cancer. Thus, the link between coffee and lung cancer was due to smoking, not coffee.

Confounding factors are prevalent in observational studies. Rigorous studies strive to minimize confounding and adjust for its impact to ensure accurate results.

### Sampling Bias

**Sampling bias** occurs when the method used to select participants systematically favors certain individuals or outcomes over others. As a result, the sample fails to act as a "miniature" version of the population, leading to **systematic error** rather than just random chance.

#### What Is Wrong and How It Happens

The core issue is **non-representativeness**. If a study claims to speak for "all people" but only gathers data from one specific type of person, the conclusions are logically trapped within that small group. This happens through:

* **Convenience Sampling:** Choosing people who are easy to reach (e.g., only surveying people at a specific mall or university).
* **Self-Selection (Voluntary Response):** Allowing people to "opt-in." This typically attracts those with extreme views (very angry or very happy), leaving out the "silent majority."
* **Undercoverage:** Using a medium that excludes certain groups (e.g., an online-only survey that misses the elderly or low-income households without internet).

#### Impact on Conclusions

When sampling bias is present, your conclusions reflect the **bias of the selection process** rather than the **reality of the population**.

1. **False Generalization:** You might conclude a new app is "easy to use" because tech-savvy students found it easy, even though it’s impossible for the general public to navigate.
2. **Invisible Factors:** The bias can hide the very thing you are trying to study. If you only study survivors, you’ll never see the factors that caused others to fail.

#### Real-World Examples

##### 1. The 1948 U.S. Presidential Election ("Dewey Defeats Truman")

In one of the most famous blunders in polling history, the *Chicago Daily Tribune* printed a headline declaring Thomas Dewey the winner before all votes were counted.

* **How it happened:** Polling was conducted via telephone surveys. In 1948, telephones were expensive luxury items owned primarily by wealthy, upper-class citizens who tended to vote Republican (for Dewey).
* **The Result:** The sample was heavily biased toward a specific economic class. The pollsters missed the huge "hidden" population of working-class voters who didn't own phones and overwhelmingly voted for Harry Truman.

##### 2. WWII Aircraft Armor (Survivorship Bias)

During World War II, the U.S. military wanted to add armor to their bombers. They examined returning planes and found they were most heavily shot in the wings and fuselage. They initially concluded they should add more armor to those specific "high-damage" areas.

* **How it happened:** The "sample" consisted only of planes that **successfully returned**. They were ignoring the planes that were shot down.
* **The Correction:** Mathematician Abraham Wald realized the planes being studied were the ones that *could* be shot in the wings and still fly home. The holes they *didn't* see—the engines and cockpit—were the areas where a single hit meant the plane never returned. They armored the spots that had **no holes** on the returning planes, saving countless lives by accounting for the missing data.

##### 3. Medical Research and "Healthy User Bias"

Many early studies on the benefits of vitamins or specific diets found that people who took supplements had significantly lower rates of heart disease.

* **How it happened:** People who take vitamins are also more likely to exercise, avoid smoking, and have higher incomes (giving them better healthcare access). This is a form of **selection bias** where the sample is "pre-filtered" for health-conscious people.
* **The Result:** Initial conclusions suggested the vitamins *caused* the health benefits. Later, more rigorous "randomized" trials (where participants didn't choose to take the vitamin) showed that the vitamins themselves often had little to no effect. The original studies were just measuring the habits of a "healthy" group of people, not the power of the pill.

### Step 2: Inference about each group

Let's first state our Hypotheses:
- The Null Hypothesis $H_0$ states that the $\mu_a$ (mean) of the Test Group is equal to the $\mu_0$ of the Control Group.
- The Alternative Hypothesis $H_a$ states that the $\mu_a$ of the Test Group is not equal to the $\mu_0$ of the Control Group.

These are two populations (Treatment and Control). We choose our Confidence Level ($95\%$ by convention) and estimate their means: $\mu_0$ and $\mu_a$ through repeated (say: $k = 1000$) large (say: $n = 50$) sampling. The goes back to the [Central Limit Theorem](../techniques/central_limit_theorem.ipynb).

### Step 3: Statistical Significance

We know have two estimates of the means: $\mu_0$ and $\mu_a$. But how can we be sure that the difference between them is not due to random chance? This is where *Statistical Significance* comes in.

The **t-test** is a statistical test that qualifies the difference between two means as being statistically significant or not. For a *Confidence Level* $CI = 95\%$, the t-test is significant if the p-value (the outcome of the t-test) is less than the *Significance Level* $\alpha = 0.05$.

In other words, the ***p-value*** is the probability of observing a test statistic as extreme as the one computed from the sample data (Treatment Group), assuming that the null hypothesis ($\mu_a = \mu_0$) is true.

Remember, the p-value will always be tied to a null hypothesis.

### Step 4: Check for Type I and Type II Errors

- **Type I Error (False Positive)**: mistakenly concluding an effect is real (when it is due to chance)
    - example: the test result says you have coronavirus, but you actually don’t
    - example: the test result says the drug is effective, but it actually isn’t
- **Type II Error (False Negative)**: mistakenly concluding an effect is due to chance (when it is real)
    - example: the test result says you don’t have coronavirus, but you actually do
    - example: the test result says the drug is not effective, but it actually is