# Module 01: Advanced Statistical Inference

**Estimated Time:** 45 minutes

## Learning Objectives

By the end of this module, you will be able to:

1. ‚úÖ Conduct and interpret hypothesis tests (t-tests, ANOVA, chi-square)
2. ‚úÖ Calculate and interpret effect sizes (Cohen's d, eta-squared, Cram√©r's V)
3. ‚úÖ Understand and calculate statistical power
4. ‚úÖ Distinguish between statistical and practical significance
5. ‚úÖ Address multiple comparison problems with corrections
6. ‚úÖ Report statistical results professionally

## Why Advanced Statistical Inference Matters

**Beginner Question:** "Is there a difference?"

**Intermediate Question:** "How big is the difference? Is it meaningful? Am I confident in this result?"

### The Problem with p-values Alone

A common mistake:
- ‚ùå "p < 0.05, therefore it's important!"
- ‚úÖ "p < 0.05, effect size is large, power is adequate, therefore it's important!"

**Statistical significance ‚â† Practical significance**

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

warnings.filterwarnings("ignore")

# Try to import pingouin for advanced stats
try:
    import pingouin as pg

    PINGOUIN_AVAILABLE = True
except ImportError:
    PINGOUIN_AVAILABLE = False
    print("‚ö†Ô∏è  Pingouin not installed. Install with: pip install pingouin")

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Random seed for reproducibility
np.random.seed(42)

# Create output directory
import os

output_dir = "outputs/module01"
os.makedirs(output_dir, exist_ok=True)

print("‚úÖ Module 01: Advanced Statistical Inference - Setup Complete!")
print(f"üìÅ Outputs will be saved to: {output_dir}")

---

## Part 1: Hypothesis Testing Fundamentals

### The Null Hypothesis Significance Testing (NHST) Framework

#### Core Concepts:

1. **Null Hypothesis (H‚ÇÄ)**: No effect/difference exists
2. **Alternative Hypothesis (H‚ÇÅ)**: An effect/difference exists
3. **Test Statistic**: Quantifies how extreme your data is
4. **p-value**: Probability of observing this data (or more extreme) if H‚ÇÄ is true
5. **Significance Level (Œ±)**: Threshold for rejecting H‚ÇÄ (typically 0.05)

### Types of Errors

| Reality | Decision: Accept H‚ÇÄ | Decision: Reject H‚ÇÄ |
|---------|---------------------|----------------------|
| **H‚ÇÄ is TRUE** | ‚úÖ Correct | ‚ùå Type I Error (False Positive) |
| **H‚ÇÄ is FALSE** | ‚ùå Type II Error (False Negative) | ‚úÖ Correct |

**Type I Error (Œ±)**: False alarm - saying there's an effect when there isn't

**Type II Error (Œ≤)**: Missing a real effect

**Power (1-Œ≤)**: Probability of correctly detecting a real effect

### Common Hypothesis Tests

| Test | Use Case | Variables |
|------|----------|----------|
| **t-test** | Compare 2 group means | 1 categorical (2 levels) + 1 continuous |
| **ANOVA** | Compare 3+ group means | 1 categorical (3+ levels) + 1 continuous |
| **Chi-square** | Test independence | 2 categorical variables |
| **Correlation** | Test association | 2 continuous variables |

In [None]:
# Demonstrate Type I and Type II errors through simulation


def simulate_hypothesis_tests(true_effect_size=0, n_simulations=1000, sample_size=50, alpha=0.05):
    """
    Simulate hypothesis tests to demonstrate error rates.

    Parameters:
    -----------
    true_effect_size : float
        True Cohen's d (0 = no effect)
    n_simulations : int
        Number of simulations to run
    sample_size : int
        Sample size per group
    alpha : float
        Significance level
    """
    significant_results = 0

    for _ in range(n_simulations):
        # Generate data
        group1 = np.random.normal(0, 1, sample_size)
        group2 = np.random.normal(true_effect_size, 1, sample_size)

        # Perform t-test
        t_stat, p_value = stats.ttest_ind(group1, group2)

        if p_value < alpha:
            significant_results += 1

    proportion_significant = significant_results / n_simulations

    return proportion_significant


# Simulate Type I error (no true effect)
print("SIMULATION: Type I Error Rate")
print("=" * 80)
type1_rate = simulate_hypothesis_tests(true_effect_size=0, n_simulations=1000)
print(f"True effect size: 0 (no effect)")
print(f"Proportion of significant results: {type1_rate:.3f}")
print(f"Expected Type I error rate (Œ±): 0.05")
print(f"Actual rate: {type1_rate:.3f}")
print("\nüí° When there's no effect, we should reject H‚ÇÄ about 5% of the time (Type I error)")

# Simulate Power (with true effect)
print("\n" + "=" * 80)
print("SIMULATION: Statistical Power")
print("=" * 80)
power = simulate_hypothesis_tests(true_effect_size=0.5, n_simulations=1000)
print(f"True effect size: 0.5 (medium effect)")
print(f"Proportion of significant results (Power): {power:.3f}")
print(f"Type II error rate (Œ≤): {1-power:.3f}")
print("\nüí° Power tells us how likely we are to detect a real effect when it exists")

---

## Part 2: T-Tests in Detail

### Types of T-Tests

#### 1. Independent Samples T-Test
- Compare two **independent** groups
- Example: Treatment vs Control (different participants)

**Assumptions:**
- Independence of observations
- Normality (for each group)
- Homogeneity of variance (equal variances)

#### 2. Paired Samples T-Test
- Compare two **related** measurements
- Example: Before vs After (same participants)

#### 3. One-Sample T-Test
- Compare sample mean to a known value
- Example: Is average score different from 50?

In [None]:
# Independent samples t-test example

# Generate realistic data: Study time effect on test scores
np.random.seed(42)
n_per_group = 40

# Control group: no extra study (mean=70, sd=10)
control_scores = np.random.normal(70, 10, n_per_group)

# Treatment group: extra study (mean=76, sd=10)
treatment_scores = np.random.normal(76, 10, n_per_group)

# Perform t-test
t_stat, p_value = stats.ttest_ind(control_scores, treatment_scores)

print("INDEPENDENT SAMPLES T-TEST")
print("=" * 80)
print("\nResearch Question: Does extra study time improve test scores?")
print(f"\nControl group (n={n_per_group}):")
print(f"  Mean = {control_scores.mean():.2f}")
print(f"  SD = {control_scores.std(ddof=1):.2f}")

print(f"\nTreatment group (n={n_per_group}):")
print(f"  Mean = {treatment_scores.mean():.2f}")
print(f"  SD = {treatment_scores.std(ddof=1):.2f}")

print(f"\nMean difference: {treatment_scores.mean() - control_scores.mean():.2f} points")

print(f"\nTest Results:")
print(f"  t-statistic = {t_stat:.3f}")
print(f"  p-value = {p_value:.4f}")
print(f"  df = {n_per_group + n_per_group - 2}")

if p_value < 0.05:
    print(f"\n‚úÖ Result: SIGNIFICANT (p < 0.05)")
    print(f"   We reject H‚ÇÄ: The treatment group scored significantly higher")
else:
    print(f"\n‚ùå Result: NOT SIGNIFICANT (p ‚â• 0.05)")
    print(f"   We fail to reject H‚ÇÄ: No significant difference detected")

# But wait - is this MEANINGFUL? We'll calculate effect size next!

In [None]:
# Visualize the t-test

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
data_dict = {
    "Score": np.concatenate([control_scores, treatment_scores]),
    "Group": ["Control"] * n_per_group + ["Treatment"] * n_per_group,
}
df_ttest = pd.DataFrame(data_dict)

sns.boxplot(data=df_ttest, x="Group", y="Score", ax=ax1)
sns.swarmplot(data=df_ttest, x="Group", y="Score", color="black", alpha=0.3, ax=ax1)
ax1.set_title("Test Scores by Group", fontsize=14, fontweight="bold")
ax1.set_ylabel("Score", fontsize=12)
ax1.set_xlabel("Group", fontsize=12)

# Distribution plot
ax2.hist(control_scores, bins=15, alpha=0.6, label="Control", color="lightblue", edgecolor="black")
ax2.hist(treatment_scores, bins=15, alpha=0.6, label="Treatment", color="orange", edgecolor="black")
ax2.axvline(control_scores.mean(), color="blue", linestyle="--", linewidth=2, label="Control Mean")
ax2.axvline(
    treatment_scores.mean(), color="red", linestyle="--", linewidth=2, label="Treatment Mean"
)
ax2.set_xlabel("Score", fontsize=12)
ax2.set_ylabel("Frequency", fontsize=12)
ax2.set_title("Distribution of Scores", fontsize=14, fontweight="bold")
ax2.legend()

plt.tight_layout()
plt.savefig(os.path.join(output_dir, "ttest_visualization.png"), dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° Visual inspection helps understand the magnitude of difference")

---

## Part 3: Effect Sizes - The Missing Piece

### Why Effect Sizes Matter

**Problem with p-values:**
- With large samples, tiny differences become "significant"
- With small samples, large differences may be "non-significant"
- p-value doesn't tell you HOW BIG the effect is

**Solution: Report Effect Sizes!**

### Common Effect Sizes

#### For T-Tests: Cohen's d

$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$$

**Interpretation (Cohen, 1988):**
- Small: d = 0.2
- Medium: d = 0.5
- Large: d = 0.8

#### For ANOVA: Eta-Squared (Œ∑¬≤)

$$\eta^2 = \frac{SS_{between}}{SS_{total}}$$

**Interpretation:**
- Small: Œ∑¬≤ = 0.01 (1% of variance explained)
- Medium: Œ∑¬≤ = 0.06 (6% of variance explained)
- Large: Œ∑¬≤ = 0.14 (14% of variance explained)

#### For Chi-Square: Cram√©r's V

$$V = \sqrt{\frac{\chi^2}{n \cdot (k-1)}}$$

where k = min(rows, columns)

### Statistical vs Practical Significance

| Scenario | p-value | Effect Size | Interpretation |
|----------|---------|-------------|----------------|
| A | 0.001 | d = 0.15 | Statistically significant, but TINY effect |
| B | 0.08 | d = 0.9 | Not significant, but LARGE effect (underpowered?) |
| C | 0.01 | d = 0.8 | Significant AND large effect! |

**Best practice: Always report BOTH p-value AND effect size!**

In [None]:
# Calculate Cohen's d


def cohens_d(group1, group2):
    """
    Calculate Cohen's d for independent samples.

    Parameters:
    -----------
    group1, group2 : array-like
        Data for each group

    Returns:
    --------
    float
        Cohen's d effect size
    """
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)

    # Pooled standard deviation
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))

    # Cohen's d
    d = (np.mean(group1) - np.mean(group2)) / pooled_std

    return d


# Calculate for our earlier example
d = cohens_d(treatment_scores, control_scores)

print("EFFECT SIZE CALCULATION")
print("=" * 80)
print(f"\nCohen's d = {d:.3f}")

# Interpret
if abs(d) < 0.2:
    interpretation = "Very Small"
elif abs(d) < 0.5:
    interpretation = "Small"
elif abs(d) < 0.8:
    interpretation = "Medium"
else:
    interpretation = "Large"

print(f"Interpretation: {interpretation} effect")

# What does this mean in practical terms?
print(f"\nPractical Meaning:")
print(f"  The treatment improved scores by {d:.2f} standard deviations")
print(
    f"  In our example: {abs(treatment_scores.mean() - control_scores.mean()):.1f} points difference"
)

# Combine with p-value for complete reporting
print(f"\nüìä Complete Reporting:")
print(
    f"  The treatment group (M = {treatment_scores.mean():.2f}, SD = {treatment_scores.std(ddof=1):.2f})"
)
print(f"  scored significantly higher than the control group")
print(f"  (M = {control_scores.mean():.2f}, SD = {control_scores.std(ddof=1):.2f}),")
print(f"  t({n_per_group*2-2}) = {t_stat:.2f}, p = {p_value:.3f}, d = {d:.2f}.")

In [None]:
# Demonstrate the relationship between sample size, effect size, and significance

sample_sizes = [10, 30, 50, 100, 200, 500]
true_effect = 0.3  # Small-to-medium effect

results = []

for n in sample_sizes:
    # Generate data
    np.random.seed(42)
    group1 = np.random.normal(0, 1, n)
    group2 = np.random.normal(true_effect, 1, n)

    # Test
    t, p = stats.ttest_ind(group1, group2)
    d = cohens_d(group1, group2)

    results.append(
        {"N per group": n, "p-value": p, "Cohen's d": d, "Significant": "Yes" if p < 0.05 else "No"}
    )

results_df = pd.DataFrame(results)

print("EFFECT OF SAMPLE SIZE")
print("=" * 80)
print(f"True effect size (Cohen's d): {true_effect}")
print("\n" + results_df.to_string(index=False))

print("\nüí° Key Insights:")
print("  ‚Ä¢ Effect size stays relatively constant (it's the 'true' effect)")
print("  ‚Ä¢ p-value decreases as sample size increases")
print("  ‚Ä¢ With large samples, even small effects become 'significant'")
print("  ‚Ä¢ Effect size tells you if the result is MEANINGFUL!")

---

## Part 4: Statistical Power Analysis

### What is Statistical Power?

**Power = Probability of detecting an effect when it truly exists**

**Power = 1 - Œ≤ (Type II error rate)**

### Factors Affecting Power

1. **Effect Size** ‚Üë ‚Üí Power ‚Üë
   - Larger effects are easier to detect

2. **Sample Size** ‚Üë ‚Üí Power ‚Üë
   - More data = better ability to detect effects

3. **Significance Level (Œ±)** ‚Üë ‚Üí Power ‚Üë
   - More lenient threshold = easier to reject H‚ÇÄ
   - But also increases Type I error!

4. **Variability** ‚Üì ‚Üí Power ‚Üë
   - Less noise = clearer signal

### Standard Power Levels

- **0.80**: Minimum acceptable (80% chance of detecting effect)
- **0.90**: Better (90% chance)
- **0.95**: Excellent (95% chance)

### A Priori vs Post-Hoc Power Analysis

**A Priori (Before Study):**
- "How many participants do I need to detect this effect?"
- **USE THIS!** It's the gold standard

**Post-Hoc (After Study):**
- "What was my power given my sample?"
- Controversial - many statisticians discourage this
- If result is significant, post-hoc power is not informative
- If result is non-significant, better to report confidence intervals

In [None]:
# A priori power analysis for t-test
from scipy.stats import t as t_dist, norm


def calculate_sample_size_ttest(effect_size, power=0.8, alpha=0.05):
    """
    Calculate required sample size for independent t-test.

    Parameters:
    -----------
    effect_size : float
        Expected Cohen's d
    power : float
        Desired power (typically 0.80)
    alpha : float
        Significance level (typically 0.05)

    Returns:
    --------
    int
        Required sample size per group
    """
    # Critical values
    z_alpha = norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta = norm.ppf(power)

    # Sample size calculation
    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2

    return int(np.ceil(n))


print("A PRIORI POWER ANALYSIS")
print("=" * 80)
print("\nResearch Question: How many participants do I need?")
print("\nScenario 1: Detecting a SMALL effect (d = 0.2)")
n_small = calculate_sample_size_ttest(effect_size=0.2, power=0.8)
print(f"  Required n per group: {n_small}")
print(f"  Total participants: {n_small * 2}")

print("\nScenario 2: Detecting a MEDIUM effect (d = 0.5)")
n_medium = calculate_sample_size_ttest(effect_size=0.5, power=0.8)
print(f"  Required n per group: {n_medium}")
print(f"  Total participants: {n_medium * 2}")

print("\nScenario 3: Detecting a LARGE effect (d = 0.8)")
n_large = calculate_sample_size_ttest(effect_size=0.8, power=0.8)
print(f"  Required n per group: {n_large}")
print(f"  Total participants: {n_large * 2}")

print("\nüí° Key Insight: Smaller effects require MUCH larger samples!")
print(f"   To detect d=0.2 requires {n_small/n_large:.1f}x more participants than d=0.8")

In [None]:
# Visualize power as a function of sample size


def calculate_power_ttest(n_per_group, effect_size, alpha=0.05):
    """
    Calculate statistical power for t-test.
    """
    # Non-centrality parameter
    ncp = effect_size * np.sqrt(n_per_group / 2)

    # Degrees of freedom
    df = 2 * n_per_group - 2

    # Critical t-value
    t_crit = t_dist.ppf(1 - alpha / 2, df)

    # Power
    power = 1 - t_dist.cdf(t_crit, df, ncp) + t_dist.cdf(-t_crit, df, ncp)

    return power


# Generate power curves
sample_sizes = np.arange(10, 201, 5)
effect_sizes = [0.2, 0.5, 0.8]

fig, ax = plt.subplots(figsize=(10, 6))

for d in effect_sizes:
    powers = [calculate_power_ttest(n, d) for n in sample_sizes]
    ax.plot(sample_sizes, powers, label=f"d = {d}", linewidth=2)

ax.axhline(y=0.8, color="red", linestyle="--", label="Desired Power (0.80)", alpha=0.7)
ax.set_xlabel("Sample Size per Group", fontsize=12, fontweight="bold")
ax.set_ylabel("Statistical Power", fontsize=12, fontweight="bold")
ax.set_title("Power Analysis for Different Effect Sizes", fontsize=14, fontweight="bold")
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

plt.tight_layout()
plt.savefig(os.path.join(output_dir, "power_curve.png"), dpi=300, bbox_inches="tight")
plt.show()

print("\nüí° Power Curve Insights:")
print("  ‚Ä¢ Larger effects reach 80% power with smaller samples")
print("  ‚Ä¢ Power increases rapidly at first, then plateaus")
print("  ‚Ä¢ Always plan for adequate power BEFORE collecting data!")

---

## Part 5: Multiple Comparisons Problem

### The Problem

If you do 20 independent tests at Œ± = 0.05:
- Expected number of false positives: 20 √ó 0.05 = **1 false positive**

**The more tests you run, the more likely you'll find a "significant" result by chance!**

### Family-Wise Error Rate (FWER)

Probability of making at least ONE Type I error across all tests:

$$FWER = 1 - (1 - \alpha)^m$$

where m = number of tests

### Correction Methods

#### 1. Bonferroni Correction
**Most conservative**

$$\alpha_{corrected} = \frac{\alpha}{m}$$

**Pros:** Simple, controls FWER strongly
**Cons:** Very conservative, reduces power

#### 2. Holm-Bonferroni (Step-Down)
**Less conservative than Bonferroni**

- Sort p-values from smallest to largest
- Compare each to Œ±/(m-i+1)

**Pros:** More powerful than Bonferroni
**Cons:** Still conservative

#### 3. False Discovery Rate (FDR) - Benjamini-Hochberg
**Controls proportion of false discoveries**

**Pros:** More powerful, appropriate for exploratory research
**Cons:** Less stringent than FWER control

### When to Correct?

**YES, correct when:**
- Testing multiple hypotheses in same analysis
- Looking at multiple outcomes
- Doing post-hoc comparisons after ANOVA

**NO correction needed when:**
- You have ONE pre-specified hypothesis
- Tests are independent research questions

In [None]:
# Demonstrate multiple comparison problem


def simulate_multiple_testing(n_tests=20, alpha=0.05, n_simulations=1000):
    """
    Simulate multiple testing to show inflation of Type I error.
    """
    any_significant = 0

    for _ in range(n_simulations):
        # Generate null data (no real effects)
        p_values = []
        for _ in range(n_tests):
            group1 = np.random.normal(0, 1, 30)
            group2 = np.random.normal(0, 1, 30)  # Same distribution!
            _, p = stats.ttest_ind(group1, group2)
            p_values.append(p)

        # Check if ANY test was significant
        if any(p < alpha for p in p_values):
            any_significant += 1

    return any_significant / n_simulations


print("MULTIPLE COMPARISONS PROBLEM")
print("=" * 80)
print("\nSimulation: 20 tests, all NULL (no real effects)")

fwer = simulate_multiple_testing(n_tests=20)

print(f"\nTheoretical FWER: {1 - (1-0.05)**20:.3f}")
print(f"Observed FWER (simulation): {fwer:.3f}")
print(f"\n‚ö†Ô∏è  With 20 tests, you have a {fwer*100:.1f}% chance of at least one false positive!")
print(f"   Even though NONE of the effects are real!")

# Show correction impact
print("\n" + "=" * 80)
print("CORRECTION METHODS")
print("=" * 80)

original_alpha = 0.05
n_tests = 10
p_values = [0.001, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.08, 0.1, 0.15]

# Bonferroni
bonferroni_alpha = original_alpha / n_tests
bonf_sig = sum(1 for p in p_values if p < bonferroni_alpha)

print(f"\nOriginal Œ± = {original_alpha}")
print(f"Number of tests = {n_tests}")
print(f"\nP-values: {p_values}")

print(f"\n1. No Correction:")
print(f"   Significant results: {sum(1 for p in p_values if p < original_alpha)}/{n_tests}")

print(f"\n2. Bonferroni Correction:")
print(f"   Corrected Œ± = {bonferroni_alpha:.4f}")
print(f"   Significant results: {bonf_sig}/{n_tests}")

print("\nüí° Bonferroni is conservative - reduces false positives but also power")

---

## Part 6: ANOVA - Comparing Multiple Groups

### When to Use ANOVA

**Research Question:** Are there differences among 3+ groups?

**Example:** Compare test scores across 4 teaching methods

### Why Not Multiple T-Tests?

With 4 groups, you'd need 6 t-tests (all pairs):
- A vs B, A vs C, A vs D, B vs C, B vs D, C vs D

**Problem:** Multiple comparison problem! FWER inflates.

**Solution:** Use ANOVA first (omnibus test), then post-hoc comparisons if significant.

### ANOVA Logic

**H‚ÇÄ:** All group means are equal (Œº‚ÇÅ = Œº‚ÇÇ = Œº‚ÇÉ = ...)

**H‚ÇÅ:** At least one mean is different

**F-statistic:**

$$F = \frac{\text{Variance between groups}}{\text{Variance within groups}}$$

If F is large ‚Üí groups differ more than expected by chance

### Post-Hoc Tests

If ANOVA is significant, which groups differ?

**Common Post-Hoc Tests:**
- **Tukey HSD**: Controls FWER, all pairwise comparisons
- **Bonferroni**: Conservative, simple
- **Dunnett**: Compare all groups to one control group

In [None]:
# ANOVA Example: Teaching methods

np.random.seed(42)

# Generate data for 4 teaching methods
method_A = np.random.normal(70, 10, 30)  # Traditional
method_B = np.random.normal(75, 10, 30)  # Flipped classroom
method_C = np.random.normal(72, 10, 30)  # Online
method_D = np.random.normal(78, 10, 30)  # Hybrid

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(method_A, method_B, method_C, method_D)

print("ONE-WAY ANOVA")
print("=" * 80)
print("\nResearch Question: Do teaching methods affect test scores?")

print("\nDescriptive Statistics:")
methods_data = {
    "Method": ["Traditional", "Flipped", "Online", "Hybrid"],
    "Mean": [method_A.mean(), method_B.mean(), method_C.mean(), method_D.mean()],
    "SD": [method_A.std(ddof=1), method_B.std(ddof=1), method_C.std(ddof=1), method_D.std(ddof=1)],
    "n": [len(method_A), len(method_B), len(method_C), len(method_D)],
}
methods_df = pd.DataFrame(methods_data)
print(methods_df.to_string(index=False))

print(f"\nANOVA Results:")
print(f"  F({3}, {len(method_A)*4-4}) = {f_stat:.3f}")
print(f"  p-value = {p_value:.4f}")

if p_value < 0.05:
    print(f"\n‚úÖ Significant: At least one method differs from the others")
    print(f"   ‚Üí Need post-hoc tests to determine which groups differ")
else:
    print(f"\n‚ùå Not significant: No evidence of differences among methods")

In [None]:
# Calculate eta-squared (effect size for ANOVA)

# Combine all data
all_scores = np.concatenate([method_A, method_B, method_C, method_D])
all_groups = (
    ["A"] * len(method_A) + ["B"] * len(method_B) + ["C"] * len(method_C) + ["D"] * len(method_D)
)

# Calculate sums of squares
grand_mean = all_scores.mean()
ss_total = np.sum((all_scores - grand_mean) ** 2)

group_means = [method_A.mean(), method_B.mean(), method_C.mean(), method_D.mean()]
group_sizes = [len(method_A), len(method_B), len(method_C), len(method_D)]
ss_between = sum(n * (mean - grand_mean) ** 2 for mean, n in zip(group_means, group_sizes))

# Eta-squared
eta_squared = ss_between / ss_total

print("\nEFFECT SIZE (Eta-Squared)")
print("=" * 80)
print(f"Œ∑¬≤ = {eta_squared:.4f}")
print(
    f"\nInterpretation: {eta_squared*100:.2f}% of variance in scores is explained by teaching method"
)

if eta_squared < 0.01:
    print("Effect size: Very Small")
elif eta_squared < 0.06:
    print("Effect size: Small")
elif eta_squared < 0.14:
    print("Effect size: Medium")
else:
    print("Effect size: Large")

---

## Practice Exercises

### Exercise 1: Conduct a Complete T-Test Analysis

**Scenario:** A company tests two website designs (A vs B) for conversion rate.

**Data:** Design A: n=50, Design B: n=50 (generate synthetic data)

**Tasks:**
1. Conduct independent t-test
2. Calculate Cohen's d
3. Report results professionally
4. Visualize the distributions

### Exercise 2: Power Analysis

**Scenario:** You're planning a study to test if a new drug reduces blood pressure.

**Tasks:**
1. Calculate required sample size for d=0.5, power=0.80
2. Create a power curve showing n from 10 to 100
3. What happens to power if you can only get n=30 per group?

### Exercise 3: Multiple Comparisons

**Scenario:** You test 15 different nutrients for their effect on plant growth.

**Tasks:**
1. If Œ±=0.05, what's the FWER?
2. What's the Bonferroni-corrected Œ±?
3. If p-values are [0.001, 0.01, 0.03, 0.05, 0.07], which remain significant after correction?

### Exercise 4: ANOVA Practice

**Scenario:** Compare 5 different diet plans on weight loss.

**Tasks:**
1. Generate synthetic data for 5 groups
2. Conduct one-way ANOVA
3. Calculate eta-squared
4. Create box plots for visualization

---

## Summary and Key Takeaways

### üéØ What We Learned

1. **Hypothesis Testing Framework**
   - Null vs alternative hypotheses
   - Type I and Type II errors
   - p-values and significance levels

2. **Effect Sizes**
   - Cohen's d for t-tests
   - Eta-squared for ANOVA
   - Statistical vs practical significance

3. **Statistical Power**
   - Definition and importance
   - Factors affecting power
   - A priori sample size calculation

4. **Multiple Comparisons**
   - FWER inflation problem
   - Bonferroni and other corrections
   - When to correct

5. **ANOVA**
   - Comparing 3+ groups
   - Post-hoc tests
   - Effect sizes

### üìö Best Practices for Reporting

**Always include:**
1. ‚úÖ Descriptive statistics (M, SD, n)
2. ‚úÖ Test statistic and degrees of freedom
3. ‚úÖ Exact p-value (not just p<0.05)
4. ‚úÖ Effect size with interpretation
5. ‚úÖ Confidence intervals when possible

**Example of good reporting:**
> "The treatment group (M = 76.2, SD = 9.8) scored significantly higher than the control group (M = 70.1, SD = 10.3), t(78) = 2.87, p = .005, d = 0.64, 95% CI [1.9, 10.3]. This represents a medium-to-large effect."

### üöÄ Next Steps

1. **Practice**: Complete the exercises above
2. **Apply**: Use these techniques in your own data analysis
3. **Read**: Study statistical reporting in your field's journals
4. **Prepare** for Module 02: Causal Inference (understanding WHY, not just IF)

### üí° Remember

> "The goal is not just to find significance, but to understand the magnitude and meaning of effects."

Statistical significance + Large effect size + Adequate power = Convincing evidence!

---

## Additional Resources

### Software
- **G*Power**: Free power analysis software
- **Pingouin**: Python statistical package
- **statsmodels**: Comprehensive Python stats

### Reading
- "Statistical Power Analysis" by Cohen (1988)
- "Statistics Done Wrong" by Reinhart
- "The Essential Guide to Effect Sizes" by Ellis (2010)

### Online Resources
- [Statistics Hell](https://statisticshell.com/)
- [Cross Validated](https://stats.stackexchange.com/)
- [Stat 545](https://stat545.com/)

---

**Next Module:** [02_causal_inference_fundamentals.ipynb](02_causal_inference_fundamentals.ipynb) - Learn to distinguish correlation from causation and understand confounding!

---

*Last updated: 2024*