# Module 03: Research Design - Experimental vs Observational

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 60 minutes

**Prerequisites**: [Module 00: Introduction to Research Methodology](00_introduction_research_methodology.ipynb), [Module 02: Research Foundations and Paradigms](02_research_foundations_paradigms.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Compare experimental vs observational study designs and explain their trade-offs
2. Apply core principles of experimental design: randomization, replication, and control
3. Design factorial experiments and interpret interaction effects
4. Understand and navigate the hierarchy of evidence
5. Recognize when observational studies are necessary and how to strengthen causal inference
6. Calculate statistical power and determine necessary sample sizes

## Setup

Let's import the libraries we'll use throughout this notebook.

In [None]:
# Standard data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# For experimental design and power analysis
from statsmodels.stats.power import FTestPower, tt_solve_power
from itertools import product

# Configuration for better visualizations
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)

print("âœ“ Libraries imported successfully!")

## 1. The Fundamental Research Design Choice

The most important decision in planning research is choosing between:

1. **Experimental Design** - The researcher manipulates variables
2. **Observational Design** - The researcher observes naturally occurring variation

This choice determines:
- What causal claims you can make
- How many resources you need
- What ethical constraints apply
- How strong your evidence will be

### Experimental Design

**Key characteristic**: The researcher *controls* and *manipulates* the independent variable(s)

**What this enables**:
- Making causal claims with confidence
- Isolating specific effects
- Controlling confounding variables
- Replicating conditions precisely

**What it requires**:
- Ability to assign participants/units to conditions
- Ethical approval (sometimes)
- Controlled environment

**Example**: Testing whether a new website design improves user engagement by randomly assigning visitors to current (control) or new (treatment) design

### Observational Design

**Key characteristic**: The researcher *observes* naturally occurring variation without manipulation

**When it's necessary**:
- Randomization is unethical (e.g., smoking effects)
- Long time horizons (decades of follow-up)
- Rare outcomes (waiting for events to occur naturally)
- Studying existing policies or phenomena

**The challenge**: Causal inference from observational data requires careful statistical work

**Example**: Studying the effect of smoking on health by comparing smokers vs non-smokers in existing data

### Key Distinction: Internal vs External Validity

| Validity Type | Meaning | Experiments | Observational |
|---|---|---|---|
| **Internal** | Can we confidently claim causation? | Typically HIGH | Typically LOW |
| **External** | Do results generalize to real-world? | May be LIMITED (lab setting) | Often HIGHER (natural conditions) |

## 2. Core Principles of Experimental Design

Rigorous experimental design rests on three fundamental principles:

### Principle 1: Randomization

**Purpose**: Eliminate systematic bias in assigning units to treatment conditions

**How it works**: Each unit has an equal probability of receiving each treatment

**Why it matters**:
- Distributes confounding variables equally across groups
- Creates comparable groups except for the treatment
- Enables statistical inference

**Different randomization approaches**:
- **Simple random assignment**: Each unit independently assigned with fixed probability
- **Stratified randomization**: Randomize within subgroups, then combine
- **Block randomization**: Ensure equal group sizes in blocks

### Principle 2: Replication

**Purpose**: Observe the effect across multiple units to reduce noise

**Two types**:
1. **Within-experiment replication**: Multiple observations per condition
2. **Across-experiment replication**: Repeating the entire experiment

**Why it matters**:
- Larger sample sizes â†’ more precise estimates
- Allows assessment of statistical significance
- Reduces impact of random variation

### Principle 3: Control

**Purpose**: Hold constant variables that might affect outcomes

**Methods**:
1. **Holding variables constant**: Use identical conditions for all units
2. **Matching**: Pair similar units, give different treatments
3. **Blocking**: Group similar units, randomize treatments within groups
4. **Statistical adjustment**: Use covariates in analysis

**Why it matters**: Reduces noise, makes effects clearer, increases statistical power

### Demonstration: Impact of These Principles

Let's simulate a simple experiment and show how these principles matter:

In [None]:
# Simulate a simple learning experiment
# Question: Does using active recall improve learning compared to passive reading?

# True effect: Active recall helps people remember 15% more on average
true_effect = 0.15

# Simulate different sample sizes
sample_sizes = [10, 30, 100, 300]
results_summary = []

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, n_per_group in enumerate(sample_sizes):
    # Simulate experiment: compare passive reading (control) vs active recall (treatment)
    # Both groups start with same baseline, treatment gets +15% boost
    control_scores = np.random.normal(65, 10, n_per_group)
    treatment_scores = np.random.normal(65 + 15, 10, n_per_group)
    
    # Statistical test
    t_stat, p_value = stats.ttest_ind(treatment_scores, control_scores)
    effect_size = (treatment_scores.mean() - control_scores.mean()) / np.sqrt((np.var(control_scores) + np.var(treatment_scores))/2)
    
    # Store results
    results_summary.append({
        'Sample Size': n_per_group,
        'Control Mean': control_scores.mean(),
        'Treatment Mean': treatment_scores.mean(),
        'Observed Effect': treatment_scores.mean() - control_scores.mean(),
        'P-value': p_value,
        'Significant': p_value < 0.05,
        'Effect Size (Cohen\'s d)': effect_size
    })
    
    # Visualize
    ax = axes[idx]
    
    # Create violin plots
    parts = ax.violinplot([control_scores, treatment_scores], positions=[1, 2], showmeans=True)
    ax.set_xticks([1, 2])
    ax.set_xticklabels(['Control\n(Passive Reading)', 'Treatment\n(Active Recall)'])
    ax.set_ylabel('Test Score', fontsize=11)
    ax.set_title(f'Sample Size: n={n_per_group} per group\np-value: {p_value:.4f} {"âœ“ Significant" if p_value < 0.05 else "âœ— Not significant"}',
                 fontsize=11, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    ax.set_ylim(30, 100)

plt.suptitle('Impact of Replication (Sample Size) on Detecting a True Effect',
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Display results table
results_df = pd.DataFrame(results_summary)
printcolumns_display = ['Sample Size', 'Control Mean', 'Treatment Mean', 'Observed Effect', 'P-value', 'Significant']
print("\nðŸ“Š Effect of Replication (Sample Size) on Statistical Power:")
print("="*90)
print(results_df[print_columns_display].to_string(index=False))
print("\nðŸ’¡ Key insight: Larger samples increase statistical power to detect real effects!")

## 3. Experimental Designs: From Simple to Complex

### 3.1 Simple Two-Group Design (A/B Test)

The simplest experimental design:

```
RANDOMIZE â†’ CONTROL GROUP (No intervention)
         â†’ TREATMENT GROUP (Intervention)
         â†’ MEASURE & COMPARE
```

**Example use cases**:
- Website button color (blue vs red) and click-through rate
- Email subject lines and open rates
- Price variations and purchase conversion

**Advantages**:
- Simple to implement and analyze
- Clear interpretation

**Disadvantages**:
- Can only test one variable at a time
- Less efficient for studying multiple factors

### Simulation: A/B Test for Website Conversion

Suppose we want to test whether a new checkout button design improves conversion rate.

In [None]:
# A/B Test Simulation: Checkout Button Design

def simulate_ab_test(control_conversion_rate, treatment_effect, sample_size_per_group, iterations=1000):
    """
    Simulate multiple A/B tests to understand variability and power.
    
    Parameters:
    -----------
    control_conversion_rate : float
        Baseline conversion rate (0 to 1)
    treatment_effect : float
        Absolute increase in conversion rate for treatment
    sample_size_per_group : int
        Number of users per group
    iterations : int
        Number of simulated A/B tests
    
    Returns:
    --------
    dict : Summary statistics and power analysis
    """
    treatment_conversion_rate = control_conversion_rate + treatment_effect
    
    p_values = []
    observed_effects = []
    
    for _ in range(iterations):
        # Simulate users and conversions
        control_conversions = np.random.binomial(
            n=sample_size_per_group,
            p=control_conversion_rate
        )
        
        treatment_conversions = np.random.binomial(
            n=sample_size_per_group,
            p=treatment_conversion_rate
        )
        
        # Chi-square test for independence
        contingency_table = np.array([
            [control_conversions, sample_size_per_group - control_conversions],
            [treatment_conversions, sample_size_per_group - treatment_conversions]
        ])
        
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
        p_values.append(p_value)
        
        observed_effect = (treatment_conversions / sample_size_per_group) - (control_conversions / sample_size_per_group)
        observed_effects.append(observed_effect)
    
    # Calculate statistical power
    power = np.mean(np.array(p_values) < 0.05)
    
    return {
        'true_effect': treatment_effect,
        'sample_size_per_group': sample_size_per_group,
        'statistical_power': power,
        'p_values': np.array(p_values),
        'observed_effects': np.array(observed_effects),
        'mean_p_value': np.mean(p_values),
        'mean_observed_effect': np.mean(observed_effects)
    }

# Run simulations with different sample sizes
baseline_conversion = 0.05  # 5% baseline
true_effect = 0.02  # 2 percentage point improvement (relative: 40% improvement)

test_sizes = [100, 500, 2000, 5000]
power_results = []

print("ðŸ“Š A/B Test Power Analysis (1000 simulations each)")
print("="*80)
print(f"Baseline conversion rate: {baseline_conversion*100:.1f}%")
print(f"Expected treatment effect: +{true_effect*100:.2f} percentage points")
print("\n" + "-"*80)

for sample_size in test_sizes:
    result = simulate_ab_test(baseline_conversion, true_effect, sample_size)
    power_results.append(result)
    
    print(f"\nSample Size: {sample_size} per group (Total: {sample_size*2})")
    print(f"  Statistical Power: {result['statistical_power']:.1%}")
    print(f"  (Power = probability of detecting true effect)")

print("\n" + "="*80)
print("ðŸ’¡ Key insight: Larger samples = higher power to detect real effects!")

### Visualizing A/B Test Power

In [None]:
# Visualize the relationship between sample size and power
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Power curve
ax1 = axes[0]
sample_sizes_detailed = np.arange(50, 5000, 50)
powers = []

for n in sample_sizes_detailed:
    result = simulate_ab_test(baseline_conversion, true_effect, int(n), iterations=500)
    powers.append(result['statistical_power'])

ax1.plot(sample_sizes_detailed, powers, linewidth=3, color='darkblue', label='Power curve')
ax1.axhline(y=0.8, color='red', linestyle='--', linewidth=2, label='Target power = 80%')
ax1.fill_between(sample_sizes_detailed, 0.8, 1, alpha=0.2, color='green', label='Acceptable power')
ax1.set_xlabel('Sample Size per Group', fontsize=12)
ax1.set_ylabel('Statistical Power', fontsize=12)
ax1.set_title('A/B Test: Sample Size vs Statistical Power', fontsize=13, fontweight='bold')
ax1.set_ylim(0, 1)
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# Plot 2: Distribution of observed effects
ax2 = axes[1]
colors = ['lightcoral', 'gold', 'lightgreen', 'lightblue']

for idx, result in enumerate(power_results):
    ax2.hist(result['observed_effects'], bins=30, alpha=0.5, label=f"n={result['sample_size_per_group']}",
             color=colors[idx], edgecolor='black')

ax2.axvline(x=true_effect, color='red', linestyle='--', linewidth=2, label=f'True effect ({true_effect*100:.2f}%)')
ax2.set_xlabel('Observed Effect Size', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Distribution of Observed Effects Across Simulations', fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ“ˆ Key findings from A/B test simulation:")
print("   - Larger samples: narrower distribution (more consistent results)")
print("   - Larger samples: higher power to detect true effect")
print("   - For 80% power, we need ~1500 users per group for this scenario")

### 3.2 Factorial Designs (Testing Multiple Factors)

**When to use**: Testing multiple factors simultaneously and their interactions

**Structure**: All combinations of factor levels

**Example**: 2Ã—2 factorial design
```
Factor A: Button Color (Blue vs Red)
Factor B: Button Text ("Buy Now" vs "Add to Cart")

Results in 4 experimental conditions:
1. Blue button + "Buy Now"
2. Blue button + "Add to Cart"
3. Red button + "Buy Now"
4. Red button + "Add to Cart"
```

**Efficiency**: Testing 2 factors in one study instead of 2 separate studies
- Main effects: Effect of Factor A alone, Effect of Factor B alone
- Interaction effect: Does the effect of A depend on B? (e.g., red works better with "Buy Now" but blue works better with "Add to Cart")

**Advantages**:
- More efficient than one-factor-at-a-time
- Can detect interaction effects
- More realistic (factors usually interact in practice)

**Disadvantages**:
- More complex to analyze
- Larger sample sizes needed

### Simulation: 2Ã—2 Factorial Design

Let's design an experiment testing two factors in online learning:

In [None]:
# 2x2 Factorial Design: Online Learning Study
# Factor A: Practice type (Spaced vs Massed)
# Factor B: Feedback type (Immediate vs Delayed)

# Generate simulated data
np.random.seed(42)

# True effects (test scores out of 100)
baseline = 65
effect_spacing = 8          # Spaced practice advantage
effect_feedback = 6         # Immediate feedback advantage
interaction_effect = -4     # Interaction: spaced + immediate is worse than expected

n_per_condition = 50  # Sample size per condition
noise_sd = 8

# Create all four conditions
conditions = {
    'Massed + Delayed': baseline,
    'Massed + Immediate': baseline + effect_feedback,
    'Spaced + Delayed': baseline + effect_spacing,
    'Spaced + Immediate': baseline + effect_spacing + effect_feedback + interaction_effect
}

# Simulate data
factorial_data = []
for condition, mean_score in conditions.items():
    scores = np.random.normal(mean_score, noise_sd, n_per_condition)
    
    # Parse condition into factors
    spacing, feedback = condition.split(' + ')
    
    for score in scores:
        factorial_data.append({
            'Practice_Type': spacing,
            'Feedback_Type': feedback,
            'Test_Score': score,
            'Condition': condition
        })

factorial_df = pd.DataFrame(factorial_data)

# Analyze results
print("\nðŸ“Š 2Ã—2 Factorial Design Results: Learning Study")
print("="*70)
print("\nMean Test Scores by Condition:")
print("-"*70)

summary_stats = factorial_df.groupby('Condition')['Test_Score'].agg(['mean', 'std', 'count'])
print(summary_stats.round(2))

# Calculate main effects
print("\n\nMain Effects Analysis:")
print("-"*70)

spaced_mean = factorial_df[factorial_df['Practice_Type'] == 'Spaced']['Test_Score'].mean()
massed_mean = factorial_df[factorial_df['Practice_Type'] == 'Massed']['Test_Score'].mean()
print(f"Effect of Practice Type:")
print(f"  Spaced:  {spaced_mean:.2f}")
print(f"  Massed:  {massed_mean:.2f}")
print(f"  Difference: {spaced_mean - massed_mean:.2f} points (Spaced is better)")

immediate_mean = factorial_df[factorial_df['Feedback_Type'] == 'Immediate']['Test_Score'].mean()
delayed_mean = factorial_df[factorial_df['Feedback_Type'] == 'Delayed']['Test_Score'].mean()
print(f"\nEffect of Feedback Type:")
print(f"  Immediate: {immediate_mean:.2f}")
print(f"  Delayed:   {delayed_mean:.2f}")
print(f"  Difference: {immediate_mean - delayed_mean:.2f} points (Immediate is better)")

print("\n\nInteraction Effect:")
print("-"*70)
print("\nDoes the benefit of spaced practice depend on feedback type?")

for feedback_type in ['Immediate', 'Delayed']:
    subset = factorial_df[factorial_df['Feedback_Type'] == feedback_type]
    spaced_effect = subset[subset['Practice_Type'] == 'Spaced']['Test_Score'].mean() - \
                    subset[subset['Practice_Type'] == 'Massed']['Test_Score'].mean()
    print(f"\n  With {feedback_type} feedback:")
    print(f"    Spaced advantage: {spaced_effect:.2f} points")

### Visualizing the Factorial Design

In [None]:
# Visualize the 2x2 factorial design
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Box plot showing all four conditions
ax1 = axes[0]
factorial_df_sorted = factorial_df.sort_values('Condition')
sns.boxplot(data=factorial_df_sorted, x='Practice_Type', y='Test_Score', 
            hue='Feedback_Type', ax=ax1, palette='Set2')
ax1.set_xlabel('Practice Type', fontsize=12)
ax1.set_ylabel('Test Score', fontsize=12)
ax1.set_title('2Ã—2 Factorial Design: Main Effects\nand Interaction', fontsize=13, fontweight='bold')
ax1.legend(title='Feedback Type', fontsize=11)
ax1.grid(axis='y', alpha=0.3)

# Plot 2: Interaction plot (line plot)
ax2 = axes[1]

# Calculate means for each combination
for practice in ['Massed', 'Spaced']:
    means = []
    feedback_types = ['Delayed', 'Immediate']
    
    for feedback in feedback_types:
        subset = factorial_df[(factorial_df['Practice_Type'] == practice) & 
                             (factorial_df['Feedback_Type'] == feedback)]
        means.append(subset['Test_Score'].mean())
    
    ax2.plot(feedback_types, means, 'o-', linewidth=2.5, markersize=10, 
             label=f'{practice}', markerfacecolor='white', markeredgewidth=2)

ax2.set_xlabel('Feedback Type', fontsize=12)
ax2.set_ylabel('Mean Test Score', fontsize=12)
ax2.set_title('Interaction Plot\n(Non-parallel lines indicate interaction)', 
              fontsize=13, fontweight='bold')
ax2.legend(title='Practice Type', fontsize=11)
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ’¡ Interpretation of interaction plot:")
print("   - Non-parallel lines indicate an interaction effect")
print("   - Spaced practice helps more with Delayed feedback")
print("   - The benefit differs depending on feedback timing")

### 3.3 Blocking Design (Controlling for Known Confounders)

**When to use**: When you know that some variable will affect outcomes but is not your focus

**How it works**:
1. Divide units into blocks based on the confounding variable
2. Randomize treatments within each block
3. Analyze results accounting for blocks

**Example**: Testing a new teaching method across schools
- Schools vary in quality (confounding variable)
- Solution: Block by school, randomize method within each school
- Result: Compare treatment vs control in each school, then combine estimates

**Advantages**:
- Reduces unexplained variation
- Increases precision of estimates
- Improves statistical power
- Controls known confounders elegantly

In [None]:
# Blocking Design Example: Testing new teaching method across schools

# Scenario: Schools differ in quality
schools = ['School A (Low)', 'School B (Medium)', 'School C (High)']
school_effects = [15, 55, 75]  # Baseline test scores
treatment_effect = 5  # New method adds 5 points on average

# Create blocked design
blocked_data = []
n_per_condition_per_block = 25  # 25 students per condition per school

for school_idx, school_name in enumerate(schools):
    baseline_score = school_effects[school_idx]
    
    # Control group in this school
    control_scores = np.random.normal(baseline_score, 5, n_per_condition_per_block)
    
    # Treatment group in this school
    treatment_scores = np.random.normal(baseline_score + treatment_effect, 5, n_per_condition_per_block)
    
    for score in control_scores:
        blocked_data.append({
            'School': school_name,
            'Condition': 'Control (Traditional)',
            'Test_Score': score
        })
    
    for score in treatment_scores:
        blocked_data.append({
            'School': school_name,
            'Condition': 'Treatment (New Method)',
            'Test_Score': score
        })

blocked_df = pd.DataFrame(blocked_data)

print("\nðŸ“Š Blocking Design: Effect of School Quality")
print("="*70)
print("\nResults by School and Condition:")
print("-"*70)

results_by_school = blocked_df.groupby(['School', 'Condition'])['Test_Score'].agg(['mean', 'std', 'count'])
print(results_by_school.round(2))

print("\n\nTreatment Effect within Each Block (School):")
print("-"*70)

block_effects = []
for school in schools:
    school_data = blocked_df[blocked_df['School'] == school]
    control_mean = school_data[school_data['Condition'] == 'Control (Traditional)']['Test_Score'].mean()
    treatment_mean = school_data[school_data['Condition'] == 'Treatment (New Method)']['Test_Score'].mean()
    effect = treatment_mean - control_mean
    block_effects.append(effect)
    print(f"\n{school}:")
    print(f"  Control mean:    {control_mean:.2f}")
    print(f"  Treatment mean:  {treatment_mean:.2f}")
    print(f"  Effect:          {effect:.2f} points")

print(f"\n\nOverall Average Treatment Effect: {np.mean(block_effects):.2f} points")
print("(Combining effects across all schools)")

print("\n\nðŸ’¡ Key advantage of blocking:")
print("   - School quality controlled for")
print("   - Cleaner estimate of method effect")
print("   - Each school's data used to estimate effect")
print("   - Better precision than ignoring school differences")

### Visualizing Blocking Effect

In [None]:
# Compare blocked vs unblocked analysis
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Unblocked view (ignoring school)
ax1 = axes[0]
data_unblocked = blocked_df.groupby('Condition')['Test_Score'].apply(list)
colors_unblocked = ['lightcoral', 'lightgreen']
for idx, condition in enumerate(['Control (Traditional)', 'Treatment (New Method)']):
    scores = blocked_df[blocked_df['Condition'] == condition]['Test_Score'].values
    ax1.scatter([idx]*len(scores), scores, alpha=0.3, s=50, color=colors_unblocked[idx])
    ax1.plot([idx, idx], [scores.mean() - 1.96*scores.std()/np.sqrt(len(scores)),
                           scores.mean() + 1.96*scores.std()/np.sqrt(len(scores))],
             'k-', linewidth=3)
    ax1.scatter(idx, scores.mean(), s=200, color=colors_unblocked[idx], 
               edgecolor='black', linewidth=2, zorder=5)

ax1.set_xticks([0, 1])
ax1.set_xticklabels(['Control', 'Treatment'])
ax1.set_ylabel('Test Score', fontsize=12)
ax1.set_title('Unblocked Analysis\n(Ignoring school differences)', fontsize=13, fontweight='bold')
ax1.set_ylim(0, 100)
ax1.grid(axis='y', alpha=0.3)

# Plot 2: Blocked view (accounting for school)
ax2 = axes[1]
for school_idx, school in enumerate(schools):
    for cond_idx, condition in enumerate(['Control (Traditional)', 'Treatment (New Method)']):
        school_cond_data = blocked_df[(blocked_df['School'] == school) & 
                                      (blocked_df['Condition'] == condition)]['Test_Score']
        x_pos = school_idx + (cond_idx - 0.5) * 0.3
        color = ['lightcoral', 'lightgreen'][cond_idx]
        ax2.scatter([x_pos]*len(school_cond_data), school_cond_data, alpha=0.3, s=50, color=color)
        ax2.scatter(x_pos, school_cond_data.mean(), s=100, color=color, 
                   edgecolor='black', linewidth=2, zorder=5)

# Connect control to treatment within each school
for school_idx, school in enumerate(schools):
    control_mean = blocked_df[(blocked_df['School'] == school) & 
                             (blocked_df['Condition'] == 'Control (Traditional)')]['Test_Score'].mean()
    treatment_mean = blocked_df[(blocked_df['School'] == school) & 
                               (blocked_df['Condition'] == 'Treatment (New Method)')]['Test_Score'].mean()
    ax2.plot([school_idx - 0.15, school_idx + 0.15], [control_mean, treatment_mean], 
            'k--', linewidth=2, alpha=0.7)

ax2.set_xticks([0, 1, 2])
ax2.set_xticklabels(schools)
ax2.set_ylabel('Test Score', fontsize=12)
ax2.set_title('Blocked Analysis\n(Accounting for school differences)', fontsize=13, fontweight='bold')
ax2.set_ylim(0, 100)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ“ˆ Visualization shows:")
print("   - Left: Overall effect appears variable (high noise)")
print("   - Right: Effect is consistent within each school (less noise)")
print("   - Blocking reduces noise and improves clarity")

## 4. Observational Studies: When Randomization Isn't Possible

### When Experiments Aren't Feasible

Sometimes randomization is **impossible, impractical, or unethical**:

1. **Unethical**: Can't randomly assign people to smoke
2. **Impractical**: Can't wait decades for long-term outcomes
3. **Rare events**: Waiting for outcome to occur naturally
4. **Historical**: Can only study what has already happened
5. **Policy**: Testing laws that already exist

### The Fundamental Problem: Causality Without Randomization

In observational studies, groups differ on many dimensions:

```
Example: Smoking and health

Smokers vs Non-smokers differ in:
- Smoking âœ“ (treatment)
- Age (confound)
- Socioeconomic status (confound)
- Diet (confound)
- Exercise (confound)
- Genetics (confound)
- And many unmeasured factors...

Observed difference = TRUE EFFECT + CONFOUNDING BIAS
```

### Strategies to Strengthen Causal Inference

**1. Matching**
- Pair treated and untreated units that are similar on observed confounders
- Compare outcomes within pairs
- Limitation: Can't account for unmeasured confounders

**2. Regression Adjustment**
- Include confounding variables as covariates in regression
- Estimates treatment effect controlling for confounders
- Limitation: Linear relationships may not fit; unmeasured confounders ignored

**3. Instrumental Variables**
- Find a variable that affects treatment but not outcome directly
- Use it to estimate causal effect
- Example: Proximity to college used as instrument for college attendance

**4. Regression Discontinuity**
- When treatment assignment depends on a cutoff (e.g., passing score)
- Compare units just above and below cutoff
- Example: Did students just passing a test benefit differently than those just failing?

**5. Difference-in-Differences**
- Compare groups before and after a policy change
- Differences in pre-trends suggest confounding
- Example: Compare income growth in treated vs control states before/after policy

### Limitations of Observational Studies

Even with these methods, observational studies cannot fully address:
- **Unmeasured confounding**: Variables you didn't measure
- **Selection bias**: How people selected into treatment
- **Reverse causality**: Does X cause Y or does Y cause X?

This is why experiments, when feasible, provide stronger evidence.

## 5. Hierarchy of Evidence

Different study designs provide different levels of evidence for causal claims.

### The Evidence Hierarchy (Strongest to Weakest)

```
    â•‘  LEVEL 1 (STRONGEST)
    â•‘  Systematic Reviews & Meta-Analyses
    â•‘  (Combining results from multiple RCTs)
    â•‘
    â•‘  LEVEL 2
    â•‘  Randomized Controlled Trials (RCTs)
    â•‘  (Gold standard for individual studies)
    â•‘
    â•‘  LEVEL 3
    â•‘  Cohort Studies
    â•‘  (Follow groups over time; stronger observational design)
    â•‘
    â•‘  LEVEL 4
    â•‘  Case-Control Studies
    â•‘  (Retrospective; comparing those with vs without outcome)
    â•‘
    â•‘  LEVEL 5
    â•‘  Case Series / Case Reports
    â•‘  (Weakest: describing individual cases)
    â•‘
    â•‘  LEVEL 6 (WEAKEST)
    â•‘  Expert Opinion, Anecdotes
    â•‘  (Subjective; prone to bias)
    â–¼
```

### Understanding Each Level

| Level | Design | Example | Strength | Limitation |
|-------|--------|---------|----------|------------|
| **Systematic Review** | Combines RCTs | "Meta-analysis of 50 depression trials" | Strongest evidence | Time-consuming |
| **RCT** | Randomized experiment | "Patients randomly assigned to drug or placebo" | Gold standard | Can't always do |
| **Cohort** | Follow exposed/unexposed forward | "Track smokers vs non-smokers for 10 years" | Can measure incidence | Confounding possible |
| **Case-Control** | Compare cases/controls backward | "Compare lung cancer patients to controls, ask about smoking" | Efficient for rare diseases | Recall bias |
| **Case Series** | Describe cases | "Here are 5 patients with unusual symptoms" | Shows what's possible | Anecdotal |
| **Opinion** | Expert judgment | "I think this works based on experience" | Generates hypotheses | Highly subjective |

### Why the Hierarchy Matters

**Different evidence levels for different questions**:
- "Does this treatment cause recovery?" â†’ Needs RCT
- "How common is this condition?" â†’ Cohort study sufficient
- "What are side effects?" â†’ Case reports valuable
- "What's the best current understanding?" â†’ Systematic review

**Real-world example**: COVID-19 vaccine safety
1. Case reports â†’ Noticed blood clotting in rare cases
2. Case series â†’ Confirmed pattern across multiple cases
3. Cohort study â†’ Estimated frequency in general population
4. RCT analysis â†’ Compared rates in vaccinated vs unvaccinated
5. Meta-analysis â†’ Combined evidence across countries

In [None]:
# Create visualization of evidence hierarchy
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Left plot: Pyramid showing study types and evidence strength
levels = ['Systematic Review\n& Meta-Analysis', 'RCTs', 'Cohort Studies', 
          'Case-Control Studies', 'Case Series / Reports', 'Expert Opinion']
heights = [1, 1.5, 2, 2, 2.5, 3]
colors_evidence = ['#2ecc71', '#27ae60', '#f39c12', '#e67e22', '#e74c3c', '#c0392b']

y_position = 0
for idx, (level, height, color) in enumerate(zip(levels, heights, colors_evidence)):
    ax1.barh(y_position, 10, height=height, color=color, edgecolor='black', linewidth=2)
    ax1.text(5, y_position, level, ha='center', va='center', fontsize=11, 
            fontweight='bold', color='white')
    y_position += height

ax1.set_ylim(0, sum(heights))
ax1.set_xlim(0, 10)
ax1.set_xlabel('Evidence Strength â†’', fontsize=12, fontweight='bold')
ax1.set_title('Hierarchy of Evidence for Causal Claims\n(Green = Strong, Red = Weak)', 
             fontsize=13, fontweight='bold')
ax1.set_yticks([])
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.spines['bottom'].set_visible(False)
ax1.spines['left'].set_visible(False)

# Right plot: Characteristics comparison
char_data = {
    'Can infer causation': [95, 80, 40, 30, 20, 10],
    'Free from bias': [90, 85, 50, 40, 30, 15],
    'Generalizable': [80, 70, 75, 60, 50, 40],
    'Quick/feasible': [40, 50, 70, 80, 85, 95]
}

levels_short = ['Sys Rev', 'RCT', 'Cohort', 'Case-Ctrl', 'Series', 'Opinion']
x = np.arange(len(levels_short))
width = 0.2

for idx, (characteristic, values) in enumerate(char_data.items()):
    ax2.bar(x + idx*width, values, width, label=characteristic)

ax2.set_xlabel('Study Type', fontsize=12)
ax2.set_ylabel('Rating (0-100)', fontsize=12)
ax2.set_title('Comparison of Study Characteristics', fontsize=13, fontweight='bold')
ax2.set_xticks(x + width * 1.5)
ax2.set_xticklabels(levels_short, rotation=45, ha='right')
ax2.legend(fontsize=10)
ax2.set_ylim(0, 100)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ“Š Trade-offs in Study Design:")
print("="*70)
print("\nSystematic Reviews & RCTs:")
print("  âœ“ Strongest causal evidence")
print("  âœ— Time-consuming and expensive")
print("  âœ— May not generalize perfectly")
print("\nCohort Studies:")
print("  âœ“ Measure incidence (outcomes develop during study)")
print("  âœ“ Faster than RCTs for some questions")
print("  âœ— Cannot definitively prove causation")
print("\nCase-Control Studies:")
print("  âœ“ Efficient for rare outcomes")
print("  âœ“ Can study historical data")
print("  âœ— Prone to recall bias and confounding")
print("\nCase Series:")
print("  âœ“ Identifies unusual patterns")
print("  âœ“ Fast to generate hypotheses")
print("  âœ— Cannot determine if outcome is causal effect")

## 6. Statistical Power: Planning Your Study

### What is Statistical Power?

**Power** = Probability of detecting a real effect if it exists

- Power = 0.80 means 80% chance of finding significant result (if true effect exists)
- Power = 0.20 means 20% chance (4 times more likely to miss real effect)

### Why Power Matters

A study with low power wastes resources:
- Spend money and time but don't detect the effect
- Conclude "no difference" when effect might exist
- Contribute to false negatives in literature

### Factors Affecting Power

1. **Sample Size**: Larger samples â†’ Higher power
2. **Effect Size**: Larger effects â†’ Higher power
3. **Significance Level (Î±)**: Standard is 0.05; stricter = lower power
4. **Variability**: More noise â†’ Lower power
5. **Study Design**: Blocking/matching â†’ Higher power

In [None]:
# Power analysis for common scenarios
from scipy.stats import norm

# Function to estimate sample size needed for desired power
def estimate_sample_size_for_power(effect_size, power=0.8, alpha=0.05, design='independent_ttest'):
    """
    Estimate sample size needed to achieve desired statistical power.
    
    For independent samples t-test:
    n = 2 * ((z_alpha/2 + z_power) / effect_size) ** 2
    """
    # Critical values
    z_alpha = norm.ppf(1 - alpha/2)  # Two-tailed
    z_power = norm.ppf(power)
    
    # Calculate n per group
    n_per_group = 2 * ((z_alpha + z_power) / effect_size) ** 2
    
    return int(np.ceil(n_per_group))

# Create power analysis table
print("\nðŸ“Š Sample Size Requirements for Different Effect Sizes")
print("="*80)
print("\nTo achieve 80% power with Î±=0.05 (two-tailed test):")
print("-"*80)

effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large
effect_labels = ['Small (0.2)', 'Medium (0.5)', 'Large (0.8)']

sample_size_table = []
for effect_size, label in zip(effect_sizes, effect_labels):
    n_per_group = estimate_sample_size_for_power(effect_size, power=0.80, alpha=0.05)
    total_n = n_per_group * 2
    sample_size_table.append({
        'Effect Size': label,
        'Per Group': n_per_group,
        'Total': total_n
    })
    print(f"\n{label}:")
    print(f"  Sample size per group: {n_per_group}")
    print(f"  Total sample size: {total_n}")

print("\n" + "-"*80)
print("\nKey insight: Smaller effects require larger samples")
print("  - Large effect: ~64 total")
print("  - Medium effect: ~128 total")
print("  - Small effect: ~784 total")
print("  - Therefore: Research design should target meaningful effects!")

### Power Analysis Visualization

In [None]:
# Visualize relationship between effect size, sample size, and power
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Power vs Sample Size for different effect sizes
ax1 = axes[0]

sample_sizes_range = np.arange(20, 500, 10)
effect_sizes_range = [0.2, 0.5, 0.8]
colors_effect = ['red', 'orange', 'green']

for effect_size, color in zip(effect_sizes_range, colors_effect):
    powers = []
    for n in sample_sizes_range:
        # Calculate power using normal approximation
        z_alpha = norm.ppf(1 - 0.05/2)
        z_stat = effect_size * np.sqrt(n/2) / 2
        power_val = 1 - norm.cdf(z_alpha - z_stat)
        powers.append(power_val)
    
    label = f'Effect size = {effect_size}'
    ax1.plot(sample_sizes_range, powers, linewidth=2.5, label=label, color=color)

ax1.axhline(y=0.8, color='black', linestyle='--', linewidth=2, label='Target power = 80%')
ax1.set_xlabel('Sample Size (per group)', fontsize=12)
ax1.set_ylabel('Statistical Power', fontsize=12)
ax1.set_title('Power Analysis: Effect Size vs Sample Size', fontsize=13, fontweight='bold')
ax1.set_ylim(0, 1)
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# Plot 2: Sample size needed by scenario
ax2 = axes[1]

scenarios = ['Email\n(Small effect)', 'Website\n(Medium effect)', 'Treatment\n(Large effect)']
sample_sizes_by_scenario = [
    estimate_sample_size_for_power(0.2),
    estimate_sample_size_for_power(0.5),
    estimate_sample_size_for_power(0.8)
]

colors_scenarios = ['#3498db', '#e74c3c', '#2ecc71']
bars = ax2.bar(scenarios, [n*2 for n in sample_sizes_by_scenario], color=colors_scenarios, edgecolor='black', linewidth=2)

ax2.set_ylabel('Total Sample Size Needed', fontsize=12)
ax2.set_title('Sample Size Requirements by Research Domain\n(For 80% power, Î±=0.05)', fontsize=13, fontweight='bold')
ax2.set_ylim(0, 1000)

# Add value labels on bars
for bar, n in zip(bars, sample_sizes_by_scenario):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'n = {int(height)}',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ’¡ Practical implications:")
print("   - Small effects (email variants): Need large samples (~400 per group)")
print("   - Medium effects (website changes): Moderate samples (~64 per group)")
print("   - Large effects (new treatment): Smaller samples (~26 per group)")

## 7. Exercises

### Exercise 1: Design Choice Decision

For each research question, decide whether to use experimental or observational design and justify your choice:

In [None]:
# Exercise 1: Design choice

research_questions = [
    {
        'Q': 'Does caffeine consumption affect sleep quality?',
        'Your Answer': '???'
    },
    {
        'Q': 'Does vitamin D supplementation prevent COVID-19?',
        'Your Answer': '???'
    },
    {
        'Q': 'What percentage of adults have hypertension?',
        'Your Answer': '???'
    },
    {
        'Q': 'Does this new drug reduce symptoms better than placebo?',
        'Your Answer': '???'
    },
    {
        'Q': 'Do college graduates earn more than high school graduates?',
        'Your Answer': '???'
    }
]

print("\nðŸ“‹ Exercise 1: Research Design Choices")
print("="*80)
print("\nFor each question, choose:")
print("  A) Experimental (can randomize)")
print("  B) Observational (must observe naturally)")
print("  C) Either (could work both ways)")
print("-"*80)

for i, question_dict in enumerate(research_questions, 1):
    print(f"\n{i}. {question_dict['Q']}")
    print(f"   Your choice: ___")
    print(f"   Reasoning: ___")

### Exercise 2: Factorial Design Planning

You're designing an experiment to optimize an online learning platform.

In [None]:
# Exercise 2: Factorial design planning

print("\n\nðŸ“‹ Exercise 2: Factorial Design Planning")
print("="*80)
print("""
Scenario: You want to optimize an online learning platform by testing:
- Video presentation (Traditional vs Interactive)
- Quiz timing (After lesson vs During lesson)

You have 1000 students available for the study.

Questions:
1. How many experimental conditions will you create?
   Answer: ___

2. How many students should you assign per condition for 125 per group?
   Answer: ___

3. What is the main effect of quiz timing?
   Operationalization: ___

4. What would indicate an interaction effect?
   Example: ___

5. Why might interaction effects be important practically?
   Answer: ___
""")

### Exercise 3: Causal Inference Challenge

Interpret the following observational study with confounding in mind:

In [None]:
# Exercise 3: Causal inference challenge

print("\n\nðŸ“‹ Exercise 3: Observational Study Analysis")
print("="*80)
print("""
Study Finding: "People who own more books have higher incomes."

Interpretation Exercise:

1. What is the observed association?
   Answer: ___

2. What potential confounding variables might explain this relationship?
   (List at least 3)
   Answer: ___

3. Why can't we conclude "buying books makes you richer"?
   Answer: ___

4. What research design would strengthen causal claims?
   Answer: ___

5. Suggest 3 methods to improve causal inference in this observational study:
   Method 1: ___
   Method 2: ___
   Method 3: ___
""")

print("\n\nHints:")
print("  - Consider what types of people buy books")
print("  - Think about reverse causality")
print("  - Consider parental education, wealth, profession")

## Summary

### Key Takeaways

âœ… **Experimental vs Observational** - Experiments enable causal claims; observational studies are necessary when randomization isn't feasible

âœ… **Core Principles** - Randomization, replication, and control are the foundation of experimental design

âœ… **Designs Range from Simple to Complex**:
- Two-group A/B tests
- Factorial designs for multiple factors
- Blocked designs to control known confounders

âœ… **Observational Studies** - Can be strengthened through matching, regression adjustment, instrumental variables, and regression discontinuity

âœ… **Hierarchy of Evidence** - Different designs provide different levels of evidence; systematic reviews are strongest, expert opinion weakest

âœ… **Statistical Power** - Essential planning tool; larger samples, larger effects, and better designs increase power

âœ… **Trade-offs Exist** - Strong causal evidence vs practical feasibility, internal validity vs external validity

## What's Next?

In **Module 04: Sample Size and Power**, you'll learn:
- Detailed power calculations for different study designs
- How to plan sample sizes before conducting research
- Sensitivity analysis for study planning

## Additional Resources

- **Book**: "Design of Experiments" by Douglas C. Montgomery
- **Book**: "The Book of Why" by Judea Pearl (causal inference)
- **Online**: G*Power software for power analysis (free)
- **Paper**: "Randomized Controlled Trials" review articles
- **Paper**: "Strengthening Causal Inference in Observational Studies" (Rotnitzky et al.)

## Self-Assessment

Before moving to Module 04, ensure you can:

- [ ] Explain the difference between experimental and observational designs
- [ ] Describe the three core principles: randomization, replication, control
- [ ] Design a simple A/B test with appropriate sample size
- [ ] Plan and interpret a factorial experiment
- [ ] Explain what blocking is and when to use it
- [ ] Identify confounding variables in observational studies
- [ ] Describe the hierarchy of evidence and what it means
- [ ] Calculate statistical power for a given sample size
- [ ] Determine sample size needed for desired power
- [ ] Recognize trade-offs between internal and external validity

If you can confidently check all boxes, you're ready for Module 04! ðŸŽ‰