# Module 05: Sampling Strategies

**Estimated Time**: 45 minutes

## Learning Objectives

By the end of this module, you will be able to:

1. **Distinguish** between probability and non-probability sampling methods
2. **Implement** simple random, stratified, cluster, and systematic sampling
3. **Calculate** required sample sizes for different precision levels
4. **Evaluate** sampling error and bias in different sampling approaches
5. **Select** appropriate sampling methods for different research contexts
6. **Apply** weighting adjustments to correct for sampling bias
7. **Understand** when non-probability sampling is appropriate
8. **Assess** representativeness and generalizability of samples

## Why This Matters

**You can't study everyone, so you sample.**

Sampling determines:
- **Generalizability**: Can findings extend beyond your sample?
- **Precision**: How accurate are your estimates?
- **Cost-effectiveness**: Maximize information per dollar/hour
- **Feasibility**: Some populations are hard to reach

**Bad sampling = bad science**, regardless of how sophisticated your analysis is.

This module equips you to:
- Select sampling methods that balance rigor with practicality
- Calculate appropriate sample sizes
- Identify and mitigate sampling biases
- Make defensible claims about generalizability

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, t as t_dist
import warnings

warnings.filterwarnings("ignore")

# Set style
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")

# Set random seed
np.random.seed(42)

# Create output directory
import os

os.makedirs("outputs/module_05", exist_ok=True)

print("âœ“ Libraries imported successfully")
print("âœ“ Output directory created")

## 1. Probability vs. Non-Probability Sampling

### Probability Sampling

**Definition**: Every member of the population has a known, non-zero probability of selection.

**Advantages**:
âœ“ Allows statistical inference to population  
âœ“ Sampling error can be calculated  
âœ“ Unbiased estimates (in expectation)  
âœ“ Defensible generalization  

**Disadvantages**:
âœ— Requires sampling frame (list of population)  
âœ— More expensive and time-consuming  
âœ— May still have non-response bias  

**Types**:
1. Simple Random Sampling
2. Stratified Sampling
3. Cluster Sampling
4. Systematic Sampling

### Non-Probability Sampling

**Definition**: Selection probabilities unknown; based on convenience, judgment, or quota.

**Advantages**:
âœ“ Quick and inexpensive  
âœ“ Useful for exploratory research  
âœ“ Practical when sampling frame unavailable  

**Disadvantages**:
âœ— Cannot calculate sampling error  
âœ— Results may not generalize  
âœ— Vulnerable to selection bias  

**Types**:
1. Convenience Sampling
2. Purposive/Judgment Sampling
3. Quota Sampling
4. Snowball Sampling

### When to Use Each

| Use Probability Sampling When... | Use Non-Probability Sampling When... |
|----------------------------------|-------------------------------------|
| Making population inferences | Exploratory research |
| Precision estimates needed | Hard-to-reach populations |
| Sampling frame exists | Limited budget/time |
| Generalizing results | Qualitative research |
| Publishing in top journals | Pilot testing |

## 2. Simple Random Sampling (SRS)

**Method**: Every individual has equal probability of selection.

**Implementation**:
1. Obtain complete sampling frame
2. Assign each member a unique ID
3. Use random number generator to select IDs

**Formula for standard error of mean**:

$$SE = \frac{\sigma}{\sqrt{n}}$$

Where:
- $\sigma$ = population standard deviation
- $n$ = sample size

**Advantages**:
- Simple to understand and implement
- Unbiased
- Foundation for statistical theory

**Disadvantages**:
- May miss important subgroups (by chance)
- Less efficient than stratified sampling

In [None]:
# Demonstrate simple random sampling

# Create a population of 10,000 people
np.random.seed(123)
population_size = 10000

# Population characteristics
population = pd.DataFrame(
    {
        "ID": range(1, population_size + 1),
        "Age": np.random.normal(45, 15, population_size).astype(int),
        "Income": np.random.lognormal(10.5, 0.6, population_size).astype(int),
        "Gender": np.random.choice(["Male", "Female"], population_size, p=[0.49, 0.51]),
        "Region": np.random.choice(
            ["North", "South", "East", "West"], population_size, p=[0.25, 0.25, 0.25, 0.25]
        ),
    }
)

# Population parameters (truth)
pop_mean_income = population["Income"].mean()
pop_sd_income = population["Income"].std()
pop_median_age = population["Age"].median()

print("POPULATION (True Values):")
print(f"Size: {population_size:,}")
print(f"Mean Income: ${pop_mean_income:,.0f}")
print(f"SD Income: ${pop_sd_income:,.0f}")
print(f"Median Age: {pop_median_age:.0f} years")
print(f"Gender distribution: {population['Gender'].value_counts(normalize=True).to_dict()}")

# Take simple random sample
sample_size = 200
srs_sample = population.sample(n=sample_size, random_state=42)

# Sample estimates
sample_mean_income = srs_sample["Income"].mean()
sample_sd_income = srs_sample["Income"].std()
sample_median_age = srs_sample["Age"].median()

print(f"\n{'='*60}")
print(f"SIMPLE RANDOM SAMPLE (n = {sample_size}):")
print(f"Mean Income: ${sample_mean_income:,.0f}")
print(f"SD Income: ${sample_sd_income:,.0f}")
print(f"Median Age: {sample_median_age:.0f} years")

# Calculate sampling error
se_income = sample_sd_income / np.sqrt(sample_size)
margin_error = 1.96 * se_income  # 95% CI

print(f"\nSampling Error (SE): ${se_income:,.0f}")
print(f"95% Margin of Error: Â±${margin_error:,.0f}")
print(
    f"\n95% Confidence Interval: ${sample_mean_income - margin_error:,.0f} - ${sample_mean_income + margin_error:,.0f}"
)

# Check if true population mean is in CI
in_ci = (
    (sample_mean_income - margin_error) <= pop_mean_income <= (sample_mean_income + margin_error)
)
print(f"\nDoes CI contain true population mean? {'âœ“ Yes' if in_ci else 'âœ— No'}")
print(f"Estimation error: ${abs(sample_mean_income - pop_mean_income):,.0f}")

In [None]:
# Demonstrate sampling distribution
# Take many samples to show that estimates vary

n_samples = 1000
sample_means = []

for i in range(n_samples):
    sample = population.sample(n=sample_size, replace=False)
    sample_means.append(sample["Income"].mean())

sample_means = np.array(sample_means)

# Visualize sampling distribution
fig, ax = plt.subplots(figsize=(12, 6))

ax.hist(
    sample_means,
    bins=50,
    density=True,
    alpha=0.7,
    color="#06A77D",
    edgecolor="black",
    linewidth=0.5,
    label="Sampling Distribution",
)

# True population mean
ax.axvline(
    pop_mean_income,
    color="red",
    linestyle="--",
    linewidth=2.5,
    label=f"True Population Mean (${pop_mean_income:,.0f})",
)

# Theoretical normal curve
x_range = np.linspace(sample_means.min(), sample_means.max(), 100)
theoretical_se = pop_sd_income / np.sqrt(sample_size)
theoretical_dist = norm.pdf(x_range, pop_mean_income, theoretical_se)
ax.plot(x_range, theoretical_dist, "b-", linewidth=2.5, label="Theoretical Normal Distribution")

# Mean of sample means (should equal population mean)
ax.axvline(
    sample_means.mean(),
    color="orange",
    linestyle=":",
    linewidth=2,
    label=f"Mean of Sample Means (${sample_means.mean():,.0f})",
)

ax.set_xlabel("Sample Mean Income ($)", fontsize=12, fontweight="bold")
ax.set_ylabel("Density", fontsize=12, fontweight="bold")
ax.set_title(
    f"Sampling Distribution of Mean Income\n(n={sample_size}, {n_samples} samples)",
    fontsize=14,
    fontweight="bold",
)
ax.legend(loc="upper right", fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("outputs/module_05/sampling_distribution.png", dpi=300, bbox_inches="tight")
plt.show()

print(f"\nðŸ“Š Sampling Distribution Results:")
print(f"Mean of {n_samples} sample means: ${sample_means.mean():,.0f}")
print(f"True population mean: ${pop_mean_income:,.0f}")
print(f"Difference (bias): ${sample_means.mean() - pop_mean_income:,.0f}")
print(f"\nStandard error (observed): ${sample_means.std():,.0f}")
print(f"Standard error (theoretical): ${theoretical_se:,.0f}")
print(f"\nðŸ’¡ With random sampling, sample means are unbiased and normally distributed.")

## 3. Stratified Sampling

**Method**: Divide population into homogeneous subgroups (strata), then sample from each stratum.

**When to use**:
- Population has distinct subgroups
- Want to ensure representation of all groups
- Some strata are more variable than others

### Types

#### 1. Proportionate Stratified Sampling
Sample size from each stratum proportional to stratum size in population.

**Example**: Population is 60% urban, 40% rural  
â†’ Sample should be 60% urban, 40% rural

#### 2. Disproportionate Stratified Sampling
Oversample small or high-variance strata.

**Example**: Population is 95% majority, 5% minority  
â†’ Sample 50% majority, 50% minority (then weight in analysis)

### Advantages over SRS
- **More precise estimates** (lower standard error)
- **Ensures representation** of all strata
- **Allows separate analysis** by stratum

### Formula for Stratified Sample Size

Proportionate allocation:
$$n_h = n \times \frac{N_h}{N}$$

Where:
- $n_h$ = sample size for stratum h
- $n$ = total sample size
- $N_h$ = population size of stratum h
- $N$ = total population size

In [None]:
# Demonstrate stratified sampling

# Use the same population, stratify by region
print("POPULATION BY REGION:")
region_counts = population["Region"].value_counts().sort_index()
print(region_counts)
print(f"\nMean income by region:")
region_income = population.groupby("Region")["Income"].mean().sort_index()
print(region_income)

# Proportionate stratified sample
total_sample_size = 200

stratified_sample = pd.DataFrame()

for region in ["North", "South", "East", "West"]:
    # Population proportion
    prop = (population["Region"] == region).sum() / len(population)

    # Sample size for this stratum
    stratum_n = int(total_sample_size * prop)

    # Sample from stratum
    stratum_sample = population[population["Region"] == region].sample(n=stratum_n, random_state=42)

    stratified_sample = pd.concat([stratified_sample, stratum_sample])

    print(
        f"\n{region}: Population = {(population['Region'] == region).sum()}, "
        f"Sample = {stratum_n} ({stratum_n/total_sample_size*100:.1f}%)"
    )

print(f"\n{'='*60}")
print("COMPARISON: SRS vs. Stratified Sampling\n")

# Compare estimates
srs_estimate = srs_sample["Income"].mean()
stratified_estimate = stratified_sample["Income"].mean()

print(f"True population mean: ${pop_mean_income:,.0f}")
print(f"\nSRS estimate: ${srs_estimate:,.0f}")
print(f"Error: ${abs(srs_estimate - pop_mean_income):,.0f}")

print(f"\nStratified estimate: ${stratified_estimate:,.0f}")
print(f"Error: ${abs(stratified_estimate - pop_mean_income):,.0f}")

# Calculate standard errors
srs_se = srs_sample["Income"].std() / np.sqrt(len(srs_sample))

# Stratified SE (formula for proportionate allocation)
stratified_vars = []
for region in ["North", "South", "East", "West"]:
    stratum_data = stratified_sample[stratified_sample["Region"] == region]["Income"]
    prop = len(stratum_data) / len(stratified_sample)
    var_contribution = (prop**2) * (stratum_data.var() / len(stratum_data))
    stratified_vars.append(var_contribution)

stratified_se = np.sqrt(sum(stratified_vars))

print(f"\nSRS Standard Error: ${srs_se:,.0f}")
print(f"Stratified Standard Error: ${stratified_se:,.0f}")
print(f"\nðŸ’¡ Stratified sampling reduced SE by {(1 - stratified_se/srs_se)*100:.1f}%")
print(f"   (More precise estimates with same sample size!)")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Panel 1: Region representation
comparison_data = pd.DataFrame(
    {
        "Region": ["North", "South", "East", "West"],
        "Population": [0.25, 0.25, 0.25, 0.25],
        "SRS Sample": [
            srs_sample["Region"].value_counts(normalize=True).get(r, 0)
            for r in ["North", "South", "East", "West"]
        ],
        "Stratified Sample": [
            stratified_sample["Region"].value_counts(normalize=True).get(r, 0)
            for r in ["North", "South", "East", "West"]
        ],
    }
)

x = np.arange(len(comparison_data))
width = 0.25

axes[0].bar(
    x - width,
    comparison_data["Population"],
    width,
    label="Population",
    color="#457B9D",
    alpha=0.8,
    edgecolor="black",
)
axes[0].bar(
    x,
    comparison_data["SRS Sample"],
    width,
    label="SRS Sample",
    color="#E63946",
    alpha=0.8,
    edgecolor="black",
)
axes[0].bar(
    x + width,
    comparison_data["Stratified Sample"],
    width,
    label="Stratified Sample",
    color="#06A77D",
    alpha=0.8,
    edgecolor="black",
)

axes[0].set_ylabel("Proportion", fontsize=12, fontweight="bold")
axes[0].set_title("Regional Representation", fontsize=13, fontweight="bold")
axes[0].set_xticks(x)
axes[0].set_xticklabels(comparison_data["Region"])
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis="y")

# Panel 2: Precision comparison (error bars)
methods = ["Population\n(Truth)", "SRS", "Stratified"]
means = [pop_mean_income, srs_estimate, stratified_estimate]
ses = [0, srs_se * 1.96, stratified_se * 1.96]  # 95% CI

colors_bars = ["#457B9D", "#E63946", "#06A77D"]
bars = axes[1].bar(
    methods,
    means,
    yerr=ses,
    capsize=10,
    color=colors_bars,
    alpha=0.7,
    edgecolor="black",
    linewidth=1.5,
)

axes[1].set_ylabel("Mean Income ($)", fontsize=12, fontweight="bold")
axes[1].set_title("Estimate Precision\n(Error bars = 95% CI)", fontsize=13, fontweight="bold")
axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("outputs/module_05/stratified_vs_srs.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nðŸ’¡ Stratified sampling ensures proportional representation and increases precision.")

## 4. Cluster Sampling

**Method**: Divide population into clusters (groups), randomly select clusters, then sample all (or some) members within selected clusters.

**When to use**:
- Population is geographically dispersed
- No complete sampling frame exists
- Cost of traveling to individuals is high

### Example Scenarios

**One-stage cluster sampling**: Select schools (clusters), survey ALL students in selected schools

**Two-stage cluster sampling**: Select schools, then randomly sample students within selected schools

### Key Consideration

**Clusters should be heterogeneous** (diverse within), unlike strata which should be homogeneous.

**Why?** We want each cluster to be a "mini-population."

### Design Effect

Cluster sampling is less efficient than SRS because people within clusters tend to be similar (intracluster correlation).

**Effective sample size** is smaller than actual sample size:

$$\text{Effective } n = \frac{n}{\text{DEFF}}$$

Where DEFF (design effect) = 1 + (cluster size - 1) Ã— ICC  
ICC = intracluster correlation coefficient

### Advantages
- Practical for geographically dispersed populations
- Cost-effective (reduce travel)
- Feasible when complete frame unavailable

### Disadvantages
- Less precise than SRS or stratified (higher SE)
- Complex variance calculations
- Risk of cluster selection bias

In [None]:
# Demonstrate cluster sampling

# Create population with 100 clusters (e.g., schools)
np.random.seed(456)
n_clusters = 100
cluster_size = 100  # 100 students per school

# Create population with cluster structure
cluster_population = []

for cluster_id in range(1, n_clusters + 1):
    # Each cluster has a mean (some schools are higher-performing)
    cluster_mean = np.random.normal(75, 10)

    # Students within cluster have scores around cluster mean
    scores = np.random.normal(cluster_mean, 5, cluster_size)

    for score in scores:
        cluster_population.append({"Cluster_ID": cluster_id, "Score": score})

df_clusters = pd.DataFrame(cluster_population)
pop_mean_score = df_clusters["Score"].mean()
pop_sd_score = df_clusters["Score"].std()

print("POPULATION WITH CLUSTER STRUCTURE:")
print(f"Number of clusters: {n_clusters}")
print(f"Cluster size: {cluster_size}")
print(f"Total population: {len(df_clusters):,}")
print(f"Population mean score: {pop_mean_score:.2f}")
print(f"Population SD: {pop_sd_score:.2f}")

# One-stage cluster sampling: Select 10 clusters, take all members
n_clusters_sample = 10
selected_clusters = np.random.choice(
    range(1, n_clusters + 1), size=n_clusters_sample, replace=False
)

cluster_sample = df_clusters[df_clusters["Cluster_ID"].isin(selected_clusters)]

cluster_mean_estimate = cluster_sample["Score"].mean()
cluster_se = cluster_sample["Score"].std() / np.sqrt(len(cluster_sample))

print(f"\n{'='*60}")
print(f"CLUSTER SAMPLE ({n_clusters_sample} clusters selected):")
print(f"Sample size: {len(cluster_sample)}")
print(f"Estimated mean: {cluster_mean_estimate:.2f}")
print(f"Error: {abs(cluster_mean_estimate - pop_mean_score):.2f}")
print(f"Standard Error: {cluster_se:.2f}")

# For comparison: SRS of same size
srs_cluster_comparison = df_clusters.sample(n=len(cluster_sample), random_state=42)
srs_mean_estimate = srs_cluster_comparison["Score"].mean()
srs_se_comparison = srs_cluster_comparison["Score"].std() / np.sqrt(len(srs_cluster_comparison))

print(f"\nCOMPARISON WITH SRS (same sample size):")
print(f"SRS mean estimate: {srs_mean_estimate:.2f}")
print(f"SRS error: {abs(srs_mean_estimate - pop_mean_score):.2f}")
print(f"SRS Standard Error: {srs_se_comparison:.2f}")

print(f"\nðŸ’¡ Cluster sampling SE is {cluster_se/srs_se_comparison:.2f}x larger than SRS")
print(f"   (Due to intracluster correlation - students in same school are similar)")

## 5. Sample Size Determination

**Question**: "How many participants do I need?"

### For Estimating a Mean

Required sample size for desired margin of error:

$$n = \left(\frac{z \cdot \sigma}{E}\right)^2$$

Where:
- $z$ = z-score for confidence level (1.96 for 95%)
- $\sigma$ = population standard deviation (estimate from pilot)
- $E$ = desired margin of error

### For Estimating a Proportion

$$n = \frac{z^2 \cdot p(1-p)}{E^2}$$

Where:
- $p$ = estimated proportion (use 0.5 if unknown for max sample size)
- $E$ = desired margin of error

### For Comparing Two Means (t-test)

$$n = 2 \left(\frac{(z_\alpha + z_\beta) \cdot \sigma}{\delta}\right)^2$$

Where:
- $z_\alpha$ = z-score for significance level (1.96 for Î± = 0.05)
- $z_\beta$ = z-score for power (0.84 for 80% power)
- $\sigma$ = pooled standard deviation
- $\delta$ = minimum detectable difference

### Adjustments

**Finite Population Correction** (when n/N > 0.05):
$$n_{\text{adjusted}} = \frac{n}{1 + \frac{n-1}{N}}$$

**Non-response Adjustment**:
$$n_{\text{adjusted}} = \frac{n}{\text{response rate}}$$

In [None]:
# Sample size calculator functions


def sample_size_mean(margin_error, std_dev, confidence=0.95):
    """
    Calculate required sample size for estimating a mean.

    Parameters:
    - margin_error: Desired margin of error
    - std_dev: Estimated population standard deviation
    - confidence: Confidence level (default 0.95)

    Returns:
    - Required sample size
    """
    z = norm.ppf(1 - (1 - confidence) / 2)
    n = ((z * std_dev) / margin_error) ** 2
    return int(np.ceil(n))


def sample_size_proportion(margin_error, proportion=0.5, confidence=0.95):
    """
    Calculate required sample size for estimating a proportion.

    Parameters:
    - margin_error: Desired margin of error
    - proportion: Estimated proportion (default 0.5 for maximum)
    - confidence: Confidence level (default 0.95)

    Returns:
    - Required sample size
    """
    z = norm.ppf(1 - (1 - confidence) / 2)
    n = (z**2 * proportion * (1 - proportion)) / (margin_error**2)
    return int(np.ceil(n))


def sample_size_ttest(effect_size, alpha=0.05, power=0.80, std_dev=1.0):
    """
    Calculate required sample size for two-sample t-test.

    Parameters:
    - effect_size: Minimum detectable difference (in std units)
    - alpha: Significance level (default 0.05)
    - power: Statistical power (default 0.80)
    - std_dev: Pooled standard deviation (default 1.0)

    Returns:
    - Required sample size per group
    """
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)

    n = 2 * ((z_alpha + z_beta) * std_dev / effect_size) ** 2
    return int(np.ceil(n))


# Example calculations
print("SAMPLE SIZE CALCULATIONS")
print("=" * 60)

# Scenario 1: Estimate mean income
print("\n1. Estimating Mean Income:")
print("   Goal: Estimate within Â±$5,000 (95% confidence)")
print("   Population SD: $30,000 (from pilot study)")
n1 = sample_size_mean(margin_error=5000, std_dev=30000, confidence=0.95)
print(f"   Required sample size: {n1}")

# Scenario 2: Estimate proportion
print("\n2. Estimating Proportion (e.g., voter support):")
print("   Goal: Estimate within Â±3% (95% confidence)")
print("   Expected proportion: Unknown (use 0.5)")
n2 = sample_size_proportion(margin_error=0.03, proportion=0.5, confidence=0.95)
print(f"   Required sample size: {n2}")

# Scenario 3: Compare two groups
print("\n3. Comparing Two Treatment Groups:")
print("   Goal: Detect difference of 0.5 SD")
print("   Alpha: 0.05, Power: 80%")
n3 = sample_size_ttest(effect_size=0.5, alpha=0.05, power=0.80)
print(f"   Required sample size per group: {n3}")
print(f"   Total sample size: {n3 * 2}")

# Adjust for non-response
print("\n" + "=" * 60)
print("ADJUSTING FOR NON-RESPONSE")
response_rate = 0.60  # Expect 60% response
print(f"\nExpected response rate: {response_rate*100:.0f}%")
print(f"\nAdjusted sample sizes:")
print(
    f"Scenario 1: {int(np.ceil(n1 / response_rate))} (need to recruit {int(np.ceil(n1 / response_rate)) - n1} extra)"
)
print(
    f"Scenario 2: {int(np.ceil(n2 / response_rate))} (need to recruit {int(np.ceil(n2 / response_rate)) - n2} extra)"
)
print(f"Scenario 3: {int(np.ceil(n3 / response_rate))} per group")

In [None]:
# Visualize sample size vs. margin of error trade-off

margins = np.linspace(0.01, 0.10, 50)  # 1% to 10%
sample_sizes = [sample_size_proportion(m, 0.5, 0.95) for m in margins]

fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(margins * 100, sample_sizes, linewidth=3, color="#E63946", marker="o", markersize=4)

# Mark common values
common_margins = [0.03, 0.05, 0.10]
for cm in common_margins:
    cn = sample_size_proportion(cm, 0.5, 0.95)
    ax.plot(
        cm * 100,
        cn,
        "o",
        markersize=12,
        color="#06A77D",
        markeredgecolor="black",
        markeredgewidth=2,
        zorder=5,
    )
    ax.annotate(
        f"Â±{cm*100:.0f}%\nn={cn}",
        xy=(cm * 100, cn),
        xytext=(10, 10),
        textcoords="offset points",
        fontsize=10,
        fontweight="bold",
        arrowprops=dict(arrowstyle="->", lw=1.5),
    )

ax.set_xlabel("Margin of Error (%)", fontsize=13, fontweight="bold")
ax.set_ylabel("Required Sample Size", fontsize=13, fontweight="bold")
ax.set_title(
    "Sample Size vs. Precision Trade-off\n(95% confidence, p=0.5)", fontsize=14, fontweight="bold"
)
ax.grid(True, alpha=0.3)
ax.set_xlim([1, 10])

plt.tight_layout()
plt.savefig("outputs/module_05/sample_size_tradeoff.png", dpi=300, bbox_inches="tight")
plt.show()

print("\nðŸ’¡ Smaller margins of error require dramatically larger sample sizes.")
print("   There are diminishing returns to increasing precision.")

## 6. Practice Exercises

### Exercise 1: Sampling Method Selection

For each scenario, recommend the most appropriate sampling method:

1. **Scenario**: National survey of university students' political views. You have a complete list of all universities but not individual students.
   - **Answer**: ____________

2. **Scenario**: Study of rare disease patients. Difficult to identify patients; rely on referrals.
   - **Answer**: ____________

3. **Scenario**: Quality control in manufacturing. Need to sample from 100 production batches.
   - **Answer**: ____________

4. **Scenario**: Pre-election poll. Population: 60% urban, 40% rural. Both groups equally variable.
   - **Answer**: ____________

In [None]:
# Exercise 2: Calculate required sample size
# You're conducting a customer satisfaction survey

# Requirements:
# - Estimate proportion of satisfied customers
# - Margin of error: Â±4%
# - Confidence: 95%
# - Assume 50% satisfaction (worst case)
# - Expected response rate: 70%

# YOUR CODE HERE
# Calculate:
# 1. Required completed surveys
# 2. Adjusted for non-response
# 3. What if you only want 90% confidence?

In [None]:
# Exercise 3: Implement stratified sampling
# Given a population with two strata:
# - Stratum A: N=1000, mean=100, SD=15
# - Stratum B: N=2000, mean=120, SD=20

# Create synthetic population and:
# 1. Take proportionate stratified sample (n=300)
# 2. Compare with SRS of same size
# 3. Calculate which has lower SE

# YOUR CODE HERE

## 7. Summary and Key Takeaways

### Sampling Method Decision Tree

```
Do you have a sampling frame?
    â”‚
    YES â†’ Can you afford to travel widely?
    â”‚        â”‚
    â”‚        YES â†’ Does population have distinct subgroups?
    â”‚        â”‚        â”‚
    â”‚        â”‚        YES â†’ STRATIFIED SAMPLING
    â”‚        â”‚        NO  â†’ SIMPLE RANDOM SAMPLING
    â”‚        â”‚
    â”‚        NO â†’ Are clusters geographically defined?
    â”‚               â”‚
    â”‚               YES â†’ CLUSTER SAMPLING
    â”‚               NO  â†’ SYSTEMATIC SAMPLING
    â”‚
    NO â†’ Is population hard to reach?
           â”‚
           YES â†’ SNOWBALL or RESPONDENT-DRIVEN SAMPLING
           NO  â†’ CONVENIENCE or QUOTA SAMPLING
```

### Sample Size Rules of Thumb

| Goal | Minimum Sample Size |
|------|--------------------|
| Pilot testing survey | 30-50 |
| Simple statistical tests | 30 per group |
| Regression (per predictor) | 10-20 observations |
| National survey (Â±3% MoE) | 1,000-1,200 |
| Structural equation modeling | 200+ |
| Machine learning | 1,000s to millions |

### Critical Reminders

1. **Bigger â‰  Better**: A large biased sample is worse than a small representative sample
2. **Response rate matters**: 30% response from random sample may be better than 100% from convenience sample
3. **Calculate, don't guess**: Use formulas to determine required sample size
4. **Adjust for non-response**: Always recruit more than your target
5. **Document methods**: Report sampling procedure, response rate, weights used
6. **Check representativeness**: Compare sample to population on known characteristics

### Moving Forward

You now understand how to select representative samples that allow valid generalization. The next module covers **Systematic Literature Reviews**, teaching you to comprehensively synthesize existing research.

## 8. Additional Resources

### Essential Readings

1. **Cochran, W.G. (1977)**. *Sampling Techniques* (3rd ed.)
   - Classic comprehensive textbook

2. **Lohr, S.L. (2021)**. *Sampling: Design and Analysis* (3rd ed.)
   - Modern treatment with R code

3. **Groves et al. (2009)**. *Survey Methodology* (2nd ed.)
   - Total survey error framework

### Online Calculators

- **Sample Size Calculator** (Raosoft): Free web-based tool
- **G*Power**: Free software for power analysis and sample size
- **Survey System**: Calculator for various sampling designs

### Software

- **R packages**: survey, sampling, samplesize
- **Python**: scipy.stats, statsmodels
- **Stata**: svy commands for complex surveys

In [None]:
# Save sampling method comparison table

sampling_comparison = pd.DataFrame(
    {
        "Method": [
            "Simple Random",
            "Stratified",
            "Cluster",
            "Systematic",
            "Convenience",
            "Quota",
            "Snowball",
        ],
        "Type": [
            "Probability",
            "Probability",
            "Probability",
            "Probability",
            "Non-probability",
            "Non-probability",
            "Non-probability",
        ],
        "Advantages": [
            "Simple; unbiased; foundation of theory",
            "More precise; ensures representation",
            "Cost-effective; practical for dispersed populations",
            "Simple to implement; good spatial coverage",
            "Quick; inexpensive; easy to recruit",
            "Ensures demographic representation; faster than probability",
            "Only option for hidden/rare populations",
        ],
        "Disadvantages": [
            "May miss subgroups; requires frame",
            "Requires knowing strata; more complex",
            "Less precise; complex SE calculation",
            "Risk of periodicity; pseudo-random",
            "Selection bias; cannot generalize",
            "Non-random; interviewer bias possible",
            "High bias; cannot calculate SE",
        ],
        "Best_Use": [
            "Homogeneous populations; research studies",
            "Known subgroups; demographic studies",
            "Geographically dispersed; schools/hospitals",
            "Quality control; spatial sampling",
            "Pilot testing; exploratory research",
            "Market research; when frame unavailable",
            "Hidden populations; network studies",
        ],
    }
)

sampling_comparison.to_csv("outputs/module_05/sampling_methods_comparison.csv", index=False)
print("âœ“ Sampling methods comparison saved to outputs/module_05/")
print("\n" + sampling_comparison.to_string(index=False))

---

## Congratulations!

You've completed **Module 05: Sampling Strategies**. You can now:

âœ“ Distinguish between probability and non-probability sampling  
âœ“ Implement various sampling methods (SRS, stratified, cluster, systematic)  
âœ“ Calculate required sample sizes for different goals  
âœ“ Evaluate sampling error and bias  
âœ“ Select appropriate methods for different contexts  
âœ“ Understand when non-probability sampling is acceptable  
âœ“ Assess generalizability of research findings  

**Next Module**: Systematic Literature Reviews  
**File**: `06_systematic_literature_reviews.ipynb`

---