# üé≤ Sampling Techniques
## How to Collect Representative Data from Nature

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/The-Pattern-Hunter/interactive-ecology-biometry/blob/main/unit-4-biometry/notebooks/04_sampling_techniques.ipynb)

---

> *"You can't count every organism in a forest. How do you choose which ones to measure?"*

### üéØ Learning Objectives

By the end of this notebook, you will:
1. Understand **why we sample** instead of measuring everything
2. Learn **four main sampling techniques**: Random, Systematic, Stratified, Cluster
3. See the **strengths and weaknesses** of each method
4. Understand **sampling error** and how to minimize it
5. Apply sampling to **real ecological scenarios**

---

## ü©∫ The Stethoscope Analogy

### Step 7 of the Pattern Hunter Journey: **Collecting Good Data**

Before you can use your statistical stethoscope (distributions), you need **good data** to analyze!

**Medical Analogy:**
- A doctor doesn't test EVERY drop of blood
- Takes a small **representative sample**
- If the sample is good ‚Üí accurate diagnosis
- If the sample is bad ‚Üí wrong diagnosis!

**Ecological Analogy:**
- Can't measure EVERY tree in a forest (time, money, effort)
- Take a **representative sample**
- Good sample ‚Üí accurate population estimate
- Bad sample ‚Üí biased, misleading results

In [None]:
# Setup
!pip install numpy scipy plotly pandas -q

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
import pandas as pd

np.random.seed(42)

print("‚úÖ Ready to learn sampling!")
print("üé≤ Let's see how to collect good data!")

---

## üìä Part 1: Why Sample? Population vs Sample

### Key Concepts:

| Term | Definition | Example |
|------|------------|----------|
| **Population** | ALL individuals of interest | All trees in a forest (100,000 trees) |
| **Sample** | Subset we actually measure | 100 trees we select and measure |
| **Parameter** | True value in population | True mean height = 15.2m (unknown!) |
| **Statistic** | Estimated value from sample | Sample mean = 15.1m (what we calculate) |

### Why We Sample:

‚úÖ **Time**: Measuring 100 trees takes days, not years  
‚úÖ **Cost**: Less expensive  
‚úÖ **Practicality**: Some populations are infinite (all future seeds)  
‚úÖ **Destruction**: Some tests destroy the specimen (seed germination)  

In [None]:
# Create a population (entire forest of 10,000 trees)
np.random.seed(42)
population = np.random.normal(loc=15, scale=3, size=10000)  # True mean=15m, sd=3m

true_mean = np.mean(population)
true_sd = np.std(population)

print("üå≤ TRUE POPULATION (10,000 trees):")
print(f"   True Mean Height: {true_mean:.2f} m")
print(f"   True Std Dev: {true_sd:.2f} m")
print(f"\n‚ö†Ô∏è In real life, we DON'T KNOW these true values!")
print(f"   We can only ESTIMATE them from a sample...\n")

# Take a sample
sample_size = 100
sample = np.random.choice(population, size=sample_size, replace=False)

sample_mean = np.mean(sample)
sample_sd = np.std(sample)

print(f"üìä OUR SAMPLE ({sample_size} trees):")
print(f"   Sample Mean: {sample_mean:.2f} m")
print(f"   Sample SD: {sample_sd:.2f} m")
print(f"\n‚úÖ Pretty close to true values!")
print(f"   Error in mean: {abs(sample_mean - true_mean):.2f} m ({abs(sample_mean - true_mean)/true_mean*100:.1f}%)")

# Visualize
fig = make_subplots(rows=1, cols=2, subplot_titles=('Population (10,000 trees)', f'Sample ({sample_size} trees)'))

fig.add_trace(go.Histogram(x=population, nbinsx=50, marker_color='lightgreen', name='Population'), row=1, col=1)
fig.add_vline(x=true_mean, line_dash="dash", line_color="red", annotation_text=f"Œº={true_mean:.1f}", row=1, col=1)

fig.add_trace(go.Histogram(x=sample, nbinsx=20, marker_color='lightblue', name='Sample'), row=1, col=2)
fig.add_vline(x=sample_mean, line_dash="dash", line_color="blue", annotation_text=f"xÃÑ={sample_mean:.1f}", row=1, col=2)

fig.update_layout(height=400, showlegend=False, template='plotly_white')
fig.update_xaxes(title_text="Tree Height (m)")
fig.update_yaxes(title_text="Frequency")

fig.show()

print("\nüí° Goal of sampling: Estimate population parameters from sample statistics!")

---

## üé≤ Part 2: The Four Main Sampling Techniques

### Overview:

1. **Random Sampling** - Every individual has equal chance
2. **Systematic Sampling** - Select every nth individual
3. **Stratified Sampling** - Divide into groups, sample from each
4. **Cluster Sampling** - Divide into clusters, randomly select entire clusters

---

## 1Ô∏è‚É£ Random Sampling

**Method**: Every individual in the population has an **equal chance** of being selected.

**How**: Use random number generator, lottery, random coordinates

**Ecological Example**: 
- Number all plants 1-1000
- Use random number generator to pick 50 numbers
- Measure those 50 plants

In [None]:
# Simulate a field of 1000 plants in a 40x25 grid
np.random.seed(42)

# Create population with spatial structure
n_plants = 1000
plant_ids = np.arange(n_plants)
x_coords = np.random.uniform(0, 40, n_plants)
y_coords = np.random.uniform(0, 25, n_plants)
heights = np.random.normal(50, 10, n_plants)

# Random sampling
sample_size = 50
random_sample_ids = np.random.choice(plant_ids, size=sample_size, replace=False)

# Visualize
fig = go.Figure()

# Population (all plants)
fig.add_trace(go.Scatter(
    x=x_coords,
    y=y_coords,
    mode='markers',
    marker=dict(size=4, color='lightgray', opacity=0.5),
    name='Population (not sampled)',
    hoverinfo='skip'
))

# Random sample (selected plants)
fig.add_trace(go.Scatter(
    x=x_coords[random_sample_ids],
    y=y_coords[random_sample_ids],
    mode='markers',
    marker=dict(size=10, color='red', symbol='star'),
    name='Random Sample',
    hovertemplate='Plant ID: %{text}<br>Height: %{customdata:.1f}cm<extra></extra>',
    text=random_sample_ids,
    customdata=heights[random_sample_ids]
))

fig.update_layout(
    title="üé≤ Random Sampling: Every plant has equal chance<br><sub>Red stars = sampled plants</sub>",
    xaxis_title="Field X-coordinate (m)",
    yaxis_title="Field Y-coordinate (m)",
    height=500,
    template='plotly_white'
)

fig.show()

print(f"\n‚úÖ Advantages:")
print(f"   ‚Ä¢ Unbiased - no systematic patterns")
print(f"   ‚Ä¢ Simple to understand and implement")
print(f"   ‚Ä¢ Works well for homogeneous populations")

print(f"\n‚ùå Disadvantages:")
print(f"   ‚Ä¢ May miss rare subgroups by chance")
print(f"   ‚Ä¢ Can be spread out (expensive to visit all locations)")
print(f"   ‚Ä¢ Requires complete list of population")

---

## 2Ô∏è‚É£ Systematic Sampling

**Method**: Select every **nth** individual from an ordered list.

**How**: 
1. Calculate interval: k = Population size / Sample size
2. Random start point between 1 and k
3. Select every kth individual

**Ecological Example**: 
- Population = 1000 plants, want 50 samples
- k = 1000/50 = 20
- Start at random plant #7, then select #27, #47, #67...

In [None]:
# Systematic sampling
k = len(plant_ids) // sample_size  # interval
start = np.random.randint(0, k)  # random start
systematic_sample_ids = np.arange(start, len(plant_ids), k)[:sample_size]

print(f"üìê Systematic Sampling Setup:")
print(f"   Population size: {len(plant_ids)}")
print(f"   Desired sample size: {sample_size}")
print(f"   Interval (k): {k}")
print(f"   Random start: Plant #{start}")
print(f"   Then every {k}th plant: {start}, {start+k}, {start+2*k}, ...")

# Visualize
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x_coords,
    y=y_coords,
    mode='markers',
    marker=dict(size=4, color='lightgray', opacity=0.5),
    name='Population',
    hoverinfo='skip'
))

fig.add_trace(go.Scatter(
    x=x_coords[systematic_sample_ids],
    y=y_coords[systematic_sample_ids],
    mode='markers',
    marker=dict(size=10, color='blue', symbol='diamond'),
    name='Systematic Sample',
    hovertemplate='Plant ID: %{text}<br>Height: %{customdata:.1f}cm<extra></extra>',
    text=systematic_sample_ids,
    customdata=heights[systematic_sample_ids]
))

fig.update_layout(
    title=f"üìê Systematic Sampling: Every {k}th plant<br><sub>Blue diamonds = sampled plants</sub>",
    xaxis_title="Field X-coordinate (m)",
    yaxis_title="Field Y-coordinate (m)",
    height=500,
    template='plotly_white'
)

fig.show()

print(f"\n‚úÖ Advantages:")
print(f"   ‚Ä¢ Easier to implement than random")
print(f"   ‚Ä¢ Spreads sample evenly across population")
print(f"   ‚Ä¢ Good for spatial sampling (transects)")

print(f"\n‚ùå Disadvantages:")
print(f"   ‚Ä¢ Can be biased if population has periodic patterns")
print(f"   ‚Ä¢ Example: If plants planted in rows of 20, and k=20, might only sample from one row!")

---

## 3Ô∏è‚É£ Stratified Sampling

**Method**: Divide population into **strata** (subgroups), then randomly sample from each stratum.

**How**:
1. Identify important subgroups (species, habitat types, age classes)
2. Divide population into strata
3. Sample proportionally (or equally) from each stratum

**Ecological Example**: 
- Forest with 60% oak, 30% maple, 10% birch
- Sample 50 trees: 30 oaks, 15 maples, 5 birch (proportional)

In [None]:
# Create stratified population (3 habitat types)
np.random.seed(42)

# Habitat A (50%): High nutrients, tall plants
n_A = 500
habitat_A = np.random.normal(60, 8, n_A)
x_A = np.random.uniform(0, 20, n_A)
y_A = np.random.uniform(0, 25, n_A)

# Habitat B (30%): Medium nutrients, medium plants
n_B = 300
habitat_B = np.random.normal(45, 7, n_B)
x_B = np.random.uniform(20, 30, n_B)
y_B = np.random.uniform(0, 25, n_B)

# Habitat C (20%): Low nutrients, short plants
n_C = 200
habitat_C = np.random.normal(30, 6, n_C)
x_C = np.random.uniform(30, 40, n_C)
y_C = np.random.uniform(0, 25, n_C)

# Proportional stratified sampling
sample_A = int(sample_size * 0.5)
sample_B = int(sample_size * 0.3)
sample_C = int(sample_size * 0.2)

sampled_A = np.random.choice(habitat_A, size=sample_A, replace=False)
sampled_B = np.random.choice(habitat_B, size=sample_B, replace=False)
sampled_C = np.random.choice(habitat_C, size=sample_C, replace=False)

print("üåç Stratified Sampling: Field with 3 habitat types\n")
print("Habitat | % of Field | Population | Sample Size | Mean Height")
print("--------|-----------|------------|-------------|------------")
print(f"   A    |    50%    |    {n_A}    |     {sample_A}      |   {np.mean(habitat_A):.1f} cm")
print(f"   B    |    30%    |    {n_B}    |     {sample_B}      |   {np.mean(habitat_B):.1f} cm")
print(f"   C    |    20%    |    {n_C}    |     {sample_C}      |   {np.mean(habitat_C):.1f} cm")
print(f"\nTotal samples: {sample_A + sample_B + sample_C}")

# Visualize
fig = go.Figure()

# Habitat A
fig.add_trace(go.Scatter(
    x=x_A, y=y_A,
    mode='markers',
    marker=dict(size=5, color='lightgreen', opacity=0.3),
    name='Habitat A (all)',
    hoverinfo='skip'
))

# Habitat B
fig.add_trace(go.Scatter(
    x=x_B, y=y_B,
    mode='markers',
    marker=dict(size=5, color='lightblue', opacity=0.3),
    name='Habitat B (all)',
    hoverinfo='skip'
))

# Habitat C
fig.add_trace(go.Scatter(
    x=x_C, y=y_C,
    mode='markers',
    marker=dict(size=5, color='lightyellow', opacity=0.3),
    name='Habitat C (all)',
    hoverinfo='skip'
))

# Samples from each habitat
x_A_sample = np.random.choice(x_A, size=sample_A, replace=False)
y_A_sample = np.random.choice(y_A, size=sample_A, replace=False)
fig.add_trace(go.Scatter(
    x=x_A_sample, y=y_A_sample,
    mode='markers',
    marker=dict(size=12, color='darkgreen', symbol='star'),
    name='Sampled from A'
))

x_B_sample = np.random.choice(x_B, size=sample_B, replace=False)
y_B_sample = np.random.choice(y_B, size=sample_B, replace=False)
fig.add_trace(go.Scatter(
    x=x_B_sample, y=y_B_sample,
    mode='markers',
    marker=dict(size=12, color='darkblue', symbol='diamond'),
    name='Sampled from B'
))

x_C_sample = np.random.choice(x_C, size=sample_C, replace=False)
y_C_sample = np.random.choice(y_C, size=sample_C, replace=False)
fig.add_trace(go.Scatter(
    x=x_C_sample, y=y_C_sample,
    mode='markers',
    marker=dict(size=12, color='orange', symbol='square'),
    name='Sampled from C'
))

fig.update_layout(
    title="üåç Stratified Sampling: Sample proportionally from each habitat",
    xaxis_title="Field X-coordinate (m)",
    yaxis_title="Field Y-coordinate (m)",
    height=500,
    template='plotly_white'
)

fig.show()

print(f"\n‚úÖ Advantages:")
print(f"   ‚Ä¢ Ensures all subgroups are represented")
print(f"   ‚Ä¢ More precise than simple random sampling")
print(f"   ‚Ä¢ Can compare between strata")

print(f"\n‚ùå Disadvantages:")
print(f"   ‚Ä¢ Need to know strata in advance")
print(f"   ‚Ä¢ More complex to implement")
print(f"   ‚Ä¢ Requires classification of all individuals")

---

## 4Ô∏è‚É£ Cluster Sampling

**Method**: Divide population into **clusters** (groups), randomly select **entire clusters**, measure all individuals in selected clusters.

**How**:
1. Divide area into clusters (plots, quadrats)
2. Randomly select some clusters
3. Measure ALL individuals in selected clusters

**Ecological Example**: 
- Divide forest into 100 plots (10m √ó 10m each)
- Randomly select 10 plots
- Measure ALL trees in those 10 plots

In [None]:
# Create clustered population (20 clusters, 50 plants each)
np.random.seed(42)

n_clusters = 20
plants_per_cluster = 50
n_select_clusters = 4  # Select 4 clusters

# Create clusters in a 4x5 grid
cluster_data = []
for i in range(4):  # rows
    for j in range(5):  # columns
        cluster_id = i * 5 + j
        # Each cluster is a 8x5 area
        x_cluster = np.random.uniform(j*8, (j+1)*8, plants_per_cluster)
        y_cluster = np.random.uniform(i*5, (i+1)*5, plants_per_cluster)
        heights_cluster = np.random.normal(50, 10, plants_per_cluster)
        
        cluster_data.append({
            'cluster_id': cluster_id,
            'x': x_cluster,
            'y': y_cluster,
            'heights': heights_cluster
        })

# Randomly select clusters
selected_clusters = np.random.choice(range(n_clusters), size=n_select_clusters, replace=False)

print(f"üó∫Ô∏è Cluster Sampling Setup:")
print(f"   Total clusters: {n_clusters} (arranged in 4√ó5 grid)")
print(f"   Plants per cluster: {plants_per_cluster}")
print(f"   Selected clusters: {list(selected_clusters)}")
print(f"   Total plants sampled: {n_select_clusters * plants_per_cluster}")

# Visualize
fig = go.Figure()

# All clusters (not selected)
for cluster in cluster_data:
    if cluster['cluster_id'] not in selected_clusters:
        fig.add_trace(go.Scatter(
            x=cluster['x'],
            y=cluster['y'],
            mode='markers',
            marker=dict(size=4, color='lightgray', opacity=0.3),
            name=f"Cluster {cluster['cluster_id']}",
            showlegend=False,
            hoverinfo='skip'
        ))

# Selected clusters
colors = ['red', 'blue', 'green', 'orange']
for idx, cluster_id in enumerate(selected_clusters):
    cluster = cluster_data[cluster_id]
    fig.add_trace(go.Scatter(
        x=cluster['x'],
        y=cluster['y'],
        mode='markers',
        marker=dict(size=8, color=colors[idx], symbol='circle'),
        name=f"Selected Cluster {cluster_id}",
        hovertemplate='Cluster %{text}<br>Height: %{customdata:.1f}cm<extra></extra>',
        text=[cluster_id]*len(cluster['x']),
        customdata=cluster['heights']
    ))

# Draw cluster boundaries
for i in range(5):
    fig.add_vline(x=i*8, line_dash="dot", line_color="gray", opacity=0.3)
for i in range(5):
    fig.add_hline(y=i*5, line_dash="dot", line_color="gray", opacity=0.3)

fig.update_layout(
    title="üó∫Ô∏è Cluster Sampling: Select entire clusters, measure all plants within<br><sub>Colored = selected clusters</sub>",
    xaxis_title="Field X-coordinate (m)",
    yaxis_title="Field Y-coordinate (m)",
    height=500,
    template='plotly_white'
)

fig.show()

print(f"\n‚úÖ Advantages:")
print(f"   ‚Ä¢ Cost-effective (visit fewer locations)")
print(f"   ‚Ä¢ Practical for large areas")
print(f"   ‚Ä¢ Easy to implement in the field")

print(f"\n‚ùå Disadvantages:")
print(f"   ‚Ä¢ Less precise than other methods")
print(f"   ‚Ä¢ Clusters may not represent full population")
print(f"   ‚Ä¢ High variance between clusters = poor estimates")

---

## üìä Part 3: Comparing All Four Methods

In [None]:
# Compare accuracy of all four methods
np.random.seed(42)

# True population
true_population = np.random.normal(50, 10, 1000)
true_mean = np.mean(true_population)

# Repeat sampling 100 times for each method
n_simulations = 100
sample_size = 50

random_means = []
systematic_means = []
stratified_means = []
cluster_means = []

for sim in range(n_simulations):
    # Random
    random_sample = np.random.choice(true_population, size=sample_size, replace=False)
    random_means.append(np.mean(random_sample))
    
    # Systematic
    k = len(true_population) // sample_size
    start = np.random.randint(0, k)
    systematic_sample = true_population[start::k][:sample_size]
    systematic_means.append(np.mean(systematic_sample))
    
    # Stratified (divide into 5 strata)
    strata = np.array_split(true_population, 5)
    stratified_sample = np.concatenate([np.random.choice(stratum, size=10, replace=False) for stratum in strata])
    stratified_means.append(np.mean(stratified_sample))
    
    # Cluster (divide into 20 clusters, select 2)
    clusters = np.array_split(true_population, 20)
    selected_clusters_idx = np.random.choice(range(20), size=2, replace=False)
    cluster_sample = np.concatenate([clusters[i] for i in selected_clusters_idx])
    cluster_means.append(np.mean(cluster_sample))

# Create comparison plot
fig = go.Figure()

fig.add_trace(go.Box(y=random_means, name='Random', marker_color='red'))
fig.add_trace(go.Box(y=systematic_means, name='Systematic', marker_color='blue'))
fig.add_trace(go.Box(y=stratified_means, name='Stratified', marker_color='green'))
fig.add_trace(go.Box(y=cluster_means, name='Cluster', marker_color='orange'))

fig.add_hline(y=true_mean, line_dash="dash", line_color="black",
              annotation_text=f"True Mean = {true_mean:.2f}")

fig.update_layout(
    title=f"üìä Comparing Sampling Methods ({n_simulations} simulations each)<br><sub>Which is most accurate and precise?</sub>",
    yaxis_title="Estimated Mean",
    height=500,
    template='plotly_white'
)

fig.show()

# Calculate statistics
print(f"\nüìà True Population Mean: {true_mean:.2f}\n")
print("Method      | Mean of Estimates | Std Dev | Bias")
print("------------|------------------|---------|------")
print(f"Random      |      {np.mean(random_means):.2f}       |  {np.std(random_means):.2f}  | {abs(np.mean(random_means)-true_mean):.2f}")
print(f"Systematic  |      {np.mean(systematic_means):.2f}       |  {np.std(systematic_means):.2f}  | {abs(np.mean(systematic_means)-true_mean):.2f}")
print(f"Stratified  |      {np.mean(stratified_means):.2f}       |  {np.std(stratified_means):.2f}  | {abs(np.mean(stratified_means)-true_mean):.2f}")
print(f"Cluster     |      {np.mean(cluster_means):.2f}       |  {np.std(cluster_means):.2f}  | {abs(np.mean(cluster_means)-true_mean):.2f}")

print(f"\nüí° Interpretation:")
print(f"   ‚Ä¢ Lower Std Dev = More PRECISE (consistent estimates)")
print(f"   ‚Ä¢ Lower Bias = More ACCURATE (close to true value)")
print(f"   ‚Ä¢ All methods are unbiased (mean ‚âà true mean)")
print(f"   ‚Ä¢ Stratified often most precise for heterogeneous populations")

---

## üìè Part 4: Sample Size Matters!

In [None]:
# Show effect of sample size
np.random.seed(42)
population = np.random.normal(50, 10, 10000)
true_mean = np.mean(population)

sample_sizes = [10, 30, 50, 100, 300, 500]
n_simulations = 100

fig = go.Figure()

for n in sample_sizes:
    estimates = []
    for _ in range(n_simulations):
        sample = np.random.choice(population, size=n, replace=False)
        estimates.append(np.mean(sample))
    
    fig.add_trace(go.Box(
        y=estimates,
        name=f'n={n}',
        boxmean='sd'
    ))

fig.add_hline(y=true_mean, line_dash="dash", line_color="red",
              annotation_text=f"True Mean = {true_mean:.2f}")

fig.update_layout(
    title="üìè Effect of Sample Size on Accuracy<br><sub>Larger samples = more precise estimates</sub>",
    xaxis_title="Sample Size",
    yaxis_title="Estimated Mean",
    height=500,
    template='plotly_white'
)

fig.show()

print("\nüí° As sample size increases:")
print("   ‚úÖ Estimates get MORE PRECISE (box gets narrower)")
print("   ‚úÖ Estimates cluster closer to true value")
print("   ‚ùå But diminishing returns after ~100-300 samples")
print("   ‚öñÔ∏è Balance: Accuracy vs. Cost/Time")

---

## üéì Summary

### Key Takeaways:

‚úÖ **Sampling is necessary** - can't measure entire populations  
‚úÖ **Four main methods**: Random, Systematic, Stratified, Cluster  
‚úÖ **Each has trade-offs** - choose based on your situation  
‚úÖ **Stratified best** for heterogeneous populations  
‚úÖ **Cluster most practical** for large areas  
‚úÖ **Sample size matters** - larger = more precise  
‚úÖ **Goal**: Representative sample that reflects population  

### Quick Decision Guide:

| Your Situation | Best Method |
|----------------|-------------|
| Homogeneous population, complete list | **Random** |
| Spatial transect, linear arrangement | **Systematic** |
| Distinct subgroups, want precision | **Stratified** |
| Large area, limited budget | **Cluster** |

### Next Notebook:

**05_hypothesis_testing.ipynb** - Chi-square and t-tests

---

<div align="center">

**Made with üíö by The Pattern Hunter Team**

[üè† Repository](https://github.com/The-Pattern-Hunter/interactive-ecology-biometry) | 
[üìì Previous: Dispersion](03_dispersion_measures.ipynb) | 
[üìì Next: Hypothesis Testing](05_hypothesis_testing.ipynb)

</div>