# üìç Measures of Central Tendency
## Finding the "Center" of Your Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/The-Pattern-Hunter/interactive-ecology-biometry/blob/main/unit-4-biometry/notebooks/02_central_tendency_analysis.ipynb)

---

> *"After seeing the shape of your data (distribution-first!), the next question is: Where is the CENTER?"*

### üéØ Learning Objectives

By the end of this notebook, you will:
1. Understand **Mean, Median, and Mode** visually
2. Know **when to use each** measure
3. See how **distribution shape** affects these measures
4. Apply to **real ecological data**
5. Use the **stethoscope analogy**: These are your "vital signs" of data

---

## ü©∫ The Stethoscope Analogy Continues

Remember:
- **Step 1-2**: We observed data and discovered its SHAPE (distribution)
- **Step 6**: Now we MEASURE the pattern

Central tendency = **"Where is the heart of your data beating?"**

Just as a doctor measures heart rate, blood pressure, and temperature as vital signs, we measure Mean, Median, and Mode as the "vital signs" of our data.

In [None]:
# Setup
!pip install numpy scipy plotly pandas -q

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
import pandas as pd

np.random.seed(42)

print("‚úÖ Ready to explore central tendency!")
print("üìç Let's find the center of our data!")

---

## üìä Part 1: The Three Measures

### What Are They?

| Measure | Description | Analogy |
|---------|-------------|----------|
| **Mean (Œº)** | Arithmetic average | "Balance point" - where data would balance on a seesaw |
| **Median** | Middle value | "50th percentile" - half above, half below |
| **Mode** | Most frequent | "Peak of the mountain" - where data clusters most |

### üå± Ecological Example: Plant Heights

Let's measure the heights of 100 plants in a population.

In [None]:
# Generate sample plant height data (Normal distribution)
np.random.seed(42)
plant_heights = np.random.normal(loc=50, scale=10, size=100)  # Œº=50cm, œÉ=10cm

# Calculate measures
mean_height = np.mean(plant_heights)
median_height = np.median(plant_heights)
mode_height = stats.mode(np.round(plant_heights), keepdims=True)[0][0]  # For continuous data, round first

print(f"Plant Heights (cm) - First 20 values:")
print(np.round(plant_heights[:20], 1))
print(f"\nüìä Central Tendency Measures:")
print(f"   Mean:   {mean_height:.2f} cm")
print(f"   Median: {median_height:.2f} cm")
print(f"   Mode:   {mode_height:.2f} cm (approximate)")
print(f"\nüí° Notice: For symmetric data, mean ‚âà median ‚âà mode!")

---

## üéÆ Part 2: Interactive Visualization

### Visualizing Mean, Median, Mode on a Distribution

In [None]:
# Create histogram with central tendency lines
fig = go.Figure()

# Add histogram
fig.add_trace(go.Histogram(
    x=plant_heights,
    nbinsx=20,
    name='Plant Heights',
    marker_color='lightgreen',
    opacity=0.7,
    hovertemplate='Height: %{x:.1f}cm<br>Count: %{y}<extra></extra>'
))

# Add mean line
fig.add_vline(
    x=mean_height, 
    line_dash="solid", 
    line_color="red",
    line_width=3,
    annotation_text=f"Mean = {mean_height:.1f}cm",
    annotation_position="top"
)

# Add median line
fig.add_vline(
    x=median_height, 
    line_dash="dash", 
    line_color="blue",
    line_width=3,
    annotation_text=f"Median = {median_height:.1f}cm",
    annotation_position="bottom right"
)

# Add mode line (approximate)
fig.add_vline(
    x=mode_height, 
    line_dash="dot", 
    line_color="green",
    line_width=3,
    annotation_text=f"Mode ‚âà {mode_height:.1f}cm",
    annotation_position="top left"
)

fig.update_layout(
    title="üìè Plant Heights: Central Tendency Measures<br><sub>Red=Mean | Blue=Median | Green=Mode</sub>",
    xaxis_title="Height (cm)",
    yaxis_title="Frequency (Number of Plants)",
    height=500,
    template='plotly_white',
    showlegend=False
)

fig.show()

print("\nüí° For symmetric (Normal) distributions:")
print("   Mean = Median = Mode (all at the center!)")

---

## üîç Part 3: The Power of Distribution Shape

### How Skewness Affects Central Tendency

**Key Insight**: The relationship between mean, median, and mode **changes** based on distribution shape!

In [None]:
# Create three different distributions
np.random.seed(42)

# 1. Symmetric (Normal)
symmetric_data = np.random.normal(50, 10, 1000)

# 2. Right-skewed (Log-normal) - like population sizes
right_skewed_data = np.random.lognormal(3, 0.8, 1000)

# 3. Left-skewed (Beta distribution)
left_skewed_data = 100 * np.random.beta(8, 2, 1000)

# Function to calculate and plot
def plot_skewed_comparison(data, title, color):
    mean_val = np.mean(data)
    median_val = np.median(data)
    mode_val = stats.mode(np.round(data), keepdims=True)[0][0]
    
    fig = go.Figure()
    
    fig.add_trace(go.Histogram(
        x=data,
        nbinsx=30,
        marker_color=color,
        opacity=0.7,
        showlegend=False
    ))
    
    fig.add_vline(x=mean_val, line_color="red", line_width=2, 
                  annotation_text=f"Mean={mean_val:.1f}")
    fig.add_vline(x=median_val, line_color="blue", line_width=2, line_dash="dash",
                  annotation_text=f"Median={median_val:.1f}")
    fig.add_vline(x=mode_val, line_color="green", line_width=2, line_dash="dot",
                  annotation_text=f"Mode‚âà{mode_val:.1f}")
    
    fig.update_layout(
        title=title,
        xaxis_title="Value",
        yaxis_title="Frequency",
        height=400,
        template='plotly_white'
    )
    
    return fig, mean_val, median_val, mode_val

# Plot all three
fig1, m1, med1, mod1 = plot_skewed_comparison(symmetric_data, 
                                               "üìä Symmetric Distribution (Normal)", 
                                               'lightblue')
fig1.show()
print(f"Symmetric: Mean={m1:.1f}, Median={med1:.1f}, Mode‚âà{mod1:.1f}")
print("‚Üí All three are EQUAL (at the center)\n")

fig2, m2, med2, mod2 = plot_skewed_comparison(right_skewed_data, 
                                               "üìà Right-Skewed (Long tail on right)", 
                                               'lightcoral')
fig2.show()
print(f"Right-Skewed: Mode‚âà{mod2:.1f} < Median={med2:.1f} < Mean={m2:.1f}")
print("‚Üí Mean is PULLED by the tail (outliers)\n")

fig3, m3, med3, mod3 = plot_skewed_comparison(left_skewed_data, 
                                               "üìâ Left-Skewed (Long tail on left)", 
                                               'lightgreen')
fig3.show()
print(f"Left-Skewed: Mean={m3:.1f} < Median={med3:.1f} < Mode‚âà{mod3:.1f}")
print("‚Üí Mean is PULLED toward the tail")

---

## üìö Part 4: The Golden Rules

### When to Use Which Measure?

| Situation | Best Measure | Why? |
|-----------|--------------|------|
| **Symmetric data** | Mean | Represents all data, easy to calculate |
| **Skewed data** | Median | Not affected by extreme values |
| **Outliers present** | Median | Resistant to outliers |
| **Categorical data** | Mode | Only measure that works for categories |
| **Need all info** | Mean | Uses every data point |
| **"Typical" value** | Median | Represents the middle |

### üå± Ecological Examples:

**Use Mean when**:
- Plant heights (normally distributed)
- Temperature measurements
- Leaf counts per branch

**Use Median when**:
- Population sizes (right-skewed)
- Income of animals in territory (skewed)
- Survival times (outliers possible)

**Use Mode when**:
- Most common flower color
- Dominant species in community
- Preferred habitat type

---

## üéÆ Part 5: Interactive Exploration - Effect of Outliers

In [None]:
# Interactive demonstration of outlier effect
def create_outlier_demo():
    # Original data
    original = np.array([45, 48, 50, 51, 52, 53, 54, 55, 56, 58])
    
    # Create traces for different outlier scenarios
    scenarios = [
        ("No outliers", original),
        ("One outlier (100)", np.append(original, 100)),
        ("Two outliers (100, 120)", np.append(original, [100, 120])),
        ("Extreme outlier (200)", np.append(original, 200))
    ]
    
    fig = go.Figure()
    
    for i, (name, data) in enumerate(scenarios):
        mean = np.mean(data)
        median = np.median(data)
        
        # Histogram
        fig.add_trace(go.Histogram(
            x=data,
            nbinsx=15,
            name=name,
            marker_color='lightgreen',
            opacity=0.7,
            visible=(i==0),  # Only first visible
            showlegend=False
        ))
        
        # Mean line (will be added separately for visibility)
        # Median line (will be added separately)
    
    # Create buttons
    buttons = []
    for i, (name, data) in enumerate(scenarios):
        mean = np.mean(data)
        median = np.median(data)
        
        visible = [False] * len(scenarios)
        visible[i] = True
        
        buttons.append(
            dict(
                label=name,
                method="update",
                args=[
                    {"visible": visible},
                    {"title": f"üéØ {name}<br>Mean={mean:.1f} | Median={median:.1f}"}
                ]
            )
        )
    
    fig.update_layout(
        updatemenus=[
            dict(
                type="buttons",
                direction="down",
                x=0.7,
                y=1.15,
                buttons=buttons
            )
        ],
        title="üéØ No outliers<br>Mean=52.2 | Median=52.5",
        xaxis_title="Plant Height (cm)",
        yaxis_title="Frequency",
        height=500,
        template='plotly_white'
    )
    
    return fig

fig = create_outlier_demo()
fig.show()

print("\nüí° Click the buttons above to see different scenarios!")
print("üìç Watch how MEAN changes dramatically with outliers")
print("üéØ But MEDIAN stays stable!")

---

## üßÆ Part 6: Calculating by Hand

### Small Dataset Example

Let's say you measured 7 seedling heights (cm): **[10, 12, 12, 15, 18, 20, 45]**

In [None]:
# Small example
seedlings = np.array([10, 12, 12, 15, 18, 20, 45])

print("Seedling Heights: ", seedlings)
print("\nüìä Step-by-step Calculations:\n")

# Mean
print("1Ô∏è‚É£ MEAN (Average):")
print(f"   Formula: Sum of all values / Number of values")
print(f"   = ({' + '.join(map(str, seedlings))}) / {len(seedlings)}")
print(f"   = {np.sum(seedlings)} / {len(seedlings)}")
print(f"   = {np.mean(seedlings):.2f} cm")
print(f"   ‚ö†Ô∏è Pulled up by outlier (45)\n")

# Median
print("2Ô∏è‚É£ MEDIAN (Middle value):")
sorted_seedlings = np.sort(seedlings)
print(f"   Step 1: Sort the data: {sorted_seedlings}")
print(f"   Step 2: Find middle position: position {len(seedlings)//2 + 1}")
print(f"   Step 3: Median = {np.median(seedlings):.1f} cm")
print(f"   ‚úÖ Not affected by outlier!\n")

# Mode
print("3Ô∏è‚É£ MODE (Most frequent):")
mode_val = stats.mode(seedlings, keepdims=True)[0][0]
print(f"   Look for most common value: {mode_val} cm appears twice")
print(f"   Mode = {mode_val} cm\n")

print("\nüéØ Which should you report?")
print(f"   ‚Üí MEDIAN ({np.median(seedlings):.1f}cm) best represents 'typical' seedling")
print(f"   ‚Üí MEAN ({np.mean(seedlings):.2f}cm) inflated by one outlier")

---

## üìã Part 7: Real Ecological Dataset

### Species Abundance in a Forest Plot

In [None]:
# Simulate species abundance data (typically right-skewed)
np.random.seed(42)
species_counts = np.random.lognormal(2, 1, 50).astype(int)  # 50 species

# Create DataFrame
df = pd.DataFrame({
    'Species_ID': [f'Species_{i+1}' for i in range(50)],
    'Individual_Count': species_counts
})

# Calculate measures
mean_abundance = df['Individual_Count'].mean()
median_abundance = df['Individual_Count'].median()
mode_abundance = df['Individual_Count'].mode()[0]

print("üå≥ Forest Plot Species Abundance Data")
print("\nFirst 10 species:")
print(df.head(10).to_string(index=False))

print(f"\nüìä Central Tendency:")
print(f"   Mean:   {mean_abundance:.1f} individuals per species")
print(f"   Median: {median_abundance:.1f} individuals per species")
print(f"   Mode:   {mode_abundance} individuals per species")

print(f"\nüí° Interpretation:")
print(f"   The MEDIAN ({median_abundance:.0f}) is lower than MEAN ({mean_abundance:.0f})")
print(f"   ‚Üí This indicates RIGHT-SKEWED distribution")
print(f"   ‚Üí Most species are rare, few are very abundant")
print(f"   ‚Üí Use MEDIAN to report 'typical' species abundance")

# Visualize
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=species_counts,
    nbinsx=20,
    marker_color='forestgreen',
    opacity=0.7
))

fig.add_vline(x=mean_abundance, line_color="red", line_width=3,
              annotation_text=f"Mean={mean_abundance:.1f}")
fig.add_vline(x=median_abundance, line_color="blue", line_width=3, line_dash="dash",
              annotation_text=f"Median={median_abundance:.1f}")

fig.update_layout(
    title="üå≥ Species Abundance Distribution (Right-Skewed)<br><sub>Typical pattern in ecological communities</sub>",
    xaxis_title="Number of Individuals",
    yaxis_title="Number of Species",
    height=500,
    template='plotly_white'
)

fig.show()

---

## üéì Summary

### Key Takeaways:

‚úÖ **Three measures of center**: Mean, Median, Mode  
‚úÖ **Distribution shape matters**: Symmetric vs. Skewed  
‚úÖ **Mean**: Sensitive to outliers (use for symmetric data)  
‚úÖ **Median**: Resistant to outliers (use for skewed data)  
‚úÖ **Mode**: Most common value (use for categorical data)  
‚úÖ **Ecological data** is often right-skewed ‚Üí Use median  

### The Pattern Hunter Way:

```
1. Plot your data FIRST (see the shape)
2. Is it symmetric or skewed?
3. Choose appropriate measure
4. Interpret in context
```

### Next Notebook:

**03_dispersion_measures.ipynb** - How spread out is your data? (Range, SD, Variance)

---

<div align="center">

**Made with üíö by The Pattern Hunter Team**

[üè† Repository](https://github.com/The-Pattern-Hunter/interactive-ecology-biometry) | 
[üìì Previous: Distributions](01_distributions_exploration.ipynb) | 
[üìì Next: Dispersion](03_dispersion_measures.ipynb)

</div>