<h4 style="color:#1a73e8;">2.2.1 Descriptive Statistics: The Five Key Measures</h4>

For any numerical feature, these statistics tell its story. Let's use a sample of house prices: `[200, 250, 300, 300, 350, 400, 500, 1000]` (in $1,000s).

**1. Mean (Average)**: The balance point of the data.

**Formula**: Mean = (Sum of all values) / (Number of values)

**Mathematical Notation**: 
```
μ = (1/n) × Σ(x_i) for i=1 to n
```

**Calculation**:
```
Mean = (200 + 250 + 300 + 300 + 350 + 400 + 500 + 1000) / 8
     = 3300 / 8
     = 412.5
```

**Interpretation**: The "typical" price is $412,500. But is it representative? Notice the $1,000,000 house pulls the mean upward. The mean is **sensitive to outliers**.

**When to Use**: When data is normally distributed and has no extreme outliers.

**2. Median**: The middle value when sorted.

**Calculation**:
- Sort the data: `[200, 250, 300, 300, 350, 400, 500, 1000]`
- For even number of points: Average the two middle values
- Middle values: 300 and 350
- Median = (300 + 350) / 2 = 325

**Interpretation**: Half the houses cost less than $325,000, half cost more. The median is **robust to outliers** (unlike the mean). Notice how the $1M house doesn't affect it.

**When to Use**: When data has outliers or is skewed. Often more representative than the mean.

**3. Mode**: The most frequent value.

**Calculation**:
- Count frequencies: 200 (1), 250 (1), 300 (2), 350 (1), 400 (1), 500 (1), 1000 (1)
- Mode = 300 (appears twice)

**Interpretation**: $300,000 is the most common price in this sample.

**When to Use**: For categorical data or when you need to know the most common value.

**4. Variance and Standard Deviation**: Measures of spread (how much values vary).

**Variance Formula**: 
```
σ² = (1/n) × Σ(x_i - μ)² for i=1 to n
```

**Calculation Steps**:
1. Calculate mean: μ = 412.5
2. For each value, calculate (x_i - μ)²:
   - (200 - 412.5)² = 45,156.25
   - (250 - 412.5)² = 26,406.25
   - (300 - 412.5)² = 12,656.25
   - (300 - 412.5)² = 12,656.25
   - (350 - 412.5)² = 3,906.25
   - (400 - 412.5)² = 156.25
   - (500 - 412.5)² = 7,656.25
   - (1000 - 412.5)² = 345,156.25
3. Sum: 455,250
4. Divide by n: 455,250 / 8 = 56,906.25

**Standard Deviation Formula**: 
```
σ = √(variance) = √(σ²)
```

**Calculation**: 
```
σ = √56,906.25 ≈ 238.55
```

**Interpretation**: Prices are, on average, about $238,550 away from the mean. High variance/standard deviation = high uncertainty/risk. Low variance = values are clustered close to the mean.

**When to Use**: 
- Variance: For mathematical operations (it's in squared units)
- Standard Deviation: For interpretation (it's in original units, easier to understand)

**5. Skewness and Kurtosis**: Shape of the distribution.

**Skewness**: Measures asymmetry.
- **Formula**: Skewness = (3 × (Mean - Median)) / Standard Deviation
- **Positive skew** (right tail): Mean > Median. Common in income, house prices, reaction times
  - Example: Most people earn $50k, but a few earn $500k, pulling the mean right
- **Negative skew** (left tail): Mean < Median. Common in exam scores (ceiling effect)
  - Example: Most students score 80-100, few score very low
- **Zero skew**: Symmetric distribution (like a normal distribution)

**Kurtosis**: Measures "tailedness" (how many outliers).
- **High kurtosis**: Heavy tails, more outliers (e.g., financial returns, extreme events)
- **Low kurtosis**: Light tails, fewer outliers, more uniform distribution
- **Normal kurtosis**: Baseline (kurtosis = 3 for "excess kurtosis")

**Python Implementation**:

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

# Sample data
prices = [200, 250, 300, 300, 350, 400, 500, 1000]
df = pd.DataFrame({'price': prices})

# Calculate all statistics
print("Descriptive Statistics:")
print(f"Mean: {df['price'].mean():.2f}")
print(f"Median: {df['price'].median():.2f}")
print(f"Mode: {df['price'].mode().values[0]}")
print(f"Standard Deviation: {df['price'].std():.2f}")
print(f"Variance: {df['price'].var():.2f}")
print(f"Skewness: {stats.skew(df['price']):.2f}")
print(f"Kurtosis: {stats.kurtosis(df['price']):.2f}")

# Using describe() for quick overview
print("\nQuick Summary:")
print(df['price'].describe())

Descriptive Statistics:
Mean: 412.50
Median: 325.00
Mode: 300
Standard Deviation: 254.60
Variance: 64821.43
Skewness: 1.73
Kurtosis: 1.75

Quick Summary:
count       8.000000
mean      412.500000
std       254.600527
min       200.000000
25%       287.500000
50%       325.000000
75%       425.000000
max      1000.000000
Name: price, dtype: float64


> **Figure 2.1**: Three histograms illustrating a normal distribution (skew=0), a positively skewed distribution (e.g., house prices), and a negatively skewed distribution (e.g., easy exam scores). Understanding skewness helps you choose appropriate transformations (e.g., log transform for positive skew).