# Week 1: Descriptive Statistics & Distributions - Exercises

**Goal**: Build intuition for distributions, sampling, and descriptive statistics through hands-on practice.

**Milestone**: By the end, you should be able to explain variance vs. standard deviation vs. standard error without looking it up.

---

In [None]:
# Setup - Run this first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

np.random.seed(42)  # For reproducibility
plt.style.use('seaborn-v0_8-whitegrid')

---

## Part 1: Exploring Distributions

### Exercise 1.1: Generate and Visualize Different Distributions

Generate 10,000 random samples from each of the following distributions and visualize them with histograms:

1. **Normal distribution** with mean=50, std=10
2. **Uniform distribution** between 0 and 100
3. **Exponential distribution** with scale=20 (Œª = 1/20)

Create a figure with 3 subplots (one for each distribution).

In [None]:
# YOUR CODE HERE
# Hints:
# - np.random.normal(loc, scale, size)
# - np.random.uniform(low, high, size)
# - np.random.exponential(scale, size)

# Generate samples
n_samples = 10000

normal_samples = # TODO
uniform_samples = # TODO
exponential_samples = # TODO

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# TODO: Create histograms for each distribution
# Use bins=50 for good resolution
# Add titles and labels

plt.tight_layout()
plt.show()

### Exercise 1.2: Describe the Shapes

In the markdown cell below, describe each distribution in your own words:
- What does the shape look like?
- Where is most of the data concentrated?
- Is it symmetric or skewed? If skewed, which direction?

**Your answers:**

1. **Normal distribution**: 

2. **Uniform distribution**: 

3. **Exponential distribution**: 

### Exercise 1.3: Effect of Parameters on Normal Distribution

Generate and plot normal distributions with:
- Same mean (50) but different standard deviations: 5, 10, 20
- Same standard deviation (10) but different means: 30, 50, 70

Overlay them on the same plots to see the differences.

In [None]:
# YOUR CODE HERE
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Same mean, different std
# TODO: Generate 3 normal distributions and plot their histograms
# Use alpha=0.5 for transparency so you can see overlapping areas
# Use density=True in hist() to normalize the y-axis

# Right plot: Same std, different means
# TODO: Generate 3 normal distributions and plot their histograms

plt.tight_layout()
plt.show()

**Question**: How does changing the standard deviation affect the shape? How does changing the mean affect it?

**Your answer**:

---

## Part 2: Calculating Descriptive Statistics Manually

### Exercise 2.1: Implement Mean, Variance, and Standard Deviation from Scratch

Without using numpy's built-in functions (except for basic operations), implement:

1. `calculate_mean(data)` - returns the arithmetic mean
2. `calculate_variance(data, ddof=0)` - returns variance (ddof=0 for population, ddof=1 for sample)
3. `calculate_std(data, ddof=0)` - returns standard deviation

**Formulas**:
- Mean: $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$
- Variance: $\sigma^2 = \frac{1}{n - ddof} \sum_{i=1}^{n} (x_i - \bar{x})^2$
- Standard Deviation: $\sigma = \sqrt{\sigma^2}$

In [None]:
def calculate_mean(data):
    """Calculate the arithmetic mean."""
    # YOUR CODE HERE
    pass


def calculate_variance(data, ddof=0):
    """
    Calculate variance.
    ddof=0 for population variance
    ddof=1 for sample variance (Bessel's correction)
    """
    # YOUR CODE HERE
    pass


def calculate_std(data, ddof=0):
    """Calculate standard deviation."""
    # YOUR CODE HERE
    pass

In [None]:
# Test your implementations
test_data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

print("Your implementations:")
print(f"  Mean: {calculate_mean(test_data)}")
print(f"  Population Variance (ddof=0): {calculate_variance(test_data, ddof=0)}")
print(f"  Sample Variance (ddof=1): {calculate_variance(test_data, ddof=1)}")
print(f"  Population Std (ddof=0): {calculate_std(test_data, ddof=0)}")
print(f"  Sample Std (ddof=1): {calculate_std(test_data, ddof=1)}")

print("\nNumpy's implementations (should match):")
print(f"  Mean: {np.mean(test_data)}")
print(f"  Population Variance (ddof=0): {np.var(test_data, ddof=0)}")
print(f"  Sample Variance (ddof=1): {np.var(test_data, ddof=1)}")
print(f"  Population Std (ddof=0): {np.std(test_data, ddof=0)}")
print(f"  Sample Std (ddof=1): {np.std(test_data, ddof=1)}")

### Exercise 2.2: Why Bessel's Correction (ddof=1)?

Run the simulation below and observe what happens. Then answer the question.

In [None]:
# Simulation: Compare population vs sample variance estimates
# True population: Normal(mean=100, std=15) -> true variance = 225

true_variance = 225
n_simulations = 10000
sample_size = 10

# Store estimates
population_var_estimates = []  # Using ddof=0
sample_var_estimates = []      # Using ddof=1

for _ in range(n_simulations):
    # Take a sample from the population
    sample = np.random.normal(100, 15, sample_size)
    
    # Calculate both estimates
    population_var_estimates.append(np.var(sample, ddof=0))
    sample_var_estimates.append(np.var(sample, ddof=1))

print(f"True population variance: {true_variance}")
print(f"\nAverage of ddof=0 estimates: {np.mean(population_var_estimates):.2f}")
print(f"Average of ddof=1 estimates: {np.mean(sample_var_estimates):.2f}")
print(f"\nBias of ddof=0: {np.mean(population_var_estimates) - true_variance:.2f}")
print(f"Bias of ddof=1: {np.mean(sample_var_estimates) - true_variance:.2f}")

**Question**: Based on the simulation, why do we use ddof=1 (dividing by n-1) for sample variance instead of ddof=0 (dividing by n)?

**Your answer**:

### Exercise 2.3: Median and Mode

Implement median and mode functions from scratch.

In [None]:
def calculate_median(data):
    """
    Calculate the median.
    If even number of elements, return average of two middle values.
    """
    # YOUR CODE HERE
    # Hint: First sort the data
    pass


def calculate_mode(data):
    """
    Calculate the mode (most frequent value).
    If tie, return the smallest value.
    """
    # YOUR CODE HERE
    # Hint: You can use a dictionary to count frequencies
    pass


# Test
test_odd = np.array([1, 3, 3, 6, 7, 8, 9])
test_even = np.array([1, 2, 3, 4, 5, 6, 8, 9])

print("Odd-length array:", test_odd)
print(f"  Your median: {calculate_median(test_odd)}, Numpy: {np.median(test_odd)}")
print(f"  Your mode: {calculate_mode(test_odd)}, Scipy: {stats.mode(test_odd, keepdims=False).mode}")

print("\nEven-length array:", test_even)
print(f"  Your median: {calculate_median(test_even)}, Numpy: {np.median(test_even)}")

### Exercise 2.4: When to Use Mean vs. Median

Create a dataset representing household incomes (with a few very high outliers) and show why median is often preferred for skewed data.

In [None]:
# Simulate household incomes (in thousands)
# Most households: 40-80k, a few very wealthy: 500k-2M

np.random.seed(42)
regular_incomes = np.random.normal(60, 15, 950)  # 950 regular households
regular_incomes = np.clip(regular_incomes, 20, 150)  # Keep reasonable bounds

wealthy_incomes = np.random.uniform(500, 2000, 50)  # 50 very wealthy households

all_incomes = np.concatenate([regular_incomes, wealthy_incomes])

# YOUR CODE HERE
# 1. Calculate mean and median of all_incomes
# 2. Create a histogram of the distribution
# 3. Add vertical lines showing mean and median
# 4. Answer: Which measure better represents a "typical" household?

mean_income = # TODO
median_income = # TODO

print(f"Mean income: ${mean_income:.1f}k")
print(f"Median income: ${median_income:.1f}k")

# Create visualization
# TODO: Histogram with vertical lines for mean and median

**Question**: Why does the mean differ so much from the median here? Which is a better measure of "typical" income?

**Your answer**:

---

## Part 3: Sampling from Distributions

### Exercise 3.1: Sampling Distribution of the Mean

This is one of the most important concepts in statistics!

**Setup**: Imagine a population of exam scores that follows a uniform distribution from 0 to 100.

**Task**: 
1. Take many samples of size n from this population
2. Calculate the mean of each sample
3. Plot the distribution of these sample means
4. Observe how the distribution changes with sample size

In [None]:
# Population: Uniform(0, 100)
# True population mean = 50

n_samples = 10000  # Number of samples to take
sample_sizes = [2, 5, 30, 100]  # Different sample sizes to try

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, sample_size in enumerate(sample_sizes):
    # YOUR CODE HERE
    # 1. Generate n_samples samples, each of size sample_size, from Uniform(0, 100)
    # 2. Calculate the mean of each sample
    # 3. Plot histogram of sample means
    # 4. Calculate and display the std of sample means (this is the standard error!)
    
    sample_means = []  # TODO: Fill this
    
    # Hint: Use a loop, or np.random.uniform with a 2D array
    
    ax = axes[idx]
    # TODO: Create histogram, add title with sample size and SE
    
plt.tight_layout()
plt.show()

**Questions to answer**:

1. What shape does the distribution of sample means take, even though the population is uniform?

2. What happens to the spread (standard error) as sample size increases?

3. The theoretical standard error is: SE = œÉ / ‚àön, where œÉ is the population std. For Uniform(0,100), œÉ ‚âà 28.87. Calculate the theoretical SE for n=30 and compare to your simulated result.

**Your answers**:

### Exercise 3.2: The Central Limit Theorem in Action

The Central Limit Theorem states that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population distribution.

**Demonstrate this with an extremely non-normal distribution: the exponential distribution.**

In [None]:
# Population: Exponential(scale=10) - very skewed!

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# First row: Show the original exponential distribution
exponential_pop = np.random.exponential(scale=10, size=100000)
axes[0, 0].hist(exponential_pop, bins=50, density=True, alpha=0.7, color='steelblue')
axes[0, 0].set_title('Original Exponential Distribution\n(Very skewed!)', fontsize=12)
axes[0, 0].set_xlabel('Value')

# Remove other plots in first row
axes[0, 1].axis('off')
axes[0, 2].axis('off')

# Second row: Sampling distributions of the mean for different n
sample_sizes = [2, 10, 50]
n_samples = 10000

for idx, n in enumerate(sample_sizes):
    # YOUR CODE HERE
    # Generate sample means from exponential distribution
    sample_means = []  # TODO
    
    ax = axes[1, idx]
    # TODO: Plot histogram of sample means
    # TODO: Overlay a normal distribution curve for comparison
    # Hint: Use stats.norm.pdf() with mean and std of your sample_means

plt.tight_layout()
plt.show()

**Question**: At what sample size does the distribution of sample means start looking "normal enough"? Why does this matter for real-world statistics?

**Your answer**:

---

## Part 4: Variance vs. Standard Deviation vs. Standard Error

### Exercise 4.1: Define Each in Your Own Words

This is your Week 1 milestone! Fill in the definitions below without looking anything up.

**Complete these definitions in your own words:**

1. **Variance** measures: 
   - Formula: 
   - Units: 

2. **Standard Deviation** measures: 
   - Relationship to variance: 
   - Units: 

3. **Standard Error (of the mean)** measures: 
   - Formula: 
   - Key insight: It decreases as sample size increases because...

### Exercise 4.2: Calculate All Three

Given the following scenario, calculate variance, standard deviation, and standard error.

In [None]:
# Scenario: You measured the heights (in cm) of 25 people
heights = np.array([165, 170, 168, 175, 180, 162, 178, 172, 169, 174,
                    171, 167, 176, 173, 179, 164, 177, 170, 168, 175,
                    172, 166, 174, 171, 169])

# YOUR CODE HERE
# Calculate (using your own functions or numpy):

sample_mean = # TODO
sample_variance = # TODO (remember to use ddof=1 for sample!)
sample_std = # TODO
standard_error = # TODO (hint: std / sqrt(n))

print(f"Sample size (n): {len(heights)}")
print(f"Sample mean: {sample_mean:.2f} cm")
print(f"Sample variance: {sample_variance:.2f} cm¬≤")
print(f"Sample standard deviation: {sample_std:.2f} cm")
print(f"Standard error of the mean: {standard_error:.2f} cm")

print(f"\nInterpretation:")
print(f"  - The average height is {sample_mean:.1f} cm")
print(f"  - Individual heights typically vary by about ¬±{sample_std:.1f} cm from the mean")
print(f"  - Our estimate of the population mean is probably within ¬±{2*standard_error:.1f} cm of the true mean (95% CI)")

### Exercise 4.3: How Standard Error Changes with Sample Size

Plot how standard error decreases as you increase sample size.

In [None]:
# Assume we know the population std is 15 cm
population_std = 15

sample_sizes = np.arange(5, 505, 5)  # 5 to 500

# YOUR CODE HERE
# Calculate SE for each sample size using SE = sigma / sqrt(n)
standard_errors = # TODO

# Create plot
plt.figure(figsize=(10, 6))
# TODO: Plot sample_sizes vs standard_errors
# Add horizontal reference lines at SE = 3, 2, 1
# Add vertical lines showing sample size needed to achieve each SE

plt.xlabel('Sample Size (n)')
plt.ylabel('Standard Error (cm)')
plt.title('How Standard Error Decreases with Sample Size')
plt.show()

# Calculate: What sample size do you need for SE = 1?
# SE = sigma / sqrt(n) => n = (sigma / SE)^2
n_for_se_1 = # TODO
print(f"Sample size needed for SE = 1: {n_for_se_1}")

**Question**: Why does reducing SE from 3 to 1.5 require 4x the sample size, not 2x?

**Your answer**:

---

## Bonus Challenges

### Bonus 1: Implement Skewness and Kurtosis

Skewness measures asymmetry, kurtosis measures "tailedness".

In [None]:
def calculate_skewness(data):
    """
    Calculate skewness (Fisher's definition).
    Formula: E[(X - Œº)¬≥] / œÉ¬≥
    Positive = right-skewed, Negative = left-skewed
    """
    # YOUR CODE HERE
    pass


def calculate_kurtosis(data):
    """
    Calculate excess kurtosis.
    Formula: E[(X - Œº)‚Å¥] / œÉ‚Å¥ - 3
    (We subtract 3 so normal distribution has kurtosis = 0)
    """
    # YOUR CODE HERE
    pass


# Test on different distributions
normal_data = np.random.normal(0, 1, 10000)
exponential_data = np.random.exponential(1, 10000)
uniform_data = np.random.uniform(-1, 1, 10000)

print("Normal distribution:")
print(f"  Your skewness: {calculate_skewness(normal_data):.3f}, scipy: {stats.skew(normal_data):.3f}")
print(f"  Your kurtosis: {calculate_kurtosis(normal_data):.3f}, scipy: {stats.kurtosis(normal_data):.3f}")

print("\nExponential distribution (should be right-skewed):")
print(f"  Your skewness: {calculate_skewness(exponential_data):.3f}, scipy: {stats.skew(exponential_data):.3f}")

print("\nUniform distribution (should have negative excess kurtosis):")
print(f"  Your kurtosis: {calculate_kurtosis(uniform_data):.3f}, scipy: {stats.kurtosis(uniform_data):.3f}")

### Bonus 2: Create a Comprehensive Statistics Summary Function

Create a function that takes an array and returns a dictionary with all descriptive statistics.

In [None]:
def describe_distribution(data):
    """
    Return comprehensive descriptive statistics for a dataset.
    
    Should include:
    - count, mean, median, mode
    - min, max, range
    - variance, std (sample)
    - standard error
    - skewness, kurtosis
    - quartiles (25th, 50th, 75th percentile)
    - IQR (interquartile range)
    """
    # YOUR CODE HERE
    pass


# Test it
test_data = np.random.exponential(10, 1000)
summary = describe_distribution(test_data)

print("Distribution Summary:")
print("=" * 40)
for key, value in summary.items():
    if isinstance(value, float):
        print(f"{key:20s}: {value:.4f}")
    else:
        print(f"{key:20s}: {value}")

---

## Week 1 Checkpoint

Before moving to Week 2, make sure you can:

- [ ] Generate samples from normal, uniform, and exponential distributions
- [ ] Calculate mean, median, mode, variance, and standard deviation by hand
- [ ] Explain why we use n-1 (Bessel's correction) for sample variance
- [ ] Explain when to use mean vs. median
- [ ] Explain the Central Limit Theorem in plain English
- [ ] **Explain the difference between variance, standard deviation, and standard error** (MILESTONE!)

---

**Great work completing Week 1! üéâ**