# ðŸ“Š Week 2: Probability & Statistics for ML

**Learning Objectives:**
1. Master probability fundamentals (distributions, expectation, variance)
2. Understand Bayes' theorem and its applications
3. Apply statistical concepts to ML problems
4. Visualize and interpret probability distributions

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from collections import Counter

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

---
# Section 1: Theory
---

## Why Probability in ML?

Machine Learning is fundamentally about **uncertainty**:
- Models make **probabilistic predictions** (not certainties)
- Training involves **maximum likelihood estimation**
- Bayesian methods use **prior beliefs + data**

## Key Concepts

| Concept | Formula | Meaning |
|---------|---------|--------|
| Probability | $P(A) \in [0, 1]$ | Likelihood of event A |
| Expectation | $E[X] = \sum x \cdot P(x)$ | Average value |
| Variance | $Var(X) = E[(X - \mu)^2]$ | Spread around mean |
| Bayes | $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ | Update beliefs with evidence |

---
# Section 2: Hands-On Implementation
---

## 2.1 Basic Probability

In [None]:
def probability(events, target):
    """Calculate probability of target in events."""
    return sum(1 for e in events if e == target) / len(events)


def joint_probability(events_a, events_b, target_a, target_b):
    """Calculate P(A and B)."""
    count = sum(1 for a, b in zip(events_a, events_b) 
                if a == target_a and b == target_b)
    return count / len(events_a)


def conditional_probability(events_a, events_b, target_a, given_b):
    """Calculate P(A | B)."""
    filtered = [(a, b) for a, b in zip(events_a, events_b) if b == given_b]
    if len(filtered) == 0:
        return 0
    return sum(1 for a, b in filtered if a == target_a) / len(filtered)

In [None]:
# Example: Coin flips
coin_flips = [0, 1, 1, 0, 1, 1, 0, 1, 0, 1]  # 0=Tails, 1=Heads

p_heads = probability(coin_flips, 1)
p_tails = probability(coin_flips, 0)

print(f"P(Heads) = {p_heads:.2f}")
print(f"P(Tails) = {p_tails:.2f}")
print(f"Sum = {p_heads + p_tails:.2f} (should be 1.00)")

## 2.2 Expectation & Variance

In [None]:
def expectation(values):
    """Calculate expected value (mean)."""
    return sum(values) / len(values)


def variance(values):
    """Calculate population variance."""
    mu = expectation(values)
    return sum((x - mu)**2 for x in values) / len(values)


def std_dev(values):
    """Calculate standard deviation."""
    return variance(values) ** 0.5

In [None]:
# Example: Dice rolls
dice_rolls = [np.random.randint(1, 7) for _ in range(1000)]

print("Dice Roll Statistics (1000 rolls):")
print(f"  E[X] = {expectation(dice_rolls):.4f} (theoretical: 3.5)")
print(f"  Var(X) = {variance(dice_rolls):.4f} (theoretical: 2.9167)")
print(f"  Std(X) = {std_dev(dice_rolls):.4f}")

## 2.3 Bayes' Theorem

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

**Real-World Application: Spam Classification**

In [None]:
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
    """
    Calculate P(A|B) using Bayes' theorem.
    
    Args:
        p_a: Prior probability P(A)
        p_b_given_a: Likelihood P(B|A)
        p_b_given_not_a: P(B|not A)
    
    Returns:
        Posterior probability P(A|B)
    """
    # P(B) = P(B|A)*P(A) + P(B|not A)*P(not A)
    p_b = p_b_given_a * p_a + p_b_given_not_a * (1 - p_a)
    
    # Bayes' theorem
    p_a_given_b = (p_b_given_a * p_a) / p_b
    return p_a_given_b

In [None]:
# Example 1: Spam Detection
print("=" * 50)
print("SPAM DETECTION EXAMPLE")
print("=" * 50)

# Given:
p_spam = 0.3                    # 30% of emails are spam
p_word_given_spam = 0.8         # 80% of spam contains "free"
p_word_given_ham = 0.1          # 10% of ham contains "free"

# Question: Email contains "free" - what's P(Spam | "free")?
p_spam_given_word = bayes_theorem(p_spam, p_word_given_spam, p_word_given_ham)

print(f"Prior P(Spam) = {p_spam:.0%}")
print(f"P('free' | Spam) = {p_word_given_spam:.0%}")
print(f"P('free' | Ham) = {p_word_given_ham:.0%}")
print(f"\nâ†’ P(Spam | 'free') = {p_spam_given_word:.1%}")

In [None]:
# Example 2: Medical Test (Classic!)
print("\n" + "=" * 50)
print("MEDICAL TEST EXAMPLE")
print("=" * 50)

# Given:
p_disease = 0.001               # 0.1% have the disease
p_positive_given_disease = 0.99 # 99% sensitivity (true positive)
p_positive_given_healthy = 0.05 # 5% false positive

# Question: Test is positive - what's P(Disease | Positive)?
p_disease_given_positive = bayes_theorem(
    p_disease, p_positive_given_disease, p_positive_given_healthy
)

print(f"Disease prevalence: {p_disease:.1%}")
print(f"Test sensitivity: {p_positive_given_disease:.0%}")
print(f"False positive rate: {p_positive_given_healthy:.0%}")
print(f"\nâ†’ P(Disease | Positive Test) = {p_disease_given_positive:.1%}")
print("\nðŸ’¡ Key Insight: Even with 99% accurate test, only ~2% chance!")
print("   This is why understanding priors matters in ML!")

## 2.4 Maximum Likelihood Estimation (MLE)

In [None]:
def mle_bernoulli(data):
    """MLE for Bernoulli distribution parameter p."""
    return sum(data) / len(data)


def mle_normal(data):
    """MLE for Normal distribution parameters (mean, variance)."""
    mu = sum(data) / len(data)
    var = sum((x - mu)**2 for x in data) / len(data)
    return mu, var

In [None]:
# Example: Estimate coin bias from flips
biased_coin = [1, 1, 1, 0, 1, 1, 0, 1, 1, 1]  # Biased towards heads

p_hat = mle_bernoulli(biased_coin)
print(f"MLE estimate of P(Heads) = {p_hat:.2f}")

# Example: Estimate normal distribution parameters
normal_data = np.random.normal(loc=5, scale=2, size=1000)
mu_hat, var_hat = mle_normal(normal_data)

print(f"\nMLE for Normal Distribution:")
print(f"  Î¼Ì‚ = {mu_hat:.4f} (true: 5)")
print(f"  ÏƒÌ‚Â² = {var_hat:.4f} (true: 4)")

---
# Section 3: Visualizations
---

## 3.1 Common Probability Distributions

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Normal Distribution
x = np.linspace(-4, 4, 100)
for mu, sigma in [(0, 1), (0, 0.5), (0, 2)]:
    axes[0, 0].plot(x, stats.norm.pdf(x, mu, sigma), 
                    label=f'Î¼={mu}, Ïƒ={sigma}')
axes[0, 0].set_title('Normal Distribution')
axes[0, 0].legend()

# 2. Uniform Distribution
x = np.linspace(-1, 2, 100)
axes[0, 1].plot(x, stats.uniform.pdf(x, 0, 1), linewidth=2)
axes[0, 1].fill_between(x, stats.uniform.pdf(x, 0, 1), alpha=0.3)
axes[0, 1].set_title('Uniform Distribution [0, 1]')

# 3. Exponential Distribution
x = np.linspace(0, 5, 100)
for lam in [0.5, 1, 2]:
    axes[0, 2].plot(x, stats.expon.pdf(x, scale=1/lam), label=f'Î»={lam}')
axes[0, 2].set_title('Exponential Distribution')
axes[0, 2].legend()

# 4. Binomial Distribution
n = 20
x = np.arange(0, n+1)
for p in [0.3, 0.5, 0.7]:
    axes[1, 0].bar(x + p*0.2, stats.binom.pmf(x, n, p), 
                   width=0.2, alpha=0.7, label=f'p={p}')
axes[1, 0].set_title(f'Binomial Distribution (n={n})')
axes[1, 0].legend()

# 5. Poisson Distribution
x = np.arange(0, 20)
for lam in [1, 4, 8]:
    axes[1, 1].bar(x + lam*0.1, stats.poisson.pmf(x, lam), 
                   width=0.3, alpha=0.7, label=f'Î»={lam}')
axes[1, 1].set_title('Poisson Distribution')
axes[1, 1].legend()

# 6. Beta Distribution (Important for Bayesian!)
x = np.linspace(0, 1, 100)
for a, b in [(0.5, 0.5), (2, 2), (2, 5), (5, 2)]:
    axes[1, 2].plot(x, stats.beta.pdf(x, a, b), label=f'Î±={a}, Î²={b}')
axes[1, 2].set_title('Beta Distribution')
axes[1, 2].legend()

plt.tight_layout()
plt.show()

## 3.2 Central Limit Theorem Visualization

In [None]:
def visualize_clt(distribution, n_samples_list, n_experiments=1000):
    """Visualize Central Limit Theorem."""
    fig, axes = plt.subplots(1, len(n_samples_list), figsize=(15, 4))
    
    for ax, n_samples in zip(axes, n_samples_list):
        # Generate sample means
        sample_means = []
        for _ in range(n_experiments):
            sample = distribution(size=n_samples)
            sample_means.append(np.mean(sample))
        
        # Plot histogram
        ax.hist(sample_means, bins=30, density=True, alpha=0.7)
        
        # Overlay normal distribution
        mu, sigma = np.mean(sample_means), np.std(sample_means)
        x = np.linspace(min(sample_means), max(sample_means), 100)
        ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2)
        
        ax.set_title(f'n = {n_samples}')
        ax.set_xlabel('Sample Mean')
    
    plt.suptitle('Central Limit Theorem: Sample Means â†’ Normal Distribution', 
                 fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()


# Demonstrate CLT with uniform distribution
print("CLT with Uniform Distribution:")
visualize_clt(np.random.uniform, [1, 5, 30, 100])

## 3.3 Bayes Theorem Visual

In [None]:
def visualize_bayes_update(prior, likelihood, n_steps=5):
    """Visualize Bayesian updating process."""
    x = np.linspace(0, 1, 100)
    
    fig, axes = plt.subplots(1, n_steps, figsize=(15, 3))
    
    # Start with uniform prior
    current_alpha, current_beta = 1, 1
    
    # Simulate coin flips (true p = 0.7)
    np.random.seed(42)
    data = np.random.binomial(1, 0.7, n_steps * 10)
    
    for i, ax in enumerate(axes):
        # Update with 10 new observations
        new_data = data[i*10:(i+1)*10]
        successes = sum(new_data)
        failures = len(new_data) - successes
        
        current_alpha += successes
        current_beta += failures
        
        # Plot posterior
        posterior = stats.beta.pdf(x, current_alpha, current_beta)
        ax.plot(x, posterior, 'b-', linewidth=2)
        ax.fill_between(x, posterior, alpha=0.3)
        ax.axvline(x=0.7, color='r', linestyle='--', label='True p')
        ax.set_title(f'After {(i+1)*10} flips\nÎ±={current_alpha}, Î²={current_beta}')
        ax.set_xlim(0, 1)
    
    plt.suptitle('Bayesian Updating: Estimating Coin Bias', fontsize=14, y=1.05)
    plt.tight_layout()
    plt.show()


visualize_bayes_update(prior=(1, 1), likelihood=0.7)

---
# Section 4: Unit Tests
---

In [None]:
def run_tests():
    """Run all unit tests."""
    print("Running Unit Tests...\n")
    
    # Test 1: Probability
    assert probability([0, 1, 1, 1], 1) == 0.75
    print("âœ“ Probability test passed")
    
    # Test 2: Expectation
    assert expectation([1, 2, 3, 4, 5]) == 3.0
    print("âœ“ Expectation test passed")
    
    # Test 3: Variance
    assert variance([1, 1, 1, 1]) == 0.0
    print("âœ“ Variance (identical values) test passed")
    
    # Test 4: Standard deviation
    assert abs(std_dev([2, 4, 4, 4, 5, 5, 7, 9]) - 2.0) < 0.01
    print("âœ“ Standard deviation test passed")
    
    # Test 5: Bayes theorem
    result = bayes_theorem(0.5, 1.0, 0.0)  # If P(B|A)=1 and P(B|~A)=0
    assert abs(result - 1.0) < 1e-10
    print("âœ“ Bayes theorem test passed")
    
    # Test 6: MLE Bernoulli
    assert mle_bernoulli([1, 1, 1, 0, 0]) == 0.6
    print("âœ“ MLE Bernoulli test passed")
    
    # Test 7: MLE Normal
    mu, var = mle_normal([0, 0, 0])
    assert mu == 0 and var == 0
    print("âœ“ MLE Normal test passed")
    
    print("\nðŸŽ‰ All tests passed!")


run_tests()

---
# Section 5: Interview Prep
---

## Key Questions

### Q1: What is Bayes' theorem and why is it important in ML?

**Answer:**
- Updates prior beliefs with new evidence
- Foundation of Naive Bayes classifier
- Essential for probabilistic models and uncertainty quantification
- Formula: $P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}$

### Q2: Explain the difference between MLE and MAP.

**Answer:**
- **MLE**: Maximize $P(Data|\theta)$ - find parameters that make data most likely
- **MAP**: Maximize $P(\theta|Data) \propto P(Data|\theta) \cdot P(\theta)$ - includes prior
- MAP with uniform prior equals MLE
- MAP can prevent overfitting (regularization)

### Q3: What is the Central Limit Theorem?

**Answer:**
- Sample means of ANY distribution approach normal distribution
- As sample size increases, distribution becomes more normal
- Justifies using normal distribution in many ML contexts

### Q4: How do you handle low confidence predictions?

**Answer:**
- Set confidence thresholds (e.g., abstain if P < 0.7)
- Use calibration to make probabilities meaningful
- Apply ensemble methods for uncertainty estimation
- Consider Bayesian approaches for full posterior

---
# Section 6: Exercises
---

In [None]:
# Exercise 1: Implement Naive Bayes Classifier
def naive_bayes_train(X, y):
    """
    Train a simple Naive Bayes classifier.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Labels (n_samples,)
    
    Returns:
        Dictionary with class priors and feature likelihoods
    """
    # TODO: Your implementation here
    pass


# Exercise 2: Implement Log-Likelihood
def log_likelihood(data, mu, sigma):
    """
    Calculate log-likelihood of data under normal distribution.
    
    Args:
        data: Observed values
        mu: Mean of normal distribution
        sigma: Standard deviation
    
    Returns:
        Log-likelihood value
    """
    # TODO: Your implementation here
    pass


# Exercise 3: Implement Monte Carlo Estimation
def monte_carlo_pi(n_samples=10000):
    """
    Estimate Ï€ using Monte Carlo simulation.
    
    Hint: Sample uniform points in [0,1]x[0,1]
          Count points inside unit circle
    
    Returns:
        Estimated value of Ï€
    """
    # TODO: Your implementation here
    pass

---
# Section 7: Deliverable
---

## What You Built This Week:

1. **`probability_utils.py`** - Probability calculation functions
2. **Bayes' Theorem Calculator** - With real-world examples
3. **Distribution Visualizations** - Understanding common distributions
4. **MLE Estimators** - Parameter estimation from data

## Key Takeaways:

- ML is fundamentally about modeling uncertainty
- Bayes' theorem: Prior + Evidence = Posterior
- MLE finds parameters that maximize data likelihood
- CLT justifies normal distribution assumptions

## Next Week: ML Core (Week 3)
- Multi-Layer Perceptrons
- Backpropagation from scratch
- Gradient-based optimization