# üìä Week 2: Probability & Statistics for AI

This notebook covers the essential probability and statistics concepts needed for AI/ML.

## Table of Contents
1. [Probability Fundamentals](#1-probability-fundamentals)
2. [Probability Distributions](#2-probability-distributions)
3. [Bayes' Theorem](#3-bayes-theorem)
4. [Statistical Concepts](#4-statistical-concepts)
5. [Maximum Likelihood Estimation](#5-maximum-likelihood-estimation)
6. [Sampling Methods](#6-sampling-methods)
7. [Applications in ML](#7-applications-in-ml)

---

## 1. Probability Fundamentals

### 1.1 Basic Definitions

**Probability** measures the likelihood of an event occurring.

| Concept | Definition | Example |
|---------|------------|---------|
| **Sample Space (Œ©)** | Set of all possible outcomes | Coin flip: {H, T} |
| **Event (E)** | Subset of sample space | Getting heads |
| **P(E)** | Probability of event E | P(H) = 0.5 |

### 1.2 Probability Axioms (Kolmogorov)

1. **Non-negativity**: $P(E) \geq 0$ for all events E
2. **Normalization**: $P(\Omega) = 1$
3. **Additivity**: $P(A \cup B) = P(A) + P(B)$ if A and B are disjoint

### 1.3 Key Formulas

| Formula | Description |
|---------|-------------|
| $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ | Union (OR) |
| $P(A \cap B) = P(A) \cdot P(B|A)$ | Intersection (AND) |
| $P(A') = 1 - P(A)$ | Complement (NOT) |
| $P(A|B) = \frac{P(A \cap B)}{P(B)}$ | Conditional probability |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Example: Calculating probabilities
def calculate_probability(favorable, total):
    """Calculate probability of an event."""
    return favorable / total

# Example: Card probability
# P(drawing a heart) = 13/52 = 0.25
p_heart = calculate_probability(13, 52)
print(f"P(Heart) = {p_heart:.4f}")

# P(drawing a face card) = 12/52
p_face = calculate_probability(12, 52)
print(f"P(Face Card) = {p_face:.4f}")

# P(Heart AND Face) = 3/52
p_heart_and_face = calculate_probability(3, 52)
print(f"P(Heart AND Face) = {p_heart_and_face:.4f}")

# P(Heart OR Face) = P(Heart) + P(Face) - P(Heart AND Face)
p_heart_or_face = p_heart + p_face - p_heart_and_face
print(f"P(Heart OR Face) = {p_heart_or_face:.4f}")

---

## 2. Probability Distributions

### 2.1 Discrete Distributions

#### Bernoulli Distribution
Single trial with probability p of success.

$$P(X = k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}$$

#### Binomial Distribution
Number of successes in n independent Bernoulli trials.

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

- Mean: $\mu = np$
- Variance: $\sigma^2 = np(1-p)$

In [None]:
# Binomial Distribution Example
n, p = 10, 0.5  # 10 coin flips, fair coin

# Create distribution
x = np.arange(0, n+1)
binomial_pmf = stats.binom.pmf(x, n, p)

# Plot
plt.figure(figsize=(10, 5))
plt.bar(x, binomial_pmf, color='steelblue', alpha=0.7)
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.xticks(x)
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"Mean: {n*p}, Variance: {n*p*(1-p)}")
print(f"P(X = 5) = {stats.binom.pmf(5, n, p):.4f}")
print(f"P(X >= 7) = {1 - stats.binom.cdf(6, n, p):.4f}")

### 2.2 Continuous Distributions

#### Normal (Gaussian) Distribution
The most important distribution in statistics!

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

**Properties:**
- Symmetric around mean Œº
- 68-95-99.7 rule (within 1, 2, 3 standard deviations)
- Central Limit Theorem: sums of random variables ‚Üí Normal

In [None]:
# Normal Distribution
mu, sigma = 0, 1  # Standard normal

x = np.linspace(-4, 4, 1000)
normal_pdf = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(12, 5))

# Plot PDF
plt.subplot(1, 2, 1)
plt.plot(x, normal_pdf, 'b-', linewidth=2)
plt.fill_between(x, normal_pdf, alpha=0.3)

# Shade 1 standard deviation
x_fill = np.linspace(-1, 1, 100)
plt.fill_between(x_fill, stats.norm.pdf(x_fill), color='green', alpha=0.3, label='68%')

plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Standard Normal Distribution')
plt.legend()
plt.grid(alpha=0.3)

# Plot different normal distributions
plt.subplot(1, 2, 2)
for mu, sigma in [(0, 1), (0, 2), (2, 1)]:
    plt.plot(x, stats.norm.pdf(x, mu, sigma), label=f'Œº={mu}, œÉ={sigma}')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title('Different Normal Distributions')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Probability calculations
print(f"P(X < 0) = {stats.norm.cdf(0):.4f}")
print(f"P(-1 < X < 1) = {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")
print(f"P(X > 1.96) = {1 - stats.norm.cdf(1.96):.4f}")

---

## 3. Bayes' Theorem

### 3.1 The Formula

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Where:
- $P(A|B)$ = **Posterior** (probability of A given B)
- $P(B|A)$ = **Likelihood** (probability of B given A)
- $P(A)$ = **Prior** (initial probability of A)
- $P(B)$ = **Evidence** (probability of B)

### 3.2 Why Bayes Matters in ML

- **Naive Bayes Classifier**: Fast, effective for text classification
- **Bayesian Inference**: Update beliefs with new data
- **Probabilistic Models**: Quantify uncertainty

In [None]:
# Bayes' Theorem Example: Medical Testing
# Disease prevalence: 1% of population has the disease
# Test sensitivity: 99% (true positive rate)
# Test specificity: 95% (true negative rate)

# Question: If you test positive, what's the probability you have the disease?

p_disease = 0.01       # P(D) - Prior
p_pos_given_disease = 0.99  # P(+|D) - Sensitivity
p_neg_given_healthy = 0.95  # P(-|¬¨D) - Specificity
p_pos_given_healthy = 1 - p_neg_given_healthy  # P(+|¬¨D) - False positive rate

# Calculate P(+) using law of total probability
p_positive = (p_pos_given_disease * p_disease + 
              p_pos_given_healthy * (1 - p_disease))

# Apply Bayes' Theorem
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive

print("Medical Test Bayes Analysis")
print("=" * 40)
print(f"Prior P(Disease) = {p_disease:.2%}")
print(f"P(Positive) = {p_positive:.4f}")
print(f"\n‚ö†Ô∏è P(Disease | Positive) = {p_disease_given_pos:.2%}")
print(f"\nSurprisingly, even with a positive test, there's only a {p_disease_given_pos:.1%}")
print("chance you actually have the disease! This is the 'base rate fallacy'.")

In [None]:
# Naive Bayes Classifier from Scratch
class NaiveBayesClassifier:
    """
    Simple Naive Bayes classifier for text.
    
    Assumes features are conditionally independent given the class.
    P(C|X) ‚àù P(C) * Œ† P(xi|C)
    """
    
    def __init__(self):
        self.class_priors = {}  # P(C)
        self.word_probs = {}    # P(word|C)
        self.vocab = set()
    
    def fit(self, documents, labels):
        """Train the classifier."""
        # Count classes
        class_counts = {}
        word_counts = {}  # word_counts[class][word] = count
        
        for doc, label in zip(documents, labels):
            class_counts[label] = class_counts.get(label, 0) + 1
            
            if label not in word_counts:
                word_counts[label] = {}
            
            words = doc.lower().split()
            self.vocab.update(words)
            
            for word in words:
                word_counts[label][word] = word_counts[label].get(word, 0) + 1
        
        # Calculate priors
        total = sum(class_counts.values())
        for label, count in class_counts.items():
            self.class_priors[label] = count / total
        
        # Calculate word probabilities with Laplace smoothing
        for label in class_counts:
            self.word_probs[label] = {}
            total_words = sum(word_counts[label].values())
            
            for word in self.vocab:
                count = word_counts[label].get(word, 0)
                # Laplace smoothing: (count + 1) / (total + vocab_size)
                self.word_probs[label][word] = (count + 1) / (total_words + len(self.vocab))
    
    def predict(self, document):
        """Predict class for a document."""
        words = document.lower().split()
        
        scores = {}
        for label in self.class_priors:
            # Start with log prior
            score = np.log(self.class_priors[label])
            
            # Add log likelihoods
            for word in words:
                if word in self.vocab:
                    score += np.log(self.word_probs[label][word])
            
            scores[label] = score
        
        return max(scores, key=scores.get)

# Example: Sentiment classification
documents = [
    "I love this movie it is great",
    "Amazing film wonderful acting",
    "This movie is terrible waste of time",
    "Horrible film bad acting",
    "Great movie loved every minute",
    "Awful terrible waste"
]
labels = ["positive", "positive", "negative", "negative", "positive", "negative"]

# Train
nb = NaiveBayesClassifier()
nb.fit(documents, labels)

# Test
test_docs = [
    "I love this amazing film",
    "terrible movie really bad",
    "great acting wonderful story"
]

print("Naive Bayes Predictions:")
for doc in test_docs:
    pred = nb.predict(doc)
    print(f"  '{doc}' ‚Üí {pred}")

---

## 4. Statistical Concepts

### 4.1 Expectation and Variance

**Expectation (Mean)**:
$$E[X] = \sum_x x \cdot P(X = x) \quad \text{or} \quad \int_{-\infty}^{\infty} x \cdot f(x) dx$$

**Variance**:
$$Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

**Standard Deviation**: $\sigma = \sqrt{Var(X)}$

### 4.2 Covariance and Correlation

**Covariance**: How two variables vary together
$$Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]$$

**Correlation**: Normalized covariance (-1 to 1)
$$\rho_{XY} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}$$

In [None]:
# Correlation visualization
np.random.seed(42)

# Generate correlated data
n = 200
correlations = [-0.9, 0, 0.9]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, target_corr in zip(axes, correlations):
    # Generate correlated data
    x = np.random.randn(n)
    noise = np.random.randn(n)
    y = target_corr * x + np.sqrt(1 - target_corr**2) * noise
    
    actual_corr = np.corrcoef(x, y)[0, 1]
    
    ax.scatter(x, y, alpha=0.5)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title(f'Correlation = {actual_corr:.2f}')
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

---

## 5. Maximum Likelihood Estimation

### 5.1 The Concept

Find parameters Œ∏ that maximize the probability of observing the data:

$$\hat{\theta}_{MLE} = \arg\max_\theta P(X|\theta) = \arg\max_\theta \mathcal{L}(\theta)$$

In practice, we maximize log-likelihood:
$$\hat{\theta}_{MLE} = \arg\max_\theta \log \mathcal{L}(\theta)$$

### 5.2 Example: Estimating Mean of Normal Distribution

In [None]:
# MLE for Normal Distribution
np.random.seed(42)

# Generate data from N(5, 2)
true_mu, true_sigma = 5, 2
data = np.random.normal(true_mu, true_sigma, 100)

# Log-likelihood function
def log_likelihood(mu, sigma, data):
    """Log-likelihood of normal distribution."""
    n = len(data)
    ll = -n/2 * np.log(2*np.pi) - n*np.log(sigma) - np.sum((data - mu)**2) / (2*sigma**2)
    return ll

# MLE estimates (closed form)
mle_mu = np.mean(data)
mle_sigma = np.std(data, ddof=0)  # MLE uses n, not n-1

print("Maximum Likelihood Estimation")
print("=" * 40)
print(f"True parameters: Œº = {true_mu}, œÉ = {true_sigma}")
print(f"MLE estimates:   Œº = {mle_mu:.3f}, œÉ = {mle_sigma:.3f}")
print(f"Log-likelihood: {log_likelihood(mle_mu, mle_sigma, data):.2f}")

# Visualize likelihood surface for Œº
mu_range = np.linspace(3, 7, 100)
lls = [log_likelihood(mu, mle_sigma, data) for mu in mu_range]

plt.figure(figsize=(10, 5))
plt.plot(mu_range, lls, 'b-', linewidth=2)
plt.axvline(mle_mu, color='r', linestyle='--', label=f'MLE Œº = {mle_mu:.3f}')
plt.axvline(true_mu, color='g', linestyle=':', label=f'True Œº = {true_mu}')
plt.xlabel('Œº')
plt.ylabel('Log-Likelihood')
plt.title('Log-Likelihood Function')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

---

## 6. Sampling Methods

### 6.1 Monte Carlo Sampling

Use random sampling to estimate quantities that are difficult to compute analytically.

In [None]:
# Monte Carlo estimation of œÄ
def estimate_pi(n_samples):
    """Estimate œÄ using Monte Carlo sampling."""
    # Generate random points in [0,1] x [0,1]
    x = np.random.uniform(0, 1, n_samples)
    y = np.random.uniform(0, 1, n_samples)
    
    # Check if inside quarter circle
    inside = (x**2 + y**2) <= 1
    
    # œÄ/4 = area of quarter circle / area of square
    pi_estimate = 4 * np.sum(inside) / n_samples
    return pi_estimate

# Estimate with different sample sizes
sample_sizes = [100, 1000, 10000, 100000]

print("Monte Carlo Estimation of œÄ")
print("=" * 40)
print(f"True œÄ = {np.pi:.6f}")
print()

for n in sample_sizes:
    estimate = estimate_pi(n)
    error = abs(estimate - np.pi)
    print(f"n = {n:>6}: œÄ ‚âà {estimate:.6f}, error = {error:.6f}")

---

## 7. Applications in ML

### 7.1 Cross-Entropy Loss

Used for classification, derived from information theory:

$$H(p, q) = -\sum_x p(x) \log q(x)$$

For binary classification:
$$L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

### 7.2 Regularization as Prior

- **L2 Regularization** = Gaussian prior on weights
- **L1 Regularization** = Laplace prior on weights

In [None]:
# Cross-entropy loss
def cross_entropy_loss(y_true, y_pred):
    """Binary cross-entropy loss."""
    epsilon = 1e-15  # Avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Example
y_true = np.array([1, 0, 1, 1, 0])
y_pred_good = np.array([0.9, 0.1, 0.8, 0.95, 0.2])
y_pred_bad = np.array([0.5, 0.5, 0.5, 0.5, 0.5])

print(f"Good predictions loss: {cross_entropy_loss(y_true, y_pred_good):.4f}")
print(f"Bad predictions loss: {cross_entropy_loss(y_true, y_pred_bad):.4f}")

---

## üìù Summary

Key concepts covered:

1. **Probability fundamentals** - Sample space, events, axioms
2. **Distributions** - Binomial, Normal, and their properties
3. **Bayes' Theorem** - Updating beliefs with evidence
4. **Statistics** - Mean, variance, correlation
5. **MLE** - Finding best parameters from data
6. **Sampling** - Monte Carlo methods
7. **ML applications** - Cross-entropy, regularization

### Key Takeaways for AI/ML

- Understanding probability enables reasoning about uncertainty
- Bayes' theorem is the foundation of probabilistic ML
- MLE connects to loss functions and optimization
- Distributions help model real-world phenomena