# Lab 1.4.5: Probability Distributions Lab

**Module: 1.4.4
- Knowledge of: Basic probability, logarithms

---

## üåç Real-World Context

**Why does probability matter in deep learning?**

Every loss function comes from probability theory!

| Task | Distribution | Loss Function |
|------|--------------|---------------|
| Regression | Gaussian | Mean Squared Error |
| Binary Classification | Bernoulli | Binary Cross-Entropy |
| Multi-class Classification | Categorical | Cross-Entropy |
| Language Modeling | Categorical | Cross-Entropy |

Understanding this connection helps you:
- Choose the right loss for your problem
- Interpret model outputs as probabilities
- Understand techniques like temperature sampling in LLMs

---

## üßí ELI5: What is Probability?

> **Imagine you're a weather predictor...**
>
> You say: "There's a 70% chance of rain tomorrow."
>
> This means:
> - If we had 100 tomorrows just like this one
> - About 70 of them would be rainy
> - About 30 would be sunny
>
> **In neural networks:**
> - The network outputs probabilities: "80% chance this is a cat"
> - Training = making these predictions more accurate
> - Loss functions measure "how wrong were your probabilities?"
>
> **Maximum Likelihood:**
> "What probability distribution makes the data we observed most likely?"

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("üöÄ Probability Distributions Lab")
print("=" * 50)

---

## Part 1: Key Probability Distributions

Let's implement and visualize the distributions used in deep learning.

In [None]:
# 1. GAUSSIAN (NORMAL) DISTRIBUTION
# Used for: Regression, weight initialization, noise modeling

def gaussian_pdf(x, mu=0, sigma=1):
    """
    Gaussian probability density function.
    
    p(x) = (1 / œÉ‚àö(2œÄ)) √ó exp(-(x-Œº)¬≤ / 2œÉ¬≤)
    
    Args:
        x: Input values
        mu: Mean
        sigma: Standard deviation
    """
    coeff = 1 / (sigma * np.sqrt(2 * np.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * np.exp(exponent)

def gaussian_log_pdf(x, mu=0, sigma=1):
    """Log of Gaussian PDF (more numerically stable)"""
    return -0.5 * np.log(2 * np.pi) - np.log(sigma) - ((x - mu) ** 2) / (2 * sigma ** 2)

# Visualize
x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Different means
for mu in [-2, 0, 2]:
    axes[0].plot(x, gaussian_pdf(x, mu=mu, sigma=1), linewidth=2, label=f'Œº={mu}, œÉ=1')
axes[0].set_xlabel('x')
axes[0].set_ylabel('p(x)')
axes[0].set_title('Gaussian Distribution (varying mean)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Different variances
for sigma in [0.5, 1, 2]:
    axes[1].plot(x, gaussian_pdf(x, mu=0, sigma=sigma), linewidth=2, label=f'Œº=0, œÉ={sigma}')
axes[1].set_xlabel('x')
axes[1].set_ylabel('p(x)')
axes[1].set_title('Gaussian Distribution (varying std)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Gaussian Distribution:")
print("  - Symmetric bell curve")
print("  - Œº controls center, œÉ controls spread")
print("  - Used for regression outputs and weight init")

In [None]:
# 2. BERNOULLI DISTRIBUTION
# Used for: Binary classification

def bernoulli_pmf(k, p):
    """
    Bernoulli probability mass function.
    
    p(k) = p^k √ó (1-p)^(1-k), where k ‚àà {0, 1}
    
    Args:
        k: Outcome (0 or 1)
        p: Probability of k=1
    """
    return (p ** k) * ((1 - p) ** (1 - k))

def bernoulli_log_pmf(k, p):
    """Log of Bernoulli PMF"""
    # Add small epsilon for numerical stability
    eps = 1e-10
    p = np.clip(p, eps, 1 - eps)
    return k * np.log(p) + (1 - k) * np.log(1 - p)

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))

probs = [0.2, 0.5, 0.8]
x_pos = np.arange(len(probs))
width = 0.25

for i, p in enumerate(probs):
    prob_0 = bernoulli_pmf(0, p)
    prob_1 = bernoulli_pmf(1, p)
    
    ax.bar(i - width/2, prob_0, width, label=f'k=0 (p={p})' if i == 0 else '', 
           color='red', alpha=0.7)
    ax.bar(i + width/2, prob_1, width, label=f'k=1 (p={p})' if i == 0 else '',
           color='blue', alpha=0.7)
    
    # Annotate
    ax.annotate(f'{prob_0:.2f}', (i - width/2, prob_0 + 0.02), ha='center')
    ax.annotate(f'{prob_1:.2f}', (i + width/2, prob_1 + 0.02), ha='center')

ax.set_xticks(x_pos)
ax.set_xticklabels([f'p={p}' for p in probs])
ax.set_ylabel('Probability')
ax.set_title('Bernoulli Distribution for Different p values')
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3, axis='y')

# Add legend
ax.bar([], [], color='red', alpha=0.7, label='k=0 (failure)')
ax.bar([], [], color='blue', alpha=0.7, label='k=1 (success)')
ax.legend()

plt.tight_layout()
plt.show()

print("Bernoulli Distribution:")
print("  - Models binary outcomes (yes/no, spam/not spam)")
print("  - p = probability of success (k=1)")
print("  - Network outputs p via sigmoid activation")

In [None]:
# 3. CATEGORICAL DISTRIBUTION
# Used for: Multi-class classification, language modeling

def categorical_pmf(k, probs):
    """
    Categorical probability mass function.
    
    p(k) = probs[k]
    
    Args:
        k: Class index (0 to K-1)
        probs: Probability vector (must sum to 1)
    """
    return probs[k]

def categorical_log_pmf(k, probs):
    """Log of categorical PMF"""
    eps = 1e-10
    return np.log(probs[k] + eps)

def softmax(logits):
    """Convert logits to probabilities"""
    exp_logits = np.exp(logits - np.max(logits))  # Subtract max for stability
    return exp_logits / exp_logits.sum()

# Visualize with example: ImageNet classes
classes = ['cat', 'dog', 'bird', 'fish', 'horse']

# Network output (logits) ‚Üí softmax ‚Üí probabilities
logits = np.array([2.5, 1.2, 0.5, -0.3, -1.0])
probs = softmax(logits)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logits
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(classes)))
axes[0].bar(classes, logits, color=colors)
axes[0].set_ylabel('Logit Value')
axes[0].set_title('Network Output (Logits)')
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[0].grid(True, alpha=0.3, axis='y')

# Probabilities after softmax
axes[1].bar(classes, probs, color=colors)
axes[1].set_ylabel('Probability')
axes[1].set_title('After Softmax (Probabilities)')
for i, (c, p) in enumerate(zip(classes, probs)):
    axes[1].annotate(f'{p:.2%}', (i, p + 0.02), ha='center')
axes[1].set_ylim(0, max(probs) * 1.2)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"Categorical Distribution:")
print(f"  - Logits: {logits}")
print(f"  - After softmax: {probs.round(3)}")
print(f"  - Probabilities sum to: {probs.sum():.6f}")

---

## Part 2: Maximum Likelihood Estimation (MLE)

**The Big Idea:** Find parameters that maximize the probability of observed data.

### Likelihood vs Probability

- **Probability:** P(data | parameters) - "Given these parameters, how likely is this data?"
- **Likelihood:** L(parameters | data) - "Given this data, which parameters are most likely?"

### üßí ELI5: Maximum Likelihood

> **Imagine you found some ancient coins...**
>
> You flip them 10 times: H H T H H H T H H H (8 heads, 2 tails)
>
> **Question:** Is this a fair coin (p=0.5) or biased?
>
> **MLE approach:** Find the p that makes 8/10 heads most likely.
> - If p=0.5: Probability of exactly 8 heads = small
> - If p=0.8: Probability of exactly 8 heads = large!
> - MLE says: p = 0.8 (best explains the data)

In [None]:
# MLE for Bernoulli: Binary Classification

# Observed data: coin flips
data = np.array([1, 1, 0, 1, 1, 1, 0, 1, 1, 1])  # 8 heads (1), 2 tails (0)

def bernoulli_likelihood(p, data):
    """Likelihood of data under Bernoulli with parameter p"""
    return np.prod([bernoulli_pmf(k, p) for k in data])

def bernoulli_log_likelihood(p, data):
    """Log-likelihood (more stable)"""
    return np.sum([bernoulli_log_pmf(k, p) for k in data])

# Compute likelihood for different p values
p_values = np.linspace(0.01, 0.99, 100)
likelihoods = [bernoulli_likelihood(p, data) for p in p_values]
log_likelihoods = [bernoulli_log_likelihood(p, data) for p in p_values]

# Find MLE
mle_idx = np.argmax(log_likelihoods)
mle_p = p_values[mle_idx]

# Analytical MLE for Bernoulli: p = mean(data)
analytical_mle = data.mean()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Likelihood
axes[0].plot(p_values, likelihoods, 'b-', linewidth=2)
axes[0].axvline(x=mle_p, color='red', linestyle='--', label=f'MLE: p={mle_p:.2f}')
axes[0].set_xlabel('p (probability of heads)')
axes[0].set_ylabel('Likelihood')
axes[0].set_title('Likelihood Function')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Log-likelihood (easier to work with)
axes[1].plot(p_values, log_likelihoods, 'b-', linewidth=2)
axes[1].axvline(x=mle_p, color='red', linestyle='--', label=f'MLE: p={mle_p:.2f}')
axes[1].set_xlabel('p (probability of heads)')
axes[1].set_ylabel('Log-Likelihood')
axes[1].set_title('Log-Likelihood Function')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Data: {data} ({data.sum()} heads, {len(data)-data.sum()} tails)")
print(f"MLE (numerical): p = {mle_p:.4f}")
print(f"MLE (analytical): p = {analytical_mle:.4f}")
print(f"\nThe MLE is simply the fraction of heads!")

---

## Part 3: Deriving MSE Loss from Gaussian MLE

**Key insight:** MSE loss = negative log-likelihood of a Gaussian!

### The Derivation

Assume our model's predictions follow: $y \sim \mathcal{N}(\hat{y}, \sigma^2)$

**Log-likelihood for one sample:**
$$\log p(y|\hat{y}, \sigma) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y-\hat{y})^2}{2\sigma^2}$$

**Negative log-likelihood:**
$$-\log p(y|\hat{y}) \propto (y - \hat{y})^2$$

This is exactly **Mean Squared Error** (up to constants)!

In [None]:
# Demonstrate the connection between Gaussian MLE and MSE

# True values and predictions
y_true = np.array([1.0, 2.5, 3.0, 4.5, 5.0])
y_pred = np.array([1.2, 2.3, 3.5, 4.2, 5.1])

# MSE Loss (what we usually use)
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Gaussian Negative Log-Likelihood (with œÉ=1)
def gaussian_nll(y_true, y_pred, sigma=1.0):
    """Negative log-likelihood under Gaussian assumption"""
    n = len(y_true)
    const = 0.5 * n * np.log(2 * np.pi * sigma**2)
    squared_errors = np.sum((y_true - y_pred)**2) / (2 * sigma**2)
    return const + squared_errors

# Compute both
mse = mse_loss(y_true, y_pred)
nll = gaussian_nll(y_true, y_pred, sigma=1.0)

print("Gaussian MLE ‚Üí MSE Connection")
print("=" * 50)
print(f"True values:  {y_true}")
print(f"Predictions:  {y_pred}")
print()
print(f"MSE Loss:     {mse:.6f}")
print(f"Gaussian NLL: {nll:.6f}")
print()
print("Notice: Minimizing MSE = Maximizing Gaussian likelihood!")

In [None]:
# Visual demonstration

# Single prediction example
y_true_single = 3.0
y_preds = np.linspace(0, 6, 100)

# Compute MSE and NLL for each prediction
mses = (y_true_single - y_preds) ** 2
nlls = -gaussian_log_pdf(y_true_single, mu=y_preds, sigma=1.0)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MSE vs prediction
axes[0].plot(y_preds, mses, 'b-', linewidth=2)
axes[0].axvline(x=y_true_single, color='red', linestyle='--', label=f'True y = {y_true_single}')
axes[0].set_xlabel('Prediction ≈∑')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('MSE Loss vs Prediction')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# NLL vs prediction
axes[1].plot(y_preds, nlls, 'b-', linewidth=2)
axes[1].axvline(x=y_true_single, color='red', linestyle='--', label=f'True y = {y_true_single}')
axes[1].set_xlabel('Prediction ≈∑')
axes[1].set_ylabel('Negative Log-Likelihood')
axes[1].set_title('Gaussian NLL vs Prediction')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Both curves have the same shape!")
print("They only differ by constants (which don't affect optimization).")

---

## Part 4: Deriving Cross-Entropy from Categorical MLE

**Key insight:** Cross-entropy loss = negative log-likelihood of Categorical!

### The Derivation

For classification with one-hot label $y$ and predicted probabilities $\hat{p}$:

**Log-likelihood:**
$$\log p(y|\hat{p}) = \sum_k y_k \log \hat{p}_k$$

**Negative log-likelihood (Cross-Entropy):**
$$H(y, \hat{p}) = -\sum_k y_k \log \hat{p}_k$$

This is the **Cross-Entropy Loss**!

In [None]:
# Cross-Entropy Loss implementation

def cross_entropy_loss(y_true_onehot, y_pred_probs):
    """
    Cross-entropy loss for classification.
    
    H(y, p) = -Œ£ y_k log(p_k)
    
    Args:
        y_true_onehot: One-hot encoded true labels
        y_pred_probs: Predicted probabilities (after softmax)
    """
    eps = 1e-10  # Prevent log(0)
    return -np.sum(y_true_onehot * np.log(y_pred_probs + eps))

def categorical_nll(y_true_class, y_pred_probs):
    """
    Negative log-likelihood for categorical distribution.
    
    NLL = -log(p_true_class)
    
    Args:
        y_true_class: True class index
        y_pred_probs: Predicted probabilities
    """
    eps = 1e-10
    return -np.log(y_pred_probs[y_true_class] + eps)

# Example: 5-class classification
n_classes = 5
true_class = 2  # True label is class 2

# One-hot encoding
y_true_onehot = np.zeros(n_classes)
y_true_onehot[true_class] = 1

# Predicted probabilities (from softmax)
logits = np.array([0.5, 1.0, 2.5, 0.3, -0.2])  # Network output
y_pred_probs = softmax(logits)

# Compute both losses
ce_loss = cross_entropy_loss(y_true_onehot, y_pred_probs)
nll = categorical_nll(true_class, y_pred_probs)

print("Categorical MLE ‚Üí Cross-Entropy Connection")
print("=" * 50)
print(f"True class: {true_class}")
print(f"One-hot:    {y_true_onehot}")
print(f"Logits:     {logits}")
print(f"Probs:      {y_pred_probs.round(4)}")
print()
print(f"Cross-Entropy Loss: {ce_loss:.6f}")
print(f"Categorical NLL:    {nll:.6f}")
print()
print("They're identical! Cross-entropy IS the categorical NLL.")

In [None]:
# Visualize how cross-entropy penalizes wrong predictions

# True class is 0
# Vary the predicted probability for class 0
p_true_class = np.linspace(0.01, 0.99, 100)

# Cross-entropy = -log(p_true_class)
ce_losses = -np.log(p_true_class)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cross-entropy vs probability
axes[0].plot(p_true_class, ce_losses, 'b-', linewidth=2)
axes[0].axvline(x=1.0, color='green', linestyle='--', label='Perfect prediction')
axes[0].axhline(y=0, color='green', linestyle='--')
axes[0].set_xlabel('Probability assigned to true class')
axes[0].set_ylabel('Cross-Entropy Loss')
axes[0].set_title('Cross-Entropy vs Correct Class Probability')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 1)

# Show gradient (steeper when very wrong)
gradient = 1 / p_true_class
axes[1].plot(p_true_class, gradient, 'r-', linewidth=2)
axes[1].set_xlabel('Probability assigned to true class')
axes[1].set_ylabel('Gradient magnitude')
axes[1].set_title('Gradient of Cross-Entropy')
axes[1].set_ylim(0, 20)
axes[1].grid(True, alpha=0.3)
axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print("Key insight:")
print("  - Loss ‚Üí 0 as probability ‚Üí 1 (correct prediction)")
print("  - Loss ‚Üí ‚àû as probability ‚Üí 0 (wrong prediction)")
print("  - Gradient is larger for worse predictions (faster learning!)")

---

## Part 5: KL Divergence and Information Theory

### What is KL Divergence?

KL Divergence measures how different two probability distributions are:

$$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

### üßí ELI5: KL Divergence

> **Imagine you're using a map to navigate...**
>
> - P = the real terrain
> - Q = your map
>
> KL Divergence measures: "How much trouble will I get into using this map?"
> - KL = 0: Perfect map! No trouble.
> - KL > 0: Map is wrong. Higher = more trouble.

### Key Relationship

$$\text{Cross-Entropy}(P, Q) = \text{Entropy}(P) + D_{KL}(P || Q)$$

Since Entropy(P) is constant during training:
- **Minimizing Cross-Entropy = Minimizing KL Divergence**

In [None]:
def entropy(p):
    """Compute entropy of distribution p"""
    eps = 1e-10
    p = np.clip(p, eps, 1-eps)
    return -np.sum(p * np.log(p))

def kl_divergence(p, q):
    """Compute KL divergence D_KL(p || q)"""
    eps = 1e-10
    p = np.clip(p, eps, 1-eps)
    q = np.clip(q, eps, 1-eps)
    return np.sum(p * np.log(p / q))

def cross_entropy_probs(p, q):
    """Compute cross-entropy H(p, q)"""
    eps = 1e-10
    q = np.clip(q, eps, 1-eps)
    return -np.sum(p * np.log(q))

# Example: True distribution vs predicted
p_true = np.array([0.7, 0.2, 0.1])  # True distribution
q_pred = np.array([0.5, 0.3, 0.2])  # Predicted distribution

H_p = entropy(p_true)
D_kl = kl_divergence(p_true, q_pred)
H_pq = cross_entropy_probs(p_true, q_pred)

print("Information Theory Relationships")
print("=" * 50)
print(f"True distribution P:      {p_true}")
print(f"Predicted distribution Q: {q_pred}")
print()
print(f"Entropy H(P):        {H_p:.6f}")
print(f"KL Divergence:       {D_kl:.6f}")
print(f"Cross-Entropy H(P,Q): {H_pq:.6f}")
print()
print(f"H(P) + D_KL(P||Q) = {H_p + D_kl:.6f}")
print(f"This equals Cross-Entropy!")

In [None]:
# Visualize KL divergence as Q approaches P

# Fix P, vary Q
p = np.array([0.7, 0.3])

q_probs = np.linspace(0.1, 0.9, 100)  # Probability of first class in Q
kl_values = []
ce_values = []

for q1 in q_probs:
    q = np.array([q1, 1 - q1])
    kl_values.append(kl_divergence(p, q))
    ce_values.append(cross_entropy_probs(p, q))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# KL Divergence
axes[0].plot(q_probs, kl_values, 'b-', linewidth=2)
axes[0].axvline(x=p[0], color='red', linestyle='--', label=f'P = [{p[0]}, {p[1]}]')
axes[0].set_xlabel('Q[0] (probability of first class)')
axes[0].set_ylabel('KL Divergence')
axes[0].set_title('KL Divergence D_KL(P || Q)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Cross-entropy
axes[1].plot(q_probs, ce_values, 'g-', linewidth=2, label='Cross-Entropy')
axes[1].axhline(y=entropy(p), color='orange', linestyle='--', label='Entropy H(P)')
axes[1].axvline(x=p[0], color='red', linestyle='--', label=f'P = [{p[0]}, {p[1]}]')
axes[1].set_xlabel('Q[0] (probability of first class)')
axes[1].set_ylabel('Value')
axes[1].set_title('Cross-Entropy = Entropy + KL Divergence')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print(f"  - KL = 0 when Q = P (distributions match)")
print(f"  - Cross-Entropy is minimized when Q = P")
print(f"  - At minimum, CE = Entropy(P) = {entropy(p):.4f}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Using MSE for Classification

```python
# ‚ùå Wrong: MSE for classification (probabilities)
loss = np.mean((y_pred_probs - y_true_onehot) ** 2)

# ‚úÖ Right: Cross-entropy for classification
loss = -np.sum(y_true_onehot * np.log(y_pred_probs))
```

**Why:** Cross-entropy has better gradient properties for classification.

### Mistake 2: Log of Zero

```python
# ‚ùå Wrong: No protection against log(0)
ce = -np.sum(y * np.log(p))  # Crashes if p contains 0!

# ‚úÖ Right: Add small epsilon
eps = 1e-10
ce = -np.sum(y * np.log(p + eps))
```

### Mistake 3: Forgetting Softmax

```python
# ‚ùå Wrong: Cross-entropy on raw logits
ce = cross_entropy(y_true, logits)  # Logits aren't probabilities!

# ‚úÖ Right: Apply softmax first
probs = softmax(logits)
ce = cross_entropy(y_true, probs)

# Or use PyTorch's combined version (numerically stable)
import torch.nn.functional as F
ce = F.cross_entropy(logits, labels)  # Handles softmax internally
```

---

## ‚úã Try It Yourself

### Exercise: Implement Binary Cross-Entropy

Implement BCE and show it's equivalent to the negative log-likelihood of a Bernoulli distribution.

<details>
<summary>üí° Hint</summary>

Binary cross-entropy:
```python
BCE = -(y * log(p) + (1-y) * log(1-p))
```

Compare with Bernoulli log-PMF:
```python
log_pmf = y * log(p) + (1-y) * log(1-p)
```
</details>

In [None]:
def binary_cross_entropy(y_true, y_pred):
    """
    Binary cross-entropy loss.
    
    BCE = -[y*log(p) + (1-y)*log(1-p)]
    
    Args:
        y_true: Binary labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
    """
    eps = 1e-10
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Test data
y_true_binary = np.array([1, 0, 1, 1, 0])
y_pred_probs_binary = np.array([0.9, 0.2, 0.8, 0.7, 0.3])

# Compute BCE
bce = binary_cross_entropy(y_true_binary, y_pred_probs_binary)

# Compute as Bernoulli NLL
bernoulli_nll_values = [-bernoulli_log_pmf(y, p) for y, p in zip(y_true_binary, y_pred_probs_binary)]
bernoulli_nll_mean = np.mean(bernoulli_nll_values)

print(f"Binary Cross-Entropy: {bce:.6f}")
print(f"Bernoulli NLL (mean): {bernoulli_nll_mean:.6f}")
print(f"\nThey're the same! BCE = Bernoulli NLL")

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **Gaussian distribution** ‚Üí MSE loss for regression
- ‚úÖ **Bernoulli distribution** ‚Üí Binary Cross-Entropy for binary classification
- ‚úÖ **Categorical distribution** ‚Üí Cross-Entropy for multi-class
- ‚úÖ **MLE** = finding parameters that maximize probability of data
- ‚úÖ **KL Divergence** measures difference between distributions
- ‚úÖ **Cross-Entropy = Entropy + KL Divergence**

**Key insight:** Loss functions aren't arbitrary - they come from probability theory!

---

## üìñ Further Reading

- [Visual Information Theory](https://colah.github.io/posts/2015-09-Visual-Information/) - Excellent visualizations
- [Understanding Softmax Cross-Entropy](https://cs231n.github.io/linear-classify/#softmax) - Stanford CS231n
- [KL Divergence Explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) - Intuitive explanation

---

## üßπ Cleanup

In [None]:
import gc
gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüéâ Congratulations! You've completed Module: 1.4: Mathematics for Deep Learning!")
print("\n‚û°Ô∏è  Next: Module 1.4 - Neural Network Fundamentals")