## BME i9400
## Fall 2025
### Probability Theory II: Distributions & Covariance

### Today's Problem

You are analyzing biomarker data from a clinical trial studying a new diagnostic test for early-stage Alzheimer's disease. The test measures two biomarkers:
- **Biomarker A**: Amyloid-β protein concentration (μg/mL)
- **Biomarker B**: Tau protein concentration (μg/mL)

From a cohort of 1000 patients, you observe that:
- Both biomarkers appear to follow normal distributions
- The biomarkers seem to be correlated (when one is high, the other tends to be high)
- Some patients have discordant results (high A, low B or vice versa)

**Your task**: Understand the joint distribution of these biomarkers to determine:
1. What is the probability a patient has both biomarkers above threshold?
2. How does the correlation between biomarkers affect diagnostic accuracy?
3. Can we predict one biomarker from the other?

### Mini-Lecture: Probability Distributions (15-20 minutes)

#### Part 1: Discrete Distributions

**Bernoulli Distribution**
- Models a single binary outcome (success/failure)
- Parameter: $p$ = probability of success
- PMF: $P(X=k) = p^k(1-p)^{1-k}$ for $k \in \{0,1\}$
- Mean: $E[X] = p$
- Variance: $\text{Var}(X) = p(1-p)$
- *Clinical example*: Single patient has disease (1) or not (0)

**Binomial Distribution**
- Models number of successes in $n$ independent Bernoulli trials
- Parameters: $n$ (trials), $p$ (success probability)
- PMF: $P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$
- Mean: $E[X] = np$
- Variance: $\text{Var}(X) = np(1-p)$
- *Clinical example*: Number of patients responding to treatment out of $n$ patients

**Poisson Distribution**
- Models count of events in fixed time/space
- Parameter: $\lambda$ = average rate
- PMF: $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$
- Mean: $E[X] = \lambda$
- Variance: $\text{Var}(X) = \lambda$
- *Clinical example*: Number of seizures per month in epilepsy patients

#### Part 2: Continuous Distributions

**Uniform Distribution**
- Equal probability over an interval $[a,b]$
- PDF: $f(x) = \frac{1}{b-a}$ for $x \in [a,b]$
- Mean: $E[X] = \frac{a+b}{2}$
- Variance: $\text{Var}(X) = \frac{(b-a)^2}{12}$
- *Clinical example*: Arrival time of emergency patients within an hour

**Gaussian (Normal) Distribution**
- Most important continuous distribution
- Parameters: $\mu$ (mean), $\sigma^2$ (variance)
- PDF: $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
- Mean: $E[X] = \mu$
- Variance: $\text{Var}(X) = \sigma^2$
- *Clinical example*: Blood pressure, height, many biomarkers

#### Part 3: Key Concepts

**Expected Value**
- Discrete: $E[X] = \sum_x x \cdot P(X=x)$
- Continuous: $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)dx$
- Interpretation: Long-run average value

**Variance**
- Definition: $\text{Var}(X) = E[(X-\mu)^2] = E[X^2] - (E[X])^2$
- Standard deviation: $\sigma = \sqrt{\text{Var}(X)}$
- Interpretation: Spread around the mean

#### Part 4: Joint Distributions & Covariance

**Joint Distribution**
- Describes probability of two variables simultaneously
- For continuous: $f(x,y)$ such that $\int\int f(x,y)dxdy = 1$
- Marginal: $f_X(x) = \int f(x,y)dy$

**Covariance**
- Definition: $\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)]$
- Alternative: $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$
- Correlation: $\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$ (normalized, $-1 \leq \rho \leq 1$)

**Covariance Matrix (2D)**

For two variables $X$ and $Y$:

$$\Sigma = \begin{bmatrix}
\text{Var}(X) & \text{Cov}(X,Y) \\
\text{Cov}(Y,X) & \text{Var}(Y)
\end{bmatrix} = \begin{bmatrix}
\sigma_X^2 & \rho\sigma_X\sigma_Y \\
\rho\sigma_X\sigma_Y & \sigma_Y^2
\end{bmatrix}$$

- **Diagonal terms**: Variances (always positive)
- **Off-diagonal terms**: Covariances (can be positive, negative, or zero)
- **Symmetric**: $\text{Cov}(X,Y) = \text{Cov}(Y,X)$
- **Positive definite**: All eigenvalues > 0 (don't worry if you don't know what an eigenvalue is yet)

### Hands-On Lab: Exploring Distributions & Covariance (45 minutes)

In [None]:
# Setup and imports
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set style and random seed for reproducibility
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

#### Task 1: Visualize Discrete Distributions (10 minutes)

In [None]:
# No deliverable in this cell, just run the cell and carefully observe the output

# Example: Bernoulli Distribution
p = 0.3  # Probability of disease
bernoulli_samples = stats.bernoulli.rvs(p, size=1000)

theoretical_mean = p
empirical_mean = np.mean(bernoulli_samples)

theoretical_variance = p * (1 - p)
empirical_variance = np.var(bernoulli_samples)

print(f"Bernoulli with p={p}:")
print(f"  Empirical mean: {empirical_mean:.3f}")
print(f"  Theoretical mean: {theoretical_mean:.3f}")
print(f"  Empirical variance: {empirical_variance:.3f}")
print(f"  Theoretical variance: {theoretical_variance:.3f}")

In [None]:
# TODO 1: Generate and visualize a Binomial distribution
# Clinical scenario: 20 patients, each with 30% chance of responding to treatment
n_trials = 20
p_success = 0.3

# Generate 1000 binomial samples modeling the number of responders
# Hint: Use stats.binom.rvs
# YOUR CODE HERE
binomial_samples =

# Create a histogram of the binomial samples
# store the counts in a numpy array of counts for each possible outcome (0 to n_trials)
# Hint: Use np.sum to count occurrences of each outcome
# YOUR CODE HERE


# Create a figure showing the histogram and theoretical PMF
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)

# Plot theoretical PMF overlay
x = np.arange(0, n_trials + 1)
pmf = stats.binom.pmf(x, n_trials, p_success)
plt.plot(x, pmf * 1000, 'r-', lw=2)

# YOUR CODE HERE: Plot your histogram

plt.title('Binomial Distribution: Treatment Response')
plt.xlabel('Number of Responders')
plt.ylabel('Probability/Frequency')
plt.legend(['Theoretical PMF', 'Observed'])

#### Task 2: Explore Continuous Distributions (10 minutes)

In [None]:
# TODO 2: Generate and visualize a Normal distribution
# Clinical scenario: Systolic blood pressure
mu = 120  # Mean BP
sigma = 15  # Standard deviation

# Generate 1000 patient measurements
# Hint: Use stats.norm.rvs
# YOUR CODE HERE
normal_samples =

# Create a histogram of the normal samples and compare it to the theoretical PDF
plt.subplot(1, 2, 1)

# YOUR CODE HERE: Create and plot histogram from your samples
# Hint: Use np.histogram to get counts and density=True

# Overlay theoretical PDF
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)
plt.plot(x, stats.norm.pdf(x, loc=mu, scale=sigma), 'r-', lw=2)
plt.title('Normal Distribution: Blood Pressure')
plt.xlabel('Systolic BP (mmHg)')
plt.ylabel('Density')


#### Task 3: Understanding 2D Joint Distributions (15 minutes)

In [None]:
# No deliverable in this cell, just run the cell and carefully observe the output

# Generate correlated biomarker data
# Biomarker A: Amyloid-β (μg/mL)
# Biomarker B: Tau protein (μg/mL)

# Define means
mean_A = 50  # Average Amyloid-β concentration
mean_B = 30  # Average Tau concentration
means = np.array([mean_A, mean_B])

# Define standard deviations
std_A = 10
std_B = 8

# Create three different covariance matrices
# Case 1: No correlation (ρ = 0)
cov_matrix_1 = np.array([[std_A**2, 0],
                         [0, std_B**2]])

# Case 2: Positive correlation (ρ = 0.7)
rho_2 = 0.7
cov_matrix_2 = np.array([[std_A**2, rho_2 * std_A * std_B],
                         [rho_2 * std_A * std_B, std_B**2]])

# Case 3: Negative correlation (ρ = -0.5)
rho_3 = -0.5
cov_matrix_3 = np.array([[std_A**2, rho_3 * std_A * std_B],
                         [rho_3 * std_A * std_B, std_B**2]])

# Generate samples
n_samples = 500
data_1 = np.random.multivariate_normal(means, cov_matrix_1, n_samples)
data_2 = np.random.multivariate_normal(means, cov_matrix_2, n_samples)
data_3 = np.random.multivariate_normal(means, cov_matrix_3, n_samples)

In [None]:
# No deliverable in this cell, just run the cell and carefully observe the output

# Visualize the three cases
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

datasets = [data_1, data_2, data_3]
titles = ['No Correlation (ρ=0)', 'Positive Correlation (ρ=0.7)', 'Negative Correlation (ρ=-0.5)']
cov_matrices = [cov_matrix_1, cov_matrix_2, cov_matrix_3]

for i, (data, title, cov_mat) in enumerate(zip(datasets, titles, cov_matrices)):
    # Scatter plot
    ax = axes[0, i]
    ax.scatter(data[:, 0], data[:, 1], alpha=0.5, s=20)
    ax.set_xlabel('Amyloid-β (μg/mL)')
    ax.set_ylabel('Tau (μg/mL)')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)
    
    # Add confidence ellipse
    from matplotlib.patches import Ellipse
    eigenvalues, eigenvectors = np.linalg.eig(cov_mat)
    angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0]))
    
    for n_std in [1, 2]:
        width, height = 2 * n_std * np.sqrt(eigenvalues)
        ellipse = Ellipse(means, width, height, angle=angle, 
                         facecolor='none', edgecolor='red' if n_std==1 else 'blue',
                         linewidth=2, alpha=0.8, linestyle='--')
        ax.add_patch(ellipse)
    
    # Heatmap of covariance matrix
    ax = axes[1, i]
    im = ax.imshow(cov_mat, cmap='coolwarm', aspect='auto')
    ax.set_xticks([0, 1])
    ax.set_yticks([0, 1])
    ax.set_xticklabels(['Amyloid-β', 'Tau'])
    ax.set_yticklabels(['Amyloid-β', 'Tau'])
    
    # Add text annotations
    for j in range(2):
        for k in range(2):
            text = ax.text(k, j, f'{cov_mat[j, k]:.1f}',
                          ha="center", va="center", color="white", fontsize=12)
    
    ax.set_title('Covariance Matrix')
    plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()

In [None]:
# TODO 3: Calculate and interpret covariance matrix from real data
# Use the positively correlated dataset (data_2)

# Calculate empirical covariance matrix
# YOUR CODE HERE
# Hint: use np.cov with rowvar=False
# Use data_2!
empirical_cov =

print("Theoretical Covariance Matrix:")
print(cov_matrix_2)
print("\nEmpirical Covariance Matrix (from data):")
print(empirical_cov)

# Extract and interpret components

# YOUR CODE HERE - extract variance of A
var_A_empirical =

# YOUR CODE HERE - extract variance of B
var_B_empirical =

#YOUR CODE HERE - extract covariance
cov_AB_empirical =

# Calculate correlation coefficient
# YOUR CODE HERE - calculate correlation from covariance and variances
# Hint: correlation = covariance / (std_A * std_B)
correlation_empirical =

print(f"\nInterpretation:")
print(f"  Variance of Amyloid-β: {var_A_empirical:.2f} (μg/mL)²")
print(f"  Variance of Tau: {var_B_empirical:.2f} (μg/mL)²")
print(f"  Covariance: {cov_AB_empirical:.2f} (μg/mL)²")
print(f"  Correlation coefficient: {correlation_empirical:.3f}")
print(f"\nMeaning: When Amyloid-β is above average, Tau tends to be "
      f"{'above' if cov_AB_empirical > 0 else 'below'} average too.")

## Micro-Deliverable
- Due at the end of class
- Copy the notebook, rename it `lecture02_yourname.ipynb`, and place it in your `my-work` folder.
- Complete the code in all instances of `# YOUR CODE HERE` in the notebook.

### Challenge Problem (Optional)

In [None]:
# [Optional] Predict one biomarker from another using linear regression
# TODO 4: Predict one biomarker from another using correlation
# Linear regression formula: B = α + β*A
# where β = Cov(A,B)/Var(A) and α = mean(B) - β*mean(A)

# Calculate regression coefficients
# YOUR CODE HERE
beta =
alpha =

# Make predictions
# YOUR CODE HERE
A_values = np.linspace(20, 80, 100)
B_predicted =

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(data_2[:, 0], data_2[:, 1], alpha=0.5, label='Observed')
plt.plot(A_values, B_predicted, 'r-', linewidth=2, label='Regression line')
plt.xlabel('Amyloid-β (μg/mL)')
plt.ylabel('Tau (μg/mL)')
plt.title('Predicting Tau from Amyloid-β')
plt.legend()
plt.grid(True, alpha=0.3)

# Add prediction interval
residual_std = np.std(data_2[:, 1] - (alpha + beta * data_2[:, 0]))
plt.fill_between(A_values,
                 B_predicted - 2*residual_std,
                 B_predicted + 2*residual_std,
                 alpha=0.2, color='red', label='95% Prediction interval')
plt.legend()
plt.show()

print(f"Regression equation: Tau = {alpha:.2f} + {beta:.2f} × Amyloid-β")
print(f"Interpretation: For each 1 μg/mL increase in Amyloid-β, ")
print(f"               Tau increases by {beta:.2f} μg/mL on average")