# 05: Genomics Applications

**Module 1.1: Calculus & Optimization - Integration**

## Learning Objectives

This notebook ties together all concepts from Module 1.1 with real genomics applications:
1. Understand DESeq2's IRLS algorithm (Newton-like optimization)
2. See why condition numbers matter for stable analysis
3. Compare convex (DESeq2) vs non-convex (scVI) optimization
4. Implement gradient descent for gene expression prediction

## Resources
- DESeq2 paper: Love et al., 2014
- scVI paper: Lopez et al., 2018

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd.functional import hessian

torch.manual_seed(42)
np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)

---
## 1. The Optimization Landscape in Genomics

| Tool | Method | Convex? | Reproducible? |
|------|--------|---------|---------------|
| **DESeq2** | IRLS (Newton-like) | ‚úì Yes | ‚úì Same result every time |
| **edgeR** | IRLS | ‚úì Yes | ‚úì Same result every time |
| **limma** | Least squares | ‚úì Yes | ‚úì Same result every time |
| **scVI** | Adam (SGD) | ‚úó No | ‚ö†Ô∏è Seed-dependent |
| **scANVI** | Adam | ‚úó No | ‚ö†Ô∏è Seed-dependent |

In [None]:
# Visualize convex vs non-convex
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x = np.linspace(-3, 3, 100)

# Convex (DESeq2-like)
y_convex = x**2 + 1
axes[0].plot(x, y_convex, 'b-', linewidth=2)
axes[0].scatter([0], [1], color='green', s=200, zorder=5, label='Global minimum')
axes[0].set_title('Convex Loss (DESeq2, limma)\nOne minimum, always found', fontsize=12)
axes[0].set_xlabel('Parameter')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Non-convex (Neural network-like)
y_nonconvex = np.sin(3*x) + 0.5*x**2
axes[1].plot(x, y_nonconvex, 'r-', linewidth=2)
local_mins = [-1.9, -0.1, 1.7]
for lm in local_mins:
    y_lm = np.sin(3*lm) + 0.5*lm**2
    axes[1].scatter([lm], [y_lm], color='orange', s=150, zorder=5)
axes[1].scatter([-0.1], [np.sin(3*-0.1) + 0.5*(-0.1)**2], color='green', s=200, zorder=5, label='Global minimum')
axes[1].scatter([1.7], [np.sin(3*1.7) + 0.5*(1.7)**2], color='orange', s=150, zorder=5, label='Local minima')
axes[1].set_title('Non-convex Loss (scVI, Neural Nets)\nMultiple minima, result depends on initialization', fontsize=12)
axes[1].set_xlabel('Parameter')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 2. DESeq2's IRLS Algorithm

DESeq2 uses **Iteratively Reweighted Least Squares** - a Newton-like method for GLMs.

### The Model

$$K_{ij} \sim \text{NegativeBinomial}(\mu_{ij}, \alpha_i)$$

$$\log(\mu_{ij}) = X_j \cdot \beta_i$$

where:
- $K_{ij}$ = counts for gene $i$, sample $j$
- $\mu_{ij}$ = expected count
- $\alpha_i$ = dispersion for gene $i$
- $X$ = design matrix
- $\beta_i$ = coefficients for gene $i$

### IRLS Update

$$\beta^{(t+1)} = (X^T W X)^{-1} X^T W z$$

where $W$ = diagonal weight matrix, $z$ = working response.

In [None]:
# Simplified IRLS for Poisson regression (simpler than NB)
def irls_poisson(X, y, max_iter=25, tol=1e-8):
    """
    IRLS for Poisson regression: y ~ Poisson(exp(X @ beta))
    
    This is the core of what DESeq2 does (with NB instead of Poisson).
    """
    n, p = X.shape
    beta = np.zeros(p)  # Initialize at zero
    
    history = {'beta': [beta.copy()], 'deviance': []}
    
    for iteration in range(max_iter):
        # Current predictions
        eta = X @ beta
        mu = np.exp(eta)
        
        # Weights (for Poisson: W = diag(mu))
        W = np.diag(mu)
        
        # Working response
        z = eta + (y - mu) / mu
        
        # IRLS update: solve weighted least squares
        # (X'WX) beta = X'Wz
        XtWX = X.T @ W @ X
        XtWz = X.T @ W @ z
        
        beta_new = np.linalg.solve(XtWX, XtWz)
        
        # Check convergence
        change = np.max(np.abs(beta_new - beta))
        beta = beta_new
        
        # Deviance (goodness of fit)
        deviance = 2 * np.sum(y * np.log(y / mu + 1e-10) - (y - mu))
        
        history['beta'].append(beta.copy())
        history['deviance'].append(deviance)
        
        if change < tol:
            print(f"Converged in {iteration + 1} iterations")
            break
    
    return beta, history

# Simulate gene expression data
np.random.seed(42)
n_samples = 50

# Design matrix: intercept + treatment
X = np.column_stack([
    np.ones(n_samples),           # Intercept
    np.array([0]*25 + [1]*25)     # Treatment (0 = control, 1 = treated)
])

# True coefficients: baseline = 5 (log scale), treatment effect = 1
beta_true = np.array([5.0, 1.0])

# Generate counts
mu_true = np.exp(X @ beta_true)
y = np.random.poisson(mu_true)

print("Simulated gene expression analysis")
print(f"True coefficients: {beta_true}")
print(f"Control mean count: {np.exp(beta_true[0]):.1f}")
print(f"Treatment fold change: {np.exp(beta_true[1]):.2f}x")

# Run IRLS
beta_est, history = irls_poisson(X, y)
print(f"\nEstimated coefficients: {beta_est}")
print(f"Estimated fold change: {np.exp(beta_est[1]):.2f}x")

In [None]:
# Visualize IRLS convergence
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

betas = np.array(history['beta'])
axes[0].plot(betas[:, 0], 'b-o', label='Œ≤‚ÇÄ (intercept)')
axes[0].plot(betas[:, 1], 'r-o', label='Œ≤‚ÇÅ (treatment)')
axes[0].axhline(beta_true[0], color='b', linestyle='--', alpha=0.5)
axes[0].axhline(beta_true[1], color='r', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Coefficient value')
axes[0].set_title('IRLS Convergence')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(history['deviance'], 'g-o')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Deviance')
axes[1].set_title('Deviance (lower = better fit)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 3. Condition Numbers: When Analysis Fails

The **condition number** $\kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}}$ tells us how numerically stable a matrix is.

- $\kappa \approx 1$: Well-conditioned, stable
- $\kappa > 10^{10}$: Ill-conditioned, numerical issues
- $\kappa = \infty$: Singular, no unique solution

In [None]:
def analyze_design_matrix(X, name=""):
    """Analyze numerical stability of design matrix."""
    # SVD
    U, S, Vh = np.linalg.svd(X)
    
    # Condition number
    kappa = S[0] / S[-1] if S[-1] > 1e-15 else np.inf
    
    # Rank
    rank = np.sum(S > 1e-10)
    
    print(f"\n{name}")
    print(f"  Shape: {X.shape}")
    print(f"  Rank: {rank} / {min(X.shape)}")
    print(f"  Condition number: {kappa:.2e}")
    print(f"  Singular values: {S[:5].round(4)}...")
    
    if kappa > 1e10:
        print("  ‚ö†Ô∏è WARNING: Ill-conditioned! Results may be unstable.")
    elif kappa > 1e6:
        print("  ‚ö†Ô∏è CAUTION: Moderately ill-conditioned.")
    else:
        print("  ‚úì Well-conditioned.")
    
    return kappa

# Good design matrix
X_good = np.column_stack([
    np.ones(100),
    np.array([0]*50 + [1]*50),
    np.random.randn(100)
])
analyze_design_matrix(X_good, "Good design (balanced, independent)")

# Collinear design (bad!)
X_collinear = np.column_stack([
    np.ones(100),
    np.arange(100),
    np.arange(100) + np.random.randn(100) * 0.001  # Almost identical to previous
])
analyze_design_matrix(X_collinear, "Collinear design (batch ‚âà time)")

# Imbalanced design
X_imbalanced = np.column_stack([
    np.ones(100),
    np.array([0]*99 + [1]*1)  # Only 1 treated sample!
])
analyze_design_matrix(X_imbalanced, "Imbalanced design (1 vs 99)")

---
## 4. Neural Network for Gene Expression (scVI-like)

Unlike DESeq2's convex optimization, neural networks have non-convex losses.

In [None]:
# Simple autoencoder for gene expression
class GeneExpressionAutoencoder(nn.Module):
    def __init__(self, n_genes, n_latent=10):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(n_genes, 128),
            nn.ReLU(),
            nn.Linear(128, n_latent)
        )
        self.decoder = nn.Sequential(
            nn.Linear(n_latent, 128),
            nn.ReLU(),
            nn.Linear(128, n_genes)
        )
    
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

# Simulate gene expression data
n_genes = 100
n_cells = 500

# Create data with structure (two cell types)
torch.manual_seed(42)
cell_type = torch.randint(0, 2, (n_cells,))
X = torch.randn(n_cells, n_genes)
X[cell_type == 1, :50] += 2  # Cell type 1 has higher expression of first 50 genes

print(f"Data: {n_cells} cells √ó {n_genes} genes")
print(f"Cell types: {(cell_type == 0).sum()} type A, {(cell_type == 1).sum()} type B")

In [None]:
# Train with different random seeds ‚Üí different results!
def train_autoencoder(X, seed, epochs=100):
    torch.manual_seed(seed)
    model = GeneExpressionAutoencoder(n_genes=X.shape[1])
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    
    losses = []
    for epoch in range(epochs):
        optimizer.zero_grad()
        X_recon = model(X)
        loss = nn.MSELoss()(X_recon, X)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    
    return model, losses

# Train with 3 different seeds
results = {}
for seed in [1, 42, 123]:
    model, losses = train_autoencoder(X, seed)
    results[seed] = {'model': model, 'losses': losses, 'final_loss': losses[-1]}
    print(f"Seed {seed}: Final loss = {losses[-1]:.4f}")

print("\n‚ö†Ô∏è Different seeds ‚Üí different final losses!")
print("   This is why scVI results vary with random seed.")

In [None]:
# Visualize different training trajectories
plt.figure(figsize=(10, 5))

for seed, data in results.items():
    plt.plot(data['losses'], label=f'Seed {seed} (final: {data["final_loss"]:.4f})')

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Non-convex optimization: Different seeds ‚Üí Different local minima')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---
## 5. Checking the Hessian at Convergence

At a proper minimum, the Hessian should be **positive definite**.

In [None]:
# For a simple model, check Hessian at convergence
def simple_loss(params, X, y):
    """Simple quadratic loss."""
    return torch.mean((X @ params - y)**2)

# Generate data
torch.manual_seed(42)
n, p = 100, 3
X_data = torch.randn(n, p)
true_params = torch.tensor([1.0, -0.5, 2.0])
y_data = X_data @ true_params + 0.1 * torch.randn(n)

# Find optimal parameters
params = torch.randn(p, requires_grad=True)
optimizer = torch.optim.Adam([params], lr=0.1)

for _ in range(200):
    optimizer.zero_grad()
    loss = simple_loss(params, X_data, y_data)
    loss.backward()
    optimizer.step()

print(f"Optimized params: {params.data.numpy().round(3)}")
print(f"True params:      {true_params.numpy()}")

# Compute Hessian at solution
def loss_fn(p):
    return simple_loss(p, X_data, y_data)

H = hessian(loss_fn, params.detach())
eigenvalues = torch.linalg.eigvalsh(H)

print(f"\nHessian eigenvalues: {eigenvalues.numpy().round(4)}")
if torch.all(eigenvalues > 0):
    print("‚úì All positive ‚Üí Proper minimum!")
else:
    print("‚ö†Ô∏è Not all positive ‚Üí Saddle point or maximum!")

---
## Summary: Module 1.1 Complete!

### What You Learned

| Concept | Genomics Application |
|---------|----------------------|
| **Gradient** | Direction to update model parameters |
| **Hessian** | Curvature, used in IRLS (DESeq2) |
| **Positive definite** | Confirms you found a minimum |
| **Jacobian** | Backprop through neural network layers |
| **Autodiff** | How PyTorch computes gradients |
| **Condition number** | Numerical stability of design matrix |
| **Convex vs non-convex** | Why DESeq2 is reproducible but scVI isn't |

### Key Takeaways

1. **DESeq2/edgeR:** Newton-like methods (IRLS), convex optimization, reproducible
2. **scVI/Neural nets:** Gradient descent, non-convex, seed-dependent
3. **Condition numbers:** Check design matrix before running analysis
4. **Hessian eigenvalues:** Verify you're at a minimum, not saddle point

---
## Exercises

### Exercise 1: IRLS for Logistic Regression
Modify the IRLS code for binary classification (cell type A vs B).

### Exercise 2: Condition Number Experiments
Create design matrices with varying condition numbers and see how this affects coefficient estimates.

### Exercise 3: Reproducibility Test
Train the autoencoder 10 times with different seeds. Plot the distribution of final losses.

### Exercise 4: Real Data
Load a real scRNA-seq dataset and analyze the design matrix condition number.

In [None]:
# Your solutions here


---
## üéâ Module 1.1 Complete!

**Next:** Module 1.2 - Linear Systems & Least Squares

You'll learn:
- LU and QR decomposition
- Why QR is more stable than normal equations
- Ridge and LASSO regularization
- Applications to gene expression modeling