# 05: Genomics Applications

**Module 1.2: Linear Systems & Least Squares - Integration**

## Learning Objectives

This notebook ties together Module 1.2 with genomics:
1. DESeq2's per-gene linear systems
2. Design matrix diagnostics
3. When to use regularization
4. Practical troubleshooting

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

torch.manual_seed(42)
np.random.seed(42)

---
## 1. DESeq2: Per-Gene Linear Systems

DESeq2 solves for each gene:
$$(X^\top W X)\beta = X^\top W z$$

Where:
- $X$: design matrix (samples √ó covariates)
- $W$: weights (diagonal)
- $\beta$: coefficients to estimate

In [None]:
def deseq2_system_size(n_samples, n_covariates):
    """Analyze DESeq2's linear system size."""
    print(f"Samples: {n_samples}, Covariates: {n_covariates}")
    print(f"X shape: {n_samples} √ó {n_covariates}")
    print(f"X'WX shape: {n_covariates} √ó {n_covariates}")
    print(f"‚Üí Inverting {n_covariates}√ó{n_covariates} matrix (trivial!)")

print("Typical bulk RNA-seq:")
deseq2_system_size(100, 5)
print("\nLarge scRNA-seq:")
deseq2_system_size(50000, 5)

---
## 2. Design Matrix Diagnostics

In [None]:
def diagnose_design(X, names=None):
    """Check design matrix health."""
    m, n = X.shape
    U, S, Vh = torch.linalg.svd(X, full_matrices=False)
    
    kappa = (S[0] / S[-1]).item() if S[-1] > 1e-15 else float('inf')
    rank = (S > S[0] * 1e-10).sum().item()
    
    print(f"Shape: {m} √ó {n}")
    print(f"Rank: {rank}/{n}")
    print(f"Condition: {kappa:.1e}")
    
    if kappa > 1e6:
        print("‚ö†Ô∏è ILL-CONDITIONED")
    elif rank < n:
        print("‚ö†Ô∏è RANK DEFICIENT")
    else:
        print("‚úì OK")

# Good design
print("Balanced design:")
X_good = torch.tensor([[1.,0.,0.],[1.,0.,1.],[1.,1.,0.],[1.,1.,1.]], dtype=torch.float64)
diagnose_design(X_good)

print("\nConfounded design:")
X_bad = torch.tensor([[1.,0.,0.],[1.,0.,0.],[1.,1.,1.],[1.,1.,1.]], dtype=torch.float64)
diagnose_design(X_bad)

---
## 3. When to Use Regularization

| Scenario | Method | Why |
|----------|--------|-----|
| samples > covariates, good design | OLS (QR) | No need |
| samples > covariates, ill-conditioned | Ridge | Stabilize |
| covariates > samples | Ridge/LASSO | Required |
| Feature selection needed | LASSO | Sparsity |

In [None]:
def recommend_method(n_samples, n_features, want_sparse=False):
    """Recommend regression method."""
    if n_samples > n_features * 5:
        if want_sparse:
            return "LASSO (for feature selection)"
        return "OLS via QR (plenty of samples)"
    elif n_samples > n_features:
        return "Ridge (some regularization helps)"
    else:
        if want_sparse:
            return "LASSO (sparse, underdetermined)"
        return "Ridge (underdetermined, need regularization)"

scenarios = [
    (100, 5, False, "Bulk RNA-seq, few covariates"),
    (50, 1000, False, "Predict phenotype from genes"),
    (50, 1000, True, "Find biomarker genes"),
    (100, 50, False, "Moderate features"),
]

for n, p, sparse, desc in scenarios:
    print(f"{desc}:")
    print(f"  {n} samples, {p} features ‚Üí {recommend_method(n, p, sparse)}\n")

---
## 4. Complete Analysis Pipeline

In [None]:
# Simulate complete gene expression analysis
n_samples, n_genes, n_covariates = 50, 100, 4

# Design: Intercept + Treatment + Batch + Continuous
X = torch.zeros(n_samples, n_covariates, dtype=torch.float64)
X[:, 0] = 1  # Intercept
X[:25, 1] = 0; X[25:, 1] = 1  # Treatment
X[::2, 2] = 0; X[1::2, 2] = 1  # Batch (balanced!)
X[:, 3] = torch.randn(n_samples, dtype=torch.float64)  # Continuous

# True effects for each gene
true_betas = torch.randn(n_covariates, n_genes, dtype=torch.float64)
true_betas[1, :] *= 2  # Strong treatment effect

# Gene expression
Y = X @ true_betas + 0.5 * torch.randn(n_samples, n_genes, dtype=torch.float64)

# Solve using QR
Q, R = torch.linalg.qr(X)
beta_hat = torch.linalg.solve_triangular(R, Q.T @ Y, upper=True)

# Check accuracy
error = (beta_hat - true_betas).norm() / true_betas.norm()
print(f"Analysis complete!")
print(f"  {n_genes} genes analyzed")
print(f"  Relative error: {error:.4f}")

---
## Module 1.2 Complete! üéâ

### Key Takeaways

| Topic | Key Insight |
|-------|-------------|
| System types | Over/underdetermined determines method |
| Condition number | $\kappa(X'X) = \kappa(X)^2$ - avoid normal equations! |
| QR decomposition | Stable least squares solution |
| Ridge | Fixes singular matrices, shrinks coefficients |
| LASSO | Sparse solutions, feature selection |

### Next: Phase 2 - Matrix Decompositions (SVD, PCA, Eigendecomposition)