# 01: Linear Systems

**Module 1.2: Linear Systems & Least Squares**

## Learning Objectives

By the end of this notebook, you will:
1. Classify linear systems as overdetermined, underdetermined, or square
2. Understand when unique solutions exist (rank conditions)
3. Connect system types to genomics scenarios
4. See why DESeq2 solves small systems per gene

## Resources
- Solomon, *Numerical Algorithms*, Chapter 3-4
- Cohen, *Practical Linear Algebra*, Chapters 6-7

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

torch.manual_seed(42)
np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)

---
## 1. The Linear System $A\vec{x} = \vec{b}$

A system of linear equations:

$$\begin{align}
a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n &= b_1 \\
a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n &= b_2 \\
&\vdots \\
a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n &= b_m
\end{align}$$

In matrix form: $A\vec{x} = \vec{b}$

Where:
- $A \in \mathbb{R}^{m \times n}$ (coefficient matrix)
- $\vec{x} \in \mathbb{R}^{n}$ (unknowns)
- $\vec{b} \in \mathbb{R}^{m}$ (right-hand side)

---
## 2. Three Types of Systems

| Type | Shape | Equations vs Unknowns | Typical Solutions |
|------|-------|----------------------|-------------------|
| **Square** | $m = n$ | Equal | 0, 1, or ∞ |
| **Overdetermined** | $m > n$ | More equations | Usually none → least squares |
| **Underdetermined** | $m < n$ | Fewer equations | Infinitely many → regularization |

In [None]:
# Visualize 2D examples
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

x = np.linspace(-2, 4, 100)

# Square system (2 equations, 2 unknowns) - unique solution
ax = axes[0]
ax.plot(x, 2 - x, 'b-', linewidth=2, label='x + y = 2')
ax.plot(x, x, 'r-', linewidth=2, label='y = x')
ax.scatter([1], [1], color='green', s=200, zorder=5, label='Solution (1,1)')
ax.set_xlim(-1, 3)
ax.set_ylim(-1, 3)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Square System (2×2)\nUnique Solution')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# Overdetermined (3 equations, 2 unknowns) - no exact solution
ax = axes[1]
ax.plot(x, 2 - x, 'b-', linewidth=2, label='x + y = 2')
ax.plot(x, x, 'r-', linewidth=2, label='y = x')
ax.plot(x, 0.5 + 0.3*x, 'g-', linewidth=2, label='y = 0.5 + 0.3x')
ax.scatter([1], [1], color='orange', s=200, zorder=5, marker='*', label='Best fit (least squares)')
ax.set_xlim(-1, 3)
ax.set_ylim(-1, 3)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Overdetermined (3×2)\nNo exact solution → Least Squares')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# Underdetermined (1 equation, 2 unknowns) - infinite solutions
ax = axes[2]
ax.plot(x, 2 - x, 'b-', linewidth=3, label='x + y = 2 (infinite solutions)')
ax.scatter([0, 1, 2], [2, 1, 0], color='green', s=100, zorder=5, label='Some solutions')
ax.set_xlim(-1, 3)
ax.set_ylim(-1, 3)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Underdetermined (1×2)\nInfinite solutions → Need constraint')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

plt.tight_layout()
plt.show()

---
## 3. Genomics Context: The Historical Shift

### Bulk RNA-seq / Microarray Era (2000s)
- ~100 samples, ~20,000 genes
- Predicting phenotype from gene expression: **underdetermined**
- Must use regularization (Ridge, LASSO)

### Single-cell Era (2020s)
- ~100,000 cells, ~20,000 genes
- Per-gene analysis with few covariates: **overdetermined**
- Can use standard least squares

In [None]:
# DESeq2's per-gene system
def analyze_deseq2_system(n_samples, n_covariates):
    """Analyze the linear system DESeq2 solves per gene."""
    print(f"DESeq2 per-gene analysis:")
    print(f"  Samples: {n_samples}")
    print(f"  Covariates: {n_covariates}")
    print(f"  Design matrix X: {n_samples} × {n_covariates}")
    print(f"  X'WX: {n_covariates} × {n_covariates}")
    
    if n_samples > n_covariates:
        print(f"  System type: OVERDETERMINED")
        print(f"  → Unique least squares solution exists")
    elif n_samples < n_covariates:
        print(f"  System type: UNDERDETERMINED")
        print(f"  → Need regularization!")
    else:
        print(f"  System type: SQUARE")
        print(f"  → Unique solution if full rank")

print("Typical RNA-seq experiment:")
analyze_deseq2_system(n_samples=100, n_covariates=5)

print("\n" + "="*50 + "\n")

print("Large scRNA-seq experiment:")
analyze_deseq2_system(n_samples=50000, n_covariates=5)

---
## 4. Rank and Existence of Solutions

The **rank** of a matrix determines solution existence:

| Condition | Result |
|-----------|--------|
| $\text{rank}(A) = \text{rank}([A|b]) = n$ | Unique solution |
| $\text{rank}(A) = \text{rank}([A|b]) < n$ | Infinite solutions |
| $\text{rank}(A) < \text{rank}([A|b])$ | No solution |

For least squares, we care about $\text{rank}(X) = n$ (full column rank).

In [None]:
def check_system_solvability(A, b):
    """Check if a linear system has solutions."""
    m, n = A.shape
    
    # Augmented matrix [A|b]
    Ab = torch.cat([A, b.unsqueeze(1)], dim=1)
    
    # Compute ranks
    rank_A = torch.linalg.matrix_rank(A).item()
    rank_Ab = torch.linalg.matrix_rank(Ab).item()
    
    print(f"Matrix A: {m} × {n}")
    print(f"rank(A) = {rank_A}")
    print(f"rank([A|b]) = {rank_Ab}")
    print(f"Number of unknowns: {n}")
    
    if rank_A < rank_Ab:
        print("→ NO SOLUTION (inconsistent system)")
        return "none"
    elif rank_A == rank_Ab == n:
        print("→ UNIQUE SOLUTION")
        return "unique"
    else:
        print(f"→ INFINITE SOLUTIONS ({n - rank_A} free parameters)")
        return "infinite"

# Example 1: Unique solution
print("Example 1: Full rank system")
A1 = torch.tensor([[1., 2.], [3., 4.]])
b1 = torch.tensor([5., 6.])
check_system_solvability(A1, b1)

print("\n" + "="*40 + "\n")

# Example 2: Rank deficient (infinite solutions)
print("Example 2: Rank deficient (dependent rows)")
A2 = torch.tensor([[1., 2.], [2., 4.]])  # Row 2 = 2 × Row 1
b2 = torch.tensor([3., 6.])  # Consistent
check_system_solvability(A2, b2)

print("\n" + "="*40 + "\n")

# Example 3: Inconsistent (no solution)
print("Example 3: Inconsistent system")
A3 = torch.tensor([[1., 2.], [2., 4.]])  # Row 2 = 2 × Row 1
b3 = torch.tensor([3., 7.])  # Inconsistent! (should be 6)
check_system_solvability(A3, b3)

---
## 5. Full Column Rank: The Key Condition

For least squares $\min\|X\beta - y\|^2$:

$$\hat{\beta} = (X^\top X)^{-1} X^\top y$$

This requires $X^\top X$ to be invertible, which happens if and only if $X$ has **full column rank**.

In [None]:
def check_least_squares_solvable(X):
    """Check if X has full column rank for least squares."""
    m, n = X.shape
    rank = torch.linalg.matrix_rank(X).item()
    
    print(f"Design matrix X: {m} × {n}")
    print(f"rank(X) = {rank}, need {n} for full column rank")
    
    if rank == n:
        print("✓ Full column rank → unique least squares solution")
        
        # Check X'X
        XtX = X.T @ X
        det = torch.linalg.det(XtX).item()
        print(f"  det(X'X) = {det:.4f}")
    else:
        print(f"✗ Rank deficient → need regularization")
        print(f"  Missing rank: {n - rank}")

# Good design matrix
print("Example 1: Good design matrix")
X_good = torch.tensor([[1., 0.],
                        [1., 1.],
                        [1., 2.],
                        [1., 3.]])
check_least_squares_solvable(X_good)

print("\n" + "="*40 + "\n")

# Bad design matrix (collinear)
print("Example 2: Collinear covariates")
X_bad = torch.tensor([[1., 2.],
                       [1., 2.],
                       [1., 2.],
                       [1., 2.]])
check_least_squares_solvable(X_bad)

---
## 6. Common Causes of Rank Deficiency in Genomics

| Problem | Example | DESeq2 Warning |
|---------|---------|----------------|
| **Collinear covariates** | Batch perfectly correlates with treatment | "Model matrix not full rank" |
| **Redundant factors** | Male + Female + Intercept | "Dropping columns" |
| **Too few samples** | 3 samples, 5 covariates | "More coefficients than samples" |

In [None]:
# Simulate problematic design matrices

# Problem 1: Collinear covariates
print("Problem 1: Batch confounded with treatment")
print("  Sample 1-5: Control, Batch A")
print("  Sample 6-10: Treatment, Batch B")

X_confounded = torch.tensor([
    [1., 0., 0.],  # Intercept, Treatment, BatchB
    [1., 0., 0.],
    [1., 0., 0.],
    [1., 0., 0.],
    [1., 0., 0.],
    [1., 1., 1.],  # Treatment = BatchB!
    [1., 1., 1.],
    [1., 1., 1.],
    [1., 1., 1.],
    [1., 1., 1.],
], dtype=torch.float32)

check_least_squares_solvable(X_confounded)
print("\nNote: Treatment and BatchB columns are identical!")
print("Cannot separate treatment effect from batch effect.")

---
## Exercises

### Exercise 1: Classify Systems
For each scenario, classify as overdetermined, underdetermined, or square:
- 1000 cells, 5 covariates
- 10 samples, 100 genes as features
- 50 samples, 50 covariates

### Exercise 2: Rank Check
Create a design matrix for: Intercept + Treatment + Sex + Treatment×Sex interaction with 20 samples (5 per group). Check if it has full rank.

### Exercise 3: Fix a Confounded Design
How would you modify the confounded experiment above to make treatment and batch effects separable?

In [None]:
# Your solutions here


---
## Summary

| System Type | Condition | Solution | Method |
|-------------|-----------|----------|--------|
| Overdetermined ($m > n$) | Full column rank | Unique (least squares) | QR decomposition |
| Underdetermined ($m < n$) | Always | Infinite | Regularization |
| Rank deficient | $\text{rank}(X) < n$ | None or infinite | Regularization |

## Next: 02_condition_numbers.ipynb