<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/Genome_Seq_Cover_Poisson_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coverage & the Poisson Model (Genome Sequencing)

This notebook explores the relationship between sequencing **coverage** and the **fraction of bases not observed** under a Poisson model.

**Key equations**
- Coverage: $( c = \frac{N \times L}{G} )$ where *N* = number of reads, *L* = average read length, *G* = genome size.
- Probability a base is **not** covered at least once: $ P_0 = e^{-c} $.
- Invert (re-arrange) to get required coverage for a target uncovered fraction: \( $c = -\ln(P_0)$ \).

---


In [None]:
import numpy as np
import matplotlib.pyplot as plt

c = np.linspace(0, 40, 400)
P0 = np.exp(-c)

plt.figure(figsize=(7,4.5))
plt.plot(c, P0)
plt.xlabel("Coverage (c)")
plt.ylabel("Fraction not covered (P₀)")
plt.title("Poisson Model: Uncovered Fraction vs Coverage")
for kp in [1, 4.60517, 10, 30]:
    plt.scatter([kp], [np.exp(-kp)])
    plt.annotate(f"c={kp:.1f}\nP0={np.exp(-kp):.3e}", (kp, np.exp(-kp)),
                 xytext=(5, -10), textcoords='offset points')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

## Exercise 1 — How much coverage do you need?

1. For an uncovered fraction target of **1%** (\(P_0 = 0.01\)), compute the required coverage \(c\).
2. For **0.1%** (\(P_0 = 0.001\)) and **0.01%** (\(P_0 = 0.0001\)) do the same.
3. Interpret the marginal gains when increasing coverage from 20× to 30×.

Fill in the cell below.


In [1]:
import math

def coverage_for_target_uncovered(P0: float) -> float:
    """Return coverage c needed so that fraction not covered <= P0."""
    if P0 <= 0 or P0 >= 1:
        raise ValueError("P0 must be in (0,1).")
    return -math.log(P0)

targets = [0.01, 0.001, 0.0001]
for t in targets:
    print(f"P0={t}: required c={coverage_for_target_uncovered(t):.3f}x")


# TODO: Discuss marginal utility when moving from 20x to 30x.
# Hint: compare P0 at 20x vs 30x.
for c in [20, 30]:
    print(f"c={c}: P0={math.exp(-c):.3e}")

P0=0.01: required c=4.605x
P0=0.001: required c=6.908x
P0=0.0001: required c=9.210x
c=20: P0=2.061e-09
c=30: P0=9.358e-14


## Exercise 2 — How many reads do you need?

Given genome size \(G\) and read length \(L\), the number of reads \(N\) required for coverage \(c\) is:

$N = \frac{N \times G}{L} $

- **(a)** For a bacterial genome of **3.1 Mb** and **150 bp** reads, how many reads are needed for **30×** coverage?
- **(b)** Repeat for **10 kb** long reads.
- **(c)** Compare cost and library complexity implications.


In [2]:
def reads_needed(c, G_bp, L_bp):
    return (c * G_bp) / L_bp

G = 3_100_000  # 3.1 Mbp
print("30x with 150 bp reads:", reads_needed(30, G, 150))
print("30x with 10,000 bp reads:", reads_needed(30, G, 10_000))

30x with 150 bp reads: 620000.0
30x with 10,000 bp reads: 9300.0


## Exercise 3 — Simulation sanity check (optional)

Assume bases are sampled independently with Poisson rate \(c\). Simulate coverage for a small genome and compare the observed
fraction of uncovered bases with the theoretical \(e^{-c}\).


In [None]:
import numpy as np

def simulate_uncovered_fraction(G=200000, c=5.0, seed=0):
    rng = np.random.default_rng(seed)
    # Poisson sampling per base
    counts = rng.poisson(lam=c, size=G)
    return np.mean(counts == 0)

for c in [1, 2, 5, 10]:
    obs = simulate_uncovered_fraction(G=200000, c=c, seed=42)
    theo = np.exp(-c)
    print(f"c={c:>2} | observed P0={obs:.4f} | theoretical P0={theo:.4f}")