# Genome Assembly: 1. Fundamentals

## Overview
This notebook introduces the foundational concepts needed for genome assembly:
- DNA sequences and sequence representation
- Next-generation sequencing (NGS) reads
- Sequencing errors and quality scores
- K-mers: the building blocks of modern assembly algorithms
- De Bruijn graphs: the core data structure for assembly

**Goal**: Understand what we're working with and why k-mers and De Bruijn graphs are powerful.

## 1. DNA Sequences and Basic Concepts

DNA is a sequence of nucleotides: A (Adenine), T (Thymine), G (Guanine), C (Cytosine).
- A genome is the complete set of genetic instructions (millions to billions of bases).
- Sequencing produces millions of short reads (fragments of the genome).
- **Assembly**: Reconstruct the original genome from these overlapping fragments.

In [None]:
# Basic sequence operations
def get_complement(base):
    """Return the complement of a DNA base (A<->T, G<->C)"""
    complement_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return complement_map[base]

def reverse_complement(sequence):
    """Return the reverse complement of a DNA sequence"""
    # Reverse the sequence and complement each base
    return ''.join(get_complement(base) for base in reversed(sequence))

# Example
dna_sequence = "ATGCGATCG"
print(f"Original sequence:  {dna_sequence}")
print(f"Reverse complement: {reverse_complement(dna_sequence)}")
print(f"\nNote: In real genomes, both strands can be read, so we care about reverse complements")

## 2. NGS Reads and Quality Scores

Modern sequencers produce millions of short reads. Each base has an associated **quality score** (Phred quality):
- Quality score Q = -10 × log₁₀(P_error)
- Q20 = 1% error, Q30 = 0.1% error
- Reads are typically stored in FASTQ format

In [None]:
import numpy as np
from collections import defaultdict

def phred_to_error_prob(q_score):
    """Convert Phred quality score to error probability"""
    return 10 ** (-q_score / 10)

def error_prob_to_phred(error_prob):
    """Convert error probability to Phred quality score"""
    return -10 * np.log10(error_prob)

# Example
print("Phred Quality Score -> Error Probability")
for q in [10, 20, 30, 40]:
    error = phred_to_error_prob(q)
    print(f"  Q{q}: {error:.4%} error")

print("\nExample read with quality scores:")
read_sequence = "ATGCGATCGATCG"
quality_scores = [35, 35, 32, 28, 25, 15, 35, 35, 35, 30, 28, 20, 18]
print(f"Sequence: {read_sequence}")
print(f"Quality:  {quality_scores}")
print(f"Mean quality: {np.mean(quality_scores):.1f}")

## 3. K-mers: The Core Unit of Assembly

A **k-mer** is a substring of length k from a sequence.

**Why k-mers?**
- Reduce noise from sequencing errors (errors affect only a few k-mers)
- Provide discrete units for comparison and overlap detection
- Enable efficient De Bruijn graph construction

**K-mer choice trade-off:**
- Small k (e.g., 21): More overlaps, but less specific (repeats confuse assembly)
- Large k (e.g., 71): More specific, but sensitive to errors (breaks into more pieces)
- Typical range: 21-127 bp

In [None]:
def extract_kmers(sequence, k):
    """Extract all k-mers from a sequence"""
    kmers = []
    for i in range(len(sequence) - k + 1):
        kmers.append(sequence[i:i+k])
    return kmers

def count_kmers(reads, k):
    """Count occurrences of each k-mer across multiple reads"""
    kmer_counts = defaultdict(int)
    for read in reads:
        for kmer in extract_kmers(read, k):
            kmer_counts[kmer] += 1
    return kmer_counts

# Example: simulating reads from a simple sequence
true_sequence = "ATGCGATCGATCGATCG"  # True genome (simplified)
print(f"True sequence: {true_sequence}")
print(f"Length: {len(true_sequence)} bp")

# Simulate a read
read = true_sequence[0:10]
k = 4
kmers = extract_kmers(read, k)
print(f"\nRead: {read}")
print(f"4-mers: {kmers}")
print(f"Count: {len(kmers)} 4-mers")

# Show how many k-mers we get from a read of length L
L = 100  # typical read length
for k in [21, 31, 51, 71]:
    num_kmers = max(0, L - k + 1)
    print(f"\nA {L}bp read yields {num_kmers} k-mers of length {k}")

## 4. De Bruijn Graphs: The Assembly Data Structure

A **De Bruijn graph** is constructed from k-mers:
- **Nodes**: (k-1)-mers (the prefixes and suffixes of k-mers)
- **Edges**: Represent k-mers; an edge connects the (k-1)-prefix to the (k-1)-suffix
- **Goal**: Find an Eulerian path through the graph (visits every edge exactly once)

**Example**: For k=4, the 3-mer "ATG" and "TGC" connected by edge "ATGC"

```
ATG --[ATGC]--> TGC
```

**Why this works:**
- If reads are perfect (no errors), every position in the genome produces unique k-mers
- The Eulerian path reconstructs the original sequence
- Real genomes have repeats and errors, which complicate this

In [None]:
class SimpleDeBruijnGraph:
    """A simple De Bruijn graph representation for assembly"""
    
    def __init__(self, k):
        self.k = k
        self.edges = defaultdict(list)  # prefix -> list of (suffix, edge_label)
        self.edge_counts = defaultdict(int)  # (prefix, suffix, edge_label) -> count
    
    def add_kmer(self, kmer, count=1):
        """Add a k-mer to the graph"""
        if len(kmer) != self.k:
            raise ValueError(f"K-mer length must be {self.k}")
        
        prefix = kmer[:-1]      # (k-1)-mer prefix
        suffix = kmer[1:]       # (k-1)-mer suffix
        last_base = kmer[-1]    # the last base
        
        # Add edge from prefix to suffix
        self.edges[prefix].append((suffix, last_base))
        self.edge_counts[(prefix, suffix, last_base)] += count
    
    def get_nodes(self):
        """Get all (k-1)-mers in the graph"""
        nodes = set(self.edges.keys())
        for suffix in [s for s, _ in sum(self.edges.values(), [])]:
            nodes.add(suffix)
        return nodes
    
    def get_out_degree(self, node):
        """Get the out-degree of a node"""
        return len(self.edges.get(node, []))
    
    def get_in_degree(self, node):
        """Get the in-degree of a node"""
        count = 0
        for edges_list in self.edges.values():
            count += sum(1 for suffix, _ in edges_list if suffix == node)
        return count

# Example
genome = "ATGCGATCGATCG"
k = 4

# Extract k-mers
kmers = extract_kmers(genome, k)
print(f"Genome: {genome}")
print(f"K-mers (k={k}): {kmers}\n")

# Build De Bruijn graph
graph = SimpleDeBruijnGraph(k)
for kmer in kmers:
    graph.add_kmer(kmer)

# Show nodes
nodes = graph.get_nodes()
print(f"Nodes ({k-1}-mers): {sorted(nodes)}")
print(f"Number of nodes: {len(nodes)}")

# Show edges
print(f"\nEdges (k-mers):")
for prefix in sorted(graph.edges.keys()):
    for suffix, base in graph.edges[prefix]:
        print(f"  {prefix} --[{prefix + base}--> {suffix}")

# Show node degrees
print(f"\nNode degrees:")
for node in sorted(nodes):
    in_deg = graph.get_in_degree(node)
    out_deg = graph.get_out_degree(node)
    print(f"  {node}: in={in_deg}, out={out_deg}")

## 5. From De Bruijn Graph to Sequence Reconstruction

Once we have a De Bruijn graph, we need to find a **path that uses every edge exactly once** (Eulerian path).

**Properties of such a path:**
- If every node has equal in-degree and out-degree: Eulerian circuit (returns to start)
- If exactly one node has out-degree > in-degree (start) and one has in-degree > out-degree (end): Eulerian path
- Otherwise: no Eulerian path exists

The sequence is reconstructed by following edges and appending the base labels.

In [None]:
def find_eulerian_path(graph):
    """
    Find an Eulerian path using Hierholzer's algorithm.
    Returns the path as a list of nodes, or None if no Eulerian path exists.
    """
    # Check Eulerian path conditions
    nodes = graph.get_nodes()
    start_nodes = []
    end_nodes = []
    
    for node in nodes:
        in_deg = graph.get_in_degree(node)
        out_deg = graph.get_out_degree(node)
        balance = out_deg - in_deg
        
        if balance == 1:
            start_nodes.append(node)
        elif balance == -1:
            end_nodes.append(node)
        elif balance != 0:
            return None  # No Eulerian path
    
    # Determine start node
    if start_nodes:
        if len(start_nodes) != 1:
            return None
        start = start_nodes[0]
    else:
        # Start from any node with outgoing edges (Eulerian circuit)
        start = next(iter(graph.edges.keys())) if graph.edges else None
    
    if start is None:
        return None
    
    # Hierholzer's algorithm
    stack = [start]
    path = []
    edges = defaultdict(list)  # Copy edges for modification
    for prefix, neighbors in graph.edges.items():
        edges[prefix] = list(neighbors)
    
    while stack:
        v = stack[-1]
        if edges[v]:
            suffix, base = edges[v].pop()
            stack.append(suffix)
        else:
            path.append(stack.pop())
    
    path.reverse()
    return path if path else [start]

def reconstruct_sequence(graph, path):
    """
    Reconstruct the sequence from an Eulerian path.
    The sequence is the path of (k-1)-mers with overlaps merged.
    """
    if not path:
        return ""
    
    # Start with the first node
    sequence = path[0]
    
    # Trace through edges to add final base of each k-mer
    for i in range(len(path) - 1):
        current = path[i]
        next_node = path[i + 1]
        # Find the edge from current to next_node
        for neighbor, base in graph.edges[current]:
            if neighbor == next_node:
                sequence += base
                break
    
    return sequence

# Test
path = find_eulerian_path(graph)
if path:
    print(f"Eulerian path found: {path}")
    reconstructed = reconstruct_sequence(graph, path)
    print(f"\nReconstructed sequence: {reconstructed}")
    print(f"Original sequence:      {genome}")
    print(f"Match: {reconstructed == genome}")
else:
    print("No Eulerian path found")

## 6. Real-World Complications

In practice, genome assembly is much harder due to:

1. **Sequencing Errors**
   - Create "spur" nodes in the graph
   - Solution: Remove low-frequency k-mers (error correction)

2. **Repeats**
   - Same k-mers appear in multiple locations
   - Produces branching in the graph
   - Solution: Use paired-end reads, longer k-mers, or graph simplification

3. **No Clear Eulerian Path**
   - Multiple valid paths through complex regions
   - Solution: Contigs (unambiguous regions) + scaffolding with pair information

4. **Coverage Variation**
   - Some regions have more reads (higher k-mer frequency)
   - Others have few reads
   - Solution: Use coverage information for graph cleaning

In [None]:
# Example: Sequencing errors creating spurious k-mers
print("=== Effect of Sequencing Errors ===")
true_read = "ATGCGATCGATCGATCG"
# Introduce an error at position 7
erroneous_read = "ATGCGACGATCGATCGATCG"  # A -> C substitution

k = 4
true_kmers = set(extract_kmers(true_read, k))
error_kmers = set(extract_kmers(erroneous_read, k))

unique_to_error = error_kmers - true_kmers
missing_from_error = true_kmers - error_kmers

print(f"True k-mers: {sorted(true_kmers)}")
print(f"\nErroneous k-mers: {sorted(error_kmers)}")
print(f"\nUnique to error (spurious): {sorted(unique_to_error)}")
print(f"Missing from error: {sorted(missing_from_error)}")
print(f"\n→ A single base error affects {len(unique_to_error)} k-mers!")
print(f"→ This is why k-mer frequency is used for error detection")

## Summary

**Key concepts:**
- **Reads**: Short DNA fragments from sequencing
- **K-mers**: Short substrings that are the fundamental units of assembly
- **De Bruijn Graph**: A directed graph encoding k-mer overlaps
- **Eulerian Path**: A path through the graph that reconstructs the original sequence
- **Real challenges**: Errors, repeats, and coverage variation

**Next**: In the next notebook, we'll handle real sequencing data with quality scores and preprocessing steps.