# Genome Assembly: 2. Read Preprocessing & Quality Control

## Overview
Before assembly, raw sequencing data must be cleaned:
- Quality assessment and visualization
- Adapter trimming (sequences added by the sequencer)
- Quality-based trimming and filtering
- Complexity filtering (low-entropy reads)
- Duplicate detection

**Data Source**: We'll use realistic simulated FASTQ data (you can replace with real data from:
- SRA (Sequence Read Archive): https://www.ncbi.nlm.nih.gov/sra
- ENA (European Nucleotide Archive): https://www.ebi.ac.uk/ena)

**Goal**: Transform raw reads into high-quality input for assembly

## 1. FASTQ Format and Reading

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from typing import List, Tuple, Iterator
import gzip
from pathlib import Path

class FASTQRecord:
    """A single FASTQ record (sequencing read)"""
    def __init__(self, header: str, sequence: str, plus: str, qualities: str):
        self.header = header
        self.sequence = sequence
        self.plus = plus
        self.qualities = qualities
    
    def __repr__(self):
        return f"FASTQRecord(id='{self.header[:30]}...', len={len(self.sequence)}bp)"
    
    def get_phred_scores(self) -> List[int]:
        """Convert quality characters to Phred scores"""
        # Standard Illumina uses ASCII offset 33 (Phred+33)
        return [ord(c) - 33 for c in self.qualities]
    
    def mean_quality(self) -> float:
        """Calculate mean quality score"""
        return np.mean(self.get_phred_scores())
    
    def min_quality(self) -> int:
        """Get minimum quality score"""
        return min(self.get_phred_scores())

def parse_fastq(file_path: str) -> Iterator[FASTQRecord]:
    """
    Parse a FASTQ file (gzip-compressed or plain text).
    Yields FASTQRecord objects.
    """
    file_path = Path(file_path)
    
    if file_path.suffix == '.gz':
        opener = gzip.open
    else:
        opener = open
    
    with opener(file_path, 'rt') as f:
        while True:
            header = f.readline().strip()
            if not header:
                break
            
            sequence = f.readline().strip()
            plus = f.readline().strip()
            qualities = f.readline().strip()
            
            yield FASTQRecord(header, sequence, plus, qualities)

# Example FASTQ content
fastq_example = """@read1/1
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHHHHHHFFFFFFFFDDDDDDDD
@read2/1
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIHHHHHHHHHHHHHHHHFFFFFFFFFFDDDDDDDD999999"""

print("FASTQ Format Example:")
print(fastq_example)
print(f"\nLine 1: @header - read identifier")
print(f"Line 2: sequence - DNA bases")
print(f"Line 3: + - separator (sometimes repeated header)")
print(f"Line 4: quality - ASCII characters encoding Phred scores")

# Save example for testing
test_fastq_path = "test_reads.fastq"
with open(test_fastq_path, 'w') as f:
    f.write(fastq_example)

print(f"\nTest FASTQ file created at: {test_fastq_path}")

## 2. Generate Synthetic Realistic Reads

In [None]:
import random
random.seed(42)

def phred_to_char(q_score: int) -> str:
    """Convert Phred score to quality character (Phred+33)"""
    return chr(q_score + 33)

def generate_synthetic_reads(reference: str, num_reads: int = 1000, read_length: int = 100, 
                              coverage: int = 10, error_rate: float = 0.01) -> List[FASTQRecord]:
    """
    Generate synthetic reads from a reference genome with realistic errors.
    
    Args:
        reference: Reference genome sequence
        num_reads: Number of reads to generate
        read_length: Length of each read
        coverage: Expected coverage depth
        error_rate: Probability of each base being erroneous
    """
    bases = ['A', 'T', 'G', 'C']
    reads = []
    
    for i in range(num_reads):
        # Random position in reference
        start = random.randint(0, len(reference) - read_length)
        read_seq = reference[start:start + read_length]
        
        # Introduce errors
        read_list = list(read_seq)
        for j in range(len(read_list)):
            if random.random() < error_rate:
                # Substitute with random base
                read_list[j] = random.choice([b for b in bases if b != read_list[j]])
        read_seq = ''.join(read_list)
        
        # Generate quality scores
        # High quality at read start/end, lower in middle (realistic pattern)
        qualities = []
        for j in range(read_length):
            # Quality decreases towards the end
            position_factor = j / read_length
            base_quality = int(35 - 10 * position_factor)
            # Add some randomness
            q = max(5, base_quality + random.randint(-3, 3))
            # Lower quality at error positions
            if read_list[j] != read_seq[j]:
                q = min(q, random.randint(5, 15))
            qualities.append(phred_to_char(min(40, q)))
        
        header = f"@read{i}/1"
        plus = "+"
        reads.append(FASTQRecord(header, read_seq, plus, ''.join(qualities)))
    
    return reads

# Generate realistic genome and reads
print("Generating synthetic bacterial genome...")
# Simple genome with some repeats
genome_base = "ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG"
genome = genome_base * 50  # ~2250 bp
print(f"Genome length: {len(genome)} bp")

print("\nGenerating 5000 reads with 1% error rate...")
synthetic_reads = generate_synthetic_reads(genome, num_reads=5000, read_length=100, error_rate=0.01)
print(f"Generated {len(synthetic_reads)} reads")
print(f"Example read: {synthetic_reads[0]}")
print(f"  Sequence: {synthetic_reads[0].sequence[:50]}...")
print(f"  Quality:  {synthetic_reads[0].qualities[:50]}...")
print(f"  Mean Q:   {synthetic_reads[0].mean_quality():.1f}")

## 3. Quality Assessment

In [None]:
def analyze_quality(reads: List[FASTQRecord]) -> dict:
    """
    Calculate quality statistics across all reads.
    """
    read_lengths = [len(r.sequence) for r in reads]
    mean_qualities = [r.mean_quality() for r in reads]
    min_qualities = [r.min_quality() for r in reads]
    gc_contents = []
    
    for read in reads:
        gc = (read.sequence.count('G') + read.sequence.count('C')) / len(read.sequence)
        gc_contents.append(gc)
    
    return {
        'num_reads': len(reads),
        'total_bases': sum(read_lengths),
        'read_length_mean': np.mean(read_lengths),
        'read_length_median': np.median(read_lengths),
        'mean_quality_mean': np.mean(mean_qualities),
        'mean_quality_min': np.min(mean_qualities),
        'mean_quality_max': np.max(mean_qualities),
        'min_quality_min': np.min(min_qualities),
        'min_quality_max': np.max(min_qualities),
        'gc_mean': np.mean(gc_contents),
        'gc_std': np.std(gc_contents),
        'mean_qualities': mean_qualities,
        'min_qualities': min_qualities,
        'gc_contents': gc_contents,
        'quality_by_position': quality_by_position(reads)
    }

def quality_by_position(reads: List[FASTQRecord]) -> dict:
    """
    Calculate quality score statistics at each position across all reads.
    """
    if not reads:
        return {}
    
    read_length = len(reads[0].sequence)
    position_qualities = defaultdict(list)
    
    for read in reads:
        phred_scores = read.get_phred_scores()
        for pos, q in enumerate(phred_scores):
            position_qualities[pos].append(q)
    
    result = {}
    for pos in range(read_length):
        qs = position_qualities[pos]
        result[pos] = {
            'mean': np.mean(qs),
            'median': np.median(qs),
            'std': np.std(qs)
        }
    
    return result

# Analyze
stats = analyze_quality(synthetic_reads)

print("=== QUALITY ASSESSMENT ===")
print(f"Total reads: {stats['num_reads']:,}")
print(f"Total bases: {stats['total_bases']:,}")
print(f"\nRead length:")
print(f"  Mean: {stats['read_length_mean']:.1f} bp")
print(f"  Median: {stats['read_length_median']:.1f} bp")
print(f"\nMean quality per read:")
print(f"  Mean: {stats['mean_quality_mean']:.1f}")
print(f"  Min: {stats['mean_quality_min']:.1f}")
print(f"  Max: {stats['mean_quality_max']:.1f}")
print(f"\nMinimum quality per read:")
print(f"  Min: {stats['min_quality_min']}")
print(f"  Max: {stats['min_quality_max']}")
print(f"\nGC content:")
print(f"  Mean: {stats['gc_mean']:.2%}")
print(f"  Std:  {stats['gc_std']:.2%}")

## 4. Quality Visualization

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# 1. Mean quality distribution
axes[0, 0].hist(stats['mean_qualities'], bins=30, color='steelblue', edgecolor='black')
axes[0, 0].axvline(np.mean(stats['mean_qualities']), color='red', linestyle='--', label='Mean')
axes[0, 0].set_xlabel('Mean Quality Score')
axes[0, 0].set_ylabel('Number of Reads')
axes[0, 0].set_title('Distribution of Mean Quality per Read')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Quality by position
positions = sorted(stats['quality_by_position'].keys())
means = [stats['quality_by_position'][p]['mean'] for p in positions]
stds = [stats['quality_by_position'][p]['std'] for p in positions]

axes[0, 1].plot(positions, means, color='steelblue', linewidth=2, label='Mean')
axes[0, 1].fill_between(positions, 
                         np.array(means) - np.array(stds),
                         np.array(means) + np.array(stds),
                         alpha=0.3, color='steelblue', label='±1 Std')
axes[0, 1].axhline(30, color='green', linestyle='--', alpha=0.5, label='Q30 threshold')
axes[0, 1].axhline(20, color='orange', linestyle='--', alpha=0.5, label='Q20 threshold')
axes[0, 1].set_xlabel('Position in Read (bp)')
axes[0, 1].set_ylabel('Quality Score')
axes[0, 1].set_title('Quality Score Profile by Read Position')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
axes[0, 1].set_ylim([0, 42])

# 3. GC content distribution
axes[1, 0].hist(stats['gc_contents'], bins=30, color='coral', edgecolor='black')
axes[1, 0].axvline(np.mean(stats['gc_contents']), color='red', linestyle='--', label='Mean')
axes[1, 0].set_xlabel('GC Content')
axes[1, 0].set_ylabel('Number of Reads')
axes[1, 0].set_title('GC Content Distribution')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# 4. Min quality distribution
axes[1, 1].hist(stats['min_qualities'], bins=30, color='lightgreen', edgecolor='black')
axes[1, 1].axvline(np.mean(stats['min_qualities']), color='red', linestyle='--', label='Mean')
axes[1, 1].set_xlabel('Minimum Quality Score in Read')
axes[1, 1].set_ylabel('Number of Reads')
axes[1, 1].set_title('Distribution of Minimum Quality per Read')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
print("Quality assessment plots displayed.")

## 5. Adapter Trimming

In [None]:
def find_adapter_end(sequence: str, adapter: str, min_overlap: int = 8, max_mismatch: float = 0.1) -> int:
    """
    Find where an adapter sequence begins in the read.
    Uses a simple approach: scan for the adapter allowing a few mismatches.
    
    Returns the position where the adapter starts, or len(sequence) if not found.
    """
    for start in range(len(sequence) - min_overlap + 1):
        # Check suffix of sequence starting at 'start' against adapter
        suffix = sequence[start:]
        # Allow partial matches at the very end
        for adapter_start in range(len(adapter) - min_overlap + 1):
            adapter_part = adapter[adapter_start:]
            if len(adapter_part) == 0:
                continue
            
            # Check how many bases match
            match_length = min(len(suffix), len(adapter_part))
            mismatches = sum(1 for i in range(match_length) if suffix[i] != adapter_part[i])
            
            if mismatches / match_length <= max_mismatch and match_length >= min_overlap:
                return start
    
    return len(sequence)

def trim_adapters(reads: List[FASTQRecord], adapter_sequences: dict = None) -> Tuple[List[FASTQRecord], dict]:
    """
    Trim known adapter sequences from read ends.
    Common Illumina adapters are included by default.
    """
    if adapter_sequences is None:
        # Common Illumina adapters
        adapter_sequences = {
            'Illumina_R1': 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCA',
            'Illumina_R2': 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT',
            'Truseq': 'GATCGGAAGAGC'
        }
    
    trimmed_reads = []
    trim_stats = {'total': 0, 'trimmed': 0, 'bases_removed': 0}
    
    for read in reads:
        seq = read.sequence
        qual = read.qualities
        trim_pos = len(seq)  # Default: no trimming
        
        # Check each adapter
        for adapter_name, adapter in adapter_sequences.items():
            pos = find_adapter_end(seq, adapter)
            trim_pos = min(trim_pos, pos)
        
        # Trim if adapter was found
        if trim_pos < len(seq):
            seq = seq[:trim_pos]
            qual = qual[:trim_pos]
            trim_stats['trimmed'] += 1
            trim_stats['bases_removed'] += len(read.sequence) - trim_pos
        
        trimmed_reads.append(FASTQRecord(read.header, seq, read.plus, qual))
        trim_stats['total'] += 1
    
    return trimmed_reads, trim_stats

# Test with synthetic reads (add fake adapters)
test_reads = synthetic_reads[:100].copy()
# Add adapter to some reads
for i in [10, 20, 30]:
    r = test_reads[i]
    adapter = 'AGATCGGAAGAGC'
    test_reads[i] = FASTQRecord(r.header, r.sequence + adapter, r.plus, r.qualities + 'I' * len(adapter))

trimmed, stats_trim = trim_adapters(test_reads)

print(f"Adapter Trimming Results:")
print(f"  Total reads processed: {stats_trim['total']}")
print(f"  Reads with adapters trimmed: {stats_trim['trimmed']}")
print(f"  Total bases removed: {stats_trim['bases_removed']}")
print(f"\nExample:")
for i in [10, 20, 30]:
    orig = test_reads[i]
    trim = trimmed[i]
    if len(orig.sequence) != len(trim.sequence):
        print(f"  Read {i}: {len(orig.sequence)}bp → {len(trim.sequence)}bp (trimmed {len(orig.sequence) - len(trim.sequence)}bp)")

## 6. Quality-Based Trimming

In [None]:
def trim_by_quality_sliding_window(read: FASTQRecord, window_size: int = 5, min_quality: int = 20) -> FASTQRecord:
    """
    Trim read using sliding window approach (similar to Trimmomatic).
    
    Algorithm:
    1. Scan read from 5' to 3' with a sliding window
    2. If mean quality in window < threshold, trim from this position onwards
    """
    phred_scores = read.get_phred_scores()
    seq = read.sequence
    qual = read.qualities
    
    # Scan from right to left (3' end)
    for i in range(len(seq) - window_size, -1, -1):
        window_quality = np.mean(phred_scores[i:i + window_size])
        if window_quality < min_quality:
            return FASTQRecord(read.header, seq[:i], read.plus, qual[:i])
    
    return read

def trim_by_quality_per_base(read: FASTQRecord, min_quality: int = 20, leading: bool = True, trailing: bool = True) -> FASTQRecord:
    """
    Trim low-quality bases from the read ends (leading and/or trailing).
    
    Args:
        read: FASTQRecord to trim
        min_quality: Minimum quality threshold
        leading: Trim from 5' end
        trailing: Trim from 3' end
    """
    phred_scores = read.get_phred_scores()
    seq = read.sequence
    qual = read.qualities
    
    # Trim trailing low-quality bases
    if trailing:
        end = len(seq)
        for i in range(len(seq) - 1, -1, -1):
            if phred_scores[i] >= min_quality:
                end = i + 1
                break
        seq = seq[:end]
        qual = qual[:end]
        phred_scores = phred_scores[:end]
    
    # Trim leading low-quality bases
    if leading:
        start = 0
        for i in range(len(seq)):
            if phred_scores[i] >= min_quality:
                start = i
                break
        seq = seq[start:]
        qual = qual[start:]
    
    return FASTQRecord(read.header, seq, read.plus, qual)

def quality_filter(reads: List[FASTQRecord], min_quality: int = 20, method: str = 'sliding_window') -> Tuple[List[FASTQRecord], dict]:
    """
    Apply quality trimming to all reads.
    
    Methods:
    - 'per_base': Trim leading/trailing bases below threshold
    - 'sliding_window': Use sliding window approach
    """
    filtered_reads = []
    stats = {'total': 0, 'kept': 0, 'removed': 0, 'trimmed_bases': 0, 'min_length': 100}
    
    for read in reads:
        stats['total'] += 1
        
        # Apply trimming
        if method == 'per_base':
            trimmed = trim_by_quality_per_base(read, min_quality)
        else:  # sliding_window
            trimmed = trim_by_quality_sliding_window(read, min_quality=min_quality)
        
        # Check minimum length
        if len(trimmed.sequence) >= stats['min_length']:
            filtered_reads.append(trimmed)
            stats['kept'] += 1
            stats['trimmed_bases'] += len(read.sequence) - len(trimmed.sequence)
        else:
            stats['removed'] += 1
    
    return filtered_reads, stats

# Apply quality filtering
filtered_reads, filter_stats = quality_filter(synthetic_reads, min_quality=20, method='sliding_window')

print(f"Quality-Based Filtering Results:")
print(f"  Total reads: {filter_stats['total']}")
print(f"  Kept: {filter_stats['kept']} ({filter_stats['kept']/filter_stats['total']*100:.1f}%)")
print(f"  Removed (too short): {filter_stats['removed']}")
print(f"  Total bases trimmed: {filter_stats['trimmed_bases']:,}")

# Before/After stats
before_stats = analyze_quality(synthetic_reads)
after_stats = analyze_quality(filtered_reads)

print(f"\nBefore and After:")
print(f"  Read count: {before_stats['num_reads']:,} → {after_stats['num_reads']:,}")
print(f"  Mean quality: {before_stats['mean_quality_mean']:.2f} → {after_stats['mean_quality_mean']:.2f}")
print(f"  Min quality: {before_stats['min_quality_min']} → {after_stats['min_quality_min']}")

## 7. Complexity Filtering

In [None]:
def shannon_entropy(sequence: str) -> float:
    """
    Calculate Shannon entropy of a sequence.
    High entropy = diverse bases, Low entropy = repetitive
    
    Formula: H = -Σ(p_i * log2(p_i)) where p_i is frequency of base i
    Max entropy (DNA): 2 bits
    """
    from collections import Counter
    
    if len(sequence) == 0:
        return 0
    
    counts = Counter(sequence)
    entropy = 0
    for count in counts.values():
        p = count / len(sequence)
        if p > 0:
            entropy -= p * np.log2(p)
    
    return entropy

def homopolymer_runs(sequence: str) -> dict:
    """
    Find homopolymer runs (stretches of identical bases).
    Returns dict with max run length and count of runs > 5 bp.
    """
    if len(sequence) == 0:
        return {'max_run': 0, 'long_runs': 0}
    
    max_run = 1
    current_run = 1
    long_runs = 0
    
    for i in range(1, len(sequence)):
        if sequence[i] == sequence[i-1]:
            current_run += 1
            max_run = max(max_run, current_run)
        else:
            if current_run > 5:
                long_runs += 1
            current_run = 1
    
    if current_run > 5:
        long_runs += 1
    
    return {'max_run': max_run, 'long_runs': long_runs}

def complexity_filter(reads: List[FASTQRecord], min_entropy: float = 1.5, max_homopolymer: int = 8) -> Tuple[List[FASTQRecord], dict]:
    """
    Filter out low-complexity reads (repetitive or homopolymer-rich).
    These reads are problematic for assembly.
    """
    filtered_reads = []
    stats = {'total': 0, 'kept': 0, 'removed_low_entropy': 0, 'removed_homopolymer': 0}
    
    for read in reads:
        stats['total'] += 1
        
        entropy = shannon_entropy(read.sequence)
        homo_info = homopolymer_runs(read.sequence)
        
        if entropy < min_entropy:
            stats['removed_low_entropy'] += 1
        elif homo_info['max_run'] > max_homopolymer:
            stats['removed_homopolymer'] += 1
        else:
            filtered_reads.append(read)
            stats['kept'] += 1
    
    return filtered_reads, stats

# Apply complexity filter
complex_filtered, complex_stats = complexity_filter(filtered_reads, min_entropy=1.5)

print(f"Complexity Filtering Results:")
print(f"  Total reads: {complex_stats['total']}")
print(f"  Kept: {complex_stats['kept']} ({complex_stats['kept']/complex_stats['total']*100:.1f}%)")
print(f"  Removed (low entropy): {complex_stats['removed_low_entropy']}")
print(f"  Removed (long homopolymers): {complex_stats['removed_homopolymer']}")

# Show examples
print(f"\nComplexity examples:")
for i in [0, 100, 200]:
    if i < len(filtered_reads):
read = filtered_reads[i]
entropy = shannon_entropy(read.sequence)
homo = homopolymer_runs(read.sequence)
print(f"  Read {i}: entropy={entropy:.2f}, max_homopolymer={homo['max_run']}")

## 8. Duplicate Detection

In [None]:
def find_duplicates(reads: List[FASTQRecord]) -> dict:
    """
    Identify duplicate reads (exact sequence matches).
    Returns dict with duplicate statistics and lists of duplicate clusters.
    """
    sequence_map = defaultdict(list)
    
    # Map sequences to read indices
    for idx, read in enumerate(reads):
        sequence_map[read.sequence].append(idx)
    
    # Count duplicates
    duplicate_clusters = [indices for indices in sequence_map.values() if len(indices) > 1]
    num_duplicates = sum(len(indices) - 1 for indices in duplicate_clusters)
    
    return {
        'total_reads': len(reads),
        'unique_sequences': len(sequence_map),
        'duplicate_reads': num_duplicates,
        'duplicate_clusters': duplicate_clusters,
        'sequence_map': sequence_map
    }

def deduplicate(reads: List[FASTQRecord], keep_first: bool = True) -> Tuple[List[FASTQRecord], dict]:
    """
    Remove duplicate reads, optionally keeping the best quality copy.
    """
    sequence_map = defaultdict(list)
    
    # Group by sequence
    for idx, read in enumerate(reads):
        sequence_map[read.sequence].append((idx, read))
    
    dedup_reads = []
    stats = {'total': 0, 'unique': 0, 'duplicates_removed': 0}
    
    for sequence, read_list in sequence_map.items():
        stats['total'] += len(read_list)
        stats['unique'] += 1
        
        if len(read_list) > 1:
            stats['duplicates_removed'] += len(read_list) - 1
            # Keep read with highest mean quality
            best_read = max(read_list, key=lambda x: x[1].mean_quality())[1]
            dedup_reads.append(best_read)
        else:
            dedup_reads.append(read_list[0][1])
    
    return dedup_reads, stats

# Test deduplication on a subset
test_dedup = complex_filtered[:1000].copy()
# Add some duplicates
test_dedup.extend(test_dedup[10:15])

dup_info = find_duplicates(test_dedup)
print(f"Duplicate Analysis (before dedup):")
print(f"  Total reads: {dup_info['total_reads']}")
print(f"  Unique sequences: {dup_info['unique_sequences']}")
print(f"  Duplicate reads: {dup_info['duplicate_reads']}")
print(f"  Largest duplicate cluster: {max((len(c) for c in dup_info['duplicate_clusters']), default=0)} copies")

dedup_reads, dedup_stats = deduplicate(test_dedup)
print(f"\nAfter Deduplication:")
print(f"  Total reads: {dedup_stats['total']}")
print(f"  Unique reads: {dedup_stats['unique']}")
print(f"  Duplicates removed: {dedup_stats['duplicates_removed']}")
print(f"  Reads retained: {len(dedup_reads)} ({len(dedup_reads)/dedup_stats['total']*100:.1f}%)")

## 9. Complete Preprocessing Pipeline

In [None]:
def preprocess_reads(input_reads: List[FASTQRecord], 
                    min_quality: int = 20,
                    min_length: int = 50,
                    remove_adapters: bool = True,
                    remove_duplicates: bool = True,
                    remove_low_complexity: bool = True) -> Tuple[List[FASTQRecord], dict]:
    """
    Complete preprocessing pipeline:
    1. Adapter trimming
    2. Quality trimming
    3. Length filtering
    4. Complexity filtering
    5. Deduplication
    """
    pipeline_stats = {}
    reads = input_reads
    
    print(f"Starting preprocessing with {len(reads)} reads...\n")
    
    # Step 1: Adapter trimming
    if remove_adapters:
        reads, stats = trim_adapters(reads)
        pipeline_stats['adapters'] = stats
        print(f"After adapter trimming: {stats['total'] - stats['trimmed']} reads unchanged, "
              f"{stats['trimmed']} trimmed")
    
    # Step 2: Quality trimming
    reads, stats = quality_filter(reads, min_quality=min_quality, method='sliding_window')
    pipeline_stats['quality'] = stats
    print(f"After quality trimming: {stats['kept']} reads retained (removed {stats['removed']} short reads)")
    
    # Step 3: Complexity filtering
    if remove_low_complexity:
        reads, stats = complexity_filter(reads, min_entropy=1.5)
        pipeline_stats['complexity'] = stats
        print(f"After complexity filtering: {stats['kept']} reads retained "
              f"(removed {stats['removed_low_entropy'] + stats['removed_homopolymer']})")
    
    # Step 4: Deduplication
    if remove_duplicates:
        reads, stats = deduplicate(reads)
        pipeline_stats['duplicates'] = stats
        print(f"After deduplication: {stats['unique']} unique reads "
              f"({stats['duplicates_removed']} duplicates removed)")
    
    print(f"\nFinal: {len(reads)} reads for assembly")
    pipeline_stats['final_count'] = len(reads)
    pipeline_stats['initial_count'] = len(input_reads)
    
    return reads, pipeline_stats

# Run full pipeline
print("="*50)
print("FULL PREPROCESSING PIPELINE")
print("="*50)
preprocessed_reads, pp_stats = preprocess_reads(
    synthetic_reads[:2000],  # Use subset for demo
    min_quality=20,
    min_length=50,
    remove_adapters=False,  # No adapters in synthetic data
    remove_duplicates=True,
    remove_low_complexity=True
)

print(f"\n" + "="*50)
print(f"Summary: {pp_stats['initial_count']} → {pp_stats['final_count']} reads "
      f"({pp_stats['final_count']/pp_stats['initial_count']*100:.1f}% retained)")
print("="*50)

## 10. Real Data Source Guide

To use real genome sequencing data:

**1. SRA (NCBI Sequence Read Archive)**
   - Website: https://www.ncbi.nlm.nih.gov/sra
   - Download using: `fastq-dump` or `fasterq-dump` (SRA Toolkit)
   - Example: Bacterial genome with short reads

**2. ENA (European Nucleotide Archive)**
   - Website: https://www.ebi.ac.uk/ena
   - Direct FASTQ downloads available

**3. Public Databases**
   - 1000 Genomes Project (human)
   - Genome in a Bottle (high-confidence benchmarks)
   - IMG/VR (viral genomes)

**Example**: Download bacterial reads from SRA:
```bash
# Install SRA Toolkit first
conda install -c bioconda sra-tools

# Download specific run (e.g., SRR1111111)
fasterq-dump SRR1111111 -O ./reads --threads 8
```

The preprocessing code above will work with real FASTQ files without modification!

## Summary

**Preprocessing steps:**
1. **Adapter trimming**: Remove ligated DNA sequences
2. **Quality trimming**: Remove low-quality bases using sliding window or per-base filtering
3. **Length filtering**: Discard reads that become too short
4. **Complexity filtering**: Remove repetitive/low-entropy reads
5. **Deduplication**: Remove exact duplicates

**Key metrics:**
- Quality score distribution by position
- GC content vs. expected (indicates contamination)
- Read length distribution

**Next**: Notebook 3 covers the actual assembly algorithms!