#  Bioinformatics: Sequence Alignment
Understanding genetic information is at the heart of modern biology, and sequence alignment is one of the most fundamental tools in this process. Sequence alignment refers to the task of arranging DNA, RNA, or protein sequences to identify regions of similarity. These similarities can indicate shared ancestry, functional relationships, or evolutionary patterns.

Unlike simple data comparison, sequence alignment must account for mutations, insertions, deletions, and gaps that naturally occur over time. This makes the task computationally complex and biologically significant. Performing accurate alignments requires a combination of algorithmic strategies and domain-specific knowledge in molecular biology.

This notebook explores how algorithms like Needleman-Wunsch, Smith-Waterman, and multiple sequence alignment (MSA) methods are used to detect conserved regions, infer phylogenetic relationships, and support gene annotation. These algorithms form the backbone of many tools used in genomic research and medical diagnostics.

By studying sequence alignment, we not only improve our ability to interpret genetic data, but we also gain insight into how algorithms can be designed to model the uncertainty and variation found in living systems.


# Multi-Sequence Alignment (MSA)
MSA is the process of aligning three or more sequences (DNA, RNA, or proteins) so that similar characters are in the same column. This helps identify conserved regions, which may be important biologically (e.g., active sites in enzymes).

In simple terms, we insert gaps (-) into sequences so that as many characters as possible match when stacked vertically.

Example Imput Sequences:

In [1]:
seq1 = "ATGCT"
seq2 = "A-G-T"
seq3 = "ATGTT"

Aligned like this:

A T G C T

A - G - T

A T G T T

Each column shows how the characters line up, and differences can be counted.

In [2]:
# Demo Code
sequences = [
    ['A', 'T', 'G', 'C', 'T'],
    ['A', '-', 'G', '-', 'T'],
    ['A', 'T', 'G', 'T', 'T']
]

num_columns = len(sequences[0])
num_sequences = len(sequences)
mismatch_count = 0

print("Aligned Sequences:")
for seq in sequences:
    print(" ".join(seq))

# Go column by column
for col in range(num_columns):
    column = [sequences[row][col] for row in range(num_sequences)]
    most_common = max(set(column), key=column.count)

    # Count mismatches in the column
    mismatches = sum(1 for base in column if base != most_common)
    mismatch_count += mismatches

print(f"\nTotal column mismatches: {mismatch_count}")

Aligned Sequences:
A T G C T
A - G - T
A T G T T

Total column mismatches: 3


In [3]:
# General Code
import random  # Import random module for generating random sequences
from collections import Counter  # Import Counter for counting occurrences of bases

def generate_random_sequence(length, gap_probability=0.15):
    """
    Generate a random DNA sequence with possible gaps.

    Args:
        length: Length of the sequence
        gap_probability: Probability of inserting a gap (-)

    Returns:
        List representing the sequence
    """
    bases = ['A', 'T', 'G', 'C']  # Define the four DNA bases
    sequence = []  # Initialize empty list to store the sequence

    # Loop through each position in the sequence
    for _ in range(length):
        # Generate random number between 0 and 1
        if random.random() < gap_probability:
            sequence.append('-')  # Add gap if random number is below threshold
        else:
            sequence.append(random.choice(bases))  # Add random DNA base

    return sequence  # Return the generated sequence as a list

def count_mismatches(sequences):
    """
    Count mismatches in aligned sequences.

    Args:
        sequences: List of sequences (each sequence is a list of bases)

    Returns:
        Dictionary with mismatch analysis
    """
    # Check if sequences list is empty
    if not sequences:
        return {"error": "No sequences provided"}  # Return error if no sequences

    num_columns = len(sequences[0])  # Get length of first sequence (all should be same length)
    num_sequences = len(sequences)  # Count total number of sequences
    total_mismatches = 0  # Initialize counter for total mismatches across all columns

    # Display header for sequence alignment
    print("Aligned Sequences:")
    # Loop through each sequence with index
    for i, seq in enumerate(sequences):
        # Print sequence number and bases separated by spaces
        print(f"Seq {i+1}: {' '.join(seq)}")

    # Display header for mismatch analysis table
    print(f"\nMismatch Analysis:")
    print("Column | Bases | Most Common | Mismatches")
    print("-" * 45)  # Print separator line

    # Analyze each column position across all sequences
    for col in range(num_columns):
        # Extract all bases from current column across all sequences
        column = [sequences[row][col] for row in range(num_sequences)]

        # Count occurrences of each base in the column using Counter
        base_counts = Counter(column)

        # Separate actual bases from gaps
        actual_bases = {base: count for base, count in base_counts.items() if base != '-'}
        gap_count = base_counts.get('-', 0)

        # Find the most common actual base (excluding gaps)
        if actual_bases:
            most_common_base = max(actual_bases, key=actual_bases.get)
            most_common_count = actual_bases[most_common_base]
        else:
            # If all are gaps, set most common as gap for display purposes
            most_common_base = '-'
            most_common_count = gap_count

        # Calculate mismatches:
        # All sequences that don't match the most common actual base are mismatches
        # This includes gaps and different bases
        if actual_bases:
            mismatches = num_sequences - most_common_count
        else:
            # If all are gaps, no mismatches (though this is an edge case)
            mismatches = 0

        total_mismatches += mismatches  # Add to running total

        # Display column analysis results
        bases_str = ''.join(column)  # Convert column list to string for display
        # Print formatted row with column number, bases, most common base, and mismatch count
        print(f"  {col+1:2d}   | {bases_str:5s} | {most_common_base:11s} | {mismatches:9d}")

    # Display summary statistics
    print(f"\nMismatch Summary:")
    print(f"Total sequences: {num_sequences}")
    print(f"Sequence length: {num_columns}")
    print(f"Total mismatches: {total_mismatches}")
    print(f"Average mismatches per column: {total_mismatches/num_columns:.2f}")

    # Calculate mismatch rate as percentage
    total_positions = num_sequences * num_columns  # Total possible positions
    mismatch_rate = (total_mismatches / total_positions) * 100  # Convert to percentage
    print(f"Overall mismatch rate: {mismatch_rate:.1f}%")

    # Return dictionary with analysis results
    return {
        'total_mismatches': total_mismatches,
        'num_sequences': num_sequences,
        'sequence_length': num_columns,
        'mismatch_rate': mismatch_rate
    }

def main():
    # Set random seed for reproducibility - same seed gives same random sequences
    random.seed(42)

    # Parameters for sequence generation
    num_sequences = 4  # Number of sequences to generate
    sequence_length = 6  # Length of each sequence
    gap_probability = 0.2  # 20% chance of gap at each position

    # Display program header and parameters
    print("=== DNA Sequence Mismatch Counter ===\n")
    print(f"Generating {num_sequences} random sequences of length {sequence_length}")
    print(f"Gap probability: {gap_probability:.1%}\n")  # Format as percentage

    # Generate random sequences
    sequences = []  # Initialize empty list to store sequences
    # Loop to generate specified number of sequences
    for i in range(num_sequences):
        seq = generate_random_sequence(sequence_length, gap_probability)  # Generate one sequence
        sequences.append(seq)  # Add sequence to list

    # Count mismatches in the generated sequences
    results = count_mismatches(sequences)

    # Run another example with different parameters
    print("\n" + "="*50)  # Print separator line
    print("Example 2: More sequences, less gaps")
    print("="*50)

    # Generate second set of sequences with different parameters
    sequences2 = []  # Initialize new list for second example
    # Generate 6 sequences of length 8 with 10% gap probability
    for i in range(6):
        seq = generate_random_sequence(8, gap_probability=0.1)  # Lower gap probability
        sequences2.append(seq)  # Add to second sequence list

    # Analyze second set of sequences
    count_mismatches(sequences2)

if __name__ == "__main__":
    main()  # Run main function only if script is executed directly

=== DNA Sequence Mismatch Counter ===

Generating 4 random sequences of length 6
Gap probability: 20.0%

Aligned Sequences:
Seq 1: A T A A A -
Seq 2: A C G A T G
Seq 3: T G - G G C
Seq 4: C - G A - G

Mismatch Analysis:
Column | Bases | Most Common | Mismatches
---------------------------------------------
   1   | AATC  | A           |         2
   2   | TCG-  | T           |         3
   3   | AG-G  | G           |         2
   4   | AAGA  | A           |         1
   5   | ATG-  | A           |         3
   6   | -GCG  | G           |         2

Mismatch Summary:
Total sequences: 4
Sequence length: 6
Total mismatches: 13
Average mismatches per column: 2.17
Overall mismatch rate: 54.2%

Example 2: More sequences, less gaps
Aligned Sequences:
Seq 1: T C G G G A T T
Seq 2: C T A A C T G C
Seq 3: C T G C C T A A
Seq 4: T C C C A A G G
Seq 5: C A G T A G T T
Seq 6: A C - G T A - A

Mismatch Analysis:
Column | Bases | Most Common | Mismatches
---------------------------------------------


In [10]:
# Silent Test Suite for DNA Sequence Mismatch Counter
import random
from collections import Counter
import sys
import time

def generate_random_sequence(length, gap_probability=0.15):
    """Generate a random DNA sequence with possible gaps."""
    bases = ['A', 'T', 'G', 'C']
    sequence = []

    for _ in range(length):
        if random.random() < gap_probability:
            sequence.append('-')
        else:
            sequence.append(random.choice(bases))

    return sequence

def count_mismatches(sequences):
    """Count mismatches in aligned sequences (the function we're testing)."""
    if not sequences:
        return {"error": "No sequences provided"}

    num_columns = len(sequences[0])
    num_sequences = len(sequences)
    total_mismatches = 0

    for col in range(num_columns):
        column = [sequences[row][col] for row in range(num_sequences)]
        base_counts = Counter(column)

        # Separate actual bases from gaps
        actual_bases = {base: count for base, count in base_counts.items() if base != '-'}

        # Find the most common actual base (excluding gaps)
        if actual_bases:
            most_common_base = max(actual_bases, key=actual_bases.get)
            most_common_count = actual_bases[most_common_base]
            mismatches = num_sequences - most_common_count
        else:
            # If all are gaps, no mismatches
            mismatches = 0

        total_mismatches += mismatches

    return {
        'total_mismatches': total_mismatches,
        'num_sequences': num_sequences,
        'sequence_length': num_columns,
        'mismatch_rate': (total_mismatches / (num_sequences * num_columns)) * 100
    }

def brute_force_mismatch_count(sequences):
    """Brute force method to verify mismatch counting."""
    if not sequences:
        return 0

    num_columns = len(sequences[0])
    num_sequences = len(sequences)  # Fix: Define num_sequences here
    total_mismatches = 0

    for col in range(num_columns):
        column = [sequences[row][col] for row in range(num_sequences)]

        # Count only actual bases (not gaps)
        actual_bases = [base for base in column if base != '-']

        if not actual_bases:
            # All gaps - no mismatches
            continue

        # Find most common actual base
        base_counts = Counter(actual_bases)
        most_common_base = base_counts.most_common(1)[0][0]
        most_common_count = base_counts.most_common(1)[0][1]

        # Count mismatches: total sequences minus most common actual base count
        # This correctly counts gaps as mismatches against actual bases
        mismatches = len(column) - most_common_count
        total_mismatches += mismatches

    return total_mismatches

def assert_test(test_name, sequences, expected_mismatches=None):
    """Silent assertion-based test."""
    # Handle empty sequences
    if not sequences:
        our_result = count_mismatches(sequences)
        if "error" in our_result:
            return True  # Expected behavior for empty sequences
        return False

    # Get results from our function
    our_result = count_mismatches(sequences)

    if "error" in our_result:
        our_mismatches = None
    else:
        our_mismatches = our_result['total_mismatches']

    # Get brute force result
    brute_force_mismatches = brute_force_mismatch_count(sequences)

    # Assertions
    if expected_mismatches is not None:
        assert our_mismatches == expected_mismatches, f"{test_name}: Expected {expected_mismatches}, got {our_mismatches}"

    assert our_mismatches == brute_force_mismatches, f"{test_name}: Our function ({our_mismatches}) != brute force ({brute_force_mismatches})"

    return True

def test_edge_cases():
    """Test edge cases."""

    # Empty sequences
    result = count_mismatches([])
    assert "error" in result, "Empty sequences should return error"

    # Single sequence
    assert_test("Single sequence", [['A', 'T', 'G', 'C']], 0)

    # Single position
    assert_test("Single position - same", [['A'], ['A'], ['A']], 0)
    assert_test("Single position - different", [['A'], ['T'], ['G']], 2)

    # All gaps
    assert_test("All gaps", [['-', '-', '-'], ['-', '-', '-'], ['-', '-', '-']], 0)

    # Mixed gaps only
    assert_test("Mixed gaps", [['A'], ['-'], ['A']], 1)  # 1 gap out of 3 sequences, A is most common

    # Two sequences
    assert_test("Two sequences identical", [['A', 'T'], ['A', 'T']], 0)
    assert_test("Two sequences different", [['A', 'T'], ['T', 'A']], 2)

    # Single base type
    assert_test("All same base", [['A', 'A', 'A'], ['A', 'A', 'A']], 0)

    # Long sequences with no variation
    long_seq = ['A'] * 100
    assert_test("Long identical sequences", [long_seq, long_seq, long_seq], 0)

def test_random_cases():
    """Test random cases with different parameters."""

    for test_num in range(50):  # 50 random tests
        # Random parameters
        num_sequences = random.randint(2, 10)
        sequence_length = random.randint(1, 20)
        gap_probability = random.uniform(0, 0.5)

        # Generate random sequences
        sequences = []
        for _ in range(num_sequences):
            seq = generate_random_sequence(sequence_length, gap_probability)
            sequences.append(seq)

        # Test consistency between methods
        assert_test(f"Random test {test_num}", sequences)

def test_stress_cases():
    """Test stress cases with large inputs."""

    # Large number of sequences
    sequences = []
    for _ in range(100):
        seq = generate_random_sequence(10, 0.1)
        sequences.append(seq)
    assert_test("100 sequences", sequences)

    # Long sequences
    sequences = []
    for _ in range(10):
        seq = generate_random_sequence(1000, 0.1)
        sequences.append(seq)
    assert_test("Long sequences (1000 bases)", sequences)

    # High gap probability
    sequences = []
    for _ in range(10):
        seq = generate_random_sequence(50, 0.8)
        sequences.append(seq)
    assert_test("High gap probability", sequences)

    # Maximum diversity test
    bases = ['A', 'T', 'G', 'C']
    sequences = []
    for i in range(4):
        seq = [bases[i]] * 100  # Each sequence is all one base
        sequences.append(seq)
    assert_test("Maximum diversity", sequences, 300)  # 75 mismatches per column * 100 columns

def test_specific_assertion_cases():
    """Test specific cases with known expected results."""

    # Perfect match
    assert_test("Perfect match",
                [['A', 'T', 'G', 'C'], ['A', 'T', 'G', 'C'], ['A', 'T', 'G', 'C']],
                0)

    # All different - each column has 4 different bases, so 3 mismatches per column
    # 3 columns × 3 mismatches = 9 total
    assert_test("All different",
                [['A', 'A', 'A'], ['T', 'T', 'T'], ['G', 'G', 'G'], ['C', 'C', 'C']],
                9)

    # Mixed with gaps
    # Column 0: ['A', 'A', 'A'] -> A appears 3 times -> 0 mismatches
    # Column 1: ['T', 'T', 'G'] -> T appears 2 times -> 1 mismatch (G)
    # Column 2: ['G', 'C', 'G'] -> G appears 2 times -> 1 mismatch (C)
    # Column 3: ['-', '-', 'T'] -> T appears 1 time, gaps 2 times -> 2 mismatches (the gaps)
    # Total: 0 + 1 + 1 + 2 = 4
    assert_test("Mixed with gaps",
                [['A', 'T', 'G', '-'], ['A', 'T', 'C', '-'], ['A', 'G', 'G', 'T']],
                4)

    # Majority rule - A appears 3 times, T appears 1 time, G appears 1 time
    # So A is most common, 2 mismatches per column × 3 columns = 6 total
    assert_test("Majority rule",
                [['A', 'A', 'A'], ['A', 'A', 'A'], ['A', 'A', 'A'], ['T', 'T', 'T'], ['G', 'G', 'G']],
                6)

    # Complex gap pattern
    # Column 0: ['A', '-', 'A'] -> A appears 2 times, gap 1 time -> A is most common -> 1 mismatch (the gap)
    # Column 1: ['-', 'T', 'T'] -> T appears 2 times, gap 1 time -> T is most common -> 1 mismatch (the gap)
    # Column 2: ['G', 'G', '-'] -> G appears 2 times, gap 1 time -> G is most common -> 1 mismatch (the gap)
    # Total: 3 mismatches
    assert_test("Complex gaps",
                [['A', '-', 'G'], ['-', 'T', 'G'], ['A', 'T', '-']],
                3)

    # All gaps in some columns
    # Column 0: ['A', 'T', 'G'] -> A, T, G each appear 1 time -> A is chosen (first max) -> 2 mismatches
    # Column 1: ['-', '-', '-'] -> All gaps -> 0 mismatches
    # Column 2: ['G', 'C', 'A'] -> G, C, A each appear 1 time -> G is chosen (first max) -> 2 mismatches
    # Total: 2 + 0 + 2 = 4
    assert_test("Partial all gaps",
                [['A', '-', 'G'], ['T', '-', 'C'], ['G', '-', 'A']],
                4)

def run_comprehensive_tests():
    """Run all test categories silently."""

    start_time = time.time()

    try:
        # Run all test categories
        test_edge_cases()
        test_specific_assertion_cases()
        test_random_cases()
        test_stress_cases()

        end_time = time.time()

        return True

    except AssertionError as e:
        return False
    except Exception as e:
        import traceback
        traceback.print_exc()  # This will help debug any remaining issues
        return False

def run_silent_validation():
    """Run validation without any output unless there's an error."""

    try:
        test_edge_cases()
        test_specific_assertion_cases()
        test_random_cases()
        test_stress_cases()
        return True
    except:
        return False

if __name__ == "__main__":
    # Run comprehensive tests with minimal output
    success = run_comprehensive_tests()

From there, one can build a consensus sequence by taking the most common character at each column from a multiple sequence alignment (MSA). It’s often used to summarize conserved regions across DNA, RNA, or protein sequences.

For example, from the earlier alignment:

The consensus sequence would be:
A T G C T → because:

Most frequent in col 1: A

Most frequent in col 2: T

Most frequent in col 3: G

Most frequent in col 4: C (C vs T vs -)

Most frequent in col 5: T

In [5]:
# Demo Code (Same input as earlier example)
msa = [
    ['A', 'T', 'G', 'C', 'T'],
    ['A', '-', 'G', '-', 'T'],
    ['A', 'T', 'G', 'T', 'T']
]

# Priority: Higher number = higher preference in a tie
priority = {'A': 5, 'T': 4, 'G': 3, 'C': 2, '-': 1}

consensus = ""

for col in range(len(msa[0])):
    # Get the column (i.e., characters at this position across all sequences)
    column = [msa[row][col] for row in range(len(msa))]

    # Find the base with the highest frequency; break ties by priority
    most_common = max(
        set(column),
        key=lambda base: (column.count(base), priority.get(base, 0))
    )

    # Add to consensus string
    consensus += most_common

print("Consensus sequence:", consensus)

Consensus sequence: ATGTT


In [6]:
# General Code
import random  # Import random module for generating random sequences
from collections import Counter  # Import Counter for counting occurrences of bases

def generate_random_sequence(length, gap_probability=0.15):
    """
    Generate a random DNA sequence with possible gaps.

    Args:
        length: Length of the sequence
        gap_probability: Probability of inserting a gap (-)

    Returns:
        List representing the sequence
    """
    bases = ['A', 'T', 'G', 'C']  # Define the four DNA bases
    sequence = []  # Initialize empty list to store the sequence

    # Loop through each position in the sequence
    for _ in range(length):
        # Generate random number between 0 and 1
        if random.random() < gap_probability:
            sequence.append('-')  # Add gap if random number is below threshold
        else:
            sequence.append(random.choice(bases))  # Add random DNA base

    return sequence  # Return the generated sequence as a list

def count_mismatches(sequences):
    """
    Count mismatches in aligned sequences.

    Args:
        sequences: List of sequences (each sequence is a list of bases)

    Returns:
        Dictionary with mismatch analysis
    """
    # Check if sequences list is empty
    if not sequences:
        return {"error": "No sequences provided"}  # Return error if no sequences

    num_columns = len(sequences[0])  # Get length of first sequence (all should be same length)
    num_sequences = len(sequences)  # Count total number of sequences
    total_mismatches = 0  # Initialize counter for total mismatches across all columns

    # Display header for sequence alignment
    print("Aligned Sequences:")
    # Loop through each sequence with index
    for i, seq in enumerate(sequences):
        # Print sequence number and bases separated by spaces
        print(f"Seq {i+1}: {' '.join(seq)}")

    # Display header for mismatch analysis table
    print(f"\nMismatch Analysis:")
    print("Column | Bases | Most Common | Mismatches")
    print("-" * 45)  # Print separator line

    # Analyze each column position across all sequences
    for col in range(num_columns):
        # Extract all bases from current column across all sequences
        column = [sequences[row][col] for row in range(num_sequences)]

        # Count occurrences of each base in the column using Counter
        base_counts = Counter(column)

        # Separate actual bases from gaps
        actual_bases = {base: count for base, count in base_counts.items() if base != '-'}
        gap_count = base_counts.get('-', 0)

        # Find the most common actual base (excluding gaps)
        if actual_bases:
            most_common_base = max(actual_bases, key=actual_bases.get)
            most_common_count = actual_bases[most_common_base]
        else:
            # If all are gaps, set most common as gap for display purposes
            most_common_base = '-'
            most_common_count = gap_count

        # Calculate mismatches:
        # All sequences that don't match the most common actual base are mismatches
        # This includes gaps and different bases
        if actual_bases:
            mismatches = num_sequences - most_common_count
        else:
            # If all are gaps, no mismatches (though this is an edge case)
            mismatches = 0

        total_mismatches += mismatches  # Add to running total

        # Display column analysis results
        bases_str = ''.join(column)  # Convert column list to string for display
        # Print formatted row with column number, bases, most common base, and mismatch count
        print(f"  {col+1:2d}   | {bases_str:5s} | {most_common_base:11s} | {mismatches:9d}")

    # Display summary statistics
    print(f"\nMismatch Summary:")
    print(f"Total sequences: {num_sequences}")
    print(f"Sequence length: {num_columns}")
    print(f"Total mismatches: {total_mismatches}")
    print(f"Average mismatches per column: {total_mismatches/num_columns:.2f}")

    # Calculate mismatch rate as percentage
    total_positions = num_sequences * num_columns  # Total possible positions
    mismatch_rate = (total_mismatches / total_positions) * 100  # Convert to percentage
    print(f"Overall mismatch rate: {mismatch_rate:.1f}%")

    # Return dictionary with analysis results
    return {
        'total_mismatches': total_mismatches,
        'num_sequences': num_sequences,
        'sequence_length': num_columns,
        'mismatch_rate': mismatch_rate
    }

def main():
    # Optional: Set random seed for reproducibility (comment out for truly random results)
    # random.seed(42)

    # Parameters for sequence generation
    num_sequences = 4  # Number of sequences to generate
    sequence_length = 6  # Length of each sequence
    gap_probability = 0.2  # 20% chance of gap at each position

    # Display program header and parameters
    print("=== DNA Sequence Mismatch Counter ===\n")
    print(f"Generating {num_sequences} random sequences of length {sequence_length}")
    print(f"Gap probability: {gap_probability:.1%}")
    print(f"Random seed: None (truly random)\n")  # Format as percentage

    # Generate random sequences
    sequences = []  # Initialize empty list to store sequences
    # Loop to generate specified number of sequences
    for i in range(num_sequences):
        seq = generate_random_sequence(sequence_length, gap_probability)  # Generate one sequence
        sequences.append(seq)  # Add sequence to list

    # Count mismatches in the generated sequences
    results = count_mismatches(sequences)

    # Run another example with different parameters
    print("\n" + "="*50)  # Print separator line
    print("Example 2: More sequences, less gaps")
    print("="*50)

    # Generate second set of sequences with different parameters
    sequences2 = []  # Initialize new list for second example
    # Generate 6 sequences of length 8 with 10% gap probability
    for i in range(6):
        seq = generate_random_sequence(8, gap_probability=0.1)  # Lower gap probability
        sequences2.append(seq)  # Add to second sequence list

    print(f"Generating 6 random sequences of length 8")
    print(f"Gap probability: 10.0%\n")

    # Analyze second set of sequences
    count_mismatches(sequences2)

if __name__ == "__main__":
    main()  # Run main function only if script is executed directly

=== DNA Sequence Mismatch Counter ===

Generating 4 random sequences of length 6
Gap probability: 20.0%
Random seed: None (truly random)

Aligned Sequences:
Seq 1: T T - A T C
Seq 2: T G G C - T
Seq 3: A A - T G C
Seq 4: C T T - A A

Mismatch Analysis:
Column | Bases | Most Common | Mismatches
---------------------------------------------
   1   | TTAC  | T           |         2
   2   | TGAT  | T           |         2
   3   | -G-T  | G           |         3
   4   | ACT-  | A           |         3
   5   | T-GA  | T           |         3
   6   | CTCA  | C           |         2

Mismatch Summary:
Total sequences: 4
Sequence length: 6
Total mismatches: 15
Average mismatches per column: 2.50
Overall mismatch rate: 62.5%

Example 2: More sequences, less gaps
Generating 6 random sequences of length 8
Gap probability: 10.0%

Aligned Sequences:
Seq 1: G A C C - G T T
Seq 2: G G G T - G C C
Seq 3: A T T - G T G A
Seq 4: - G C A A T T C
Seq 5: - T C A T - C A
Seq 6: T C T G A T C G

Mismatch

In [12]:
# Proving
import random
from collections import Counter

# Import the functions we want to test
def generate_random_sequence(length, gap_probability=0.15):
    """Generate a random DNA sequence with possible gaps."""
    bases = ['A', 'T', 'G', 'C']
    sequence = []

    for _ in range(length):
        if random.random() < gap_probability:
            sequence.append('-')
        else:
            sequence.append(random.choice(bases))

    return sequence

def generate_consensus_sequence(sequences, threshold=0.5, include_gaps=True):
    """Generate a consensus sequence from aligned sequences."""
    if not sequences:
        return {"error": "No sequences provided"}

    num_columns = len(sequences[0])
    num_sequences = len(sequences)
    consensus = []
    consensus_strength = []

    # Analyze each column
    for col in range(num_columns):
        column = [sequences[row][col] for row in range(num_sequences)]

        # Determine bases to count and the total count for strength calculation
        if include_gaps:
            base_counts = Counter(column)
            total_count_for_strength = num_sequences
        else:
            non_gap_bases = [base for base in column if base != '-']
            base_counts = Counter(non_gap_bases)
            total_count_for_strength = len(non_gap_bases)

        # Find the most common base and check threshold
        if base_counts and total_count_for_strength > 0:
            most_common_base, count = base_counts.most_common(1)[0]
            strength = count / total_count_for_strength

            # Apply the threshold check explicitly
            if strength >= threshold:
                consensus_base = most_common_base
            else:
                consensus_base = 'N'
        else:
            # If no bases (e.g., empty column after removing gaps and include_gaps=False)
            consensus_base = 'N'
            strength = 0

        consensus.append(consensus_base)
        consensus_strength.append(strength)


    # Calculate metrics
    avg_strength = sum(consensus_strength) / len(consensus_strength)
    ambiguous_positions = consensus.count('N')

    return {
        'consensus': consensus,
        'consensus_string': ''.join(consensus),
        'strength': consensus_strength,
        'average_strength': avg_strength,
        'ambiguous_positions': ambiguous_positions,
        'coverage': (len(consensus) - ambiguous_positions)/len(consensus)
    }

# VERIFICATION FUNCTIONS

def brute_force_consensus(sequences, threshold=0.5, include_gaps=True):
    """
    Brute force verification: manually generate consensus for comparison.
    This is a simple, obviously correct implementation.
    """
    if not sequences:
        return []

    consensus = []
    sequence_length = len(sequences[0])
    num_sequences = len(sequences)

    # For each column
    for col in range(sequence_length):
        # Get all bases in this column
        column_bases = []
        for row in range(num_sequences):
            column_bases.append(sequences[row][col])

        # Handle gaps
        if include_gaps:
            bases_to_count = column_bases
            total_count = num_sequences
        else:
            bases_to_count = [base for base in column_bases if base != '-']
            total_count = len(bases_to_count)

        # Count each base type manually
        base_counts = {}
        for base in bases_to_count:
            if base in base_counts:
                base_counts[base] += 1
            else:
                base_counts[base] = 1

        # Find most common base
        most_common_base = None
        max_count = 0
        # Handle the case where base_counts is empty after removing gaps
        if base_counts:
            for base, count in base_counts.items():
                if count > max_count:
                    max_count = count
                    most_common_base = base

        # Check threshold
        if most_common_base is not None and total_count > 0: # Ensure a most common base was found and total_count is not zero
            strength = max_count / total_count
            if strength >= threshold:
                consensus.append(most_common_base)
            else:
                consensus.append('N')
        else:
            # If no bases were found (e.g., column was all gaps and include_gaps=False)
            consensus.append('N')


    return consensus

def test_known_cases():
    """Test with known, hand-calculated examples."""

    # Test Case 1: Perfect consensus (all same)
    perfect_sequences = [
        ['A', 'T', 'G', 'C'],
        ['A', 'T', 'G', 'C'],
        ['A', 'T', 'G', 'C']
    ]

    result = generate_consensus_sequence(perfect_sequences, threshold=0.5)
    brute_force_result = brute_force_consensus(perfect_sequences, threshold=0.5)

    assert result['consensus'] == ['A', 'T', 'G', 'C'], "Perfect consensus should be ATGC"
    assert result['consensus'] == brute_force_result, "Results should match brute force"
    assert result['ambiguous_positions'] == 0, "Perfect consensus should have no ambiguous positions"
    assert result['coverage'] == 1.0, "Perfect consensus should have 100% coverage"

    # Test Case 2: Majority rule
    majority_sequences = [
        ['A', 'T', 'G', 'C'],
        ['A', 'T', 'G', 'C'],
        ['A', 'G', 'A', 'T']
    ]

    result = generate_consensus_sequence(majority_sequences, threshold=0.5)
    brute_force_result = brute_force_consensus(majority_sequences, threshold=0.5)

    assert result['consensus'] == ['A', 'T', 'G', 'C'], "Majority rule should be ATGC" # Corrected assertion
    assert result['consensus'] == brute_force_result, "Results should match brute force"
    assert result['ambiguous_positions'] == 0, "Should have 0 ambiguous positions" # Corrected expected ambiguous positions

    # Test Case 3: Strict consensus (100% threshold)
    result_strict = generate_consensus_sequence(majority_sequences, threshold=1.0)
    brute_force_strict = brute_force_consensus(majority_sequences, threshold=1.0)

    assert result_strict['consensus'] == ['A', 'N', 'N', 'N'], "Strict consensus should be ANNN"
    assert result_strict['consensus'] == brute_force_strict, "Results should match brute force"
    assert result_strict['ambiguous_positions'] == 3, "Should have 3 ambiguous positions"

def test_gap_handling():
    """Test gap handling in consensus generation."""

    # Test Case 1: Gaps included
    gap_sequences = [
        ['A', 'T', '-', 'C'],
        ['A', '-', 'G', 'C'],
        ['A', 'T', 'G', '-']
    ]

    result_with_gaps = generate_consensus_sequence(gap_sequences, threshold=0.5, include_gaps=True)
    brute_force_with_gaps = brute_force_consensus(gap_sequences, threshold=0.5, include_gaps=True)

    assert result_with_gaps['consensus'] == ['A', 'T', 'G', 'C'], "Should be ATGC with gaps included"
    assert result_with_gaps['consensus'] == brute_force_with_gaps, "Results should match brute force"

    # Test Case 2: Gaps excluded
    gap_sequences_2 = [
        ['A', 'T', '-', 'C'],
        ['A', '-', 'G', 'C'],
        ['A', 'T', 'G', '-']
    ]

    result_without_gaps = generate_consensus_sequence(gap_sequences_2, threshold=0.5, include_gaps=False)
    brute_force_without_gaps = brute_force_consensus(gap_sequences_2, threshold=0.5, include_gaps=False)

    assert result_without_gaps['consensus'] == ['A', 'T', 'G', 'C'], "Should be ATGC with gaps excluded"
    assert result_without_gaps['consensus'] == brute_force_without_gaps, "Results should match brute force"

def test_edge_cases():
    """Test edge cases and boundary conditions."""

    # Test Case 1: Single sequence
    single_seq = [['A', 'T', 'G', 'C']]
    result = generate_consensus_sequence(single_seq, threshold=0.5)
    brute_force_result = brute_force_consensus(single_seq, threshold=0.5)

    assert result['consensus'] == ['A', 'T', 'G', 'C'], "Single sequence should be perfect consensus"
    assert result['consensus'] == brute_force_result, "Results should match brute force"
    assert result['coverage'] == 1.0, "Single sequence should have 100% coverage"

    # Test Case 2: All gaps column
    all_gaps = [
        ['A', '-', 'C'],
        ['T', '-', 'G'],
        ['G', '-', 'A']
    ]

    result = generate_consensus_sequence(all_gaps, threshold=0.5, include_gaps=True)
    brute_force_result = brute_force_consensus(all_gaps, threshold=0.5, include_gaps=True)

    assert result['consensus'][1] == '-', "All gaps column should have gap consensus"
    assert result['consensus'] == brute_force_result, "Results should match brute force"

    # Test Case 3: Empty sequences
    empty_result = generate_consensus_sequence([])

    assert "error" in empty_result, "Empty input should return error"

def test_threshold_behavior():
    """Test different threshold values."""

    test_sequences = [
        ['A', 'T', 'G', 'C'],
        ['A', 'T', 'G', 'T'],
        ['A', 'G', 'A', 'T'],
        ['T', 'T', 'A', 'T']
    ]

    # Test different thresholds
    thresholds = [0.25, 0.5, 0.75, 1.0]

    for threshold in thresholds:
        result = generate_consensus_sequence(test_sequences, threshold=threshold)
        brute_force_result = brute_force_consensus(test_sequences, threshold=threshold)

        assert result['consensus'] == brute_force_result, f"Threshold {threshold} failed"

        # Higher thresholds should have more or equal ambiguous positions
        if threshold > 0.25:
            prev_result = generate_consensus_sequence(test_sequences, threshold=threshold-0.25)
            assert result['ambiguous_positions'] >= prev_result['ambiguous_positions'], \
                "Higher threshold should have more ambiguous positions"

def test_random_sequences():
    """Test with random sequences."""

    random.seed(42)

    for test_num in range(10):
        # Generate random parameters
        num_sequences = random.randint(2, 6)
        sequence_length = random.randint(3, 8)
        threshold = random.uniform(0.3, 0.9)
        include_gaps = random.choice([True, False])

        # Generate random sequences
        sequences = []
        for _ in range(num_sequences):
            seq = generate_random_sequence(sequence_length, 0.2)
            sequences.append(seq)

        # Test both methods
        result = generate_consensus_sequence(sequences, threshold=threshold, include_gaps=include_gaps)
        brute_force_result = brute_force_consensus(sequences, threshold=threshold, include_gaps=include_gaps)

        assert result['consensus'] == brute_force_result, f"Random test {test_num + 1} failed"

        # Test properties
        assert len(result['consensus']) == sequence_length, "Consensus length should match input"
        assert 0 <= result['coverage'] <= 1.0, "Coverage should be between 0 and 1"
        assert result['ambiguous_positions'] == result['consensus'].count('N'), "Ambiguous count should match N count"

def test_mathematical_properties():
    """Test mathematical properties that should always hold."""

    random.seed(123)

    for _ in range(5):
        num_sequences = random.randint(2, 5)
        sequence_length = random.randint(3, 6)

        sequences = []
        for _ in range(num_sequences):
            seq = generate_random_sequence(sequence_length, 0.1)
            sequences.append(seq)

        # Test with different thresholds
        result_low = generate_consensus_sequence(sequences, threshold=0.3)
        result_high = generate_consensus_sequence(sequences, threshold=0.8)

        # Higher threshold should have more or equal ambiguous positions
        assert result_high['ambiguous_positions'] >= result_low['ambiguous_positions'], \
            "Higher threshold should have more ambiguous positions"


if __name__ == "__main__":
    # Run all tests
    test_known_cases()
    test_gap_handling()
    test_edge_cases()
    test_threshold_behavior()
    test_random_sequences()
    test_mathematical_properties()