# Nucleotide Sequence Alignment Tool — Demo

This notebook demonstrates the **SeqAlign** tool, which performs pairwise nucleotide sequence alignment using the **Smith-Waterman algorithm** with **affine gap penalties**.

## How It Works

1. **Parse** two DNA sequences from FASTA files
2. **Align** them using Smith-Waterman (local alignment) with configurable scoring
3. **Display** the best alignment with statistics

In [None]:
import sys
sys.path.insert(0, '..')

from seqalign import parse_fasta, smith_waterman, format_alignment

---
## Example 1: Basic Alignment with Default Parameters

We align two closely related DNA sequences (Human vs Mouse BRCA1 exon fragments) with the default scoring:
- Match: **+5**
- Mismatch: **-4**
- Gap opening: **-12**
- Gap extension: **-2**

In [None]:
# Parse FASTA files
header1, seq1 = parse_fasta('seq1.fasta')
header2, seq2 = parse_fasta('seq2.fasta')

print(f'Sequence 1: {header1}')
print(f'  Length: {len(seq1)} bp')
print(f'  First 60 bp: {seq1[:60]}...')
print()
print(f'Sequence 2: {header2}')
print(f'  Length: {len(seq2)} bp')
print(f'  First 60 bp: {seq2[:60]}...')

In [None]:
# Perform alignment with default parameters
result = smith_waterman(
    seq1, seq2,
    match=5,
    mismatch=-4,
    gap_open=-12,
    gap_ext=-2
)

print(format_alignment(result))

---
## Example 2: Aligning Sequences with Gaps

Let's test with sequences where we expect the algorithm to introduce gaps.

In [None]:
# Sequences with insertions/deletions
seq_a = 'ATCGATCGATCGATCGATCG'
seq_b = 'ATCGATGATCGATCG'  # 3 bases deleted in the middle

print(f'Seq A: {seq_a} ({len(seq_a)} bp)')
print(f'Seq B: {seq_b} ({len(seq_b)} bp)')
print()

result_gap = smith_waterman(
    seq_a, seq_b,
    match=5,
    mismatch=-4,
    gap_open=-12,
    gap_ext=-2
)

print(format_alignment(result_gap))

---
## Example 3: Effect of Varying Gap Penalties

This example shows how different gap penalty settings affect the alignment. We use the same pair of sequences but vary the gap parameters.

In [None]:
seq_x = 'ACGTACGTAAAAACGTACGT'
seq_y = 'ACGTACGTACGTACGT'  # missing 'AAAA' in the middle

print('=== SCENARIO A: Harsh gap open penalty (gap_open=-16, gap_ext=-2) ===')
print('  (Discourages gaps → may prefer mismatches)\n')
result_a = smith_waterman(seq_x, seq_y, match=5, mismatch=-4, gap_open=-16, gap_ext=-2)
print(format_alignment(result_a))

print('\n=== SCENARIO B: Mild gap open penalty (gap_open=-4, gap_ext=-1) ===')
print('  (Gaps are cheap → algorithm freely introduces them)\n')
result_b = smith_waterman(seq_x, seq_y, match=5, mismatch=-4, gap_open=-4, gap_ext=-1)
print(format_alignment(result_b))

---
## Example 4: Matching EBI LALIGN Parameters

To compare with [EBI LALIGN](https://www.ebi.ac.uk/jdispatcher/psa/lalign?stype=dna&gapext=0), use these parameters:
- Match: **+5**
- Mismatch: **-4**
- Gap opening: **-12**
- Gap extension: **0**

Submit the same sequences at the EBI LALIGN link and compare the results.

In [None]:
# Re-align with EBI LALIGN-like parameters (gap_ext = 0)
result_ebi = smith_waterman(
    seq1, seq2,
    match=5,
    mismatch=-4,
    gap_open=-12,
    gap_ext=0
)

print('Alignment with EBI LALIGN-matching parameters:')
print(f'  match=5, mismatch=-4, gap_open=-12, gap_ext=0\n')
print(format_alignment(result_ebi))

---
## Example 5: Custom Sequences (User Input)

You can easily align any two sequences by modifying the cell below.

In [None]:
# ── Modify these to try your own sequences ──
my_seq1 = 'GAATTCAGGTTCATGCATCCGATCGATCG'
my_seq2 = 'GAATTCAGGATCATGCATGATCG'

# ── Modify scoring parameters ──
my_match    =  5
my_mismatch = -4
my_gap_open = -12
my_gap_ext  = -2

# ── Run alignment ──
my_result = smith_waterman(
    my_seq1, my_seq2,
    match=my_match,
    mismatch=my_mismatch,
    gap_open=my_gap_open,
    gap_ext=my_gap_ext
)

print(format_alignment(my_result))

---
## Example 6: Aligning Sequences from FASTA Files (User-Provided)

Place your own `.fasta` files in the `examples/` directory and update the paths below.

In [None]:
# ── Change these paths to your own FASTA files ──
file1 = 'seq1.fasta'
file2 = 'seq2.fasta'

h1, s1 = parse_fasta(file1)
h2, s2 = parse_fasta(file2)

print(f'File 1: {h1} ({len(s1)} bp)')
print(f'File 2: {h2} ({len(s2)} bp)\n')

result_file = smith_waterman(s1, s2, match=5, mismatch=-4, gap_open=-12, gap_ext=0)
print(format_alignment(result_file))