Sequence comparison and alignment, combined with the systematic collection and search of databases containing biomolecular sequences, both DNA and protein, has become an essential part of modern molecular biology. Molecular sequence data is important because two proteins with similar sequences often have similar functions or structures. This means that we can learn about the function of a protein in humans by studying the functions of proteins with similar sequences in simpler organisms such as yeast, fruit flies, or frogs.

We will examine methods for comparison of two sequences, working with two strings $S$ and $T$ of lengths $m$ and $n$ respectively, composed of characters
from some finite alphabet. Write $S_i$ for the $i$-th character of $S$, and $S[i, j]$ for the substring $S_i, \dots, S_j$. If $i > j$ then $S[i, j]$ is the empty string. A prefix of $S$ is a substring $S[1, k]$ for some $k \leq m$ (possibly the empty prefix). Similarly, a suffix of $S$ is a substring $S[k, m]$ with $k > 1$.

Suppose $S = \textit{fruit}$ and $T = \textit{berry}$. We can transform $S$ into $T$ by
1.  Replacing $f$ with $b$,
2.  Inserting $e$,
3.  Matching $r$ with $r$,
4.  Replacing $u$ with $r$,
5.  Replacing $i$ with $y$,
6.  Deleting $t$.

We call $\text{RIMRRD}$ the editing transcript. The alignment of $S$ and $T$ is read vertically character by character.

There are many possible ways to transform one string into another; but from an
evolutionary point of view there must be a cost associated with any action other than matching. The optimal edit transcripts are those that involve the least number of edit operations (Replace, Insert, and Delete). We define the edit distance $d(S, T)$ to be the minimal number of edits between $S$ and $T$. Let $D(i, j) = d(S[1, i], T[1, j])$. Observe that $d(S, T) = D(m, n)$.


To find the optimal edit distance $D(i, j)$ between the prefix $S[1, i]$ and $T[1, j]$, we consider the last operation performed to achieve the alignment. There are only three possible ways the alignment can end at indices $i$ and $j$:

1.  Deletion (aligning $S_i$ with a gap)

    We assume that $S_i$ is deleted. This means we must have already transformed the prefix $S[1, i-1]$ into $T[1, j]$. The total cost is equal to the cost of the sub-problem $D(i-1, j)$ plus the cost of one deletion $D(i-1, j) + 1$.

2.  Insertion (aligning a gap with $T_j$)

    We assume that $T_j$ is inserted. This means we must have already transformed the full prefix $S[1, i]$ into the prefix $T[1, j-1]$. Here, the total cost is equal to the cost of the sub-problem $D(i, j-1)$ plus the cost of one insertion $D(i, j-1) + 1$.

3. Match or Replacement (aligning $S_i$ with $T_j$)

    We assume that $S_i$ and $T_j$ are aligned together. This means we must have transformed $S[1, i-1]$ into $T[1, j-1]$. The total cost is equal to the cost of the sub-problem $D(i-1, j-1)$ plus the cost of the action on the current characters which is $0$ for a match and $1$ for a replacement. This is $D(i-1, j-1) + s(S_i, T_j)$ where

    \begin{equation}
        s(a, b) =
        \begin{cases}
        0 & a = b, \\
        1 & a \neq b.
        \end{cases}
    \end{equation}

Since $D(i, j)$ is defined as the minimal number of edits, it must be the minimum of these three possible scenarios. Thus,
\begin{equation}
    D(i, j) = \min \{ D(i-1, j) + 1, \ D(i, j-1) + 1, \ D(i-1, j-1) + s(S_i, T_j) \}.
\end{equation}

To solve the recurrence, we need base cases for when one or both strings are empty. These are given by
\begin{align}
    D(0, 0) = 0, \quad D(i, 0) = i, \quad D(0, j) = j.
\end{align}
The first comes from the fact that the edit distance between two empty strings is $0$. In the second case, transforming a string of length $i$ into an empty string requires deleting all $i$ characters. In the final situation, transforming an empty string into a string of length $j$ requires inserting all $j$ characters.


In [33]:
def edit_distance(str1, str2):
    m = len(str1)
    n = len(str2)

    # Create an (m+1) by (n+1) table to store results of subproblems
    dp = [[0 for x in range(n + 1)] for x in range(m + 1)]

    # If str1 is empty, then insert all characters of str2
    for j in range(n + 1):
        dp[0][j] = j

    # If str2 is empty, then delete all characters of str1
    for i in range(m + 1):
        dp[i][0] = i

    # Fill the table using the recurrence relation
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            # Determine cost s(Si, Tj)
            if str1[i - 1] == str2[j - 1]:
                cost = 0
            else:
                cost = 1
            # Apply the formula
            dp[i][j] = min(
                dp[i - 1][j] + 1,       # Deletion
                dp[i][j - 1] + 1,       # Insertion
                dp[i - 1][j - 1] + cost # Match or Replacement
            )

    return dp

s = "shesells"
t = "seashells"
distance = edit_distance(s, t)[-1][-1]
print(f"Distance between '{s}' and '{t}': {distance}")

Distance between 'shesells' and 'seashells': 3


The complexity is analysed based on the lengths of the two strings, let $m$ be the length of string $S$ and $n$ be the length of string $T$. The time complexity is $O(m \times n)$ as the algorithm outer loop runs $m$ times and the inner loop runs $n$ times. Inside the inner loop, we perform a constant number of operations The space complexity is $O(m \times n)$. We allocate a 2D matrix of size $(m+1) \times (n+1)$ to store the intermediate edit distances.

Usually, we are interested in finding an optimal editing transcript and alignment of $S$ and $T$ rather than just $d(S, T)$. This can be done by assigning a pointer when calculating $D(i, j)$, pointing to one of $D(i - 1, j)$, $D(i, j - 1)$, or $D(i - 1, j - 1)$.

In [34]:
# @title Protein Data

# Duckbill platypus
Protein_A = """
MGLSDGEWQLVLKVWGKVEGDLPGHGQEVLIRLFKTHPETLEKFDKFKGLKTEDEMKASA
DLKKHGGTVLTALGNILKKKGQHEAELKPLAQSHATKHKISIKFLEYISEAIIHVLQSKH
SADFGADAQAAMGKALELFRNDMAAKYKEFGFQG
"""

# Yellowfin tuna
Protein_B = """
MADFDAVLKCWGPVEADYTTMGGLVLTRLFKEHPETQKLFPKFAGIAQADIAGNAAISAH
GATVLKKLGELLKAKGSHAAILKPLANSHATKHKIPINNFKLISEVLVKVMHEKAGLDAG
GQTALRNVMGIIIADLEANYKELGFSG
"""

In [42]:
def get_optimal_alignment(seq1, seq2):
    m = len(seq1)
    n = len(seq2)
    dp = edit_distance(seq1, seq2)
    edit_dist = dp[-1][-1]

    # Traceback to find the optimal alignment
    align1 = []
    align2 = []
    i, j = m, n

    while i > 0 or j > 0:
        current_score = dp[i][j]

        # Check if Match/Replace
        if i > 0 and j > 0:
            if seq1[i - 1] == seq2[j - 1]:
                cost = 0
            else:
                cost = 1

            if dp[i - 1][j - 1] + cost == current_score:
                align1.append(seq1[i - 1])
                align2.append(seq2[j - 1])
                i -= 1
                j -= 1
                continue

        # Check if Deletion (from seq1)
        if i > 0 and dp[i - 1][j] + 1 == current_score:
            align1.append(seq1[i - 1])
            align2.append("-")
            i -= 1
            continue

        # Check if Insertion (into seq1)
        if j > 0 and dp[i][j - 1] + 1 == current_score:
            align1.append("-")
            align2.append(seq2[j - 1])
            j -= 1
            continue

    # The traceback builds the alignment backwards, so reverse it
    align1 = "".join(align1[::-1])
    align2 = "".join(align2[::-1])

    return edit_dist, align1, align2

dist, a1, a2 = get_optimal_alignment(Protein_A, Protein_B)
print(f"Edit Distance: {dist}")
print("First 50 steps of optimal alignment:")
print(f"A: {a1[:50]}")
print(f"B: {a2[:50]}")

Edit Distance: 86
First 50 steps of optimal alignment:
A: 
MGLSDGEWQLVLKVWGKVEGDLPGHGQEVLIRLFKTHPETLEKFDKFKG
B: 
M--AD--FDAVLKCWGPVEADYTTMGGLVLTRLFKEHPETQKLFPKFAG


A protein is essentially a long sequence of amino acids. Approximately twenty types of amino acid (the exact number is species dependent) are involved in the construction of each protein. A gene is a sequence of DNA which can be translated into a sequence of amino acids, i.e., a protein. Mutations in DNA will lead to changes in the sequence of amino acids, and some mutations are more likely than others. We shall adjust our scoring algorithm in order to capture some of these biological considerations.

The adjustment is achieved by replacing the scoring function $s(a, b)$. There are various schemes for assessing the probability of a mutation from amino acid $a$ to amino acid $b$; currently the two dominant schemes are the PAM matrices and the BLOSUM matrices.

Let $v(S, T)$ be the maximum score of all edit transcripts from $S$ to $T$. We now implement the BLOSUM matrix for the scoring function $s$, and score $-8$ for each Insert or Delete.

In [55]:
# @title BLOSUM Matrix
blosum_text = """
   C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W
C  9 -1 -1 -3  0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1  4  1 -1  1  0  1  0  0  0 -1 -1  0 -1 -2 -2 -2 -2 -2 -3
T -1  1  5 -1  0 -2  0 -1 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -2
P -3 -1 -1  7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A  0  1  0 -1  4  0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -3
G -3  0 -2 -2  0  6  0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2
N -3  1  0 -2 -2  0  6  1  0  0  1  0  0 -2 -3 -3 -3 -3 -2 -4
D -3  0 -1 -1 -2 -1  1  6  2  0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4  0 -1 -1 -1 -2  0  2  5  2  0  0  1 -2 -3 -3 -2 -3 -2 -3
Q -3  0 -1 -1 -1 -2  0  0  2  5  0  1  1  0 -3 -2 -2 -3 -1 -2
H -3 -1 -2 -2 -2 -2  1 -1  0  0  8  0 -1 -2 -3 -3 -3 -1  2 -2
R -3 -1 -1 -2 -1 -2  0 -2  0  1  0  5  2 -1 -3 -2 -3 -3 -2 -3
K -3  0 -1 -1 -1 -2  0 -1  1  1 -1  2  5 -1 -3 -2 -2 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2  0 -2 -1 -1  5  1  2  1  0 -1 -1
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3  1  4  2  3  0 -1 -3
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2  2  2  4  1  0 -1 -2
V -1 -2  0 -2  0 -3 -3 -3 -2 -2 -3 -3 -2  1  3  1  4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3  0  0  0 -1  6  3  1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1  2 -2 -2 -1 -1 -1 -1  3  7  2
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3  1  2 11
"""

In [58]:
def solve_alignment(seq_A, seq_B):
    gap_penalty = -8
    blosum = {}
    lines = blosum_text.strip().split('\n')
    aa = lines[0].split()

    for i, line in enumerate(lines[1:]):
        parts = line.split()
        row_char = parts[0]
        scores = parts[1:]
        for j, score in enumerate(scores):
            col_char = aa[j]
            blosum[(row_char, col_char)] = int(score)

    def get_score(a, b):
        # Handle cases where chars might not be in BLOSUM (e.g. 'X' or gaps)
        return blosum.get((a, b), -4) # Default miss penalty if not found

    m, n = len(seq_A), len(seq_B)

    # DP Matrix initialisation
    # V[i][j] stores the max score
    V = [[0] * (n + 1) for _ in range(m + 1)]

    # Boundary conditions
    for i in range(1, m + 1):
        V[i][0] = i * gap_penalty
    for j in range(1, n + 1):
        V[0][j] = j * gap_penalty

    # Fill Table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            score_match = V[i-1][j-1] + get_score(seq_A[i-1], seq_B[j-1])
            score_del   = V[i-1][j] + gap_penalty
            score_ins   = V[i][j-1] + gap_penalty

            V[i][j] = max(score_match, score_del, score_ins)

    max_score = V[m][n]

    # Traceback
    align_A = []
    align_B = []
    i, j = m, n

    while i > 0 or j > 0:
        current = V[i][j]

        # Check Match/Mutation
        if i > 0 and j > 0:
            s = get_score(seq_A[i-1], seq_B[j-1])
            if current == V[i-1][j-1] + s:
                align_A.append(seq_A[i-1])
                align_B.append(seq_B[j-1])
                i -= 1
                j -= 1
                continue

        # Check Deletion
        if i > 0 and current == V[i-1][j] + gap_penalty:
            align_A.append(seq_A[i-1])
            align_B.append("-")
            i -= 1
            continue

        # Check Insertion
        if j > 0 and current == V[i][j-1] + gap_penalty:
            align_A.append("-")
            align_B.append(seq_B[j-1])
            j -= 1
            continue

    # Reverse strings
    align_A = "".join(align_A[::-1])
    align_B = "".join(align_B[::-1])

    return max_score, align_A, align_B

score, a1, a2 = solve_alignment(Protein_A, Protein_B)
print(f"Alignment Score: {score}")
print("First 50 steps of optimal alignment:")
print(f"A: {a1[:50]}")
print(f"B: {a2[:50]}")

Alignment Score: 262
First 50 steps of optimal alignment:
A: 
MGLSDGEWQLVLKVWGKVEGDLPGHGQEVLIRLFKTHPETLEKFDKFKG
B: 
M--AD--FDAVLKCWGPVEADYTTMGGLVLTRLFKEHPETQKLFPKFAG
