Some mechanisms for DNA mutations involve the deletion or insertion of large chunks of DNA. Proteins are often composed of combinations of different domains from a relatively small repertoire; so two protein sequences might be relatively similar over several regions, but differ in other regions where one protein contains a certain domain but the other does not.

At some computational cost, we can still align two protein strings taking gaps into account. Let $w(l) < 0$, for $l > 1$, be the score of deleting (or inserting) a sequence of amino acids of length $l$ from or into a protein. Let $v_{\text{gap}}(S, T)$ be the gap-weighted score between $S$ and $T$, and write
$V_{\text{gap}}(i, j)$ for $v_{\text{gap}}(S[1, i], T[1, j])$. Then
\begin{align}
    V_{\text{gap}}(i, j) = \max \{E(i, j), F(i, j), G(i, j)\}, \\
    E(i, j) = \max_{0 \leq k \leq j-1} \{V_{\text{gap}}(i, k) + w(j -k)\}, \\
    F(i, j) = \max_{0 \leq k \leq i-1} \{V_{\text{gap}}(k, j) + w(i - k)\}, \\
    G(i, j) = V_{\text{gap}}(i - 1, j - 1) + s(S_i, T_j).
\end{align}
Iterating the above equations on the $n$ by $m$ grid has complexity of $O(mn^2 + nm^2)$.

Fortunately, if $w(l)$ takes some fixed value $u$ for all $l > 1$, then there exists an algorithm for finding $v_{\text{gap}}$ which has complexity $O(mn)$. Indeed, calculating the cost of a gap does not depend on the length of the gap, but simply on the existence of a gap. Substituting the fixed constant $w(l) = u$, we obtain
\begin{equation}
    E(i, j) = u + \max_{0 \leq k \leq j-1} V_{\text{gap}}(i, k).
\end{equation}
This tells us that the score for an Insertion at $(i, j)$ is simply the maximum score seen so far in the current row $i$ plus the constant $u$. Similarly, the score for a Deletion is the maximum score seen so far in the current column $j$ plus $u$. Instead of iterating backwards through $k$ for every cell, we can maintain the row and column maximums and update these values incrementally. Thus, the time complexity is $O(mn)$ as we traverse the grid exactly once, performing a constant number of operations and the space complexity is also $O(mn)$, we as need to store the scores.

For a fixed gap penalty $u < 0$, the boundary conditions are similar to before $V(0, 0) = 0$, $V(i, 0) = u$ and $V(0, j) = u$. Here, $u$ represents the cost of a single deletion or insertion of a chunk which is now independent of the length.

In [42]:
# @title Protein Data

Protein_C = """
MTSDCSSTHCSPESCGTASGCAPASSCSVETACLPGTCATSRCQTPSFLSRSRGLTGCLL
PCYFTGSCNSPCLVGNCAWCEDGVFTSNEKETMQFLNDRLASYLEKVRSLEETNAELESR
IQEQCEQDIPMVCPDYQRYFNTIEDLQQKILCTKAENSRLAVQLDNCKLATDDFKSKYES
ELSLRQLLEADISSLHGILEELTLCKSDLEAHVESLKEDLLCLKKNHEEEVNLLREQLGD
RLSVELDTAPTLDLNRVLDEMRCQCETVLANNRREAEEWLAVQTEELNQQQLSSAEQLQG
CQMEILELKRTASALEIELQAQQSLTESLECTVAETEAQYSSQLAQIQCLIDNLENQLAE
IRCDLERQNQEYQVLLDVKARLEGEINTYWGLLDSEDSRLSCSPCSTTCTSSNTCEPCSA
YVICTVENCCL
"""

Protein_D = """
MPYNFCLPSLSCRTSCSSRPCVPPSCHSCTLPGACNIPANVSNCNWFCEGSFNGSEKETM
QFLNDRLASYLEKVRQLERDNAELENLIRERSQQQEPLLCPSYQSYFKTIEELQQKILCT
KSENARLVVQIDNAKLAADDFRTKYQTELSLRQLVESDINGLRRILDELTLCKSDLEAQV
ESLKEELLCLKSNHEQEVNTLRCQLGDRLNVEVDAAPTVDLNRVLNETRSQYEALVETNR
REVEQWFTTQTEELNKQVVSSSEQLQSYQAEIIELRRTVNALEIELQAQHNLRDSLENTL
TESEARYSSQLSQVQSLITNVESQLAEIRSDLERQNQEYQVLLDVRARLECEINTYRSLL
ESEDCNLPSNPCATTNACSKPIGPCLSNPCTSCVPPAPCTPCAPRPRCGPCNSFVR
"""

In [43]:
# @title BLOSUM Matrix
blosum_text = """
   C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W
C  9 -1 -1 -3  0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1  4  1 -1  1  0  1  0  0  0 -1 -1  0 -1 -2 -2 -2 -2 -2 -3
T -1  1  5 -1  0 -2  0 -1 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -2
P -3 -1 -1  7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A  0  1  0 -1  4  0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -3
G -3  0 -2 -2  0  6  0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2
N -3  1  0 -2 -2  0  6  1  0  0  1  0  0 -2 -3 -3 -3 -3 -2 -4
D -3  0 -1 -1 -2 -1  1  6  2  0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4  0 -1 -1 -1 -2  0  2  5  2  0  0  1 -2 -3 -3 -2 -3 -2 -3
Q -3  0 -1 -1 -1 -2  0  0  2  5  0  1  1  0 -3 -2 -2 -3 -1 -2
H -3 -1 -2 -2 -2 -2  1 -1  0  0  8  0 -1 -2 -3 -3 -3 -1  2 -2
R -3 -1 -1 -2 -1 -2  0 -2  0  1  0  5  2 -1 -3 -2 -3 -3 -2 -3
K -3  0 -1 -1 -1 -2  0 -1  1  1 -1  2  5 -1 -3 -2 -2 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2  0 -2 -1 -1  5  1  2  1  0 -1 -1
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3  1  4  2  3  0 -1 -3
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2  2  2  4  1  0 -1 -2
V -1 -2  0 -2  0 -3 -3 -3 -2 -2 -3 -3 -2  1  3  1  4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3  0  0  0 -1  6  3  1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1  2 -2 -2 -1 -1 -1 -1  3  7  2
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3  1  2 11
"""

In [44]:
def solve_fixed_gap_alignment(seq_c, seq_d):
    u = -12  # Fixed score for insertion/deletion (regardless of length)

    # Parse BLOSUM Matrix
    blosum_map = {}
    lines = blosum_text.strip().split('\n')
    aa = lines[0].split()
    for line in lines[1:]:
        parts = line.split()
        row_char = parts[0]
        scores = [int(x) for x in parts[1:]]
        for idx, score in enumerate(scores):
            blosum_map[(row_char, aa[idx])] = score

    def get_blosum(c1, c2):
        return blosum_map.get((c1, c2), -4)

    m = len(seq_c)
    n = len(seq_d)

    # V[i][j] stores the max score
    V = [[0 for _ in range(n + 1)] for _ in range(m + 1)]

    # Traceback grid stores (prev_i, prev_j) coordinates
    trace = [[None for _ in range(n + 1)] for _ in range(m + 1)]

    # Deleting whole prefix seq_c[0..i] is one block deletion costing u
    for i in range(1, m + 1):
        V[i][0] = u
        trace[i][0] = (0, 0)

    # Inserting whole prefix seq_d[0..j] is one block insertion costing u
    for j in range(1, n + 1):
        V[0][j] = u
        trace[0][j] = (0, 0)

    # col_max_val[j] tracks the max score seen in column j so far
    col_max_val = [V[0][j] for j in range(n + 1)]
    col_max_idx = [0 for _ in range(n + 1)] # Tracks which row index provided the max

    for i in range(1, m + 1):

        # row_max_val tracks the max score seen in current row i so far
        row_max_val = V[i][0]
        row_max_idx = 0

        for j in range(1, n + 1):

            # Match / Replacement
            score_match = V[i-1][j-1] + get_blosum(seq_c[i-1], seq_d[j-1])

            # Horizontal Gap (Insertion into C)
            # Take the best previous column k in this row row_max_val + u
            score_ins = row_max_val + u

            # Vertical Gap (Deletion from C)
            # Take the best previous row k in this column col_max_val + u
            score_del = col_max_val[j] + u

            # Determine the best move
            best_score = max(score_match, score_ins, score_del)
            V[i][j] = best_score

            # Set Traceback Pointer
            if best_score == score_match:
                trace[i][j] = (i-1, j-1)
            elif best_score == score_ins:
                # Came from (i, row_max_idx), a horizontal jump
                trace[i][j] = (i, row_max_idx)
            else:
                # Came from (col_max_idx[j], j), a vertical jump
                trace[i][j] = (col_max_idx[j], j)

            # Update for the next column step
            if V[i][j] > row_max_val:
                row_max_val = V[i][j]
                row_max_idx = j

            # Update for the next row step
            if V[i][j] > col_max_val[j]:
                col_max_val[j] = V[i][j]
                col_max_idx[j] = i

    final_score = V[m][n]

    # --- Traceback / Alignment Reconstruction
    align_c_list = []
    align_d_list = []

    cur_i, cur_j = m, n

    while cur_i > 0 or cur_j > 0:
        prev_i, prev_j = trace[cur_i][cur_j]

        if prev_i == cur_i - 1 and prev_j == cur_j - 1:
            # Diagonal move (Match/Sub)
            align_c_list.append(seq_c[cur_i - 1])
            align_d_list.append(seq_d[cur_j - 1])
        elif prev_i == cur_i:
            # Horizontal jump (Gap in C)
            # We moved from j back to prev_j.
            # This corresponds to characters seq_d[prev_j : cur_j]
            segment_len = cur_j - prev_j
            align_c_list.append("-" * segment_len)
            align_d_list.append(seq_d[prev_j:cur_j][::-1]) # Reverse for building backwards
        elif prev_j == cur_j:
            # Vertical jump (Gap in D)
            # We moved from i back to prev_i.
            # This corresponds to characters seq_c[prev_i : cur_i]
            segment_len = cur_i - prev_i
            align_c_list.append(seq_c[prev_i:cur_i][::-1]) # Reverse for building backwards
            align_d_list.append("-" * segment_len)

        cur_i, cur_j = prev_i, prev_j

    # Reverse list and join
    final_align_c = "".join(align_c_list)[::-1]
    final_align_d = "".join(align_d_list)[::-1]

    return final_score, final_align_c, final_align_d

score, align_c, align_d = solve_fixed_gap_alignment(Protein_C, Protein_D)

print(f"Gap-Weighted Score: {score}")
print("First 50 steps of the optimal alignment:")
print(f"C: {align_c[:50]}")
print(f"D: {align_d[:50]}")

Gap-Weighted Score: 1085
First 50 steps of the optimal alignment:
C: 
M-----------TSDCSSTHCSPESCGTASGCAPASSCSVETACLPGTC
D: 
MPYNFCLPSLSCRTSCSSRPCVPPSCHS---------------------


We may now ask at what threshold a score $v_{\text{gap}}(S, T)$ should be declared to have biological significance. Suppose there are only two letters in our alphabet, $a$ and $b$, corresponding to hydrophobic and hydrophilic amino acids. Let $s(a, a) = s(b, b) = 1$ and $s(a, b) = s(b, a) = -1$. Let $U^n$ be a random protein of length $n$: all the amino acids $U_1^n, \dots, U_n^n$ are independent and identically distributed, with $\Pr(U_i^n = a) = p$ and $\Pr(U_i^n = b) = 1 - p$.

Consider two random proteins $U^n$ and $V^n$, independent and identically
distributed. Let the score of inserting/deleting a sequence of length $l$ be fixed $w(l) = u$ for all $l > 1$. The limit
\begin{equation}
    \lim_{n \to \infty} \frac{E[v_{\text{gap}}(U^n, V^n)]}{n}
\end{equation}
represents the long-term expected optimal score per character. By the property of superadditivity (the score of a concatenated string is at least the sum of the scores of its parts), we know that
\begin{equation}
    \lim_{n \to \infty} \frac{E[v_{\text{gap}}(U^n, V^n)]}{n} = \sup_{n \geq 1} \frac{E[v_{\text{gap}}(U^n, V^n)]}{n}.
\end{equation}
To prove the limit is strictly positive, it suffices to show that there exists some integer length $K$ such that the expected optimal score $E[v_{\text{gap}}(U^K, V^K)]$ is strictly positive.

Consider the simple strategy where we align each character of $U^K$ and $V^K$ without any gaps. Let this score be denoted
\begin{equation}
    S_{\text{diag}} = \sum_{i=1}^{K} s(U_i, V_i)
\end{equation}
where $s(U_i, V_i) = 1$ if $U_i = V_i$ and $-1$ if $U_i \neq V_i$. Then the relevant probabilities to compute the expected value of a single pair score $X_i = s(U_i, V_i)$ are $\Pr(\text{match}) = p^2 + (1-p)^2$ and $\Pr(\text{mismatch}) = 2p(1-p)$. The expected score for one position is
\begin{equation}
    E[X_i] = 1 \cdot (p^2 + (1-p)^2) + (-1) \cdot (2p(1-p)) = (2p - 1)^2.
\end{equation}
Therefore, the expected diagonal score for length $K$ is
\begin{equation}
    \mathbb{E}[S_{\text{diag}}] = K(2p - 1)^2.
\end{equation}
If $p \neq 1/2$, then $(2p-1)^2 > 0$. The expected score of the diagonal alignment is strictly positive for any $K \geq 1$. Since the optimal score is at least the diagonal score, this gives the result.

It remains to consider the case when $p = 1/2$. The diagonal strategy alone yields an expectation of $0$ so we must use the gap options to improve this. We define a new strategy for alignment:
1.  Calculate the diagonal score $S_{\text{diag}}$.
2.  If $S_{\text{diag}}$ is lower than the cost of just replacing the whole string, then we delete the entire string $U^K$ and insert the entire string $V^K$. Here, the cost of a 'bail-out' is $2u$.

The score of the optimal alignment $v_{\text{gap}}$ must be at least the maximum of these two options
\begin{equation}
    v_{\text{gap}}(U^K, V^K) \geq \max(S_{\text{diag}}, 2u).
\end{equation}

Turning to taking expectations, note that $E[S_{\text{diag}}] = 0$ so
\begin{align}
    E[v_{\text{gap}}]
    &\geq E[\max(S_{\text{diag}}, 2u)] \\
    &= E[S_{\text{diag}} + \max(0, 2u - S_{\text{diag}})] \\
    &= E[S_{\text{diag}}] + E[\max(0, 2u - S_{\text{diag}})] \\
    &= E[\max(0, 2u - S_{\text{diag}})].
\end{align}
To prove the expectation is strictly positive, we need to show that there is a non-zero probability that the diagonal score $Y$ drops below $2u$. First, note that $S_{\text{diag}}$ is the sum of $K$ independent random variables taking values $\pm 1$. Choose $K$ large enough such that $-K < 2u$ so that $S_{\text{diag}} = -K$ is a possible event, specifically, when $U_i \neq V_i$ for all $i$. The probability of a mismatch at one position is $2pq = 1/2$ so the probability of $K$ mismatches is $1/2^K > 0$. Hence $\Pr(S_{\text{diag}} < 2u) > 0$ as required.

In [45]:
import random
import time

def calculate_gap_score(s1, s2, u):
    '''
    Calculates the optimal gap-weighted score between two strings
    using a fixed gap penalty u.
    '''
    m = len(s1)
    n = len(s2)

    # We only need two rows to calculate the score plus the column max trackers.
    # Initialise previous row
    # V[0][j] = u (for j > 0), 0 for j=0
    prev_row = [u] * (n + 1)
    prev_row[0] = 0

    # Initialise column max trackers
    # col_max[j] stores max(V[0...i-1][j])
    col_max = list(prev_row)

    current_row = [0] * (n + 1)

    for i in range(1, m + 1):
        # Initialise row max tracker
        # row_max stores max(V[i][0...j-1])
        current_row[0] = u
        row_max = u

        char1 = s1[i-1]

        for j in range(1, n + 1):
            char2 = s2[j-1]

            # Match/Mismatch (s(a,a)=1, s(a,b)=-1)
            score_char = 1 if char1 == char2 else -1
            score_diag = prev_row[j-1] + score_char

            # Insertion (Horizontal gap)
            # max(V[i][k]) + u  => row_max + u
            score_ins = row_max + u

            # Deletion (Vertical gap)
            # max(V[k][j]) + u => col_max[j] + u
            score_del = col_max[j] + u

            # Maximise
            val = max(score_diag, score_ins, score_del)
            current_row[j] = val

            # Update running maximums
            if val > row_max:
                row_max = val
            if val > col_max[j]:
                col_max[j] = val

        # Move current row to previous for next iteration
        prev_row = list(current_row)

    return current_row[n]

def run_simulation():
    '''
    Monte Carlo simulation.
    '''
    # Parameters
    u = -3
    # Lengths n to test
    n_values = [50, 100, 200, 500, 1000, 2000]

    print(f"{'n':<10} | {'Trials':<10} | {'Avg Score':<12} | {'Estimate (Score/n)':<20}")
    print("-" * 60)

    results = []

    for n in n_values:
        # As n increases, the variance decreases so we can reduce the trials
        trials = 100 if n < 1000 else 20

        total_score = 0

        for _ in range(trials):
            # Generate random strings (p=0.5 for 'a' and 'b')
            s1 = "".join(random.choices(['a', 'b'], k=n))
            s2 = "".join(random.choices(['a', 'b'], k=n))

            score = calculate_gap_score(s1, s2, u)
            total_score += score

        avg_score = total_score / trials
        estimate = avg_score / n
        results.append(estimate)

        print(f"{n:<10} | {trials:<10} | {avg_score:<12.2f} | {estimate:<20.5f}")

    print("-" * 60)
    print(f"Estimated Limit: {results[-1]:.5f}")

run_simulation()

n          | Trials     | Avg Score    | Estimate (Score/n)  
------------------------------------------------------------
50         | 100        | 16.01        | 0.32020             
100        | 100        | 36.79        | 0.36790             
200        | 100        | 80.16        | 0.40080             
500        | 100        | 210.24       | 0.42048             
1000       | 20         | 434.25       | 0.43425             
2000       | 20         | 869.80       | 0.43490             
------------------------------------------------------------
Estimated Limit: 0.43490


To estimate the limit $\lim_{n\to\infty} \frac{1}{n} \mathbb{E}[v_{\text{gap}}(U^n, V^n)]$, we rely LLN and the Kingman's subadditive ergodic theorem. As $n$ increases, the estimate will converge to the true limit.