Full alignment of proteins is meaningful when the two strings are members of the same family. For example, the full sequences of the oxygen-binding proteins myoglobin and haemoglobin are very similar. Often, though, only a small region of the protein is critical to its function and only this region will be conserved throughout the evolutionary process. When we identify two proteins which perform similar functions but look superficially different, it is useful to identify these highly conserved regions.

We aim to find a pair of substrings $S'$ and $T'$ of $S$ and $T$ respectively with the highest alignment score, namely,
\begin{equation}
    v_{\text{sub}}(S, T) = \max\{v(S', T') : S' \text{ is a substring of } S,\, T' \text{ is a substring of } T\}.
\end{equation}
We write $s(-, a) = s(a, -) < 0$ for the score of an insertion or deletion.

Finding $v_{\text{sub}}(S, T)$ seems to be of much higher complexity than solving the global alignment problem, as there are $\Theta(n^2m^2)$ combinations of substrings of $S$ and $T$. In fact, we will solve it using an algorithm whose complexity is still only $O(mn)$.

We will first define a slightly easier problem. Suppose we restrict ourselves to suffixes of $S$ and $T$
\begin{equation}
    v_{\text{sfx}}(S, T) = \max\{v(S', T') : S' \text{ is a suffix of } S,\, T' \text{ is a suffix of } T\}.
\end{equation}


---

A string $S'$ is a substring of $S$ if and only if it is a suffix of some prefix of $S$. Denote the set of prefixes of $S$ of length $i$ as $P_i$ and the set of prefixes of $T$ of length $j$ as $Q_j$. Substituting the set union definition into the local alignment score yields
\begin{equation}
    v_{\text{sub}}(S, T) = \max_{0 \leq i \leq m, \ 0 \leq j \leq n} \max \{ v(S', T') : S' \text{ is a suffix of } P_i,\, T' \text{ is a suffix of } Q_j \}.
\end{equation}
The inner maximisation term is exactly the definition of the suffix alignment score $v_{\text{sfx}}$ for the prefixes $S[1, i]$ and $T[1, j]$:
\begin{equation}
    v_{\text{sfx}}(S[1, i], T[1, j]) = \max \{ v(S', T') : S' \text{ is a suffix of } S[1, i],\, T' \text{ is a suffix of } T[1, j] \}.
\end{equation}
Therefore, by substituting back into the original expression.
\begin{align}
    v_{\text{sub}}(S, T)
    &= \max_{0 \leq i \leq m, \ 0 \leq j \leq n} \{ v_{\text{sfx}}(S[1, i], T[1, j]) \} \\
    &= \max \{ v_{\text{sfx}}(S', T') : S' \text{ is a prefix of } S,\, T' \text{ is a prefix of } T \}.
\end{align}

---

Write $V_{\text{sfx}}(i, j)$ for $v_{\text{sfx}}(S[1, i], T[1, j])$ which denotes the maximum score of aligning a suffix of $S[1, i]$ with a suffix of $T[1, j]$. Let the optimal pair of suffixes be $S^*$ and $T^*$. There are two possibilities for the optimal suffixes:

1.  Null alignment $S^* = \epsilon$ and $T^* = \epsilon$. The empty string is a suffix of every string. The alignment of two empty strings has a score of $0$. If aligning any non-empty suffixes results in a negative score, then the optimal choice is to choose the empty suffixes.

2.  Non-Empty alignment. If the optimal alignment is not empty, then at least one of $S^*$ or $T^*$ is non-empty. We consider the last column of the optimal alignment of $S^*$ and $T^*$. There are three possible configurations for this last column, corresponding to the standard edit operations:

    *   Match/Mismatch, $S_i$ aligned with $T_j$. The last characters of the suffixes are $S_i$ and $T_j$. The remaining parts of the suffixes (removing the last characters) must be optimal suffixes of $S[1, i-1]$ and $T[1, j-1]$. Thus, the score is $V_{\text{sfx}}(i - 1, j - 1) + s(S_i, T_j)$.

    *   Deletion, $S_i$ aligned with a gap $-$. The suffix $S^\*$ ends with $S_i$, but $T^\*$ does not end with $T_j$ relative to the alignment column. The score is the optimal score of a suffix of $S[1, i-1]$ aligned with a suffix of $T[1, j]$, plus the penalty for a gap, $V_{\text{sfx}}(i - 1, j) + s(S_i, -)$.

    *   Insertion, gap $-$ aligned with $T_j$. The suffix $T^\*$ ends with $T_j$, but $S^\*$ does not end with $S_i$. The score is the optimal score of a suffix of $S[1, i]$ aligned with a suffix of $T[1, j-1]$, plus the penalty for a gap, $V_{\text{sfx}}(i, j - 1) + s(-, T_j)$.

Since $V_{\text{sfx}}(i, j)$ must be the maximum of all possible valid alignments, it is given by
\begin{equation}
    V_{\text{sfx}}(i, j) = \max
    \begin{cases}
        0, \\
        V_{\text{sfx}}(i - 1, j - 1) + s(S_i, T_j), \\
        V_{\text{sfx}}(i - 1, j) + s(S_i, -), \\
        V_{\text{sfx}}(i, j - 1) + s(-, T_j).
    \end{cases}
\end{equation}
For $V_{\text{sfx}}(i, 0)$, we are aligning a suffix of $S[1, i]$ with a suffix of an empty string, which must be $\epsilon$. Since gap penalties $s(x, -)$ are negative, any non-empty suffix of $S[1, i]$ aligned with $\epsilon$ would have a negative score. The definition allows us to choose the empty suffix $\epsilon$ for $S$ as well. Thus, $V_{\text{sfx}}(i, 0) = 0$. Similarly, $V_{\text{sfx}}(0, j) = 0$.

In [1]:
# @title Protein Data

Protein_C = """
MTSDCSSTHCSPESCGTASGCAPASSCSVETACLPGTCATSRCQTPSFLSRSRGLTGCLL
PCYFTGSCNSPCLVGNCAWCEDGVFTSNEKETMQFLNDRLASYLEKVRSLEETNAELESR
IQEQCEQDIPMVCPDYQRYFNTIEDLQQKILCTKAENSRLAVQLDNCKLATDDFKSKYES
ELSLRQLLEADISSLHGILEELTLCKSDLEAHVESLKEDLLCLKKNHEEEVNLLREQLGD
RLSVELDTAPTLDLNRVLDEMRCQCETVLANNRREAEEWLAVQTEELNQQQLSSAEQLQG
CQMEILELKRTASALEIELQAQQSLTESLECTVAETEAQYSSQLAQIQCLIDNLENQLAE
IRCDLERQNQEYQVLLDVKARLEGEINTYWGLLDSEDSRLSCSPCSTTCTSSNTCEPCSA
YVICTVENCCL
"""

Protein_D = """
MPYNFCLPSLSCRTSCSSRPCVPPSCHSCTLPGACNIPANVSNCNWFCEGSFNGSEKETM
QFLNDRLASYLEKVRQLERDNAELENLIRERSQQQEPLLCPSYQSYFKTIEELQQKILCT
KSENARLVVQIDNAKLAADDFRTKYQTELSLRQLVESDINGLRRILDELTLCKSDLEAQV
ESLKEELLCLKSNHEQEVNTLRCQLGDRLNVEVDAAPTVDLNRVLNETRSQYEALVETNR
REVEQWFTTQTEELNKQVVSSSEQLQSYQAEIIELRRTVNALEIELQAQHNLRDSLENTL
TESEARYSSQLSQVQSLITNVESQLAEIRSDLERQNQEYQVLLDVRARLECEINTYRSLL
ESEDCNLPSNPCATTNACSKPIGPCLSNPCTSCVPPAPCTPCAPRPRCGPCNSFVR
"""

In [2]:
# @title BLOSUM Matrix
blosum_text = """
   C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W
C  9 -1 -1 -3  0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2
S -1  4  1 -1  1  0  1  0  0  0 -1 -1  0 -1 -2 -2 -2 -2 -2 -3
T -1  1  5 -1  0 -2  0 -1 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -2
P -3 -1 -1  7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4
A  0  1  0 -1  4  0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1  0 -2 -2 -3
G -3  0 -2 -2  0  6  0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2
N -3  1  0 -2 -2  0  6  1  0  0  1  0  0 -2 -3 -3 -3 -3 -2 -4
D -3  0 -1 -1 -2 -1  1  6  2  0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4
E -4  0 -1 -1 -1 -2  0  2  5  2  0  0  1 -2 -3 -3 -2 -3 -2 -3
Q -3  0 -1 -1 -1 -2  0  0  2  5  0  1  1  0 -3 -2 -2 -3 -1 -2
H -3 -1 -2 -2 -2 -2  1 -1  0  0  8  0 -1 -2 -3 -3 -3 -1  2 -2
R -3 -1 -1 -2 -1 -2  0 -2  0  1  0  5  2 -1 -3 -2 -3 -3 -2 -3
K -3  0 -1 -1 -1 -2  0 -1  1  1 -1  2  5 -1 -3 -2 -2 -3 -2 -3
M -1 -1 -1 -2 -1 -3 -2 -3 -2  0 -2 -1 -1  5  1  2  1  0 -1 -1
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3  1  4  2  3  0 -1 -3
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2  2  2  4  1  0 -1 -2
V -1 -2  0 -2  0 -3 -3 -3 -2 -2 -3 -3 -2  1  3  1  4 -1 -1 -3
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3  0  0  0 -1  6  3  1
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1  2 -2 -2 -1 -1 -1 -1  3  7  2
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3  1  2 11
"""

In [4]:
def solve_local_alignment(seq_C, seq_D):
    # Gap penalty
    gap = -2
    blosum_map = {}
    lines = blosum_text.strip().split('\n')
    aa = lines[0].split()
    for line in lines[1:]:
        parts = line.split()
        row_char = parts[0]
        scores = [int(x) for x in parts[1:]]
        for idx, score in enumerate(scores):
            blosum_map[(row_char, aa[idx])] = score

    # --- Calculation (Smith-Waterman) ---
    m, n = len(seq_C), len(seq_D)

    # We only need to store the max score found so far
    max_score = 0

    # We can optimize space to O(n) since we don't need the traceback for the score
    # current_row[j] corresponds to V(i, j)
    # prev_row[j] corresponds to V(i-1, j)

    prev_row = [0] * (n + 1)
    current_row = [0] * (n + 1)

    for i in range(1, m + 1):
        for j in range(1, n + 1):

            # Diagonal: Match/Mismatch
            # Use .get() with negative default for safety, though chars should be standard
            s_char = blosum_map.get((seq_C[i-1], seq_D[j-1]), -4)
            score_diag = prev_row[j-1] + s_char

            # Up: Deletion
            score_up = prev_row[j] + gap

            # Left: Insertion
            score_left = current_row[j-1] + gap

            # Local Alignment Rule: Max with 0
            val = max(0, score_diag, score_up, score_left)

            current_row[j] = val

            if val > max_score:
                max_score = val

        # Move current to prev
        prev_row = list(current_row)

    return max_score

result = solve_local_alignment(Protein_C, Protein_D)
print(f"Local Alignment Score v_sub(C, D): {result}")

Local Alignment Score v_sub(C, D): 1286
