# Sequence Alignment

In sequence alignment we look at the ways to arrange DNA, RNA or protein sequences to reflect how they are related to each other. From this we can infer evolutionary relationships between sequences, uncover sequences that are under selective constraint.

# Substitution process

Our substitution model gives the matrix of rates of substitution of base $\beta$ by base $\alpha$. For a finite amount of time $t$, the substitution probabilities are given by

$$
\frac{\partial P(\alpha | \beta, t)}{\partial t} = \sum_{\gamma} R_{\alpha \gamma} P(\gamma | \beta, t)  \Rightarrow P(\alpha | \beta, t) = (e^{Rt})_{\alpha \beta}
$$

In the limit of a long time we reach a limit distribution $\lim_{t \rightarrow \infty} P(\alpha | \beta, t) = \pi_{\alpha}$

Most substitution rate matrices are reversible, meaning that

$$
    P(\alpha | \beta, t) \pi_{\beta} = P(\beta | \alpha, t) \pi_{\alpha}
$$

For the reversible model, we get

$$
\sum_{\beta} P(\alpha | \beta, t) P(\gamma | \beta, t_2) \pi_{\beta} = P(\gamma | \alpha, t_1 + t_2) \pi_{\alpha}
$$

# Insertion and deletion

Assume that in a very short time interval $dt$ three types of events can happen
1. A base is mutated (With probability $mdt$ pre base)
2. A new base is inserted (With probability $\lambda dt$ after each base)
3. A base is deleted (With probability $\mu dt$ pre base)

Given $p_{n}(t)$, being the probability that through the process of insertion and deletion over a time $t$ a single nodes survives and leaves $n$ descendants (including itself). It obeys the differential equation

$$
\frac{ d p_{n}}{dt} = \underbrace{\lambda (n - 1) p_{n - 1}}_{\text{Prob. to gain a descendant (going from n - 1 to n)}} + \underbrace{\mu n p_{n+1}}_{\text{Prob. to lose descendant (going from n + 1 to n)}} - \underbrace{(\lambda + \mu) n p_n}_{\text{Prob. of going from n to n + 1 or n - 1}}
$$

Which can be solved to give

$$
p_n = \alpha \beta^{n-1}(1 - \beta)
$$

with $\alpha = e^{- \mu t}$ and $\beta = \frac{\lambda - \lambda e^{(\lambda - \mu)t}}{\mu - \lambda e^{(\lambda - \mu)t}}$.

![](pnprob.png)

Given $q_n(t)$, being the probability that through the process of insertion and deletion over a time $t$ a single node disappears and leaves $n$ extra nodes after it.

For $n > 0$ the probability $q_n(t)$ obeys the differential equation

$$
\frac{dq_n}{dt} = \underbrace{\lambda(n-1)q_{n-1}}_{\text{Prob. of gaining a child (going from n - 1 to n)}} + \underbrace{\mu(n+1)q_{n+1}}_{\text{Prob. of a child dying (going from n + 1 to n)}} - \underbrace{(\lambda + \mu)nq_n}_{\text{Prob. of going from n to n + 1 or n - 1}} + \underbrace{\mu q_{n+1}}_{\text{Prob. of node dying}}
$$

For $n = 0$ we have $\frac{dq_o}{dt} = \mu(q_1 + p_1)$

With $\gamma = 1 - \frac{\mu (1 - e^{(\lambda - \mu)t})}{(1 - e^{- \mu t})(\mu - \lambda e^{(\lambda - \mu)t})}$ the solution is given as

$$
q_n =
\begin{cases}
(1 - \alpha)(1 - \gamma) \quad \text{for } n = 0 \\
(1 - \alpha)\gamma\beta^{n-1}(1-\beta) \quad \text{for } n > 0
\end{cases}
$$

![](qnprob.png)

Given $r_n(t)$, the probability that through the process of insertion and deletion over a time $t$ the immortal link at the start of the sequence leaves $n$ nodes. For $n > 0$ the probability $r_n(t)$ obeys the differential equation

$$
\frac{dr_n}{dt} = \underbrace{\lambda n r_{n-1}}_{\text{Prob. of gaining a child node (going from n - 1 to n)}} + \underbrace{\mu (n+1)r_{n+1}}_{\text{Prob. of losing a child node (going from n + 1 to n)}} - \underbrace{\lambda (n+1)r_n - \mu n r_n}_{\text{Prob. of losing a child}}
$$

And for $n = 0$ we have $\frac{dr_0}{dt} = \mu r_1 - \lambda r_0$.

The solution is given by $r_n = \beta^n(1 - \beta)$

![](rnprob.png)

- $\alpha$ can be seen as the probability that the ancestral residue survives
- $\beta$ can be seen as the probability of insertion given that the ancestral node survives
- $\gamma$  can be seen as the probability of insertions given that the ancestran node disappears

These formulas can be represented using the following HMM

![](pair_alignment_hmm.png)

The alignment model and the HMM differ in their result, this is due to the fact that in the HMM, we model explicitely the length of the alignment. The HMM can be further collapsed to give the collapsed pair-HMM

![](collapsed_pair_hmm.png)

All transition probabilities depend only on the two parameters $\lambda$ and $\mu$, and the time $t$. The ratio $\frac{\lambda}{\mu}$ controls the expected length of the sequence and the absolute value $\mu t$ the amount of insertion / deletion.
In a more general model we can introduce more parameters to independently control
- The number of insertions / deletions
- The average length of insertions/deletions
- The total sequence length

# Viterbi Algorithm

With this pair-HMM, we can use it to represent sequence alignments.

Given
- $\delta$: Open insertion/deletion block
- $\epsilon$: Extend insertion / deletion block
- $\tau$ terminate alignment

**Initialization**
- $\nu^M(0,0) = 1$ assuming starting from match state
- $\nu^*(i,0) = \nu^*(0,j) = 0$ for all other $\nu^*(i,j)$

**Recursion**
- $\nu^M(i,j) = p_{x_i,y_j} max \begin{cases} (1-2\delta - \tau) \nu^M (i-1,j-1) \\ (1 - \epsilon - \tau) \nu^X (i-1,j-1) \\ (1 - \epsilon - \tau)\nu^Y (i-1,j-1) \end{cases}$
- $\nu^X(i,j) = q_{x_i} max \begin{cases} \delta \nu^M (i-1,j) \\ \epsilon \nu^X(i-1,j) \end{cases}$
- $\nu^Y(i,j) = q_{y_j} max \begin{cases} \delta \nu^M(i, j - 1) \\ \epsilon \nu^Y(i,j-1) \end{cases}$

**Termination**
- $\nu^E= \tau max$

The background model is the model where the sequences are emitted independently of each other.

![](background_log_odds.png)

The probability of a path is then given as

$$
P(x,y | R) = (1 - \mu)^n \mu \prod_{i=1}^n q_{x_i} (1-\mu)^m \nu \prod_{j=1}^m q_{y_j} = \mu^2 (1-\mu)^{m+n} \prod_{i=1}^n q_{x_i} \prod_{j=1}^m q_{y_j}
$$

The most probable path with the odds score is then again a recursion formula

**Recursion**
- $\nu^X(i,j) = \frac{q_{x_i}}{q_{x_i}} max \begin{cases} \frac{\delta}{1 - \mu} \nu^M (i-1, j) \\ \frac{\epsilon}{1 - \mu} \nu^X (i-1, j) \end{cases}$
- $\nu^M(i,j) = \frac{p_{x_i, y_j}}{q_{x_i} q_{y_j}} max \begin{cases} \frac{1 - 2 \delta - \tau}{(1 - \mu)^2} \nu^M(i-1,j-1) \\ \frac{1 -\epsilon - \tau}{(1-\mu)^2} \nu^X(i-1,j-1) \\ \frac{1-\epsilon - \tau}{(1-\mu)^2 \nu^Y(i-1,j-1)} \end{cases}$
- $\nu^Y(i,j) = \frac{q_{y_i}}{q_{y_i}} max \begin{cases} \frac{\delta}{1 - \mu} \nu^M (i-1, j) \\ \frac{\epsilon}{1 - \mu} \nu^Y (i-1, j) \end{cases}$

A more common formulation of this recursion is

- $V^M(i,j) = s(x_i, y_j) + max \begin{cases} V^M(i-1,j-1) \\ V^X(i-1,j-1) \\ V^Y(i-1,j-1) \end{cases}$
- $V^X(i,j) = max \begin{cases} V^M(i-1,j) - d \\ V^X(i-1,j) - e \end{cases}$
- $V^Y(i,j) = max \begin{cases} V^M(i,j-1) - d \\ V^Y(i,j-1) - e \end{cases}$

Where $s$ is the scores for characted to characted alignments, the gap opening penalty $d$ and the gap extension penalty $e$.

- Likelihood ratio for match after deletion / insertion $$ \frac{p_{x_i, y_j} (1 - \epsilon - \tau)}{ q_{x_i} q_{y_j} (1 - \mu)^2} $$
- Likelihood ratio for match after match $$ \frac{p_{x_i, y_i} (1 - 2 \delta - \tau)}{q_{x_i} q_{y_j} (1 - \mu)^2} $$
- Likelihood ratio for deletion / insertion after match $$ \frac{q_{x_i / y_j} \delta}{ q_{x_i / y_j} (1 - \mu)} = \frac{\delta}{1 - \mu} $$
- Likelihood ratio for deletion / insertion after deletion / insertion $$ \frac{q_{x_i / y_j} \epsilon}{ q_{x_i / y_j} (1 - \mu)} = \frac{\epsilon}{1 - \mu} $$

**Initialization**
- $V^M(0,0) = - 2 \log(\mu)$ $V^*(i,0) = V^*(0,j) = - \infty, \forall i,j$

**Recursion**
- $V^M(i,j) = s(x_i, y_j) + max \begin{cases} V^M(i-1,j-1) \\ V^X(i-1,j-1) \\ V^Y(i-1,j-1) \end{cases}$
- $V^X(i,j) = max \begin{cases} V^M(i-1,j) - d \\ V^X(i-1,j) - e \end{cases}$
- $V^Y(i,j) = max \begin{cases} V^M(i,j-1) - d \\ V^Y(i,j-1) - e \end{cases}$

**Termination**
- $V = max(V^M(n,m), V^X(n,m) - c, V^Y(n,m) - c)$, with $c = \log(\frac{1 - \epsilon - \tau}{1 - 2 \delta - \tau})$

# Deriving score parameters

The intuative approach would be to compute character-character alignment, gap initiation and gap extension parameters from confirmed alignments. The difficulties with this approach are

1. Confirmed alignments are hard to come by $\rightarrow$ Use alignments of very closely related sequences, for which we can assume that a very small number of evolutionary changes occurred
2. The overall frequency of various events depends on the evolutionary distance $\rightarrow$ Use alignments generated from sequences that are seperated by roughly the same evolutionary distance as the sequences that we will later want to align

# Dayhoff matrices

Proteins that in pairwise comparison did not differ by more than 15% were used to construct the maximum parsimony phylogenetic trees and to infer the mutations that occurred along the tree. The number of substitutions from one amino acid $a$ to another $b$, $A_{ab}$, and the number of occurences of each amino acid that could have undergone mutations (depending on the amino acid frequency and the number of mutations in each branch) were counted. Assuming reversibility, changed were counted symmetrically. The entries in the matrix were scaled so as to obtain 1 substitution in 100 amino acids. This 1 PAM matrix was defined as the substitution matrix that corresponds to an evolutionary time that yields an expected 1% of amino acids to undergo substitution.

For the matrix to correspond to 1 substitution in 100 amino acids, we have to have $\sum_{i=1}^20 p_i \lambda m_i = \frac{1}{100}$ where $p_i$ is the frequency of amino acid $i$ and $m_i$ is its mutability.

From this we infer $\lambda$, and then the mutability matrix in which $M_{ii} = 1 - \lambda m_i$ is proportional to the probability of amino acid $i$ to stay unchanged, and $M_{ij} = \frac{\lambda m_i A_{ij}}{\sum_j A_{ij}}$ is proportional to the probability of amino acid $i$ being substituted by $j$.

From the PAM 1 matrix (= $B$) we can obtain the PAM matrix corresponding to an arbitrary number of evolutionary units by computing $B^n$. Finally, scores for the PAM$_n$ are derived as log likelihoods, $q_b$ being the limit frequency of amino acid $b$.

# BLOSUM matrices

The BLOSUM matrices are derived from ungapped alignment regions of proteins that have a higher degree of divergence. Proteins are initially clustered whenever their percentage of identical residues exceeds some level L% and then only a representative is used per cluster. Frequencies $A_{ab}$ representing the number of times residue $a$^is paired with residue $b$ are calculated, taking into account cluster size. Then the probabilities of individual residues and pairs  of residues are calculates as:

$$
q_a = \frac{\sum_b A_{ab}}{\sum_{c,d} A_{cd}} \quad p_{ab} = \frac{A_{ab}}{\sum_{c,d} A_{cd}}
$$

and the score $s(a,b) = \log\left( \frac{p_{ab}}{q_a q_b} \right)$

We can calculate the best alignment through a dynamic programming table, where we choose the value at a given index in the table as

$$
V(i,j) = max \begin{cases} V(i - 1, j) + d \\ V(i - 1, j - 1) + s(x_i, y_j) \\ V(i, j - 1) + d \end{cases}
$$

A gap is introduced in the upper sequence through the path $\leftarrow$, one in the lower with $\uparrow$ and a match with $\nwarrow$.

![](blosum_match.png)

# Summing over paths

**Initialization**
- $f^M(0,0) = 1, f^X(0,0) = 0, f^Y(0,0) = 0$
- $f^*(i,0) = f^*(0,j) = 0$ for all other $\nu^*(i,j)$

**Recursion**: $i = 0, ..., n$ and $j=0,...,m$ except $(0,0)$
- $f^M(i,j) = p_{x_i, y_j} [(1 - 2\delta - \tau)f^M(i-1,j-1) + (1 - \epsilon - \tau)(f^X(i-1,j-1) + f^Y(i-1,j-1))]$
- $f^X(i,j) = q_{x_i} = [\delta f^M(i-1,j) + \epsilon f^X(i-1,j)]$
- $f^Y(i,j) = q_{y_j} = [\delta f^M(i,j-1) + \epsilon f^Y(i,j-1)]$

**Termination**
- $f^E = \tau [f^M(n,m) + f^X(n,m) + f^Y(n,m)]$


## Application

With this we could sample alignments. We Traceback through the matrix $f^k(i,j)$ but instead of following the highest scoring move, choose probabilistically. E.g. for a match state we have

$$
f^M(i,j) = p_{x_i, y_j} [(1 - 2\delta - \tau)f^M(i-1,j-1) + (1 - \epsilon - \tau)(f^X(i-1,j-1) + f^Y(i-1,j-1))]
$$

Then we have

- $M(i-1,j-1)$ with probability $\frac{p_{x_i, y_j} (1 - 2 \delta - \tau)f^M(i-1,j-1)}{f^M(i,j)}$
- $X(i-1,j-1)$ with probability $\frac{p_{x_i, y_j} (1 - \epsilon - \tau)f^X(i-1,j-1)}{f^M(i,j)}$
- $Y(i-1,j-1)$ with probability $\frac{p_{x_i, y_j} (1 - \epsilon - \tau)f^Y(i-1,j-1)}{f^M(i,j)}$


Defining the posterior distribution over alignments given sequences $x$ and $y$

$$
P(\pi | x,y) = \frac{P(x,y,\pi)}{P(x,y)}
$$

We can compute the posterior probability of specific characters being aligned with each other by summing over alignments that share this character-to-character alignment vs. all alignments, without any contstraint.

#  Local alignment (Smith-Waterman alignment)

**Initialization**
$$
V(i, 0) = V(0,j) = 0 \forall i,j
$$

**Iterations**
$$
V(i,j) = \max \begin{cases} 0 \\ V(i-1,j) + \sigma(x_i, -) \\ V(i,j-1) + \sigma(-,y_j) \\ V(i-1,j-1) + \sigma(x_i, y_j) \end{cases}
$$

Starting then from the maximum value in the table and tracing back until $V(i,j) < 0$ gives us the best local alignment.

![](local_alignment.png)