# Markov Chains

A random process is a process in which the trajectory of the system is not precisely determined and it is described in terms of probability distributions.

Discrete random process is a process in which the system changes between discrete states. The parameters of these transitions are called transition probabilities.

Markov property: The probability distribution over states at the next time step depends only on the current state of the system

$$
Pr(X_{n+1} = x | X_1 = x_1, X_2 = x_2, ...) = Pr(X_{n + 1} = x | X_n = x_n)
$$

A markov chain is a discrete random process with the markov property. In general, a $m$'th order markov considers the last $m$ steps.

# Estimating nucleotide probabilities

Given a genome, we let $(\pi_A, \pi_C, \pi_G, \pi_T)$  be the probabilities of choosing each of the four nucleotides. To estimate these probabilities, we write the likelihood of the sequence given the model and find the parameters which maximize the likelihood. We look here at a 0-order markov chain.
The probability is given as

$$
P(s | \vec \pi) = \pi_A^{n_A} \pi_C^{n_C} \pi_G^{n_G} \pi_T^{n_T}
$$

where $n_i$ is the amount of occurences of base $i$. The log likelihood is then

$$
L(s | \vec \pi) = n_A\log(\pi_A) + n_C\log(\pi_C) + n_G\log(\pi_G) + n_T\log(\pi_T)
$$

under the constraint that $\pi_A + \pi_C + \pi_G + \pi_T = 1$. This is solved with the lagrange multiplier where we want to maximize the function $f(x)$ s.t. the constraint $g(x) = c$. We define the function $\Lambda(x, \lambda) = f(x) + \lambda(g(x) - c)$ and solve for the partial derivatives of $\Lambda$ being 0.

\begin{align*}
    \frac{\partial \Lambda(\vec \pi | \lambda) }{\partial \pi_{\alpha}}
    &= 
    \frac{\partial }{\partial \pi_{\alpha}} ( n_A\log(\pi_A) + n_C\log(\pi_C) + n_G\log(\pi_G) + n_T\log(\pi_T) + \lambda (\pi_A + \pi_C + \pi_G + \pi_T - 1)) \\
    &=
    \frac{n_{\alpha}}{\pi_{\alpha}} + \lambda \\
    &= 0 \\
    \Rightarrow - \lambda &= \frac{n_{\alpha}}{\pi_{\alpha}}
\end{align*}

Now because we have that

$$
\sum_{\alpha} \pi_{\alpha} = \sum_{\alpha} - \frac{n_{\alpha}}{\lambda} = 1 \Rightarrow n_A + n_C + n_G + n_T = - \lambda
$$

Thus we have that the maximum likelihood solution is given as $ \pi_{\alpha} = \frac{n_{\alpha}}{n} $. 


Now looking at a 1-order markov chain, where the nucleotide at position $i$ depends only on the nucleotide at position $i - 1$. The probability of the sequence is then given as

$$
P(x) = P(x_L|x_{L - 1}) P(x_{L - 1} | x_{L - 2}) ... P(x_2 | x_1 ) P(x_1)  
$$

Similarly to the single nucleotide probability, we can estimate from the maximum likelihood the double nucleotide probabilities.

| From\To | A        | C        | G        | T        | Total   |
|---------|----------|----------|----------|----------|---------|
| A       | $n_{}$   | $n_{AC}$ | $n_{AG}$ | $n_{AT}$ | $n_{A}$ | 
| C       | $n_{CA}$ | $n_{CC}$ | $n_{CG}$ | $n_{CT}$ | $n_{C}$ | 
| G       | $n_{GA}$ | $n_{GC}$ | $n_{GG}$ | $n_{GT}$ | $n_{G}$ | 
| T       | $n_{TA}$ | $n_{TC}$ | $n_{TG}$ | $n_{TT}$ | $n_{T}$ | 

These frequencies are given as $\pi_{\alpha\beta} = \frac{n_{\alpha \beta}}{n_{\alpha}} $ with $n_{\alpha} = \sum_{\beta} n_{\alpha \beta}$.

# Detection of CpG islands

Given a genome, the log odds ratio is given as 

$$
S(x) = \log \left( \frac{P(\vec | CpG \ model)}{P(\vec | non- CpG \ model)} \right) = \log(P(\vec x | CpG \ model)) - \log(P(\vec x | non-CpG \ model))
$$

The Log-probability then of a sequence under the CpG island is given as 

$$
\sum \log(p_{CpG}(\alpha | \beta))
$$

The Log-probability then of a sequence under the non CpG island is given as 

$$
\sum \log(p_{\overline{CpG}}(\alpha | \beta))
$$

Given precalculated data for the probabilities, we can use a sliding window over the genome sequence to see where the CpG islands are located. The problem with this is, how do we choose the size of the sliding window and at which probability is the cut-off range for a CpG island.

# Hidden Markov Model

In a hidden markov model, we model simultaneously the states (CpG island and non-CpG island) and the nucleotides that are observed in these states. The path ($\pi$) between states is modeled as a markov chain and transitions in and out of the CpG island state specify the CpG island boundaries.

This model consists of multiple parameters

- State-to-State transition probabilities: Probability to go from state $k$ to state $l$ $$a_{kl} = P(\pi_i = l | \pi_{i-1} = k)$$
- Emission Probabilities: Having at position $i$ the base $\beta$ $$e_l(\beta) = P(x_i = \beta | \pi_i = l, x_{i-1} = \alpha)$$
 
Probability to go from state $k$ to state $l$ and to emit letter $\beta$ is then given as 

$$
e_l(\beta) a_{kl}
$$

The most likely path through the model given a sequence $x$ corresponds to the most likely assignment of the CpG islands through the sequence.

# The Viterbi Algorithm

The viterbi algorithm allow us to find the maximum likelihood of the path given our hidden markov model through recursion. We define $\nu_k(i)$ as the maximum probability that we can achieve after observing the first $i$ letters of the sequence, ending in state $k$. Through recursion we can then find the best path

$$
\nu_k(i) = e_k(x_i) \max_{l} (\nu_l (i - 1) a_{lk})
$$ 

During this recursion, we keep track of the sequence of states $k$ that gave us the maximum probability at each nucleotide $i$ in the sequence. 

*Initialization*
- $\nu_S(\phi) = 1, \nu_k(\phi) = 0$ $\forall k$ state other than start

*Recursion*
- $\nu_k(i) = e_k(x_i) \max_l (\nu_l(i - 1) a_{lk})$
- $ptr_i(k) = \operatorname{argmax}_l (\nu_l(i - 1) a_{lk})$

*Termination*
- $P(\vec x, \pi^*) = \max_k (\nu_k (L) a_{kE})$
- $\pi^*_L = \operatorname{argmax}_k (\nu_k(L) a_{kE})$

*Traceback*
- $\pi_{i-1}^* = ptr_i(\pi^*_i) $ $\forall i = L...1$

# Forward Algorithm

Due to the optimal path being only one of many paths, with varying amounts of similarities between one another. To find more "stable" CpG islands, i.e. islands that occur often in high probability paths, we look at the posterior probabilities if the paths.

The forward algorithm gives us the probability of a sequence taking into account that multiple paths can give rise to the same sequence of symbols. The total probability is given as

$$
P(\vec x) = \sum_{\pi} P(\vec x, \pi)
$$

This allows us to calculate the probabilities of individual paths as 

$$
P(\pi | \vec x) = \frac{P(\vec x, \pi)}{P(\vec x)} = \frac{P(\vec x, \pi)}{\sum_{\pi'} P(\vec x, \pi')}
$$

Here in the forward algorithm, we calculate the probability to end in state $k$ after reading the first $i$ symbols

$$
f_k(i) = P(x_1, ..., x_i, \pi_i = k)
$$

*Initialization*
- $f_S(\phi) = 1, f_k(\phi) = 0$ $\forall k$ state other than start

*Recursion*
- $f_k(i) = e_k(x_i) \underbrace{\sum_l (f_l(i-1) a_{lk})}_{\text{All previous paths transitioning to state } k}$

*Termination*
- $P(\vec x) = \sum_k (f_k(L) a_{kE})$

# Backward Algorithm

The backward algorithm gives us the probability of a sequence taking into account that multiple paths can give rise to the same sequence of symbols, but doing the computations from the end towards the beginning from the sequence. We calculate the probability of symbols $x_{i + 1}, ... , x_L$ starting in state $k$ at position $i$.

$$
b_k(i) = P(x_{i+1}, ..., x_L, \pi_i = k)
$$

*Initialization*
- $b_k(L) = a_{kE}$ $\forall k$

*Recursion*
- $b_k(i) = \sum_{l} b_l(i+1)e_l(x_{i+1})a_{kl}$

*Termination*
- $b_s(\phi) = P(x_1,...,x_L | \pi_0 = S) = P(\vec x) = \sum_k b_k(1) e_k(x_1)a_{Sk}$

# Most probable State

The forward backward prob can be seen as the total probability from paths, where position $i$ is in state $k$. The forward, as the name suggest looks at all the paths from $0$ to $i$ and the backwards looks at all the paths from $L$ to $i$.

Having the forward and backwards probabilities, we can calculate the probablitity that position $i$ is in a CpG island

$$
P(\vec x, \pi_i = k) = \underbrace{P(x_1, ..., x_i, \pi_i = k)}_{f_k(i)} \underbrace{P(x_{i+1}, ..., x_L, \pi_i = k)}_{b_i(k)}
$$

It follows that then

$$
P(\pi_i = k | \vec x) = \frac{P(\vec x, \pi_i = k)}{P(\vec x)} = \frac{f_k(i)b_k(i)}{P(\vec x)}
$$

# Distribution of state lengths

We would like to know the average length of a CpG island. This can be estimated by

$$
P(l) = a_{CC}^l (1 - a_{CC})
$$

It follows then that

\begin{align*}
    \langle l \rangle
    &=
    \sum_{l=0}^{\infty} l P(l) \\
    &=
    \sum_{l=0}^{\infty} l a_{CC}^l (1 - a_{CC}) \\
    &=
    a_{CC}(1 - a_{CC})\sum_{l=0}^{\infty} l a_{CC}^{l - 1} \\
    &=
    a_{CC}(1 - a_{CC}) \frac{\partial}{\partial a_{CC}} \sum_{l=0}^{\infty} a_{CC}^{l} \\
    &=
    a_{CC}(1 - a_{CC}) \frac{\partial}{\partial a_{CC}} \frac{1}{1 - a_{CC}} \\
    &=
    a_{CC}(1 - a_{CC}) \frac{1}{(1 - a_{CC})^2} \\
    &=
    \frac{a_{CC}}{1 - a_{CC}} \\
\end{align*}

The problem with this model setup is that we assumed we know the transition and emission probabilities, which in actuality are unknown. When the state sequence is known, we simply need to compute the number of transitions and emissions to calculate

$$
a_{kl} = \frac{A_{kl}}{\sum_{l'} A_{kl'}} \qquad e_{k} (\beta) = \frac{E_{k}(\beta)}{\sum_{\beta'}E_k(\beta')}
$$

Due to the state sequence being unknown, we use the expectation maximization to optimize these parameters.

# Baum-Welch Algorithm

Based on the current values of $a_{kl}$ and $e_k(\beta)$, compute the expected values of $A_{kl}$ and $E_k(\beta)$, considering probable paths. We then re-estimate $a_{kl}$ and $e_k(\beta)$.
The posterior probability that transition $a_{kl}$ is used at position $i$ in sequence $\vec x$ is given as 

$$
P(\pi_i = k, \pi_{i+1} = l | \vec x, \theta) = \frac{f_k(i)a_{kl}e_l(x_{i+1})b_l(i+1)}{P(\vec x)}
$$

By then summing over all training sequences $j$ and all positions $i$, we get 

\begin{align*}
    A_{kl} &= \sum_j \frac{1}{P(\vec x^j)} \sum_i f_k^j(i) a_{kl} e_l(x_{i+1}^j) b_l^j(i+1) 
    E_k(\beta) &= \sum_j \frac{1}{P(\vec x^j)} \sum_{\{i | x_i^j = \beta\}} f_k^j(i) b_k^j(i)
\end{align*}

and then we recalculate 

$$
a_{kl} = \frac{A_{kl}}{\sum_{l'} A_{kl'}} \qquad e_{k} (\beta) = \frac{E_{k}(\beta)}{\sum_{\beta'}E_k(\beta')}
$$

