We can use HMMs to describe what a gene is and use that model to identify genes in the genomic sequence.

A gene is a continious stretch of nucleotides in the genome that encodes a protein.

# Open reading frames

A ribosome knows where to start translation due to the shine-Dalgarno sequence, which is a G and A rich sequence, upstream of the start codon. Due to this base richness, one can try to detect genes where this pattern is present.
The space between a start and stop codon is called the open reading frame (ORF). Given 64 possible codons (base triplet), we may ask ourselves, what is the average ORF. Given the stop codon probability $P(stop) = \frac{3}{64}$, the probability that we have a ORF of length of $l$ is given as

$$
    P((\bar{stop}_{l-1} stop)) = \left(  1 - \frac{3}{64}  \right)^{l-1}  \frac{3}{64}
$$

The average length is then

$$
\langle l \rangle = \sum_{l=1}^{\infty} l p^{l-1} (1 - p) = \frac{64}{3}
$$

The ORF obviously depends on the G,C and A,T content, as all the stop codons contain A. Meaning that a increase in G/C content increases our average ORF length.
The ORF itself is not enough, as we'd have to decide the cutoff for a valid ORF length. But there are also short ORF which encode for functional peptides. Additionally there are also genes that do not encode for proteins.

# Promoter regions

Upstream of the transcription start site, there exist other motifs, like the $\sigma$ factor binding site and the Pribnow TATA box, which are regions recognized by the $\sigma 70$ subunit polymerase. There also exist transcription terminators, which are created due to RNA internal bonding, creating a hairpin structure.

# HMM bacterial gene prediction (ORF prediction)

![](hmm_gene.png)

The null block is to model everything that is not part of a gene. After that is the RBS submodule, which models the ribosomal binding site aswell as nucleotides between the RBS and the start codon. The start codon is modelled with 3 nucleotides. After the start codon, the first triplet is explicitely modeled with 3 nucleotides as it appears that this part differs from the rest of the gene. Similarly the last codon before the stop codon is modeled explicitly and 6 bases after the stop in order to capture information present around the stop codon. Inside the these parts are 3 looped codon submodels for the interior part of the gene. The reason for these submodules is to embed a realistic length distribution in the HMM instead of the geometric distribution.

The 1 codon model would give the distribution $P(l + 1) = p^l (1 - p)$. On the other hand, the 3 codon model would give $P(l + 3) = \frac{p^l (1 - p)^3 (l+1)(l+2)}{2}$

# Eukaryotic

Due to the presence of introns and exons, the prediction becomes more difficult, as the not all of the ORF is relevant for the encoding protein.

There are different intron phases
- Phase 0 Intron: The intron is inbetween two codons, between the end of the first and the beginning of the second
- Phase 1 Intron: The intron which is located after the first nucleotide of a codon
- Phase 2 Intron: The intron which is located after the second nucleotide of a codon

![](hmm_eukaryot.png)

# Regulatory elements in DNA and RNA

RNA synthesis is regulated by transcription factors. Experimentally, TF binding sites are identified by the ChIP-seq, which is chromatin immunoprecipitation and sequencing.

![](chip-seq.png)

Using this one first has to align the read sequences with the genome and find the best alignment. That is, find the distance $d$ that maximized the correlation function

$$
C(d) = \sum_{i = 1}^{|G|} r_+(i) r_-(i+d)
$$

Given
- $n_i, m_i$ = Number of foreground / background reads in window $i$
- $N, M$ = Total number of foreground / background reads in the sample
- $\sigma^2$ = Variance of multiplicative noise introduced during sample preperation
- $\mu$ = Depletion of background reads in bound regions
- $\rho$ = Fraction of background windows
- $W$ = Window size
- $R$ = Range of variation of log read density in foreground vs background in a window

The probability to observe $n$ reads in a given window - given that the window could be "bound" or "not bound (unbound)" by the TF is given as

$$
P_b(n | m, N, M) = \frac{1}{R}
$$

$$
P_u(n | m, N, M, \mu, \sigma) = \frac{1}{\sqrt{2\pi (2 \sigma^2 + \frac{1}{n} + \frac{1}{m})}} \exp\left( -\frac{(\log(\frac{n}{N}) - \log(\frac{m}{M}) - \mu)^2}{2 (2 \sigma^2 + \frac{1}{n} + \frac{1}{m})} \right)
$$

The log likelihood of the data is then given as

$$
L = \sum_i P_{mix} (n_i | m_i, N, M, \rho, \mu, \sigma) = \sum_i \rho P_u(n_i | m_i, N, M, \mu, \sigma) + (1 - \rho) P_b(n_i | m_i, N, M)
$$

After parameter fitting, a z-score can be calculated for every window

$$
z = \frac{\log(\frac{n}{N}) - \log(\frac{m}{M}) - \mu}{\sqrt{2 \sigma^2 + \frac{1}{n} + \frac{1}{m}}}
$$

Then modeling the coverage per position with a mixture of gaussian peaks, for a given region we get

$$
L(C | \vec \pi, \vec \sigma, \vec \rho, W) = \prod_i \left[ \rho_j \frac{1}{\sqrt{2 \pi \sigma^2_j}} \exp\left( -\frac{(i-\mu_j)^2}{2\sigma_j^2} \right) + \left(1 - \sum_j \rho_j \right) \frac{1}{W}\right]^{C(i)}
$$

Where $\vec \pi, \vec \sigma, \vec \rho$ are fitted by EM. The maximas are then possible binding sites.

# Representing binding sites

The interaction between the TF and the binding site is in essence characterized by two parameters
1. The binding energy $E$ of the interaction between TF and the binding site
2. The concentration $c$ of the transcription factor

As the TF concentration increases, the fraction $P$ of the time TF is bound to the site increases

$$
P_{bound} = \frac{ce^{\beta E}}{ce^{\beta E} + K }
$$

We assume the binding energy of a sequence $s$ is an additive function of the individual bases

$$
E(s) = \sum_{i=1}^l E_i(s_i)
$$

The probability for the site to be bound can be roughly described by the function $P_{bound}(s)$.
Assume that the only constraint on "functional binding sites" is that they have some characteristic average energy $E$. Using the maximum entropy we get

$$
P(s) = \frac{e^{\lambda E(s)}}{\sum_{s'}e^{\lambda E(s')}} = \prod_{i=1}^l \frac{e^{\lambda E_i(s_i)}}{\sum_{\alpha} e^{\lambda E_i(\alpha)}}
$$

Where the lagrange multiplier $\lambda$ is chosen such that $\sum_s E(s) P(s) = E$.

We can rewrite $P(s)$ in terms of a weight matrix WM $w$.

$$
P(s) = \prod_{i=1}^l P_i(s_i) = \prod_{i=1}^l w_{s_i}^i
$$

The probability that a binding site for the TF will have a sequence $s$ is given by

$$
P(s|w) = \prod_{i=1}^l w_{s_i}^i
$$

These weights can be estimated by counting where $\alpha$ is a pseudo count.

$$
w_b^i = \frac{n_b^i + \alpha_b^i}{\sum_b (n_b^i + \alpha_b^i)}
$$

The question is, how do we set the pseudo count.

$$
\begin{align*}
    P(S | w)
    &=
    \prod_{s \in S} P(s | w) \\
    &=
    \prod_{s \in S} [\prod_{i=1}^l w^i_{s_i}] \\
    &=
    \prod_{i=1}^l  [\prod_{s \in S} w^i_{s_i}] \\
    &=
    \prod_{i=1}^l [\prod_{\alpha} (w_{\alpha}^i)^{n_{\alpha}^i}]
\end{align*}
$$

To calculate the posterior $P(w | S)$ we need the prior, for this we use the familiy of dirichlet priors $P(w)dw = \frac{\Gamma (4 \gamma)}{[\Gamma(\gamma)]^4} \prod_{\alpha} (w_{\alpha}^{\gamma - 1})dw$.
For $\gamma = 1$ we have a uniform prior. For $\gamma < 1$ more weight is on the corners and edges of the simplex, the distribution is heavily biased to one or two bases. For $\gamma > 1$ more weight is on the middle of the simplex, the distribution of all bases are equal.

Given this prior, we can calculate the posterior as $P(w | S) = \frac{P(S|w)P(w)}{P(S)}$. This is the probability that all sequences in $S$ come from $w$.

The information score of the WM is a measure of the specificity of the binding factor

$$
I = \sum_{i,b} f_b^i \log(\frac{f_b^i}{p_b})
$$

Where $f_b^i = \frac{n_b^i}{\sum_b n_b^i}$ and $p_b$ is the background frequency of nucleotide $i$.

To compute the likelihood of a sequence given the WM is just the multiplication of the prob of the nucleotides at their respective position.

$$
P(s[i..i+\omega - 1] | \vec w) = \prod_{j=1}^{\omega} w_{s[i+j]}^j
$$

Where $\omega$ is the length of $\vec w$.

# Finding multiple sites

Given we have the WM for two TF sites and the model of the background sequence $B$.

A parse is one way of placing WM and background regions over a sequence. The likelihood of a given parse for each region is then

$$
P(s[i..i+\omega - 1]|\vec w) = \prod_{j=1}^{\omega} w_{s_{i+j}}^j
$$

Where $\vec w$ is either the WM corresponding to the TF bound at $i$ or the background, and $\omega$ the length of the WM.

The total likelihood of the sequence, summed over all possible parses, represents the partition function $Z$.

$$
Z(1, i) = \sum_{j=1}^m Z(1, i - \omega_j) P(s[i-\omega_j + 1 .. i]| \vec w^j)
$$

This is a summation over all $m$ weight matrices of the contribution of parses neding with a given weight matrix at position $i$. The posterior prob to have a site for $TF_i$ starting at position $i$ is then

$$
P(TF_i \ at \ j) = \frac{Z(1, i - 1)P(s[i..i+\omega_i - 1 | \vec w^i])Z(i + \omega_i, L)}{Z(1, L)}
$$

![](parses.png)

# Inferring novel binding specificities

Imagine we've identified regions that are bound by a given TF whose specificity we do not know. We would like to find the binding specificity of the TF, find the WM from this set of binding sites.

Using the gibbs sampling algorithm for inferring binding specificity, we start with windows placed randomly, one in each sequence.

- At each iteration
    - Remove one window,
    - Use the others to infer WM
    - At each position in sequence $i$ calculate the probability that the subsequence starting at that position was generated from the WM
    - Sample one window according to these probabilities


Each way of placing a window of length $L$ in each of the $n$ upstream regions represents a state $x$, with an associated probability $P(x)$. $P(x)$ is proportional to the likelihood that all subsequences enclosed by the windows are generated using the same WM.
To find states with high probability, sample $P(x)$ using a markov chain.
1. Given state $x$, state $y$ is proposed with probability $P(y|x)$, such that $P(y|x) = P(x|y)$
2. State $y$ is accepted with probability $min(1, P(y)/P(x))$

For the MEME algorithm, we start with a guess for a weight matrix and for the probability of occurence of binding sites.

At each iteration, we use the current WM and prior probability for binding sites to calculate the probability at each position in every sequence that there is a binding site starting at that position. We then update the WM and prior probability of sites.

The posterior of a site starting at position $j$ is

$$
P(site \ at \ j) = \frac{\pi_t P(s[j..j+\omega - 1] | \vec w)}{\pi_t P(s[j..j + \omega - 1] | \vec w) + (1 - \pi_t) P(s[j..j+\omega - 1]|\vec B)}
$$

We update the prior with

$$
\pi_{t+1} = \frac{\sum_j P(site \ at \ j)}{L}
$$

And update the WM

$$
\left. w_{\alpha}^{k}\right|_{t+1}= \frac{\sum_{j} P(site \ at \ j) \delta(s[j + k], \alpha)}{\sum_j P(site \ at \ j)}
$$

The discovery of regulatory sites is done by

1. Collect sets of sequences that are thought to contain binding sites for a common regulatory factor. Then search overrepresented short sequence motifs.
2. Phylogenetic footprinting: Create multiple alignments of orthologous intergenic sequences and identify sequence segments more conserved than "average".

The nucleotides in one column are not independent samples of the WM, but are phylogenetically related. The probability of an alignment column is $P(S | T, w)$. It is the probability of the bases at the leaves of a given tree and the limit frequency $w$ is the product over the transition probabilities along each of the branches, summed over the possible bases at the internal nodes.

$$
P(S | T, w) = \sum_{x,y,z} w_x P(y | x, t_1) P(g|x, t_2) P(g|y, t_3) P(z | y, t_4) P(g|z, t_5) P(a|a, t_6)
$$

# Ahab scoring model

Rather than assuming that we expect binding sites to be uniformly distributed across the sequence, Ahab assumes that different regions upstream have different prior probabilities of containing binding sites. Thus the likelihood of a sequence segment given a WM depends not only on the sequence, but also on the prior probability of a site for the TF in that genomic region

$$
P(s[i..i+\omega - 1] | \vec w) = \pi \prod_{j=1}^{\omega} w_{s[i+j]}^j
$$

Ahab determines the set of priors for all the $m$ WM that maximize the partition function for a given region of the upstream sequence.

$$
Ahab \ score = \frac{Z(1, L| \{\vec w\})}{Z(1, L | \vec w_B)}
$$