In [53]:
import math  # Just ignore this :-)

def log(x):
    if x == 0:
        return float('-inf')
    return math.log(x)

# CTiB E2023 - Week 13 - Exercises

## Theoretical exercises

***Exercise 1***: Consider the simple "weather-HMM" with a transition diagram as shown on slide 3 in the slides *Hidden Markov Models - Training - Selecting model parameter* from the lecture on Nov 27. Assume that we do not know the model parameters i.e. the start-, transition-, and emission-probabilities, but that we are given two pairs of $({\bf X}, {\bf Z})$ as training data. 

These pairs are: (`HHLLLHHHHLLLLHH`, `SSSRRSSSRRRRSSS`) and (`LLHHLLLHHHHLLHH`, `RRRSSRRRSSSRRRS`), where H and L are the two states of the model, and S and R are the two emissions sunshine and rain.  

Use Training-by-Counting to set the model parameters according to this training data.
 


![tabel](tabel.png)


***Exercise 2***: Consider the 7-state HMM on slides 26 in slides *Hidden Markov Models - Selecting the initial model parameters and using HMMs for (simpel) gene finding* from the lecture on Nov 27, which you will use in pratical exercises below. As stated on slide 27, this HMM is also relevant for gene finding, where we say that state 3 emits non-coding symbols, states 2, 1, 0 emit coding triplets (codons) in the left-to-right direction and states 4, 5, 6 emit coding symbols in the reverse (right-to-left) direction. 

If we are given a DNA string, say 

`ACGTATGCTAATCTAAACCTACGGCATGT`

and information about its gene structure using the N, C, R annotation also used in the slides and practical exercises, say   

`NNNNCCCCCCCCCCCCNNRRRRRRRRRNN`    

then we can convert this gene structure into an actual sequence of states, as also explained on slide 30 (for a different model), as

`33332102102102103345645645633`

Use the above DNA string and information about its gene structure to set the model parameters of the 7-state HMM using Traning-by-Counting. (You can perhaps use this small example as a test case for your implementation of Traning-by-Counting in the practical exercises.)

In [54]:
DNA_string ='ACGTATGCTAATCTAAACCTACGGCATGT' #observable
hidden_path='33332102102102103345645645633' #hidden states

def translate_observations_to_indices(obs):
    mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}
    return [mapping[symbol.lower()] for symbol in obs]

DNA_string_index = translate_observations_to_indices(DNA_string)

k = 7 #antallet af hidden states
d = 4 #antallet af observable states

#initial_list=[1]*k
#emission_list=[[1]*d]*k
#transition_list=[[1]*k]*k

initial_list = [1, 1, 1, 1, 1, 1, 1]

transition_list = [
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 1]
]

emission_list = [
    #   A     C     G     T
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [1, 1, 1, 1]
]

#print(emission_list)

for i in range(len(DNA_string_index)):
    if i == 0:
        initial_list[int(hidden_path[i])] += 1
        emission_list[int(hidden_path[i])][DNA_string_index[i]] += 1
    else:
        #print(int(hidden_path[i]),DNA_string_index[i])
        emission_list[int(hidden_path[i])][DNA_string_index[i]] += 1
        transition_list[int(hidden_path[i-1])][int(hidden_path[i])] +=1

print((emission_list[1]))

for k in range(len(emission_list)): #rækken
    #sum_k = sum(emission_list[k])
    sum = 0
    for i in range(len(emission_list[0])):
        sum += emission_list[k][i]
    for d in range(len(emission_list[0])): #kolonnen
        new_value = emission_list[k][d]/sum
        emission_list[k][d] = new_value

for k in range(len(transition_list)): #rækken
    sum = 0
    for i in range(len(transition_list[0])):
        sum += transition_list[k][i]
    for d in range(len(transition_list[0])): #kolonnen
        new_value = transition_list[k][d]/sum
        transition_list[k][d] = new_value

sum = 0
for i in range(len(initial_list)):
    sum += initial_list[i]
for k in range(len(initial_list)): #rækker
    new_value = initial_list[k]/sum
    initial_list[k] = new_value


print(emission_list)
print(transition_list)
print(initial_list)


[2, 1, 1, 4]
[[0.375, 0.25, 0.25, 0.125], [0.25, 0.125, 0.125, 0.5], [0.375, 0.25, 0.125, 0.25], [0.25, 0.25, 0.25, 0.25], [0.14285714285714285, 0.5714285714285714, 0.14285714285714285, 0.14285714285714285], [0.2857142857142857, 0.14285714285714285, 0.2857142857142857, 0.2857142857142857], [0.2857142857142857, 0.14285714285714285, 0.2857142857142857, 0.2857142857142857]]
[[0.09090909090909091, 0.09090909090909091, 0.36363636363636365, 0.18181818181818182, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091], [0.45454545454545453, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091], [0.09090909090909091, 0.45454545454545453, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091, 0.09090909090909091], [0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.42857142857142855, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142], [0.1, 0.1, 0.1, 0.1, 0.1, 0.4

# Practical exercises

In the exercise below, you will implement and experiment with various ways of training a HMM (i.e. deciding parameters from data), and you will see an example of how to apply a HMM for identifying coding regions (genes) in genetic matrial.

# 1 - Training

You are given the same 7-state HMM and helper functions that you used last week:

In [55]:
class hmm:
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.init_probs = init_probs
        self.trans_probs = trans_probs
        self.emission_probs = emission_probs

In [56]:
init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]

trans_probs_7_state = [
    [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],
    [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],
    [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],
]

emission_probs_7_state = [
    #   A     C     G     T
    [0.30, 0.25, 0.25, 0.20],
    [0.20, 0.35, 0.15, 0.30],
    [0.40, 0.15, 0.20, 0.25],
    [0.25, 0.25, 0.25, 0.25],
    [0.20, 0.40, 0.30, 0.10],
    [0.30, 0.20, 0.30, 0.20],
    [0.15, 0.30, 0.20, 0.35],
]

hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)

In [57]:
def translate_observations_to_indices(obs):
    mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}
    return [mapping[symbol.lower()] for symbol in obs]

def translate_indices_to_observations(indices):
    mapping = ['a', 'c', 'g', 't']
    return ''.join(mapping[idx] for idx in indices)

def translate_path_to_indices(path):
    return list(map(lambda x: int(x), path))

def translate_indices_to_path(indices):
    return ''.join([str(i) for i in indices])

In [58]:
def make_table(m, n):
    """Make a table with `m` rows and `n` columns filled with zeros."""
    return [[0] * n for _ in range(m)]

## Training by counting

Training a hidden Markov model is a matter of estimating the initial, transition and emission probabilities. If we are given training data, i.e. a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, we can do "training by counting" by counting the number of observed transitions and emissions in the training data as explained in the lecture.

Given ${\bf X}$ and ${\bf Z}$ we would like to count the number of transitions from one state to another, and the number of times that symbol $k$ was observed while being in state $i$.  That is, we want to construct a $K \times K$ matrix such that entry $i, j$ is the number of times that a transition from state $i$ to state $j$ is observed in the training data, and a $K \times D$ matrix where entry $i, k$ contains the number of times that symbol $k$ is observed in the training data while being in state $i$.

Implement this as the below function:

In [59]:
def count_transitions_and_emissions(K, D, x, z):
    """
    Returns a KxK matrix and a KxD matrix containing counts cf. above
    """
    transition_list = []
    emission_list = []
    initial_list = []

    x = translate_observations_to_indices(x)
    z = translate_path_to_indices(z)

    for k in range(K):
        initial_list.append(1)
    
    for k1 in range(K):
        transition_list.append([])
        for k2 in range(K):
            transition_list[k1].append(1)

    for k in range(K):
        emission_list.append([])
        for d in range(D):
            emission_list[k].append(1)

    for i in range(len(x)):
        if i == 0:
            emission_list[z[i]][x[i]] += 1
            initial_list[z[i]] += 1
        else:
            emission_list[z[i]][x[i]] += 1
            transition_list[z[i-1]][z[i]] +=1

    return initial_list, transition_list, emission_list

In [60]:
x_long = 'TGAGTATCACTTAGGTCTATGTCTAGTCGTCTTTCGTAATGTTTGGTCTTGTCACCAGTTATCCTATGGCGCTCCGAGTCTGGTTCTCGAAATAAGCATCCCCGCCCAAGTCATGCACCCGTTTGTGTTCTTCGCCGACTTGAGCGACTTAATGAGGATGCCACTCGTCACCATCTTGAACATGCCACCAACGAGGTTGCCGCCGTCCATTATAACTACAACCTAGACAATTTTCGCTTTAGGTCCATTCACTAGGCCGAAATCCGCTGGAGTAAGCACAAAGCTCGTATAGGCAAAACCGACTCCATGAGTCTGCCTCCCGACCATTCCCATCAAAATACGCTATCAATACTAAAAAAATGACGGTTCAGCCTCACCCGGATGCTCGAGACAGCACACGGACATGATAGCGAACGTGACCAGTGTAGTGGCCCAGGGGAACCGCCGCGCCATTTTGTTCATGGCCCCGCTGCCGAATATTTCGATCCCAGCTAGAGTAATGACCTGTAGCTTAAACCCACTTTTGGCCCAAACTAGAGCAACAATCGGAATGGCTGAAGTGAATGCCGGCATGCCCTCAGCTCTAAGCGCCTCGATCGCAGTAATGACCGTCTTAACATTAGCTCTCAACGCTATGCAGTGGCTTTGGTGTCGCTTACTACCAGTTCCGAACGTCTCGGGGGTCTTGATGCAGCGCACCACGATGCCAAGCCACGCTGAATCGGGCAGCCAGCAGGATCGTTACAGTCGAGCCCACGGCAATGCGAGCCGTCACGTTGCCGAATATGCACTGCGGGACTACGGACGCAGGGCCGCCAACCATCTGGTTGACGATAGCCAAACACGGTCCAGAGGTGCCCCATCTCGGTTATTTGGATCGTAATTTTTGTGAAGAACACTGCAAACGCAAGTGGCTTTCCAGACTTTACGACTATGTGCCATCATTTAAGGCTACGACCCGGCTTTTAAGACCCCCACCACTAAATAGAGGTACATCTGA'
z_long = '3333321021021021021021021021021021021021021021021021021021021021021021033333333334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210321021021021021021021021033334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563333333456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456332102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102103210210210210210210210210210210210210210210210210210210210210210'

Test your implementation of `count_transitions_and_emissions` on (prefixes) of `x_long` and `z_long` above in order to conclude that your implementation works as expected.

In [61]:
print(count_transitions_and_emissions(7, 4, x_long, z_long))

([1, 1, 1, 2, 1, 1, 1], [[1, 1, 122, 5, 1, 1, 1], [127, 1, 1, 1, 1, 1, 1], [1, 127, 1, 1, 1, 1, 1], [1, 1, 6, 24, 4, 1, 1], [1, 1, 1, 1, 1, 198, 1], [1, 1, 1, 1, 1, 1, 198], [1, 1, 1, 4, 195, 1, 1]], [[27, 40, 29, 34], [27, 44, 17, 42], [57, 16, 28, 29], [6, 9, 8, 12], [43, 70, 66, 22], [69, 44, 56, 32], [27, 70, 33, 71]])


Use your implementation of `count_transitions_and_emissions` to implement a function `training_by_counting` that given the number of hidden states, $K$, the number of observables, $D$, a sequence of observations, ${\bf X}$, and a corresponding sequence of hidden states, ${\bf Z}$, returns a HMM (as an instance of `class hmm`), where the tranistion, emission, and initial probabilities are set cf. training by counting on ${\bf X}$ and ${\bf Z}$.

In [62]:
def training_by_counting(K, D, x, z):
    """
    Returns a HMM trained on x and z cf. training-by-counting.
    """
    initial_list, transition_list, emission_list = count_transitions_and_emissions(K, D, x, z)
    

    for k in range(len(emission_list)): #rækken
        sum = 0
        for i in range(len(emission_list[0])):
            sum += emission_list[k][i]
        for d in range(len(emission_list[0])): #kolonnen
            new_value = emission_list[k][d]/sum
            emission_list[k][d] = new_value

    for k in range(len(transition_list)): #rækken
        sum = 0
        for i in range(len(transition_list[0])):
            sum += transition_list[k][i]
        for d in range(len(transition_list[0])): #kolonnen
            new_value = transition_list[k][d]/sum
            transition_list[k][d] = new_value

    sum = 0
    for i in range(len(initial_list)):
        sum += initial_list[i]
    for k in range(len(initial_list)): #rækker
        new_value = initial_list[k]/sum
        initial_list[k] = new_value

    model = hmm(initial_list, transition_list, emission_list)
    return model

Consider a HMM trained on `x_long` and `z_long`:

In [66]:
hmm_7_state_tbc = training_by_counting(7, 4, x_long, z_long)

print(hmm_7_state_tbc)

<__main__.hmm object at 0x0000021368D43910>


How does this HMM (i.e. its transistion, emission, and initial probabilities) compare to `hmm_7_state` as specified above?

You can e.g. try to perform a Viterbi decoding of `x_long` using the two HMMs and investigate if the decodings differ:

In [68]:
# Your implementation of Viterbi (log transformed) from last week

def compute_w_log(model, x):
    k = len(model.init_probs)
    n = len(x)
    x = translate_observations_to_indices(x)
    
    w = make_table(k, n)
    
    # Base case: fill out w[i][0] for i = 0..k-1
    for j in range(k):
        ip = model.init_probs[j]
        ep = model.emission_probs[j][x[0]]
        w[j][0] = log(ip) + log(ep)
    
    # Inductive case: fill out w[i][j] for i = 0..k, j = 0..n-1
    for i in range(1, n):
        for j in range(k):
            max_prob = float('-inf')
            for l in range(k):
                tp = model.trans_probs[l][j]
                ep = model.emission_probs[j][x[i]]
                prob = w[l][i-1] + log(tp) + log(ep)
                if prob > max_prob:
                    max_prob = prob
            w[j][i] = max_prob
    return w  

def opt_path_prob_log(w):
    max_prob = float('-inf')
    
    for j in range(len(w)):
        prob = w[j][-1]
        if prob > max_prob:
            max_prob = prob
    return max_prob

def backtrack_log(model, x, w):
    x = translate_observations_to_indices(x)
    
    max_prob = float('-inf')
    max_prob_position = None 
    for j in range(len(w)):
        prob = w[j][-1]
        if prob > max_prob:
            max_prob = prob
            max_prob_position = j
    path = str(max_prob_position)

    pre_prob = max_prob
    pre_prob_position = max_prob_position
    
    for i in range(-2, -len(x)-1, -1):
        for j in range(len(w)):
            tp = model.trans_probs[j][pre_prob_position]
            ep = model.emission_probs[pre_prob_position][x[i+1]]
            prob = w[j][i] + log(tp) + log(ep)
            if math.isclose(prob, pre_prob):
                pre_prob = w[j][i]
                pre_prob_position = j
                path = str(j) + path

    return path

In [70]:
w = compute_w_log(hmm_7_state, x_long)
z_vit = backtrack_log(hmm_7_state, x_long, w)

w_tbc = compute_w_log(hmm_7_state_tbc, x_long)
z_vit_tbc = backtrack_log(hmm_7_state_tbc, x_long, w_tbc)

print(z_vit)
print(z_vit_tbc)

# Your comparison of z_vit and z_vit_tbc here ...

3333321021021021021021021021021021021021021021021021021021021021021021033333333334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210210321021021021021021021021033334564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564563333333456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456332102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102103210210210210210210210210210210210210210210210210210210210210210

# 2 - Using a HMM for Gene Finding

Below we will investigate how to use a hidden Markov model for gene finding in prokaryotes.

You are give a data set containing 2 Staphylococcus genomes, each containing several genes (i.e. substrings) obeying the "gene syntax" explained in class. The genomes are between 1.8 million and 2.8 million nucleotides.

The genomes and their annontations are given in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format).

In [71]:
def read_fasta_file(filename):
    """
    Reads the given FASTA file f and returns a dictionary of sequences.

    Lines starting with ';' in the FASTA file are ignored.
    """
    sequences_lines = {}
    current_sequence_lines = None
    with open(filename) as fp:
        for line in fp:
            line = line.strip()
            if line.startswith(';') or not line:
                continue
            if line.startswith('>'):
                sequence_name = line.lstrip('>')
                current_sequence_lines = []
                sequences_lines[sequence_name] = current_sequence_lines
            else:
                if current_sequence_lines is not None:
                    current_sequence_lines.append(line)
    sequences = {}
    for name, lines in sequences_lines.items():
        sequences[name] = ''.join(lines)
    return sequences

You can use the function like this (note that reading the entire genome will take some time):

In [82]:
g1 = read_fasta_file('genome1.fa')
g1['genome1'][:50]

g2 = read_fasta_file('genome2.fa')

z1 = read_fasta_file('true-ann1.fa')
z2 = read_fasta_file('true-ann2.fa')

The data is:

* The files [genome1.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/genome1.fa) and  [genome2.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/genome2.fa) contain the 2 genomes.
* The files [true-ann1.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/true-ann1.fa) and [true-ann2.fa](http://users-birc.au.dk/cstorm/courses/ML_e22/exercises/true-ann2.fa) contain the annotation of the two genomes with the tru gene structure. The annotation is given in FASTA format as a sequence over the symbols `N`, `C`, and `R`. The symbol `N`, `C`, or `R` at position $i$ in `true-annk.fa` gives the "state" of the nucleotide at position $i$ in `genomek.fa`. `N` means that the nucleotide is non-coding. `C` means that the nucleotide is coding and part of a gene in the direction from left to right. `R` means that the nucleotide is coding and part of gene in the reverse direction from right to left.

The annotation files can also be read with `read_fasta_file`.

You are given the same 7-state HMM that you used above and similar helper functions:

In [73]:
class hmm:
    def __init__(self, init_probs, trans_probs, emission_probs):
        self.init_probs = init_probs
        self.trans_probs = trans_probs
        self.emission_probs = emission_probs

In [74]:
init_probs_7_state = [0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00]

trans_probs_7_state = [
    [0.00, 0.00, 0.90, 0.10, 0.00, 0.00, 0.00],
    [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    [0.00, 0.00, 0.05, 0.90, 0.05, 0.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00],
    [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00],
    [0.00, 0.00, 0.00, 0.10, 0.90, 0.00, 0.00],
]

emission_probs_7_state = [
    #   A     C     G     T
    [0.30, 0.25, 0.25, 0.20],
    [0.20, 0.35, 0.15, 0.30],
    [0.40, 0.15, 0.20, 0.25],
    [0.25, 0.25, 0.25, 0.25],
    [0.20, 0.40, 0.30, 0.10],
    [0.30, 0.20, 0.30, 0.20],
    [0.15, 0.30, 0.20, 0.35],
]

hmm_7_state = hmm(init_probs_7_state, trans_probs_7_state, emission_probs_7_state)

Notice that this time the function `translate_indices_to_path` is a bit different. In the given model the states 0, 1, 2 represent coding (C), state 3 represents non-coding (N) and states 4, 5, 6 represent reverse-coding (R) as explained in class.

In [75]:
def translate_indices_to_path(indices):
    mapping = ['C', 'C', 'C', 'N', 'R', 'R', 'R']
    return ''.join([mapping[i] for i in indices])

def translate_observations_to_indices(obs):
    mapping = {'a': 0, 'c': 1, 'g': 2, 't': 3}
    return [mapping[symbol.lower()] for symbol in obs]

def translate_indices_to_observations(indices):
    mapping = ['a', 'c', 'g', 't']
    return ''.join(mapping[idx] for idx in indices)

In [76]:
def make_table(m, n):
    """Make a table with `m` rows and `n` columns filled with zeros."""
    return [[0] * n for _ in range(m)]

Now insert your Viterbi implementation (log transformed) in the cell below, this means that you should copy `compute_w_log`, `opt_path_prob_log`, `backtrack_log` and any other functions you may have defined yourself for your Viterbi implementation.

In [79]:
def viterbi_training(K, D, x, z, I):
    model = training_by_counting(K, D, x, z)
    
    for i in range(I):
        w = compute_w_log(model, x)
        z_vit = backtrack_log(model, x, w)
        model = training_by_counting(K, D, x, z_vit)
    
    return z_vit

print(viterbi_training(7, 4, x_long, z_long, 20))

1021021021021021021021021021021021021021021021021021021021021021021064564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564521021021021021021021021021021021064564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564564556456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456456452102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102102100210210210210210210210210210210210210210210210210210210210210210

## Finding genes in a genome

In the cell below, use your Viterbi implementation to compute an annotation for genome 1 and 2. Save the annotation in a variable (remember to translate the indicies to a path using `translate_indices_to_path`). Feel free to define a function that wraps `compute_w_log` and `backtrack_log` so that you don't have to call both functions each time you want an annotation for a sequence.

In [83]:
translate_indices_to_path(viterbi_training(7, 4, g1['genome1'], z1, 3))

ValueError: invalid literal for int() with base 10: 'true-ann1'

## Comparing annotations

We will now compare the predicted annotations to the true annotations. Read the true annotations (`true-ann1.fa` and `true-ann2.fa`) and use the `compute_accuracy` function given below to compare the predicted annotation to the true annotation by computing the accurary. Note that there are other ways to measure the quality of a prediction annotation against the true annotation, e.g. the ACC as shown in the lectures.

In [None]:
def compute_accuracy(true_ann, pred_ann):
    if len(true_ann) != len(pred_ann):
        return 0.0
    return sum(1 if true_ann[i] == pred_ann[i] else 0 
               for i in range(len(true_ann))) / len(true_ann)

In [2]:
# Your code to read the annotations and compute the accuracies of your predictions...

## Training a model

Above, we used the stock `hmm_7_state` for prediction. In a real application, one would train the HMM on genomes with known gene structure in order to make a model that reflects reality. 

Make a HMM `hmm_7_state_genome1` that has a transition diagram similar to `hmm_7_state`, but where the transition, emission, and initial probabilities are set by training by counting on `genome1.fa` and its corresponding true gene structure as given in `true-ann1.fa`.

You should be able to use your implementation of training by counting as done above, but you must translate the annotation in `annotation1.fa` into a proper sequence of hidden states, i.e. the annotation `NCCCNRRRN` would correspond to `321034563`.

Use the trained HMM `hmm_7_state_genome1` to predict the gene structure of genome 2, and compare the predicted annotation to true annotation (`true-ann2.fa`). Is the accurracy better than your prediction on genome 2 using `hmm_7_state`?

Implement training by counting in the cell below. We'll use it to train a new model for predicting genes. Feel free to define any helper functions you find useful.

In [1]:
# Your code to get hmm_7_state_genome1 using trainig by counting, predict an annotation of genome2, and compare the prediction to true-ann2.fa

Redo the above, where you train on genome 2 and predict on genome 1, i.e. make model `hmm_7_state_genome2` using training by counting on `true-ann2.fa`, predict the gene structure of `genome1.fa` and compare your prediction against `true-ann1.fa`.

In [None]:
# Your code to get hmm_7_state_genome2 using trainig by counting, predict an annotation of genome1, and compare the prediction to true-ann1.fa

If you have time you can redo the above for other training methods, e.g. Viterbi training that you also considered above. I.e. train a model `hmm_7_state_genome1_vit` using Viterbi training on `genome1.fa`, and use it to predict a gene structure for genome 2.

You can also experiment with other HMMs that allow for a more precise modelling of gene structure as explained in class, e.g. the model with 31 states that models start- and stop-codons. What is the best accuracy that you can obtain?