# Class : Hidden Markov Models - Viterbi

---

## Before Class
In class today we will be implementing the Viterbi algorithm to identify the most likely path through states given model parameters.
Prior to class, please do the following:
1. Review slides on Hidden Markov models in detail
* Focus on how to conceptually translate the algorithm to code
* Understand what argmax versus max means
* How does one implement a max function? argmax?
* Take a look at what arithmetic underflow is.

---
## Learning Objectives

1. Conceptually understand Hidden Markov Models
* Implement a basic HMM
* Implement the Viterbi algorithm

---
## Background

In the last class we described Markov chains. Here we expand this idea to the concept of a hidden state variable along with observed emissions from the model. We will be using the example of CpG islands from the lecture slides. I have provided the class structure of a simple HMM below. All parameters to this model must be provided as inputs, so essentially this is a class containing the parameters described below:

We define a categorical Hidden Markov Model as $M = (\Sigma, Q, \Theta)$ with the following parameters:

$\Sigma$ : Finite alphabet of symbols (eg. A, C, G, T)

$Q$ : Finite discrete hidden states

$\Theta$: set of probabilities containing: $A$ as transition probabilites $a_{kl}$ for all $k,l \in Q$ and $E$ as emission probabilities $e_k(\sigma)$ for all $k \in Q$ and $\sigma \in \Sigma$ and $B$ as starting probabilities $b_k$ for all $k \in Q$.

We also define a number of $T$ emissions as $y_t = 1 \dots T$ that are drawn from $\Sigma$ and hidden states as $\pi_t = 1 \dots T$ that are drawn from $Q$.

The goal today will be to estimate $\pi^*$, the most probable path through the hidden states $Q$ when a HMM $M$ is provided.

We will be following the definition described in the slides as described below:


---
## Imports

In [4]:
import os
import numpy as np

---
## Viterbi algorithm

To estimate $\pi^*$, the most probable path through the hidden states, we will use the Viterbi algorithm, which is a dynamic programming exercise.

Initialization ($i = 0$): $v_{k}(i) = e_{k}(\sigma)b_{k}$.

Recursion ($i = 1 \dots T$): 
$v_{l}(i) = e_{l}(x_{i})$ max$_{k}(v_{k}(i-1)a_{kl})$;  ptr$_{i}(l) = $ argmax$_{k}(v_{k}(i-1)a_{kl})$.

Termination: $P(x, \pi^{*}) =$ max$_{k}(v_{k}(l)a_{k0})$; $\pi^{*}_{l} = $ argmax$_{k}(v_{k}(l)a_{k0})$.

Traceback: ($i = T\dots1$): $\pi^{*}_{i-1} = $ ptr$_{i}(\pi^{*}_{i})$.

A few implementation notes:
1. Break the code up into each of the above phases of the algorithm!
2. You will probably want to move all of your probabilities into log space so that you don't get underflow errors!

In [58]:
class HMM(object):
    """Main class for HMM objects
    
    Class for holding HMM parameters and to allow for implementation of
    functions associated with HMMs
    
    Private Attributes:
        _alphabet (set): The alphabet of emissions
        _hidden_states (set): Hidden states in the model
        _transitions (dict(dict)): A dictionary of transition probabilities
        _emissions (dict(dict)): A dictionary of emission probabilities
        _initial (dict): A dictionary of initial state probabilities

    """

    def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
        self._alphabet = set(alphabet)
        self._hidden_states = set(hidden_states)
        self._transitions = A
        self._emissions = E
        self._initial = B
        
    def _emit(self, cur_state, symbol):
        return self._emissions[cur_state][symbol]
    
    def _transition(self, cur_state, next_state):
        return self._transitions[cur_state][next_state]
    
    def _init(self, cur_state):
        return self._initial[cur_state]

    def _states(self):
        for k in self._hidden_states:
            yield k
        
    def viterbi(self, sequence):
        """ The viterbi algorithm for decoding a string using a HMM

        Args:
            sequence (list): a list of valid emissions from the HMM

        Returns:
            result (list): optimal path through HMM given the model parameters
                           using the Viterbi algorithm
        
        Pseudocode for Viterbi:
            Initialization (𝑖=0): 𝑣𝑘(𝑖)=𝑒𝑘(𝜎)𝑏𝑘.
            Recursion (𝑖=1…𝑇): 𝑣𝑙(𝑖)=𝑒𝑙(𝑥𝑖) max𝑘(𝑣𝑘(𝑖−1)𝑎𝑘𝑙); 
                                ptr𝑖(𝑙)= argmax𝑘(𝑣𝑘(𝑖−1)𝑎𝑘𝑙).
            Termination: 𝑃(𝑥,𝜋∗)= max𝑘(𝑣𝑘(𝑙)𝑎𝑘0); 
                             𝜋∗𝑙= argmax𝑘(𝑣𝑘(𝑙)𝑎𝑘0).
            Traceback: (𝑖=𝑇…1): 𝜋∗𝑖−1= ptr𝑖(𝜋∗𝑖).
        """

        # Initialization (𝑖=0): 𝑣𝑘(𝑖)=𝑒𝑘(𝜎)𝑏𝑘.
        
        # Recursion (𝑖=1…𝑇): 𝑣𝑙(𝑖)=𝑒𝑙(𝑥𝑖) max𝑘(𝑣𝑘(𝑖−1)𝑎𝑘𝑙); 
        #                 ptr𝑖(𝑙)= argmax𝑘(𝑣𝑘(𝑖−1)𝑎𝑘𝑙).
            
        # Termination: 𝑃(𝑥,𝜋∗)= max𝑘(𝑣𝑘(𝑙)𝑎𝑘0); 
        #                  𝜋∗𝑙= argmax𝑘(𝑣𝑘(𝑙)𝑎𝑘0).

        # Traceback: (𝑖=𝑇…1): 𝜋∗𝑖−1= ptr𝑖(𝜋∗𝑖).
        
        #init
        states = list( self._states() ) #define deterministic ordering of states (get around `set`)
        trace = [ {s:(np.log(self._init(s)*self._emit(s,sequence[0])), None) for s in states} ] #calc init probs
        
        #recurse
        for b in sequence[1:]: #loop over remainder of sequence
            trace.append( dict() ) #add a new dictionary
            for s in states:
                pos = [ np.log( self._emit(s,b) * self._transition(ls,s) ) + trace[-2][ls][0] for ls in states ] #calc lambda probs
                trace[-1][s] = max( zip(pos,states) ) #optimize lambda probs
                
        #term
        ptr = max( ( trace[-1][s][0], s, trace[-1][s][1] ) for s in states ) #find the max endpt
        toreturn, ptr = [ ptr[1] ], ptr[2] #define iterators for traceback
        
        #traceback
        for level in trace[-1::-1]: #go back up the stack
            toreturn.insert( 0, ptr ) #save old pointer to front of stack
            ptr = level[ptr][1] #define new pointer
            
        return toreturn
        
        

In [59]:
# This section of code will initialize your HMM with parameters as defined in the lecture slides
# for the identification of CpG Islands.
# All of this should be able to run whether or not you implement the Viterbi function!

hidden_states = ('I', 'G') # CpG Island or Genome
alphabet = ('A', 'C', 'G', 'T') # DNA Alphabet

# These are the initial probabilities as defined in the lecture slides
initial_probabilities = {
    'I' : 0.1,
    'G' : 0.9
}

# These are the probabilities of transitioning from outer state to inner state
#  as defined in the lecture slides
transition_probabilities = {
    'I': { 'I' : 0.6, 'G' : 0.4 },
    'G': { 'I' : 0.1, 'G' : 0.9 }
}

# These are the probabilites of each state emmitting each alphabet character
emission_probabilities = {
    'I': { 'A' : 0.1, 'C' : 0.4, 'G' : 0.4, 'T' : 0.1 },
    'G': { 'A' : 0.4, 'C' : 0.1, 'G' : 0.1, 'T' : 0.4 }
}

# Build the model
model = HMM(alphabet, hidden_states, transition_probabilities, emission_probabilities, initial_probabilities)

In [60]:
# Exact example from slides
sequence = "ACGCGATC"
print(sequence)
print (''.join(model.viterbi(list(sequence))))

# A slightly more complex example
sequence = "ACGCGATCATACTATATTAGCTAAATAGATACGCGCGCGCGCGCGATATATATATATAGCTAATGATCGATTACCCCCCCCCCCAATTA"
print(sequence)
print (''.join(model.viterbi(sequence)))

ACGCGATC
GIIIIGGGG
ACGCGATCATACTATATTAGCTAAATAGATACGCGCGCGCGCGCGATATATATATATAGCTAATGATCGATTACCCCCCCCCCCAATTA
GIIIIGGGGGGGGGGGGGGGGGGGGGGGGGGIIIIIIIIIIIIIIGGGGGGGGGGGGGGGGGGGGGGGGGGGGIIIIIIIIIIIGGGGGG


In [55]:
hidden_states = ('Ai', 'Ci', 'Gi', 'Ti', 'Ag', 'Cg', 'Gg', 'Tg')
alphabet = ('A', 'C', 'G', 'T')

initial_probabilities = {
    'Ai' : 0.125,
    'Ci' : 0.125,
    'Gi' : 0.125,
    'Ti' : 0.125,
    'Ag' : 0.125,
    'Cg' : 0.125,
    'Gg' : 0.125,
    'Tg' : 0.125
}

transition_probabilities = {
    'Ai': { 'Ai' : 0.2, 'Ci' : 0.36, 'Gi' : 0.2, 'Ti' : 0.2, 'Ag' : 0.01, 'Cg' : 0.01, 'Gg' : 0.01, 'Tg' : 0.01 },
    'Ci': { 'Ai' : 0.1, 'Ci' : 0.1, 'Gi' : 0.66, 'Ti' : 0.1, 'Ag' : 0.01, 'Cg' : 0.01, 'Gg' : 0.01, 'Tg' : 0.01 },
    'Gi': { 'Ai' : 0.1, 'Ci' : 0.39, 'Gi' : 0.1, 'Ti' : 0.1, 'Ag' : 0.1, 'Cg' : 0.01, 'Gg' : 0.1, 'Tg' : 0.1 },
    'Ti': { 'Ai' : 0.2, 'Ci' : 0.36, 'Gi' : 0.2, 'Ti' : 0.2, 'Ag' : 0.01, 'Cg' : 0.01, 'Gg' : 0.01, 'Tg' : 0.01 },
    'Ag': { 'Ai' : 0.01, 'Ci' : 0.1, 'Gi' : 0.01, 'Ti' : 0.01, 'Ag' : 0.2175, 'Cg' : 0.2175, 'Gg' : 0.2175, 'Tg' : 0.2175 },
    'Cg': { 'Ai' : 0.01, 'Ci' : 0.1, 'Gi' : 0.01, 'Ti' : 0.01, 'Ag' : 0.2175, 'Cg' : 0.2175, 'Gg' : 0.2175, 'Tg' : 0.2175 },
    'Gg': { 'Ai' : 0.01, 'Ci' : 0.1, 'Gi' : 0.01, 'Ti' : 0.01, 'Ag' : 0.2175, 'Cg' : 0.2175, 'Gg' : 0.2175, 'Tg' : 0.2175 },
    'Tg': { 'Ai' : 0.01, 'Ci' : 0.1, 'Gi' : 0.01, 'Ti' : 0.01, 'Ag' : 0.2175, 'Cg' : 0.2175, 'Gg' : 0.2175, 'Tg' : 0.2175 }
}

emission_probabilities = {
    'Ai': { 'A' : 1, 'C' : 0.001, 'G' : 0.001, 'T' : 0.001 },
    'Ci': { 'A' : 0.001, 'C' : 1, 'G' : 0.001, 'T' : 0.001 },
    'Gi': { 'A' : 0.001, 'C' : 0.001, 'G' : 1, 'T' : 0.001 },
    'Ti': { 'A' : 0.001, 'C' : 0.001, 'G' : 0.001, 'T' : 1 },
    'Ag': { 'A' : 1, 'C' : 0.001, 'G' : 0.001, 'T' : 0.001 },
    'Cg': { 'A' : 0.001, 'C' : 1, 'G' : 0.001, 'T' : 0.001 },
    'Gg': { 'A' : 0.001, 'C' : 0.001, 'G' : 1, 'T' : 0.001 },
    'Tg': { 'A' : 0.001, 'C' :0.0010, 'G' : 0.001, 'T' : 1 }
}

model = HMM(alphabet, hidden_states, transition_probabilities, emission_probabilities, initial_probabilities)

In [56]:
sequence = "ACGCGATCATACTATATTAGCTAAATAGATACGCGCGCGCGCGCGATATATATATATAGCTAATGATCGATTACCCCCCCCCCCAATTA"

print(sequence)

result = ''.join(model.viterbi(sequence))
result = result.replace("A", "")
result = result.replace("C", "")
result = result.replace("G", "")
result = result.replace("T", "")
result = result.replace("i", "I")

print(result)

ACGCGATCATACTATATTAGCTAAATAGATACGCGCGCGCGCGCGATATATATATATAGCTAATGATCGATTACCCCCCCCCCCAATTA
IIIIIggggggggggggggggggggggggggIIIIIIIIIIIIIIggggggggggggggggggggggggggggggggggggggggggggg
