#  6) Hidden Markov Models - Decoding

* Hidden Markov Models
* 3 Problems for HMM's
    * Compute P(X|Y)
    * __Decoding__
    * __Training__  
* Decoding
    * __Viterbi__ 
    * __Posterior__ 


### Hidden Markov Models

In many other machine learning situations we have the assumption of __independent and identically distributed__ data. This is not always a reasonable assumption. For example with __sequential data__ such as weather observations (rainy, cloudy, sunny, etc.), the probability of seeing rain one day, is affected by which type of weather was observed the day before (and possibly further back). This leads us to consider __markov models__ in which the probability of an observation is independent of all but the most recent observations. If the probability is affected by only the previous observation, then we call it a first-order markov model, and in general we can have __i'th-order markov models__ where:

$$ p(x_N \rvert x_1,...,x_{N-1}) = p(x_n \rvert x_{N-i},..., x_{N-1})$$

If the observations are affected by latent (or hidden) discrete variables, and form a markov chain, then we have a __hidden markov model (hmm)__. For the weather observations, is not actually the outcome of the previous days, but rather its affected by low/high pressure areas.

A k-state hmm can be represented by a 3-tuple of matrices $(\pi,A,\theta)$:

-  $\pi: k x 1$ matrix of initial state probabilities
-  $A: k\ x\ k$ matrix of transition probabilities.
-  $\theta: k\ x\ |\Sigma|$ matrix of emission probabilities

Where $\Sigma$ is the "emission alphabet". For the weather example it contains "sun", "rain", "cloudy" etc.  

__A hmm generates a sequence of observables by jumping from state to state, according to A, each time emitting an observable according to $\theta$.__

An example observations sequence X, with corresponding underlying state sequence Z, can be seen below for the weather example:

<img src="imgs\hmm.png" alt="Drawing" style="width: 300px;"/>


### 3 Problems for hmm's

For hmm's there are 3 basic problems which must be solved for them to be useful.

#### 1) Compute P(X|M) 

To compute $P(X\rvert M)$, that is, the probability of seeing some sequence of observables given a hmm M. To do this we can argue that the following holds:

$$P(X\rvert M) = \sum_{z\in Z} P(X,Z\rvert M)$$

The probability of seeing an observation sequence is the sum of all joint probabilities with different underlying state sequences (different state sequences can produce the same sequence of observables). There are $k^N$ different state sequences possible - where k is the number of states in the hmm - so computing this directly is infeasible. Instead it can be calculated in time $O(K^2N)$ time as a bi-product of posterior decoding, by summing the last column of the $\alpha$-table.


##### Joint probability P(X,Z | M)

The formula for computing the joint probability looks as follows:

$$ P(X,Z \rvert M) = P(z_1 \rvert \pi) \bigg[\prod_{n=2}^N P(z_n \rvert z_{n-1}, A)\bigg] \prod_{n=2}^N P(x_n \rvert z_n, \theta)$$

-  $P(z_1 \rvert \pi)$  is the probability of starting in state $z_1$
-  $\prod_{n=2}^N P(z_n \rvert z_{n-1}, A)$ is the probability of going though the state sequence Z.
-  $\prod_{n=1}^N P(x_n \rvert z_n, \theta)$ is the emission probabilities for this particular state sequence.

N is often a very large number, so the joint probability can be come _very_ small. To avoid numerical underflow one can compute $log(P(X,Z \rvert M)$ instead.

#### 2) Decoding

The second basic problem for hmm's is __decoding.__ Here we are interested in finding the most likely hidden state sequence which produced a given observation sequence (__Viterbi__), or finding the most likely state to be in at a given point in the observation sequence (__Posterior__).

##### Viterbi decoding

In Viterbi decoding we wish to find the following:

$$Z^\ast = arg \underset{Z}{\operatorname{max}} P(X,Z \rvert M) $$

That is, the state sequence Z that maximize the joint probability $P(X,Z \rvert M)$. We do this in 3 steps:

-  Compute the $\omega$-table
-  Pick the row with the largest value in the last column
-  Backtrack to obtain optimal path

##### Posterior  decoding

In Posterior decoding we wish to find the following:

$$z^\ast_n = arg \underset{z_n}{\operatorname{max}} P(z_n  \rvert x_1,...x_N) $$

That is, the most likely state to be in at the n'th step given an observation sequence.

#### 3) Training

##### Training by counting

##### Viterbi training

##### Expectation Maximization

### Decoding

#### Viterbi decoding 

That is, the state sequence Z that maximize the joint probability $P(X,Z \rvert M)$. To find this $Z^\ast$ we first compute the $(k\ x\ N)$-sized $\omega$-table where the entry $\omega(z_n)$ is the probability of the most likely sequence of states $z_1,z_2,...,z_n$, ending in $z_n$ - that generated the observations $x_1, x_2,...,x_n$. When we have computed the table, we can pick the row in the last colum with the largest value
w
- given sequence of observations X, what is the most likely explanation  Z (hidden state sequence)?

#### Posterior decoding
- Given X of length n, whats most likely state to be in at time n?