#  6) Hidden Markov Models - Decoding

* Hidden Markov Models
* 3 Problems for HMM's
    * Compute P(X|M)
    * __Decoding__
    * __Training__  
* Decoding
    * __Viterbi__ 
    * __Posterior__ 


### Hidden Markov Models

In many other machine learning situations we have the assumption of __independent and identically distributed__ data. This is not always a reasonable assumption. For example with __sequential data__ such as weather observations (rainy, cloudy, sunny, etc.), the probability of seeing rain one day, is affected by which type of weather was observed the day before (and possibly further back). This leads us to consider __markov models__ in which the probability of an observation is independent of all but the most recent observations. If the probability is affected by only the previous observation, then we call it a first-order markov model, and in general we can have __i'th-order markov models__ where:

$$ p(x_N \rvert x_1,...,x_{N-1}) = p(x_n \rvert x_{N-i},..., x_{N-1})$$

If the observations are affected by latent (or hidden) discrete variables, and form a markov chain, then we have a __hidden markov model (hmm)__. For the weather observations, is not actually the outcome of the previous days, but rather its affected by low/high pressure areas.

A k-state hmm can be represented by a 3-tuple of matrices $(\pi,A,\theta)$:

-  $\pi: k x 1$ matrix of initial state probabilities
-  $A: k\ x\ k$ matrix of transition probabilities.
-  $\theta: k\ x\ |\Sigma|$ matrix of emission probabilities

Where $\Sigma$ is the "emission alphabet". For the weather example it contains "sun", "rain", "cloudy" etc.  

__A hmm generates a sequence of observables by jumping from state to state, according to A, each time emitting an observable according to $\theta$.__

An example observations sequence X, with corresponding underlying state sequence Z, can be seen below for the weather example:

<img src="imgs\hmm.png" alt="Drawing" style="width: 300px;"/>


### 3 Problems for hmm's

For hmm's there are 3 basic problems which must be solved for them to be useful.

#### 1) Compute P(X|M) 

To compute $P(X\rvert M)$, that is, the probability of seeing some sequence of observables given a hmm M. To do this we can argue that the following holds:

$$P(X\rvert M) = \sum_{z\in Z} P(X,Z\rvert M)$$

The probability of seeing an observation sequence is the sum of all joint probabilities with different underlying state sequences (different state sequences can produce the same sequence of observables). There are $k^N$ different state sequences possible - where k is the number of states in the hmm - so computing this directly is infeasible. Instead it can be calculated in time $O(K^2N)$ time as a bi-product of posterior decoding, by summing the last column of the $\alpha$-table.


##### Joint probability P(X,Z | M)

The formula for computing the joint probability looks as follows:

$$ P(X,Z \rvert M) = P(z_1 \rvert \pi) \bigg[\prod_{n=2}^N P(z_n \rvert z_{n-1}, A)\bigg] \prod_{n=2}^N P(x_n \rvert z_n, \theta)$$

-  $P(z_1 \rvert \pi)$  is the probability of starting in state $z_1$
-  $\prod_{n=2}^N P(z_n \rvert z_{n-1}, A)$ is the probability of going though the state sequence $z_2$ to $z_N$.
-  $\prod_{n=1}^N P(x_n \rvert z_n, \theta)$ is the probability of observing X for the state sequence Z.

N is often a very large number, so the joint probability can be come _very_ small. To avoid numerical underflow one can compute $log(P(X,Z \rvert M)$ instead.

#### 2) Decoding

The second basic problem for hmm's is __decoding.__ Here we are interested in uncovering the most hidden parts of the model. There are two interpretations for this. One is finding the most likely hidden state sequence which produced a given observation sequence (__Viterbi__). Another is finding the individually most likely state to be in at a given point in the observation sequence (__Posterior__).

##### Viterbi decoding

In viterbi decoding we wish to find the following:

$$Z^\ast = arg \underset{Z}{\operatorname{max}} P(X,Z \rvert M) $$

That is, the state sequence Z that maximize the joint probability $P(X,Z \rvert M)$. We do this in 3 steps:

-  Compute the $\omega$-table.
-  Pick the row with the largest value in the last column.
-  Backtrack to obtain optimal path.

##### Posterior  decoding

In posterior decoding we wish to find the following:

$$z^\ast_n = arg \underset{z_n}{\operatorname{max}} P(z_n  \rvert x_1,...x_N) $$

That is, the most likely state to be in at the n'th step given an observation sequence. We do this in x steps:

1) Compute the $\alpha$-table and the $\beta$-table.

2) TODO


#### 3) Training

Training is the third basic problem for hmm's. It is the problem of selecting model parameters $(\pi,A,\theta)$, to reflect given (X,Z) pair's or just a set of X's.

##### Training by counting

If we are given several sequences of observations and corresponding latent states - how do we set model parameters to make the given (X,Z)'s most likely to occur? The parameters should reflect what we have seen.

In "training by counting" we count the relevant occurrences and adjust the model parameters to reflect these. This yields a __maximum likelihood estimation__ of $M = (\pi, A, \theta)$ by maximizing the likelihood:

$$P(X\rvert M) = \sum_{z\in Z} P(X,Z\rvert M)$$

##### Viterbi training

If we only have the X's to work with - that is, the Z's are unknown - we can use Viterbi training. It involves 4 steps:

1) Decide on initial parameters $M_0 = (\pi_0, A_0, \theta_0)$.

2) Find the most likely sequence of states $Z^\ast$ explaining X using Viterbi decoding and the current parameters $M_i$.

3) Update parameters to $M_{i+1}$ by “counting” (with pseudo counts) according to $(X,Z^\ast)$.

4) Repeat 2-3 until $P(X,Z^\ast | M_i)$ is satisfactory (or the Viterbi sequence of states does not change).

This yields a local maximum of: 

$$VIT_X(M) = max_Z P(X,Z \rvert M) $$

Which is not a MLE, but it works ok.

##### Expectation Maximization

If we only have the X's to work with we can still do a MLE using Expectation Maximization:

Init:   Pick “suitable” parameters (transition and emission probabilities).

E-step: 1) Run the forward- and backward-algorithms with the current choice of parameters (to get the params of Q-func).

Stop ?: 2) Compute the likelihood $P(X \rvert M)$, if sufficient (or another stopping criteria is met) then stop.

M-step: 3) Compute new parameters using the values stored by the forward- and backward-algorithms. Repeat 1-3.

Each iteration of the above states makes the likelihood $P(X \rvert M)$ converge to local max.

### Decoding

#### Viterbi decoding 

As mentioned earlier, Viterbi decoding is about finding the following state sequence:

$$Z^\ast = arg \underset{Z}{\operatorname{max}} P(X,Z \rvert M) $$

That is, the state sequence Z that maximize the joint probability $P(X,Z \rvert M)$. To find $Z^\ast$ we first compute the $(k\ x\ N)$-sized $\omega$-table where the entry $\omega(z_n)$ is the probability of the most likely sequence of states $z_1,z_2,...,z_n$, ending in $z_n$, that generated the observations $x_1, x_2,...,x_n$. i.e.:

$$\boxed{\omega(z_n) = \underset{z_1,...,z_{n-1}}{\operatorname{max}} P(x_1,...,x_n,z_1,...,z_n)} $$

If we look at the following formula, we can see how we came up with the definition for $\omega(z_n)$:

\begin{equation} 
       \begin{split}
          P(X,Z^\ast) &=  \underset{Z}{\operatorname{max}} P(X,Z)\\
          &= \underset{z_1,...z_N}{\operatorname{max}} P(x_1,...,x_N,z_1,...,z_N)\\\\
          &= \underset{z_N}{\operatorname{max}} \boxed{\underset{z_1,...z_{N-1}}{\operatorname{max}}  P(x_1,...,x_N,z_1,...,z_N)}\\
         &= \underset{z_N}{\operatorname{max}} \boxed{\omega(z_N)}\\
    \end{split}
    \end{equation}

##### Computing omega table

The $\omega$-table will be computed recursively column by column left to right. The first column will be filled using the base step described below. In the recursive step we take the previous column into consideration.

__Base:__ $\omega(z_1): P(x_1, z_1) = P(z_1)P(x_1 \rvert z_1)$

__Recu:__ $\omega(z_n):  \underset{z_{n-1}}{\operatorname{max}} \big[ \omega(z_{n-1}) P(z_n \rvert z_{n-1}) \big] P(x_n \rvert z_n)$


After computing the table we can find the most likely state sequence $Z^\ast$ which was what we wanted in the first place. We do this by __backtracking__ through the table: 

-  First let $z_N$ be the argmax of the last column
-  Let $Z_{N-1}$ be the element of column N-1 that was used to compute the probability of the argmax in column N
above.
-  Continue as above


#### Posterior decoding
- Given X of length n, whats most likely state to be in at time n?