# 1. From Markov Models to Hidden Markov Models
We are now going to extend the basic idea of markov models to hidden markov models. We have talked about latent variables before, and they will be a very important concept as we move forward. They show up in **K-means clustering**, **Gaussian Mixture Models**, **principle components analysis**, and many other areas. With hidden markov models, it even shows up in the name, so you know that hidden (latent) variables are central to this model. 

The basic idea behind a latent variable is that there is something going on beyond what we can observe/measure. What we observe is generally stochastic/random, since if it were deterministic we could predict it without doing any machine learning at all. The assumption that we make when we assume there are latent or hidden variables is that there is some cause behind the scenes that is leading to the observations that we see. In hidden markov models, the hidden cause itself is stochastic-it is a random process, the markov chain. 

An example of this can be seen in genetics. As a human, we are just a physical manifestation of some biological code. Now that the code is readable, it is not hidden in the sense that we can't measure it, but there was a time when we couldn't. At that point, people would use HMM's to determine how genes map to actual physical attributes. 

Another example is speech to text. A computer isn't able to read the words you are attempting to say, but it can use an internal language model-i.e. a model of likely sequences of hidden states, to try and match those to the sounds that it hears. So, in this case what is observed are the sound signal, and the latent variables are just the sentence or phrase that you are saying. 

## 1.1 Markov $\rightarrow$ Hidden Markov
So, how do we go from markov models to hidden markov models? The simplest way to explain this is via an example. Suppose you are at a carnival and a magician has two biased coins that he is hiding behind his back. He will choose to flip one of the coins at random, and all you get to see is the result of the coin toss (H/T). So, what are the **hidden states** and what are the **observed variables**? 

* Since we can see the results of the coin toss, that means heads and tails are our *observed variables*. We can think of this as a vocabulary or space of possible observed values. 
* The **hidden states**, of course, are which coin the magician chose to flip. We can't see them, so they are hidden. This is called a stochastic or random process, since it is a sequence of random variables. 

## 1.2 Define an HMM
How do we actually go about defining an HMM? Well, an HMM has 3 parts:
> **$\pi$, A, B**

(Note that this is opposed to the regular markov model which just has $\pi$ and $A$). $\pi$ is the **initial state distribution**, or the probability of being in a state when the sequence begins. In our coin example, say the magician really likes coin 1, so the probability that he starts with coin 1 is 0.9.

#### $$\pi_i = 0.9$$

$A$ is the state transition matrix, which tells us the probability of going from one state to another. 

#### $$A(i,j) = probability \; of \; going \; to \; state \; j \; from \; state \; i$$


In hidden markov models, the states themselves are hidden, so $A$ corresponds from transitioning from one hidden state to another hidden state. In the coin example, suppose the magician is very figity, and the probability of transitioning from coin 1 to coin 2 is 0.9, and the probability of transitioning from 2 to 1 is 0.9. Then, the probability of staying with the same coin for either coin is 0.1.

$$A = \begin{bmatrix}
    A_{11} & A_{12}\\
    A_{21} & A_{22} 
\end{bmatrix}
= \begin{bmatrix}
    0.1 & 0.9\\
    0.9 & 0.1 
\end{bmatrix}
$$

The new variable here of course is $B$. This is the probability of observing some symbol given what state you are in. Note this also a matrix because it has two inputs. What state you are in, which is $j$, and what you observe, which is $k$. 

#### $$B(j,k) = probability \; of \;observing \;symbol\;k\;while\;you\;are\;in\;state\;j$$

## 1.3 Indepence Assumptions
In the HMM we are making more independence assumptions than just the markov assumption. Remember, the markov assumption is that the current state only depends on the previous state, but is independent of any state before the previous state. Now that we have both observed and hidden variables in our model, we have another independence assumption: 

> "What we observe is only dependent on the current state"

So, the observation at time $t$, depends only on the state at time $t$, but not at any other time, state, or observation. 

## 1.4 What can we do with an HMM? 
So, what are we able to do with an HMM once we have one? Well, it will be similar to what we had discussed with regular markov models, with some additions. With markov models there were two main things we could do:
> 1. **Get the probability of a sequence**. This was just the multiplication of each state transition probability, and the probability of the initial state.  
2. **Train the model.** For this we just used maximum likelihood. That was just using frequency counts. 

With HMM's, we still have these two tasks, but both of these will be harder due to a more complex model. Training will most definitely be harder, because it not only requires the expectation maximization algorithm, but we will run into the limits of the numerical accuracy of the computer (limited accuracy of float). 

There is also one more task we will go over: *finding the most likely sequence of hidden states*. 

---

<br>
# 2. HMM's are Doubly Embedded 
Let's now discuss how HMM's are doubly embedded stochastic processes. Why do we say that they are doubly embedded? Well, think of the inner most layer. This is already a markov model, which is a specific type of stochastic process. With regular markov models, that is all we need-you know the state, end of story. With hidden markov models, once we hit a state there is yet another random sample that must be drawn. Think about our magician example: once the magician chooses the coin (1 or 2) he still has to flip the coin. So, we pick a state and then we have another random variable whose value has to be observed. 

We can think of this as two layers:

> * On the inner most layer, the state is chosen (choosing of coin). 
* On the outer layer, once the state is chosen a random variable is generated using the observation distribution for that state. 

---

<br>
# 3. How can we chose the number of hidden states?
The number of hidden states is a **hyperparameter**. In order to chose the number of hidden states, in general we would use *cross validation*. Well, if you think about, say we have $N$ training samples, and we then create a model with $N$ parameters. We could easily train this model to achieve 100% classication accuracy, however, this does not say anything about how the model will generalize to *unseen data*. Our goal will always be to fit to the trend, and not to the noise. If we can capture the real underlying trend, we should be able to make good predictions on new data. So, we will chose the number of hidden states that gives us the highest validation accuracy. We can use K-folds cross-validation. 

Generally that is all we would need to do when talking about hyperparameters. However, HMM's are a bit different. A lot of the time, the number of states in an HMM can reflect a real physical situation, or what we know about the situation we are trying to model-aka *priori knowledge*. For instance, in the magician example, we know the magician only has two coins, so we would use two states. When we are doing speech to text, we know the number of words in our vocabulary. In addition, we can separately train the hidden state transitions on pure text to give us a good initialization on the transition probabilities. Another example is biology-a codon is sequence of 3 DNA or RNA nucleotides, and these are responsible for creating amino acids which are then turned into proteins. A simple HMM may then have 3 physical states. So, we can use our knowledge of the physical system to help us determine the number of hidden states.  

# 4. The Forward-Backward Algorithm
The first question that we can ask of our HMM is the simplest one: 

>"What is the probability of a sequence?"

Suppose we have $M$ hidden states, and our sequence of observations is of length $T$. The idea is that we want to *marginalize* the joint probability over all possible values of the hidden states. 

So, we start with:

#### $$p(x,z)$$

Where both $x$ and $z$ are vectors:

#### $$x = \big[x(1), x(2), ..., x(T)\big]$$
#### $$z = \big[z(1), z(2), ..., z(T)\big]$$

However, we want to be able to marginalize out $z$ and find:

#### $$p\big(x(1), x(2),...,x(T)\big)$$

The final equation we end up with is:

$$p\big(x(1), x(2),...,x(T)\big) = \sum_{z(1)=1..M,...,z(T)=1..M}\pi\big(z(1)\big)p\big(x(1)|z(1)\big)\prod_{t=2}^Tp\big(z(t)|z(t-1)\big)p\big(x(t)|z(t)\big)$$

<br>
Which when we break it down we see that we have **the probability of the initial state**: 
#### $$\pi\big(z(1)\big)p\big(x(1)|z(1)\big)$$

We have **A, the probability of going to state j from state i**:
#### $$p\big(z(t)|z(t-1)\big)$$
#### $$A(i,j) = p\big(z(t)=j|z(t-1)=i\big)$$

And we have **B, the probability of seeing symbol k from state j**:
#### $$p\big(x(t)|z(t)\big)$$
#### $$B(j,k) = p\big(x(t)=k|z(t)=j\big)$$

By performing our marginalization:

#### $$\sum_{z(1)=1..M,...,z(T)=1..M}$$

We are essentially saying: 
> For the hidden variable at state 1, we want to look at *each potential value* of z. So in the case of the magician, at state 1, we would perform the calculation if coin 1 was used, z(1) and then add that the calculation if coin two was used, z(2). We would then perform this again from state 2, and all the way up through state $T$. This process of marginalization is based on the product rule of probability. 

The question is, how long will this take to calculate?  Well, in the inner part we have a product which is $2T - 1$, which can be seen based on the first product:

$$\prod_{t=2}^Tp\big(z(t)|z(t-1)\big)p\big(x(t)|z(t)\big)$$

Which is multipled by:

$$\pi\big(z(1)\big)p\big(x(1)|z(1)\big)$$

Given us the second product. this occurs for a total of $T$ times, hence $2T$ products. We then subtract 1 from this based on where $T$ is initialized, leaving us with $2T -1$ products. How many times do we need to compute this product? This is equal to the number of possible state sequences, which is $M^T$. So in total that leaves us with $O(TM^T)$. This is exponential growth which is pretty bad, so we don't want to do this. A better way of doing this would be the forward backward algorithm. The main issue that is causing us so many problems is that we have a product inside of a sum. Normally, we can't simplify a product inside of a sum, but in this case we can factor the expression using the properties of probability to reduce the number of calculations we have to do. 

## 4.1 Forward-Backword Algorithm Process
So, how does the forward backward algorithm actually work? We need to define a variable called $\alpha$:

> _This is the forward variable, and it represents the joint probability of seeing the sequence you have observed up until now and being in a specific state at that time._

#### $$\alpha(t,i) = p\big(x(1),...,x(t), z(t)=i\big)$$

We can see that there are two index's to $\alpha$: time and $i$, which index's the state.

#### 4.1.1 Step 1
So, our first step is to calculate the initial value of $\alpha$ (t = 1):

#### $$\alpha(1, i) = p \big(x(1), z(1)=i\big)$$

Where if we recall the _Kolmogorov definition_ of conditional probability:

#### $$P( A \cap B ) = P(A \mid B) P(B)$$

We can extend that to our scenario:

#### $$\alpha(1, i) =  p\big(z(1) = i \big) p\big(x(1) \mid z(1)= i\big)$$

And we know that:

#### $$p\big(x(1) \mid z(1)= i\big) = B\big(i, x(1)\big)$$

And that:

#### $$p\big(z(1) = i \big) = \pi_i$$

Meaning we end up with:

#### $$\alpha(t,i) = \pi_iB\big(i, x(t)\big)$$

#### $$\alpha(1,i) = \pi_iB\big(i, x(1)\big)$$

#### 4.1.2 Step 2
The second step is called the **induction step**. This will be done for every state and every time up until $T$. 

#### $$\alpha(t+1, j) = \sum_{i=1}^M \alpha(t,i) A(i,j)B(j, x(t+1))$$

What this is doing is allowing us to update our forward variable, $\alpha$. In other words, we continually update the joint probability of seeing the sequence you have observed up until now and being in a specific state at that time.

#### 4.1.3 Step 3
The final step is the termination step, where we marginalize over the hidden states at time $T$. 

#### $$p(x) = \sum_{i=1}^M\alpha(T,i) = \sum_{i=1}^M p\big(x(1),...,x(T),z(t)=i\big)$$

Notice that we already have our answer. We already know the probability of the sequence after only having done the forward step of the forward-backward algorithm. We can also show that the time complexity of this algorithm is $O(M^2T)$. 

## 4.2 Backward 
Now, at this point we do not need the backward algorithm (it is not needed to solve for $p(x)$, but we are going to use it later! It has two main steps, and it essentially just the reverse of the forward algorithm. 

To perform the backward algorithm, we will define a variable called $\beta$, which is also indexed by time and the state. 

#### Initialization Step
The initialization step is to define $\beta$, at time $T$, to be 1 for every state:

#### $$\beta(T, i) = 1 $$

#### $$\beta(t, i) = p\big(x(t+1), ... x(T) \mid z(t)=i\big)$$

#### Induction Step
The induction step is to then calculate the previous $\beta$ for every state, similar to what we did with the forward algorithm: 

#### $$\beta(t, i) = \sum_{j=1}^M A(i,j)B\big(j, x(t+1)\big) \beta(t+1, j)$$

Again, we want to do this for all times down to 1 or 0, depending on how you index, and for every state at each time. 

## 4.3 Pseudocode

---
```
alpha = np.zeros((T, self.M))
alpha[0] = self.pi * self.B[:, x[0]]
for t in range(1, T):
    alpha[t] = alpha[t-1].dot(self.A) * self.B[:, x[t]]
P[n] = alpha[-1].sum()

beta = np.zeros((T, self.m))
beta[-1] = 1
for t in range(T-2, -1, -1):
    beta[t] = self.A.dot(self.B[:, x[t+1]] * beta[t+1])
```
---

We can see above that both $\alpha$ and $\beta$ are arrays of $TxM$, and notice how we have vectorized our operations. 

## 4.4 Forward Algorithm Explanation
The key idea behind the forward algorithm is that we are going to unroll the HMM in time. 

### 4.4.1 Sequence of Length 1
First we can discuss what we do with a sequence of length 1. Remember, the goal is to determine the probability of the sequence. Well, that is just a simple probability problem. 

We have the following:

> * $\pi \rightarrow$ The probability of the first state
* $B \rightarrow$ The probability of observing something given the state

And we want to find:

#### $$p\big(x(1)\big)$$

So, we just marginalize over the states, $z$:

#### $$p\big(x(1)\big) = p\big(z(1)=1\big) p\big(x(1) \mid z(1) =1 \big) +...+p\big(z(1)=M\big) p\big(x(1) \mid z(1) =M \big)$$

#### $$p\big(x(1)\big) = \pi_1 B \big(1, x(1) \big) + \pi_2 B \big(2, x(1) \big)+ ... +\pi_M B \big(M, x(1) \big)$$

Visually, this looks like:

<img src="images/forward-1.png" width="350">

> * We go from the null, or _start_ position, to one of the states. That is $\pi$. For this example, we will assume that the number of states, $M$, is 3.
* We then go from that state to producing an observed variable. That is just:
$\pi$ times $B$ for the 3 states. 

This is our definition for the initial value of $\alpha$:

#### $$\alpha(1, i) = p \big(x(1), z(1)=i\big)$$

And we can define $\alpha$ at rows 1, 2, and 3 respectively as:

#### $$\alpha(t=1, i=1) = p \big(x(1), z(1)=1\big) = \pi_1 B \big(1, x(1) \big)$$

#### $$\alpha(t=1, i=2) = p \big(x(1), z(1)=2\big) = \pi_2 B \big(2, x(1) \big)$$

#### $$\alpha(t=1, i=3) = p \big(x(1), z(1)=3\big) = \pi_3 B \big(3, x(1) \big)$$

### 4.4.2 Sequence of Length 2 $\rightarrow$ Induction Step
Now let's think about what we can do for a sequence of length 2. How would we find $p\big( x(1), x(2)\big)$, given we already have $p\big(x(1)\big)$? Remember, the observations are not directly dependent, and that each observation at a certain time, depends only on the state at that time. The states are Markov, so we can use the Markov assumption here. Let's try and find the probability of the second observation, $x(2)$, if the state is 1.

The question we want to ask here is: 

> How can we get to state 1 at time t=2, seeing symbol $x(2)$? 

The answer is that we can come from any possible previous state! Visually, that looks like:

<img src="images/forward-2.png" width="350">

Since each of those transitions are independent, we can sum each of those distinct possibilities:

#### $$\pi_1A(1,1)B(1, x(2))+ \pi_2A(2,1)B(1, x(2))+\pi_3A(3,1)B(1, x(2))$$

Where: 

> * $\pi_1$ is the probability we start at state 1
* $A(1,1)$ is the probabilty of transitioning from state 1 to state 1 
* $B(1, x(2))$ is the probability of observing $x(2)$ while in state 1
* We then add the probability that we came from state 2 and state 3

Notice, if we include the $B$ for $t=1$, $B(i, x(1))$, this just gives us $\alpha$:

#### $$\pi_1B \big(1, x(1) \big)A(1,1)B(1, x(2))+ \pi_2B \big(2, x(1) \big)A(2,1)B(1, x(2))+\pi_3B \big(3, x(1) \big)A(3,1)B(1, x(2))$$

Hence, we can write the previous probability in terms of the previous $\alpha$:

#### $$\alpha(t=0, i=1)A(1,1)B(1, x(2))+ \alpha(t=0, i=2)A(2,1)B(1, x(2))+\alpha(t=0, i=3)A(3,1)B(1, x(2))$$

But wait-this is just the next $\alpha$ at time $t=2$!

#### $$\alpha(t=2, i=1)$$

That particular $\alpha$, at time $t=2$ and state = 1, is the probability of observing $x(1)$ and observing $x(2)$ and being in the state 1 at time = 2:

#### $$\alpha(t=2, i=1) = p \big(x(1), x(2), z(2)=1 \big)$$

So, we can see that $\alpha$ is defined recursively! This particular $\alpha$ we are showing for $t=2$ and state = 1, is the probability of observing $x(1)$ and $x(2)$, and being the state $z(2) = 1$ at $t=2$. 

Realize that this induction can be used for any subsequent time step. In other words, the next $\alpha$ can always be defined in terms of the current alpha. The probability that this gives us is the probability of the observed sequence so far, and ending up in a particular state:

#### $$\alpha(t+1,i=1) = p \big(x(1),...,x(t+1), z(t+1)=1\big)$$

### 4.4.3 Termination Step
If we keep doing this process, eventually we will end up with $\alpha(T,i)$, which is the probability of the entire sequence, and ending up in state $i$:

#### $$\alpha(T,i) = p \big(x(1),...,x(T), z(T)=i\big)$$

Remember, our goal is to find just the probability of the sequence, so how can we do that? We can do the same thing that we did initially, it is just another probability problem. We marginalize over $z$, or in other words, sum the last $\alpha$ over all $i$:

#### $$p \big(x(1),...,x(T)\big) = \sum_{i=1..M}\alpha(T,i)$$


