# 7. HMM's with Continuous Observations 
At this point we are ready to look at the use application of Hidden Markov Model's to discrete observations. All that is meant by continuous observations is that what you observe is a number on a scale, rather than a symbol such as heads or tails, or words. This is an interesting topic in and of itself, because it allows us to think about:
* What is continuous? 
* What is discrete?
* What are we approximating as discrete that is really continuous?

Think of an audio signal for a moment. Audio is simply sound, and sound is just air pressure that vibrates. By vibrating I just mean that it is oscillating in time. So, when audio is stored on your computer, obviously your computer cannot store an infinite number of values, and hence the air pressure has to be **quantized**; it has to be given a number that it can represent. Now, there are actually two things to quantize:

1. The **amplitude** of the sound pressure $\rightarrow$ 16 bit int.
2. The **time** the amplitude occurred $\rightarrow$ 44100 Hz.

As has been mentioned before in other posts, sound is generally sampled at 44.1 KHz. That is far too fast for your brain to realize it is not continuous. So, what happens when we use a hidden markov model, but we assume the observed variable is a mixture of gaussians? Well, in essence:

> The amplitude becomes continuous, and the time stays discrete.

Before we dive into the details of integrating HMM's with GMM's, I want to note that the following is assuming you are familiar with GMM's (if not please review my posts on unsupervised learning, they give a very thorough and intuitive overview of GMM's). 

In [2]:
import numpy as np
from scipy.stats import bernoulli, binom, norm
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

## 1. GMM's and HMM's
Now, the question is: How can we relate the concepts of GMM's back to what we already know about HMM's (specifically, to allow our HMM to deal with continuous data)? Remember that a Hidden Markov Model is defined by three things: $\pi$, $A$, and $B$. Because we are now working with continuous emissions, it seems that $B$ is what needs to change, since it used to be an $MxV$ matrix, and there are no longer just $V$ possible symbols.

Take a moment to think about what $\pi$ and $A$ really represent. They both are entirely related to what hidden state, $M$, we can expect to be in at a given time. Once we are in a specific hidden state, we know that $B$ determines the probability of observing a specific emission. This actually relates to gaussian mixture models quite nicely! Recall that GMM's also have a their own latent variable (the specific gaussian that a data point was generated via, aka the cluster it belongs to). In other words, our HMM has the concept of being in a specific hidden state, $Z$, and then the GMM has the concept of being in a particular hidden gaussian. In combination these can be used to determine the probability of observation. A visualization should help make this more concrete:

<img src="https://drive.google.com/uc?id=13uFk4T6UTN3OSLA95wzpkqnxrokOkhrq" width="600">

Above, we can see that we still have our transition matrix $A$, that determines how we move between hidden states. However, we now see these hidden states, $Z$, labeled as hidden state 1. This is because we have introduced a second hidden state, that of the gaussian that is generating the observation. So, we transition into a new hidden state $Z$, via our transition matrix $A$, and then we select a specific gaussian, our _second hidden state_. From the gaussian we sample in order to generate our observed data point $x$. We can see below that our $B$ emission matrix is replaced with the GMM:

<img src="https://drive.google.com/uc?id=1fAR-hbRjxzoCDyyP4FEyf_PNzgwshJO9" width="600">

A sample path that would generate two $x$ observations is seen below:

<img src="https://drive.google.com/uc?id=1iU8RHZj8VB5KmNFs98h2ksAsxXbkhX0b" width="600">

So we have an idea of how this works in practice, let's now go through the mathematics required to utilize GMM's in our HMM. Recall, that there are three things that we need for a Gaussian Mixture Model:

* The responsibilities, the probability of a specific gaussian: $R$
* For each individual gaussian, the mean: $\mu$
* For each individual gaussian, the covariance: $\Sigma$

So, we are replacing $B$ by three new parameters: $R$, $\mu$ and $\Sigma$; this is seen clearly in the visualization above. We will use the letter $K$ to be the number of gaussians. It is very important to keep in mind that we have **two hidden states**, one of which is $K$ gaussian's.  We will store $R$ as an $MxK$ matrix, so there is one row for each hidden state, and then for each state there are $K$ different probabilities. Since this is a probability matrix, each row must sum to one.

```
R = (M x K)
```

Keep in mind, what we are really saying here is that based on the specific hidden state, $Z$, that we are in, there is a corresponding set of probabilties to select one of the $K$ gaussians, which will then generate our data point $x$. As a simple example, let's say that we are looking at household energy usage. The first hidden state, $Z$, is the time of day (morning, afternoon, evening), the second hidden state (gaussian) is what appliance is being utilized (washer, blow dryer, blender, etc), and the observed $x$ is the energy usage in watts. If we know that the hidden state, time, is morning, then the probability that the blow dryer or blender are being use is higher than evening (a person is more likely to blow dry their hair in the morning, and also more likely to make a smoothie for breakfast). So, the probability of being in a specific second hidden state is most certainly based on the first hidden state that we are in.

If you did in fact go through my post on GMM's, you may recall that we defined the probability of an observation $x$ as:

$$P(x) = \pi_1 N(x \mid \mu_1, \Sigma_1) + \pi_2 N(x \mid \mu_2, \Sigma_2) + \pi_3 N(x \mid \mu_3, \Sigma_3)$$

Here the probability of ending up in a specific latent state (gaussian) was encapsulated entirely by $\pi$ (not the initial state probability that we deal with in HMM's). In our current scenario, the probability of ending up in a specific gaussian is based on our responsibilities, $R$ (which $\pi$ above heavily related to-see GMM post), as well our state transition matrix $A$, and our initial state probabilities, $\pi$ (or HMM $\pi$).

Moving on the gaussians themselves, Each individual $\mu$ is $D$ dimensional, and there are $K$ of them. We need one for each of the $M$ states; hence, $\mu$ will be an $MxKxD$ matrix:

```
mu = (M x K x D)
```

$\Sigma$ will be $MxKxDxD$:

```
sigma = (M x K x D x D)
```

Once a state ($j$) has been chosen ($\pi$ and $A$ have done their job), we create a "$B$" observation probability matrix of size $MxT$:

$$B(j,t) = \sum_{k=1}^K \overbrace{R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)}^\text{Gaussian Mixture Model}$$

Even though we don't technically have a $B$ matrix anymore, we can still make one by calculating it for each sequence. And we will store the individual mixture components as well:

$$Comp(j, k, t) = R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)$$

#### Expectation Step
We can then calculate a new $\gamma$, which is part of the expectation step:

$$\gamma(j,k,t) = \frac{\alpha(t, j) \beta(t, j)}{\sum_{j'=1}^M \alpha(t, j') \beta(t, j')} 
\frac{R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)}{\sum_{k'=1}^K R(j,k') N\Big( x(t), \mu(j,k'), \Sigma(j,k')\Big)} 
$$

$$\gamma(j,k,t) = \frac{\alpha(t, j) \beta(t, j)}{\sum_{j'=1}^M \alpha(t, j') \beta(t, j')} 
\frac{Comp(j, k, t)}{B(j,t)} 
$$

#### Maximization Step
We can now define the updates for $R$, $\mu$ and $\Sigma$:

$$R(j, k) = \frac{\sum_{t=1}^T \gamma(j, k, t)}{\sum_{t=1}^T \sum_{k'=1}^K \gamma(j, k', t)}$$

$$\mu(j, k) = \frac{\sum_{t=1}^T \gamma(j,k,t)x(t)}{\sum_{t=1}^T \gamma(j,k,t)}$$

$$\Sigma(j,k) = \frac{\sum_{t=1}^T \gamma(j, k, t)\big(x(t) - \mu(j,k)\big) \big(x(t) - \mu(j, k) \big)^T}{\sum_{t=1}^T \gamma(j, k, t)}$$