# 7. HMM's with Continuous Observations 
At this point we are ready to look at the use application of Hidden Markov Model's to discrete observations. All that is meant by continuous observations is that what you observe is a number on a scale, rather than a symbol such as heads or tails, or words. This is an interesting topic in and of itself, because it allows us to think about:
* What is continuous? 
* What is discrete?
* What are we approximating as discrete that is really continuous?

Think of an audio signal for a moment. Audio is simply sound, and sound is just air pressure that vibrates. By vibrating I just mean that it is oscillating in time. So, when audio is stored on your computer, obviously your computer cannot store an infinite number of values, and hence the air pressure has to be **quantized**; it has to be given a number that it can represent. Now, there are actually two things to quantize:

1. The **amplitude** of the sound pressure $\rightarrow$ 16 bit int.
2. The **time** the amplitude occurred $\rightarrow$ 44100 Hz.

As has been mentioned before in other posts, sound is generally sampled at 44.1 KHz. That is far too fast for your brain to realize it is not continuous. So, what happens when we use a hidden markov model, but we assume the observed variable is a mixture of gaussians? Well, in essence:

> The amplitude becomes continuous, and the time stays discrete.

So, what changes when we use hidden markov models with **Gaussian Mixture Models**? First, the following is assuming you are familiar with GMM's (if not please review my posts on unsupervised learning). With that said, let's go through a very quick recap.

## 1. Gaussian Mixture Models Review
Gaussian Mixture Models are a form of **density estimation**. They give us an approximation of the probability distribution of our data. We want to use gaussian mixture models when we notice that our data is multimodal (meaning there are multiple modes or bumps). From probability, we can recall that the **mode** is just the most common value. 

<img src="https://drive.google.com/uc?id=1d4ePx8RP9Cj1jxJYTjsZdsT4FU_a2qHz" width="400">

A Gaussian mixture is just the sum of weighted gaussians. To represent these weights we will introduce a new symbol called $\pi$. $\pi_k$ is the probability that x belongs to the $k$th Gaussian. 

$$p(x) = \pi_1 N(\mu_1, \Sigma_1) + \pi_2 N(\mu_2, \Sigma_2) + \pi_3 N(\mu_3, \Sigma_3)$$

### 1.1 $\pi$ is a distribution
Notice that there is a constraint here that all of the $\pi$'s have to sum to 1. 

$$1 = \int p(x)dx = \int \pi_1 N(x | \mu_1, \Sigma_1)dx + \pi_2 N(x | \mu_2, \Sigma_2)dx$$
$$\pi_1 1 + \pi_2 1$$

### 1.2 Latent Variables
Another way of thinking of this is that we introduced a new random variable called "Z". $Z$ represents which gaussian the data came from. So, we can say that:

$$\pi_k = P(Z = k)$$

This is like saying that there is some hidden cause called $Z$ that we can't measure. Each of these $Z$'s is causing a gaussian to be generated, and all we can see in our data is the combined effects of those individual $Z$'s. This will be important because it puts GMM's into the framework of **expectation maximization**.

### 1.3 Training a GMM
Training a GMM is as follows:

1. **Calculate Responsibilites**<br>
$R_k^{(n)}$ is the responsibility of the $k$th gaussian for generating the $n$th point. This is just the proportion of that gaussian, divided by all of the gaussians. If $\pi_k$ is large here, then it will overtake the other gaussians, and this will be approximately equal to 1. 

$$R_k^{(n)} = p(z^{(n)}|x) = \frac{\pi_k N (x^{(n)} | \mu_k, \Sigma_k) }{\sum_{j=1}^K \pi_j N (x^{(n)} | \mu_j, \Sigma_j)}$$

2. **Calculate model parameters of the gaussians**
We now need to recalculate the means, covariances, and $\pi$'s. The way that this is done is also similar to k-means, where we weight each samples influence on the parameter, by the responsibility. If that responsibility is small, then that $x$ matters less in the total calculation. 

$$\mu_k = \frac{1}{N_k}\sum_{n=1}^N R_k^{(n)} x^{(n)}$$

$$\Sigma_k = \frac{1}{N_k} \sum_{n=1}^N R_k^{(n)} (x^{(n)} - \mu_k)(x^{(n)} - \mu_k)^T$$ 

$$\pi_k = \frac{N_k}{N} \; with \; N_k = \sum_{n=1}^N R_k^{(n)}$$

### 1.4 GMM's and HMM's
Now, the question is: How can we relate the concepts of GMM's back to what we already know about HMM's (specifically, to allow our HMM to deal with continuous data)? Remember that a Hidden Markov Model is defined by three things: $\pi$, $A$, and $B$. Because we are now working with continuous emissions, it seems that $B$ is what needs to change, since it used to be an $MxV$ matrix, and there are no longer just $V$ possible symbols.

Take a moment to think about what $\pi$ and $A$ really represent. They both are entirely related to what hidden state, $M$, we can expect to be in at a given time. Once we are in a specific hidden state, we know that $B$ determines the probability of observing a specific emission. This actually related to gaussian mixture models quite nicely! As we saw in the GMM recap, $\pi_k = P(Z = k)$. In other words, we have the concept of being in a specific hidden state, $Z$, and then the GMM has a process of determining the probability of observation; this will take the place of $B$. 

How will this look in practice? Recall, that there are three things that we need for a Gaussian Mixture Model:

* The responsibilities, the probability of a specific gaussian: $R$
* For each individual gaussian, the mean: $\mu$
* For each individual gaussian, the covariance: $\Sigma$

So, we are replacing $B$ by three new parameters: $R$, $\mu$ and $\Sigma$. We will use the letter $K$ to be the number of gaussians. We will store $R$ as an $MxK$ matrix, so there is one row for each hidden state, and then for each state there are $K$ different probabilities. Since this is a probability matrix, each row must sum to one.

```
R = (M x K)
```

Each individual $\mu$ is $D$ dimensional, and there are $K$ of them. We need that many for the $M$ states; hence, $\mu$ will be an $MxKxD$ matrix:

```
mu = (M x K x D)
```

$\Sigma$ will be $MxKxDxD$:

```
sigma = (M x K x D x D)
```

Once a state ($j$) has been chosen ($\pi$ and $A$ have done their job), we create a "$B$" observation probability matrix of size $MxT$:

$$B(j,t) = \sum_{k=1}^K \overbrace{R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)}^\text{Gaussian Mixture Model}$$

Even though we don't technically have a $B$ matrix anymore, we can still make one by calculating it for each sequence. And we will store the individual mixture components as well:

$$Comp(j, k, t) = R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)$$

#### Expectation Step
We can then calculate a new $\gamma$, which is part of the expectation step:

$$\gamma(j,k,t) = \frac{\alpha(t, j) \beta(t, j)}{\sum_{j'=1}^M \alpha(t, j') \beta(t, j')} 
\frac{R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)}{\sum_{k'=1}^K R(j,k) N\Big( x(t), \mu(j,k), \Sigma(j,k)\Big)} 
$$

$$\gamma(j,k,t) = \frac{\alpha(t, j) \beta(t, j)}{\sum_{j'=1}^M \alpha(t, j') \beta(t, j')} 
\frac{Comp(j, k, t)}{B(j,t)} 
$$

#### Maximization Step
We can now define the updates for $R$, $\mu$ and $\Sigma$:

$$R(j, k) = \frac{\sum_{t=1}^T \gamma(j, k, t)}{\sum_{t=1}^T \sum_{k'=1}^K \gamma(j, k', t)}$$

$$\mu(j, k) = \frac{\sum_{t=1}^T \gamma(j,k,t)x(t)}{\sum_{t=1}^T \gamma(j,k,t)}$$

$$\Sigma(j,k) = \frac{\sum_{t=1}^T \gamma(j, k, t)\big(x(t) - \mu(j,k)\big) \big(x(t) - \mu(j, k) \big)^T}{\sum_{t=1}^T \gamma(j, k, t)}$$