# Expectation-Maximization Algorithm

Finally, we get to the most renowned optimizatio algorithm for HMMs: The Baum-Welch optimization/reparameterization algorithm.  Ultimately, this algorithm is an early-implemetaion of the relatively broad class of expectation-maximization algorithms.

In this notebook, we go through, in details, the calculation of a single iteration of this procedure, and generalize the process to iteratively improve estimates until convergence is achieved.

### Preliminaries

Before discussing the optimization procedure in too much detail, there are a few numerical quantitites that we need in order to proceed with the EM algorithm. For the purpoises of conciseness they are just listed here, but I will reference the tutorial notebook that overviews each of these quantities in more detail.

First, we will need the Bayesian estimate of the hidden state $p(x_t | Y^T)$ which is the best estimate of the hidden state, given information about all observations (past and future). This quantity is discussed in more detail in the notebook `03-slarge-hmm-filters.ipynb` as well as Ref.[1].

Second, we will need the values of the forward and backward trajectory probabilities, denoted (as is convention) by $\alpha_t(i) \equiv p(Y^{t} | x_t=i)$ and $\beta_t(j) \equiv p(Y^{[t:T]}| x_t = j)$. Further discussion of these quantities is contained in notebook `04-slarge-alpha-beta.ipynb` as well as Ref.[2]

Now, the actual implementation of the EM algorithm comes through the iterative application of expectation (E) and maximization (M) steps. Along with this iteration there is a convergence criterion, which will effectively determine how long to repeat the iteration as well as track the convergence of the model towards its local minimum.  To start, we discuss the actual calculation of the expectation and maximization steps in isolation.

Ultimately, the differentiation of this method of attack as compared to the preivous likelihood calulations comes about by what we consider to be the likelihood function. Specifically, we previously used a likelihood function of the form $\mathcal{L}(\theta | Y^T)$ which effecively assumes that the *data* inquestion for this optimization are the observations. Conversely, we could take the stance that the true *data* for this problem is the set of all system states $Y$ *and* $X$ (hence our previous references to the likelihood as a *partial* likelihood function). In this sense, the true likelihood (or, as it is often called, the *Comnplete Data Likelihood Function*) $\mathcal{L}(\theta | Y^T, X^T)$ contains both the observations as well as the hidden states. However, because we do not know what the hidden states are, we cannot directly optimize this function. As it turns out, this is exactly where the EM algorithm shines, as the iterative approach will ensure that the full likelihood function will converge towards a global maximum upon iterative estimation (or re-estimation) of model parameters.

### Expectation Step

The expectation step is most simple, as it simply revolves around calculating the expected value of the hidden state sequence, given the available observation data, and an estiamte (initially a guess) at the dynamics matrices $A$ and $B$. Put simply, this is simply a calculation of the Bayesian state estimates $p(x_t | Y^T)$ using the current dynamics and observation matrices $\boldsymbol{A}$ and $\boldsymbol{B}$, respectively.

### Minimizaion Step

Following the expectation step, the maximizatioon step involves the updating of these matrices based on our best-guesses of inter-state transition probabilities of hidden states, as well as expected observation errors. To start with the former, note that we can, in general, estimate the rate of transition $i\to j$ in a Markov model by computing the ratio

$$ \hat{A}_{ij} = \frac{N_{i\to j}}{N_{i}} $$

where $N_{i\to j}$ is the number of observed transitions from $i \to j$ and $N_i$ is the number of times the system was observed to be in state $i$. For a hidden Markov model, we can write down a probabilistic version of this equation as

$$ \hat{A}_{ij} = \frac{\sum_t p(x_{j, t+1}, x_{i, t} | Y^T)}{\sum_t p(x_{i, t} | Y^T)} $$

where $p(x_{j, t+1}, x_{i, t} | Y^T)$ denotes the probability of transitioning from state $i\to j$ during the $t\to t+1$ timestep, conditioned upon all observations. Now, in order to calculate the numerator here though, we need to actually figure out how to estimate this probability. In supporting documentation, we show that this numerator term can be written as

$$ p(x_{t, i}, x_{t-1, j} | Y^T) = \frac{\beta_{t}(i)\alpha_{t-1}(j)A_{ij}}{\sum_{i} \beta_{t}(i)\alpha_{t-1}(j)A_{ij}}p(x_{t-1, i} | Y^T) $$

which is just a combination of several terms that we already know how to calcualte.

Now, useing this result, we can *re-estimate* the parameters of the matrix $\boldsymbol{A}$.

To perform a similar update to the matrix $\boldsymbol{B}$, we can perform a similar calculation (albeit much more simple). You can show (again, discussed in more detail in supporting documentation) that the update to elements of the $\boldsymbol{B}$ matrix is

$$ \hat{B}_{ij} = \frac{\sum_t \delta_{y_t, i} p(x_{t, j} | Y^T)}{\sum_t p(x_{t, j} | Y^T)} $$

Now, we go about building the necessary computational tools to actually determine how to perform these updates to the dynamics matrices, and show how this can lead to improved HMM optimization.

#### References
- [1] J. Bechhoefer *Control Theory of Physicists*, Cambridge University Press, 2020, Cambridge, MA
- [2] *Numerical Recipes* 

In [None]:
# to start we need to import the necessary libraries
import os
import numpy as np
import matplotlib.pyplot as plt

from hidden import dynamics
from hidden import infer

# Declare sample (2D) HMM
hmm = dynamics.HMM(2, 2)
hmm.init_uniform_cycle(0.15, 0.2)
hmm.run_dynamics(500)
