## Training HMMs when hidden states are truly unobserved

In this case we use a expectation maximization (EM) algorithm. The EM algorithm is useful whenever we have to marginalize over an unknown (hidden) variable. Our likelihood is $p(x) = \sum_{z} p(x, z)$.

The E step of the EM algorithm implies computing the analogue of the counting in the case of observed $z$ s. The thing is that, now, these countings depend on $A, B, \pi, \alpha$ and $\beta$ themselves (so we actually need the forward-backward algorithms at this point). The fact that we cannot count directly is caused by the unobserved nature of the $z$ s.

E step:
$$
\xi_t(i,j) = \frac{\alpha_t(i)A_{i,j}B_{j,x_{t+1}}\beta_{t+1}(j)}{\sum_{i=1}^M\sum_{j=1}^M\alpha_t(i)A_{i,j}B_{j,x_{t+1}}\beta_{t+1}(j)}\,,\\

\gamma_t(i) = \sum_{j=1}^M \xi_t(i,j)
$$

The M step now computes $\pi, A$ and $B$ from the quantities $\xi$ and $\gamma$

M step:

$$
\pi_i = \gamma_1(i)\,,\\
A_{i,j} = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\gamma_t(i)}\,,\\
B_{j,k} = \frac{\sum_{t=1}^{T}\gamma_t(j)\mathbb{1}(x_t=k)}{\sum_{t=1}^{T}\gamma_t(j)}\,,
$$

This is an iterative algorithm that starts from guessed values for $A$, $B$ and $\pi$ and iterates until convergence. The EM algorithm applied to HMMs (the above algorithm) is actually called `Baum-Welch` algorithm. It ensures only convergence to a *local* maximum, and so one could perform more than one training (starting from different initial points in parameter space) and keep the best final model.

## Choose the number of hidden states

When states $z$ are really unobserved, we do not know how many hidden states to choose ($M$). This needs to be treated as a hyperparameter of the model. For example, we can choose the $M$ that maximizes the log-lokelihood of the resulting model.

Another possibility is to use AIC or BIC, that choose the best model penalizing for the number of parameters (on the training set only!)

$$ AIC = 2p- 2 \log L\,,\\
BIC = p\log N -2\log L\,,
$$
where $p$ is the number of parameters. We can choose the minimum (best) AIC or BIC after evaluating on different $M$s

## Baum-Welch algorithm for multiple observations

We index each observation (each sequence) in our training set by $n$, with $n=1,..., N$. Now we will have $N$ training sequences and, for each one, we will be able to compute a $\alpha_n\,, \beta_n$. There will be also a overall probability of the sequence $P_n$ and a length that of course depends on the sequence (called $T_n$).

With this we can use the generalization of our previous formulas (now explicit in $\alpha, \beta$) as:
$$
\pi_i = \frac{1}{N}\sum_{n=1}^N \frac{\alpha_n(1,i)\beta_n(1,i)}{P_n}\,,\\
A_{ij} = \frac{\sum_{n=1}^N \frac{1}{P_n} \sum_{t=1}^{T_n-1}\alpha_n(t,i)A_{ij}B_{j, x_n(t+1)}\beta_n(t+1,j)}{\sum_{n=1}^N \frac{1}{P_n} \sum_{t=1}^{T_n-1}\alpha_n(t,i)\beta_n(t,i)}\,,\\
B_{jk} = \frac{\sum_{n=1}^N \frac{1}{P_n} \sum_{t=1}^{T_n}\alpha_n(t,j)\beta_n(t,j) \mathbb{1}(x_n(t)=k)}{\sum_{n=1}^N \frac{1}{P_n} \sum_{t=1}^{T_n}\alpha_n(t,j)\beta_n(t,j)}\,,
$$




In [None]:
import numpy as np
import matplotlib.pyplot as plt