# Training a HMM when the hidden states are not really hidden (supervised version)

This is the case when the "hidden" states are actually known and not really hidden, but are part of the observations of the dataset. To understand this, consider the case of Parts of Speech tagging (POS tagging) -speech recognition is another case where this would apply.

Our dataset consists of sentences as features (the $x$s), but this problem is supervised and so the target will be included in the dataset. This target variable is a label that, for each word, gives us its POS. For example:

    bananas -> Noun
    like -> Verb

These labels are the corresponding $z$ s or hidden states, but in this example they are not hidden! They are known for the training set (supervised problem) and also their meaning is conceptually clear. Basically, this makes the problem boiling down to a (non-hidden) regular Markov model.

So here we will be assuming that we do know which hidden states correspond to the observed states (at least for the training set). In other notebooks we will also consider the harder problem in which the $z$ s are actually hidden (unsupervised problem).


To train a HMM in this scenario, we just need to count. Training means just finding the parameters $\pi$, $B$ and $A$ that are used to do inference (see notebooks on the forward, backward and Viterbi algorithms).

For example, let's start with the $z$ s and compute $A_{ij} = p(z_t=j|z_{t-1}=i)$. This is simple if we do have the states $z$:

$$ A_{ij} = \frac{count(i\rightarrow j)}{count(i)}\,, $$

So $A_{ij}$ is the number of times that, in the dataset, state $i$ is followd by state $j$ over the number of appearances of state $i$. As for $\pi_i$: 

$$ \pi_i = \frac{count(z_1=i)}{N}\,,$$

where $N$ is the number of observations in the dataset. This estimator is the maximum likelihood estimator for the initial state probability. Finally, we estimate $B$ as:

$$B_{ik} = p(x_t=k|z_t=i) = \frac{count(z=i \wedge x=k)}{count(z=i)}\,,$$

where $\wedge$ is the logical `AND` operator.

And that's it. the training in this case consists of counting the number of times that certain things occur.

## Derivation using Maximum Likelihood estimation

The above formulas can be derived from maximum likelihood estimation. The likelihood (using the Markov property implicitly on the joint distribution) is 

$$ 
L = \prod_{n=1}^N p(z^{(n)}_1) \prod_{t=2}^{T_n} p(z_{t+1}^{(n)}|z_{t}^{(n)})\,,
$$

where $n$ labels each one of the $N$ sequences of length $T$ that we have in our training set. Now, we assume that each $z$ follow a **categorical distribution** with probabilities $\pi_i$

$$ f(z; \pi) = \pi_1^{\mathbb{1}(z=1)} \pi_M^{\mathbb{1}(M=1)}...\pi_M^{\mathbb{1}(z=M)} = \prod_{i=1}^M \pi_i^{\mathbb{1}(z=i)}\,,$$

where $\mathbb{1}(z=i)$ is a function that equals 1 if $z=i$ and zero otherwise, so that $f(z=k; \pi) =\pi_k$. Hence 

$$p(z^{(n)}_1) = \prod_{i=1}^M \pi_i^{\mathbb{1}(z_1=i)}$$

and 

$$ p(z_t|z_{t-1}) = \prod_{i=1}^M \prod_{j=1}^M A^{\mathbb{1}(z_{t-1}=i \wedge z_t=j)}_{ij} $$

Instead of the likelihood, we consider the log-likelihood

$$ l(\pi, A) = \log  L = \sum_{n=1}^N \left[ \sum_{i=1}^M \mathbb{1}(z^{(n)}_1=i) \log \pi_i  +\sum_{t=2}^T\sum_{i=1}^M\sum_{j=1}^M \mathbb{1}(z^{(n)}_{t-1}=i \wedge z^{(n)}_t=j) \log A_{ij}\right] $$

The maximization problem is clearly constrained, as we need to impose 

$$ \sum_i^M \pi_i = 1\,,\hspace{1cm} \sum_j^M A_{ij} =1\,, \forall i=1,..., M\,, \hspace{1cm} \pi_i \ge 0\,, \hspace{1cm} A_{ij} \ge 0$$

The last two conditions are automatically satisfied by taking the logarithm. As for the first two, we impose them by adding lagarange multipliers, so that out lagrangian reads

$$ R = l(\pi, A) + \sum_i^M \alpha_i \left(1-\sum_j^M A_{ij}\right) + \beta \left(1-\sum_i^M \pi_i \right)$$ 

We finally have to solve for $\pi, A$ (and the lagrange multipliers) after imposing the maximum $ \frac{\partial R}{\partial \alpha_i} = 0 $, $\frac{\partial R}{\partial \beta} = 0 $, $\frac{\partial R}{\partial \pi_i} = 0$, $\frac{\partial R}{\partial A_{ij}} = 0$ 

The result, after a little bit of algebra, is summarized in the eqautions above for $\pi, A$.
