# Optimizing Hidden Markov Models

Definition: The Hidden Markov Model can be described by the following parameters

$$
\begin{array}{c}
H M M=\{N, M, A, B, \pi\} \\
O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\}
\end{array}
$$

where:
<br>
__N__ is the number of all the possible states (e.g. the words for speech recognition)
<br>
__M__ is the set of observations that can be made (e.g. spectra of acoustic signal for speech recognition)
<br>
__A__ is the state transition matrix - probability of moving from one state i to another state j
<br>
__B__ is the distribution of probabilities of seeing one of the observable symbols given that one is in a particular state (exists for each state)
<br>
$\Pi$is the probability of beginning in a particular state
<br>
__O__ is the observations made, where the index indicates at what point in time the observations were made

## Three key problems to sovle

(1) What is the probability that a model $\lambda=(A, B, \pi)$ generated a sequence of observations $O =(O_{1}, O_{2}, O_{3},...,O_T)$?

$$
P(O \mid \lambda)?
$$

e.g. what model bests fits with the observations - what particular sentence was written by an author?

<br>
<br>
(2) Given a model, $\lambda=(A, B, \pi)$, what sequence of states, $Q=\left\{q_{1}, q_{2}, q_{3}, \ldots, q_{T}\right\}$,  best explains a sequence of observations $O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\}$ ?


e.g. what sequence of words best explains a series of sound spectra?

<br>
<br>
<br>

(3) Given a set of observation sequences, how do we learn the model probabilities that would generate them (how do we learn the parameters of the model)? 

$$O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\} \quad \lambda=(A, B, \pi) ?$$

## Forward-backward algorithm to solve question what is the probability that a particular type of model generated sequence of observations?



Let's start by imagining all possible state sequences
<br>
$$
Q=q_{1}, q_{2}, q_{3}, \ldots, q_{T}
$$

Given the model and the observations

<br>
$$O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\} \quad \lambda=(A, B, \pi)$$

__Probability of seeing observations given those states is__
$$
P(O \mid Q, \lambda)=\prod_{t=1}^{T} P\left(O_{t} \mid q_{t}, \lambda\right)
$$

Which can be written as 

<br>
$$
P(O \mid Q, \lambda)=b_{q_{1}}\left(O_{1}\right) \cdot b_{q_{2}}\left(O_{2}\right) \cdots b_{q_{T}}\left(O_{T}\right)
$$

<br>
where $b_{qT}$ is the probability of a particular observation in a particular state, e.g. the probability of a particular phone for a particular word

<br>

e.g. $b_{q_{1}}\left(O_{1}\right)$ tells us what is the probability of seeing observation $O_1$ when we are in state $Q_1$

<br>

__Probability of seeing those state transitions, given the other parameters in the model__

<br>
$$P(Q \mid \lambda)=\pi_{q_{1}} a_{q_{1} q_{2}} a_{q_{2} q_{3}} \cdots a_{q_{T-1} q_{T}}$$
<br>
, where $\pi_{q_{1}}$ is the probability of being in the initial state and $a_{q_{1} q_{2}}$ is the probability of transitioning from $q_1$ to $q_2$

Have probabilites that tells us what's the probability of seeing a particular sequence of observations and have an expression that tells us what is the probability of seeing a particular sequence of states

__Joined Probability of those seeing observations AND those state transitions is__
$$
P(O, Q \mid \lambda)=P(O \mid Q, \lambda) P(Q \mid \lambda)
$$

But we want the probability of the observations regardless of the particular state sequence, so we have to iterate over __all__ possible state sequences
$$
P(O \mid \lambda)=\sum_{\text {all } Q} P(O \mid Q, \lambda) P(Q \mid \lambda)
$$

$$
P(O \mid \lambda)=\sum_{q_{1}, q_{2}, \ldots, q_{T}} \pi_{q_{1}} b_{q_{1}}\left(O_{1}\right) a_{q_{1} q_{2}} b_{q_{2}}\left(O_{2}\right) a_{q_{2} q_{3}} \cdots a_{q_{T-1} q_{T}} b_{q_{T}}\left(O_{T}\right)
$$


i.e. summing over all possible state sequences

<br>


__Calculating this is infeasible__ 

How many state sequences are there? $N^{T}$
How many multiplications per state sequence?
$$
2 T-1
$$
Total number of operations?
$$
(2 T-1) N^{T}+\left(N^{T}-1\right)
$$


$\mathrm{T}=100$ and $\mathrm{N}=5,$ How many operations?
$$
\begin{array}{l}
(2 T-1) N^{T}+\left(N^{T}-1\right) \\
(2(100)-1) 5^{100}+\left(5^{100}-1\right) \\
199 \cdot 5^{100}+5^{100}-1 \\
200 \cdot 5^{100}-1 \\
\approx 5^{103} \\
\approx 10^{72}
\end{array}
$$

__Need a better way to calculate this, which is the forward-backward algorithm__

### Forward backward algorithm 
#### Forward Algorithm - Alpha helper function


Main motivation a lot of repeated calculations in above formula for likelihood of a given sequence of observations given a model. Can reduce number of repeated calculations by introducing alpha (note that alpha is different from a)

$$
\alpha_{t}(i)=P\left(O_{1}, O_{2}, O_{3}, \ldots, O_{t}, q_{t}=S_{i} \mid \lambda\right)
$$

Alpha sub t at i is the probability of seeing observations $O_1,..O_t$ and then ending up at state $S_i$ at time $q_t$, given our model. 

The helper function is limiting the time at which it is considering the probability (only to lower case t not capital T) and ending at one particular state- considering all possible state sequences up to lower case t and then transitioning to state $q_t$. The state that we are going to represent it $i$.

__Calculate inductively (iteratively)__:

__(1) base case:__
$$\alpha_{1}(i)=\pi_{i} b_{i}\left(O_{1}\right) \quad 1 \leq i \leq N$$

__(2) inductive step:__
$$\alpha_{t+1}(j)=\left[\sum_{i=1}^{N} \alpha_{t}(i) a_{i j}\right] b_{j}\left(O_{t+1}\right) \quad \begin{array}{l}1 \leq t \leq T-1 \\ 1 \leq j \leq N\end{array}$$



<a href="http://drive.google.com/uc?export=view&id=1uFCxHKH072QOQ6BW66ZgYEgBDI8XhLSv"><img src="https://drive.google.com/uc?export=view&id=11erhIXSww_4BAa27mpWY7T_ZN7En5TM_" width="300px"></a>

__(3) final step:__
$$
P(O \mid \lambda)=\sum_{i=1}^{N} \alpha_{T}(i)
$$


Finally sum up over all the possible states that we could have ended up in

__We are effectively using a lattice of calculations__



<a href="http://drive.google.com/uc?export=view&id=14i6uKOVcSGOHXrlVyJl2s-qtVUo4RMTm"><img src="https://drive.google.com/uc?export=view&id=1zOuBNuZAfl3UfVVIXZn4rk5S-Xgmk8dr" width="300px"></a>

in the _final step_ we are summing over the final column

At each step we have to calculate the flow of probabilities from one N stat to the ext N states, hence $N^2$. Have to do it for each of our T observations.

Hence the total number of calculations is roughly:


$$
\begin{array}{l}
O\left(N^{2} T\right) \\
N=5 \\
T=100 \\
\quad \approx 3000 \text { calculations }
\end{array}
$$

### Backward algorithm: beta helper function

Moving backward step by step to answer the question: what is the probability of being in state t given what we are seeing in the future (cannot be done in real time but only once we know what the future looks like).

$$
\beta_{t}(i)=P\left(O_{t+1}, O_{t+2}, \cdots O_{T} \mid q_{t}=S_i, \lambda\right)
$$

We solve this inductively


base case: $$\quad \beta_{T}(i)=1 \quad 1 \leq i \leq N$$

inductive (recursive) step:
$$
\begin{array}{r}
\beta_{t}(i)=\sum_{j=1}^{N} a_{i j} b_{j}\left(O_{t+1}\right) \beta_{t+1}(j) \\
t=T-1, T-2, \cdots, 1 \quad 1 \leq i \leq N
\end{array}
$$

Tells us what is the probability that I am in this state given the observations that are coming. Have to calculate what beta t+1 is and then moving backwards step by step.


<a href="http://drive.google.com/uc?export=view&id=1OVHjHsQEDGOWO1rftIqU-2fDW5CtxgjZ"><img src="https://drive.google.com/uc?export=view&id=1a1JPutEjOFYlbID3NbmVbXD4xbiAL1eu" width="300px"></a>



## The Viterbi Algorithm

(2) Given a model, $\lambda=(A, B, \pi)$, what sequence of states, $Q=\left\{q_{1}, q_{2}, q_{3}, \ldots, q_{T}\right\}$,  best explains a sequence of observations $O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\}$ ?

e.g. the observations are frequencies that we are hearing and hidden states are components of words that generated these sounds.

By best we mean the sequence of states that has the highest likelihood given the observations.



1) choose states that are __individually__ most likely
$$
\begin{array}{c}
\gamma_{t}(i)=P\left(q_{t}=S_{i} \mid O, \lambda\right) \\
\gamma_{t}(i)=\frac{\alpha_{t}(i) \beta_{t}(i)}{P(O \mid \lambda)}=\frac{\alpha_{t}(i) \beta_{t}(i)}{\sum_{j=1}^{N} \alpha_{t}(j) \beta_{t}(j)} \\
\sum_{i=1}^{N} \gamma_{t}(i)=1
\end{array}
$$

where we divide by the total sum to normalize and ensure that the probabilities add up to 1. And where gamma gives us the probability of being in a state at a particular point in time.

$$
q_{t}=\underset{1 \leq i \leq N}{\operatorname{argmax}}\left[\gamma_{t}(i)\right], \quad 1 \leq t \leq T
$$



<a href="http://drive.google.com/uc?export=view&id=1pqM6JrA-xR4SCHfZsMZve_v0MufOFwSB"><img src="https://drive.google.com/uc?export=view&id=15Ikc1GgaGk9zYz5dfmWu42P4yYfrBVcl" width="300px"></a>


But these states may not be connected (so it is not the best sequence - we solved each step independently)

__Chose the sequence that Maximises the probability - the Viterby algoirthm__

Chose the max at each time step give the max chosen at the previous state


What is the path with the highest probability that accounts for the first $t$ observations and
:
ends at state $S_{i} ?$
$$
\delta_{t}(i)=\max _{q_{1}, q_{2}, \cdots q_{t}-1} P\left(\left\{q_{1}, q_{2}, q_{3} \cdots q_{t}=i\right\},\left\{O_{1}, O_{2}, O_{3}, \cdots O_{t}\right\} \mid \lambda\right)
$$
induction step
$$
\delta_{t+1}(j)=\left[\max _{i} \delta_{t}(i) a_{i j}\right] \cdot b_{j}\left(O_{t+1}\right)
$$
We need to keep track of which i maximized the result at each time step

Need to keep track of the variables where we came from..


Initialization 

$$\delta_{1}(i)=\pi_{i} b_{i}\left(O_{1}\right)$$
$$
\psi_{1}(i)=0
$$
Inductive step
$$
\begin{array}{rlr}
\delta_{t}(j) & =\max _{1 \leq i \leq N}\left[\delta_{t-1}(i) a_{i j}\right] \cdot b_{j}\left(O_{t}\right) & 2 \leq t \leq T \\
\psi_{t}(j) & =\underset{1 \leq i \leq N}{\operatorname{argmax}}\left[\delta_{t-1}(i) a_{i j}\right] & 1 \leq j \leq N
\end{array}
$$
Termination
$$
P^{*}=\max _{1 \leq i \leq N}\left[\delta_{T}(i)\right]
$$
$$
q_{T}^{*}=\underset{1<i<N}{\operatorname{argmax}}\left[\delta_{T}(i)\right] \quad q_{t}^{*}=\psi_{t+1}\left(q_{t+1}^{*}\right)
$$

## The Baum-Welsh Algorithm

__(3) Given a set of observations sequences how do we learn the model probabilities that would generate them (how do we learn the parameters of the model)?__ 

$$O=\left\{O_{1}, O_{2}, O_{3}, \ldots, O_{T}\right\} \quad \lambda=(A, B, \pi) ?$$

- There is no known way to solve for the globally optimal parameters of lambda
- We will search for a locally optimal result
- A result that converges to a stable good answer but isn't guaranteed to be the best answer (depends on initialization).



- This is an EM method
- Expectation-Maximization (EM)
- Iterative, converges to a local optimum
- Aka gradient "descent"


<a href="http://drive.google.com/uc?export=view&id=1_dNL0mOAfiEPzNpmaXkbNDHL8k9Br0pC"><img src="https://drive.google.com/uc?export=view&id=1fepHmt1pg4x_NShj5jC1DWLZTwW7U7Lf" width="300px"></a>


We need a new mathematical tool
$$
\xi_{t}(i, j)=P\left(q_{t}=S_{i}, q_{t+1}=S_{j} \mid O, \lambda\right)
$$

Probability of being in one state and then transitioning into another.


<a href="http://drive.google.com/uc?export=view&id=1QfNEwmnTYR5CdlIFVam842W1lsTWLcmw"><img src="https://drive.google.com/uc?export=view&id=1pTZSg44rdktNxVtn60hJUEyupK-3lZxU" width="300px"></a>


$$
\begin{array}{l}
\xi_{t}(i, j)=\frac{\alpha_{t}(i) a_{i j} b_{j}\left(O_{t+1}\right) \beta_{t+1}(j)}{P(O \mid \lambda)} \\
\xi_{t}(i, j)=\frac{\alpha_{t}(i) a_{i j} b_{j}\left(O_{t+1}\right) \beta_{t+1}(j)}{\sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{t}(i) a_{i j} b_{j}\left(O_{t+1}\right) \beta_{t+1}(j)}
\end{array}
$$

with gamma on the left and zhe on the right


<a href="http://drive.google.com/uc?export=view&id=1xYxaAwdX6_oBwsfI3M3AYDflhngjVpp8"><img src="https://drive.google.com/uc?export=view&id=1vTOsS7w-YaEzcMJg2DDMcFLTmVxEMGB9" width="300px"></a>

can move visually from the left to the right side by doing the summation



$\xi_{t}(i, j)$ is related to $\gamma_{t}(i)$
$$
\gamma_{t}(i)=\sum_{j=1}^{N} \xi_{t}(i, j)
$$



<br>
if we sum over all time observations, $t$, then we have a number that can be treated as the expected number of times $S_{i}$ is ever visited (the probability of ever being in state $S_{i}$). 

<br>
$\xi_{t}(i, j)$ is the probability of ever transitioning from $S_{i}$ to $S_{j}$ at time $t$ (regardless of the time).

<br>
if we sum over all $t$ then we have a number that can be treated as the expected number of times $S_{i}$ ever transitions to $S_{j}$

$$
\begin{array}{l}
\sum_{t=1}^{T-1} \gamma_{t}(i)=\text { expected number of transitions from } S_{i} \\
\sum_{t=1}^{T-1} \xi_{t}(i, j)=\text { expected number of transitions from } S_{i} \text { to } S_{j}
\end{array}
$$

How do we use these to improve our model (generate updates)?
$$
\bar{\lambda}=?
$$


$$
\bar{\pi_i}=\gamma_{1}(i)=\text { expected frequency in } S_{i} \text { at time }(t=1)
$$

$$
\begin{array}{c}
\bar{a}_{i j}=\frac{\text { expected number of transitions from } S_{i} \text { to } S_{j}}{\text { expected number of transitions from } S_{i}} \\
\bar{a}_{i j}=\frac{\sum_{t=1}^{T-1} \xi_{t}(i, j)}{\sum_{t=1}^{T-1} \gamma_{t}(i)}
\end{array}
$$


calculated from the previous lambda and our observations

$$
\bar{b}_{j}(k)=\frac{\text { expected number of times in state } j \text { and observing } v_{k}}{\text { expected number of times in state } \mathrm{j}}
$$

where $v_k$ is the observation of the particular symbol we are interested in

 
$$
\bar{b}_{j}(k)=\frac{\sum_{t=1}^{T} \gamma_{t}(i)\ emitting\ v_{k}}{\sum_{t=1}^{T}  \gamma_{t}(i)}
$$

$$
\begin{aligned}
&\text { Given } \lambda=(A, B, \pi) \text { and } O \text { we can produce } \alpha_{t}(i), \beta_{t}(i), \gamma_{t}(i), \xi_{t}(i, j)\\
&\text { Given } \alpha_{t}(i), \beta_{t}(i), \gamma_{t}(i), \xi_{t}(i, j) \text { we can produce } \bar{\lambda}=(\bar{A}, \bar{B}, \bar{\pi})
\end{aligned}
$$


<a href="http://drive.google.com/uc?export=view&id=1Qj5JDwCky7xxkSR2PHMR_92_UmdB98J0"><img src="https://drive.google.com/uc?export=view&id=1GMQsatf44KCuXpDxVGruQ7HPyW1i6VgF" width="300px"></a>