# Hidden Markov Model

## motivation

A Hidden Markov Model (HMM) is a statistical model that represents a system undergoing a sequence of hidden (latent, unobservable) states. 

HMM is a weighted finite-state transducer

- transitions between these states follow a **Markov chain**. transitions encode possible sequences of states, e.g., ADJ-NOUN, NOUN-VERB

- hidden states encode most recent history, emit observable symbols (observations) according to specific probability distributions.

- observations are sequential data, i.e. sequence of dependent random variables. e.g., text, weather reports, stock market numbers

## Markov chain

A Markov chain is a stochastic process that models a sequence of events.

Markov/memoryless property: probability of transitioning from one event to another **depends only on the current state and not on the previous states**. 


A Markov chain is characterized by:

1. A finite set of states: $S = \{s_1, s_2, \dots, s_N\}$

2. Transition probabilities: $P = \{p_{ij}\}$, where $p_{ij}$ is the probability of transitioning from state $s_i$ to state $s_j$, with $\sum_{j=1}^N p_{ij} = 1 \ \forall i$

## definition

Hidden Markov model $\mu=(A, B, \Pi)$ is defined by initial state probabilities $\Pi$, state transition probabilities $A$, and symbol emission probabilities $B$.

- states $\mathcal{Q}=\left\{q_0, q_1, ..., q_f\right\}$ a finite set of $N$ states. $q_0, q_f$ is initial state and final state

- observations $O=\left\{o_1, o_2, ..., o_T\right\}$ a sequence of $T$ observations, each one drawn from a vocabulary $\mathcal{V}=\{v_1, ..., v_{|\mathcal{V}|}\}$

- state transition probabilities $A \in \mathbb{R}^{N \times N}$ a transition probability matrix. each $a_{ij}$ represent a probability of moving from state $i$ to state $j$. s.t. $\sum_{j=1}^N a_{ij}=1 \ \forall i$

- symbol emission probabilities $B=\left\{b_i(o_t)\right\}$ a sequence of observation likelihoods, probability of an observation $o_t$ being generated from a state $i$

- initial state probabilities $\Pi = \left\{\pi_1, \pi_2, ..., \pi_N\right\}$ an initial probability distribution over states. $\sum_{i=1}^n \pi_i = 1$, $\pi_i$ is probability that Markov chain starts in state $i$. some states $j$ may have $\pi_j=0$, meaning that they can't be initial states.

## generative algorithm

generative algorithm is a procedure to generate a sequence of observations and hidden states from a given Hidden Markov Model (HMM) by iteratively sampling from the HMM model probabilities.

1. Pick the start state: Sample the initial hidden state $q_1$ from the initial state probability distribution $\Pi$.

2. For each time step $t = 1, \dots, T$:
   
   a. Move to another state: Based on the current state $q_t$ and the state transition probabilities $A$, sample the next hidden state $q_{t+1}$.
   
   b. Emit an observation: Based on the current state $q_t$ and the symbol emission probabilities $B$, sample the observation $o_t$.

## application

- NLP: part of speech tagging, speech recognition

- Bioinfo: sequence alignment, gene prediction

- time series analysis: stock prediction

## language model

language model: estimate probability of observation sequence $P(O|\mu)$ given model $\mu=(A, B, \Pi)$


**Forward algorithm**: similar to Viterbi algorithm, but use **sum** instead of **max**

The Forward algorithm computes the probability of observing the partial sequence $o(1:t)$ up to time step $t$, given the current state $q_t$ and the HMM model by recursively computes the probabilities at each time step, summing up the joint probabilities of all possible state sequences up to time step $t$.

$$
\alpha_t(q_t) = p(o_t | q_t) \cdot \sum_{q_{t-1}} p(q_t | q_{t-1}) \cdot \alpha_{t-1}(q_{t-1})
$$

Here, $\alpha_t(q_t)$ represents the joint probability of the observation sequence $o(1:t)$ and the current state $q_t$. 

The emission probability is $p(o_t | q_t)$, and the transition probability is $p(q_t | q_{t-1})$. 

### Derivation

Given the goal of computing the joint probability of the observation sequence $o(1:t)$ and the current state $q_t$, we can derive the formula step-by-step:

1. Joint probability: $p(o(1:t), q_t)$

2. Apply the chain rule: $p(o_t | o(1:t-1), q_t) \cdot p(o(1:t-1), q_t)$

3. Assume that the current observation $o_t$ depends only on the current state $q_t$ (HMM assumption): $p(o_t | q_t) \cdot p(o(1:t-1), q_t)$

4. Apply the chain rule again to the second term: $p(o_t | q_t) \cdot p(q_t | o(1:t-1)) \cdot p(o(1:t-1))$

5. Assume that the current state $q_t$ depends only on the previous state $q_{t-1}$ (HMM assumption): $p(o_t | q_t) \cdot p(q_t | q_{t-1}) \cdot p(o(1:t-1), q_{t-1})$

6. Sum over all possible previous states $q_{t-1}$: $p(o_t | q_t) \cdot \sum_{q_{t-1}} p(q_t | q_{t-1}) \cdot p(o(1:t-1), q_{t-1})$

7. Define $\alpha_t(q_t) = p(o(1:t), q_t)$ and $\alpha_{t-1}(q_{t-1}) = p(o(1:t-1), q_{t-1})$

8. Final formula: $\alpha_t(q_t) = p(o_t | q_t) \cdot \sum_{q_{t-1}} p(q_t | q_{t-1}) \cdot \alpha_{t-1}(q_{t-1})$


# HMM learning

- Supervised: Training sequences are labeled

- Unsupervised: Training sequences are unlabeled, Known number of states

- Semi-supervised: Some training sequences are labeled

## Supervised

- Estimate the state transition probabilities using MLE: 

    $$
    a_{ij} = \frac{\text{Count}(q_t = s_i, q_{t+1} = s_j)}{\text{Count}(q_t = s_i)}
    $$

- Estimate the observation probabilities using MLE: 

    $$
    b_j(k) = \frac{\text{Count}(q_t = s_j, o_t = v_k)}{\text{Count}(q_t = s_j)}
    $$


- Estimating emission probabilities can be harder than estimating transition probabilities because:

  * Language is highly flexible and creative, leading to new word/POS combinations that may not have been seen in the training data.

  * Sparse data due to the large number of possible word/POS combinations.


- solution:

  * smoothing: Laplace smoothing, Good-Turing, or Kneser-Ney can be used to assign non-zero probabilities to unseen word/POS combinations.

  * Heuristics based on word features, e.g., morphology, words ending with suffix "-ing" might be more likely to have a POS tag of "Verb".

## Unsupervised

goal: find HMM model $\mu_i$ that best describes the observations given observation sequence $O$ and a space of all possible $\mu_{1, ..., m}$

**Expectation-Maximization**: forward-backward (Baum-Welch) algorithm. 

- Baum-Welch finds an approximate solution for $P(O|\mu)$.

- guarantees that at each iteration, the likelihood of the data $P(O|\mu)$ increases.

- can be stopped at any point to provide a partial solution.

- converges to a local maximum.

Algorithm

1. Randomly set the parameters of the HMM.

2. Until the parameters converge, repeat:

   - E step: Determine the probability of various state sequences for generating the observations.

   - M step: Reestimate the parameters based on these probabilities.


