# Hidden Markov Model

## overview

A Hidden Markov Model (HMM) is a statistical model that represents a system undergoing a sequence of hidden (latent, unobservable) states. 

- HMM can be considered an extension of the Naive Bayes classifier to sequences

- HMM is a weighted finite-state transducer

- sequence of hidden states form a **Markov Chain** because they follow the **Markov assumption**.

- hidden states encode most recent history, emit observable symbols (observations) according to specific probability distributions.

- observations are sequential data, i.e. sequence of dependent random variables. e.g., text, weather reports, stock market numbers

- in POS tagging, hidden states are POS tags and observations are words in text sequence

## Markov assumption

first-order Markov assumption: current hidden state depends only on the previous hidden state and is conditionally independent of earlier hidden states. 

$$
P(q_t|q_{t-1}, ..., q_1)=P(q_t|q_{t-1})
$$

Where $q_t$ represents the hidden state at time step $t$. This assumption simplifies the modeling of dependencies in the hidden state sequence, making it more computationally tractable.

## Markov chain

A Markov chain is a stochastic process that models a sequence of events.

**Markov property**: memoryless property. future state depends only on current state and not on previous states


A Markov chain is characterized by:

1. A finite set of states: $S = \{s_1, s_2, \dots, s_N\}$

2. Transition probabilities: $P = \{p_{ij}\}$, where $p_{ij}$ is the probability of transitioning from state $s_i$ to state $s_j$, with $\sum_{j=1}^N p_{ij} = 1 \ \forall i$

other methods that use Markov chain:

- Markov Chain Monte Carlo (MCMC) Method: a class of algorithms for sampling from a probability distribution, e.g., Metropolis-Hastings algorithm and Gibbs sampling

- Markov Decision Processe (MDP):  in reinforcement learning, The states represent different situations, the transitions represent actions taken by an agent, and the transition probabilities represent the effects of those actions.

## definition (parameters of HMM)

Hidden Markov model $\mu=(A, B, \Pi)$ is defined by initial state probabilities $\Pi$, state transition probabilities $A$, and symbol emission probabilities $B$.

- states $\mathcal{Q}=\left\{q_0, q_1, ..., q_T\right\}$: a sequence of T hidden states consists of $N$ finite possible states. 

    initial state $q_0$ doesn't emit any observation
    
    final state $q_T$ doesn't transition to any state. 

- observations $O=\left\{o_1, o_2, ..., o_T\right\}$: a sequence of $T$ observations, each one drawn from a vocabulary $\mathcal{V}=\{v_1, ..., v_{|\mathcal{V}|}\}$

- state transition probabilities $A \in \mathbb{R}^{N \times N}$: a transition probability matrix.

    $A_{ij}$: probability of moving from state $i$ to state $j$. s.t. $\sum_{j=1}^N A_{ij}=1 \ \forall i$

- symbol emission probabilities $B =\left\{\phi_i(o_t)\right\}\in \mathbb{R}^{N \times |V|}$: an emission probability matrix. 

    each row is a hidden state, each column is an observation. 

    $B_{it}$: probability of t-th observation $o_t$ being generated from state $i$ s.t. $\sum_{t=1}^{|V|} B_{it}=1 \ \forall i$


- initial state probabilities $\Pi = \left\{\pi_1, \pi_2, ..., \pi_N\right\}$: an initial probability distribution over states. 

    $\sum_{i=1}^n \pi_i = 1$, $\pi_i$ is probability that Markov chain starts in state $i$. 
    
    some states $j$ may have $\pi_j=0$, meaning that they can't be initial states.

### example of POS tagging

States (POS tags): 
Q = {q0 (START), Noun (q1), Verb (q2), Adjective (q3), q4 (END)}

Vocabulary: 
V = {v1 (dog), v2 (cat), v3 (ran), v4 (quickly), v5 (jumped), v6 (slowly)}

Observations:
O = {o1, o2, o3, o4, o5, o6, o7, o8}

Transition probabilities A:
|        | START | Noun | Verb | Adjective | END |
|--------|-------|------|------|-----------|-----|
| START  | 0     | 0.2  | 0.3  | 0.5       | 0   |
| Noun   | 0     | 0.3  | 0.3  | 0         | 0.4 |
| Verb   | 0     | 0    | 0.4  | 0.6       | 0   |
| Adjective| 0   | 0.2  | 0    | 0.3       | 0.5 |
| END    | 0     | 0    | 0    | 0         | 0   |

Emission probabilities B:
|        | dog | cat | ran | quickly | jumped | slowly |
|--------|-----|-----|-----|---------|--------|--------|
| START  | 0   | 0   | 0   | 0       | 0      | 0      |
| Noun   | 0.1 | 0.3 | 0.2 | 0.1     | 0.1    | 0.2    |
| Verb   | 0.2 | 0.1 | 0.2 | 0.3     | 0.1    | 0.1    |
| Adjective| 0.1| 0.1 | 0.2 | 0.2     | 0.2    | 0.2    |
| END    | 0   | 0   | 0   | 0       | 0      | 0      |

Initial state probabilities $\Pi$: Adjective can't be initial state
| START | Noun | Verb | Adjective | END |
|-------|------|------|-----------|-----|
| 0.8   | 0.1    | 0.05 | 0       | 0.15|


## algorithms

| Aspect | Forward-Backward | Viterbi | Baum-Welch (EM) |
|---|---|---|---|
| **Input** | HMM parameters $\mu=(A, B, \Pi)$ and Observed sequence $O$ | HMM parameters $\mu=(A, B, \Pi)$ and Observed sequence $O$ | Initial guess of HMM parameters $\mu_0=(A, B, \Pi)$ and Observed sequence $O$|
| **Output** | State probabilities $P(q_t \| O, \mu)$ at each time step | Most probable state sequence $\hat {\mathcal{Q}}$| estimated HMM parameters  $\hat \mu$|
| **Algorithm** | Dynamic programming | Dynamic programming | Iterative optimization |
| **Applications** | POS tagging, Speech recognition, gene prediction | POS tagging, Speech recognition, Error detection & correction, sequence alignment | Speech recognition, gene prediction |
| **Dependencies** | Forward & backward probabilities | Maximum probability of the most likely path | Expectation-maximization approach |



### HMM generator

generative process of a Hidden Markov Model: generate a sequence of observations and hidden states from a given Hidden Markov Model (HMM) by iteratively sampling from the HMM model probabilities.

1. Sample an initial hidden state $q_0$ from the initial state probability distribution $\Pi$.

2. For each time step $t = 1, \dots, T$:
   
   a. Move to another state: Based on the current state $q_t$ and the state transition probabilities $A$, sample the next hidden state $q_{t+1}$.
   
   b. Emit an observation: Based on the current state $q_t$ and the symbol emission probabilities $B$, sample the observation $o_t$.

<img src='https://ars.els-cdn.com/content/image/3-s2.0-B9780124077959000141-f14-09-9780124077959.jpg' />

### Baum-Welch algorithm

- unsupervised learning to train HMM model

- objective: Expectation-Maximization. 

   estimate HMM parameters $\hat \mu=(\hat A, \hat B, \hat \Pi)$ by maximizing likelihood of observed sequence $O=\{o_1, o_2, ..., o_T\}$, given the initial guess of the parameters $\mu_0=(A, B, \Pi)$

   $$
   \mu=\arg\max_{\mu}\log P(O|\mu)= \log \sum_Q P(O, Q | \lambda)
   $$

   The difficulty in directly maximizing this expression comes from the summation over $Q= {q_1, q_2, ..., q_T}$ inside the logarithm, which makes the problem of direct maximization intractable.

   Baum-Welch (EM) algorithm overcomes this difficulty by iteratively performing E-step and M-step

Algorithm

1. initialization: Initialize the transition probabilities, emission probabilities, and initial state probabilities.

2. repeat until the parameters convergence:

   - Expectation (E-step): Compute the expected count of transitions and emissions using the forward-backward algorithm.

   - Maximization (M-step):  Update the transition probabilities, emission probabilities, and initial state probabilities based on the expected counts.

In [None]:
# psudo code
initialize A, B and pi to some initial values
repeat until convergence:
    compute alpha and beta using the forward-backward algorithm
    for each state i:
        pi[i] = gamma[1][i]
        for each state j:
            A[i][j] = expected number of transitions from state i to state j / expected number of transitions from state i
            for each observation vk in V:
                B[i][vk] = expected number of times in state i and observing vk / expected number of times in state i


$\gamma[t][i] = P(q_t = i | O, \mu)$ is posterior probability of being in state i at time t given the observation sequence and the model parameters

### Viterbi algorithm

objective: find state sequence $\mathcal{Q}$ given observation sequence $O$ and HMM model $\mu$

$$
Q=\arg\max_{Q}P(Q|O, \mu)
$$

dynamic programming Algorithm

1. Initialization: Initialize the maximum probability path $v_1$ and the corresponding backpointer $b$ at the initial state. 

    $$
    v_1(i) = \pi_i B_i(o_1) \quad \forall i: 1 \leq i \leq N\\[1em]
    B_1(i)=0 \quad \forall i: 1 \leq i \leq N
    $$

2. Recursion: For each state at each subsequent time step $t (2 < t \leq T)$, compute the maximum probability path $v_t$ that ends at this state, and update the backpointer $B_t$.
   
   $$
   v_t(i) = \max_{1 \leq j \leq N} v_{t-1}(j) A_{ji} B_i(o_t) \quad 1 \leq i \leq N \\[1em]
   B_t(i) = \arg\max_{1 \leq j \leq N} v_{t-1}(i) A_{ji}B_i(o_t) \quad 1 \leq i \leq N
   $$

3. Termination: Find the state with the maximum probability at the final time step $T$, and trace back the path using the backpointers to find the most likely sequence of states.

   Best score: $P^* = \max_{1 \leq i \leq N} v_T(i)$

   Start of backtrace: $q_T^* = \arg\max_{1 \leq i \leq N} v_T(i)$


Notation:

- A: state transition probability matrix. shape (N, N)
    A[i][j]: probability of transitioning from state i to state j

- B: emission probability matrix. shape (N, |V|)

- pi: initial state probability distribution. pi[i] is the probability of starting in state i

- O: sequence of observations. O[t] is the observation at time t.

- v:  Viterbi trellis matrix. shape (N, T)

    v[t][i] : probability of the most probable state sequence responsible for the first t observations that has i as its final state. 

- backpointer: a 2D array that keeps track of the state with the highest probability at each step. 
            used to reconstruct the most probable state sequence once the algorithm has finished.

- bestpathprob: probability of the most probable state sequence.

- bestpathpointer: state which ends the most probable state sequence.

- bestpath: most probable state sequence.

Pseudocode

```markdown
initialize v[1][i] = pi[i] * B[i][O[1]] for all i
initialize backpointer[1][i] = 0 for all i
for t from 2 to T:
    for i from 1 to N:
        v[t][i] = max over all j (v[t-1][j] * A[j][i]) * B[i][O[t]]
        backpointer[t][i] = argmax over all j (v[t-1][j] * A[j][i])

bestpathprob = max over all i (v[T][i])
bestpathpointer = argmax over all i (v[T][i])

bestpath = the path ending in bestpathpointer
```

trellis is a type of data structure used for dynamic programming. 

visualized as a grid or a graph, but represented as a 2D array in implementation.

Consider a HMM with three hidden states (H1, H2, H3) and three observations (O1, O2, O3). 

The Viterbi trellis for this case might look something like this:

```markdown
        H1  ---->  H1 ----> H1
        |         |        |
        v         v        v
        H2  ---->  H2 ----> H2
        |         |        |
        v         v        v
        H3  ---->  H3 ----> H3

        |         |        |
        v         v        v
        O1        O2       O3
```

The goal of the Viterbi algorithm is to find the most probable path through the hidden states (H1, H2, H3) that leads to the observed sequence (O1, O2, O3).

For instance, the most probable path might be H1 -> H2 -> H3, meaning that observation O1 was most likely generated when the system was in state H1, O2 when the system was in state H2, and so on.

### forward and backward algorithm

**Forward Algorithm:**

1. Initialization: initialize forward probability at first time step for each possible state $i\in \{N\}$

    $$\alpha_1(i) = \pi_i \cdot B_i(O_1) \quad \forall i$$

2. Recursion: Starts from the beginning of sequence, moves forward in time, computing the forward probabilities at each time step for each possible state.

    $$\alpha_t(i) = B_i(O_t)\left(\sum_{j=1}^N \alpha_{t-1}(j) \cdot A_{ji}\right)   \quad \text{for } t=2 \text{ to } T \text{ and } \forall i$$


**Backward Algorithm:**

1. Initialization: initialize backward probability at final time step for each possible state $i\in \{N\}$ to be 1 

    $$\beta_T(i) = 1 \quad \forall i$$

2. Recursion: Starts from the end of the sequence and moves backward in time, computing the backward probabilities at each time step for each possible state.

    $$\beta_t(i) = \sum_{j=1}^N A_{ij} \cdot B_j(O_{t+1}) \cdot \beta_{t+1}(j) \quad \text{for } t=T-1 \text{ to } 1 \text{ and } \forall i$$

**Combine**: compute the posterior probabilities for the hidden states $i$ at each time step $t$ given the observed sequence $O$ by forward probabilities ($\alpha$) and backward probabilities ($\beta$).

$$
p(q_t = i | O) = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \cdot \beta_t(j)}
$$

The denominator normalizes the probability, ensuring that the sum of the posterior probabilities over all states at time t is equal to 1. It sums over the joint probabilities of the observation sequence $O$ and all possible hidden states $j$ at time step $t$. 


Notation:

- $\alpha$: forward probability matrix, contains the joint probabilities of observing the partial sequence up to each time step and being in each state at that time step

    $\alpha_t(i)$: probability of being in state i at time t, given the observed sequence **up to time t** $O(1:t)$ and the Hidden Markov Model parameters.

- $\beta$: backward probability matrix, contains the conditional probabilities of the ending partial sequence given each state at each time step.

    $\beta_t(i)$: probability of the observed sequence from time $t+1$ to the end $O(t+1:T)$, given that system is in state $i$ at time $t$.

In [None]:
# forward pass
initialize alpha[1][i] = pi[i] * B[i][O[1]] for all i
for t from 2 to T:
    for i from 1 to N:
        alpha[t][i] = (sum over all j (alpha[t-1][j] * A[j][i])) * B[i][O[t]]

# backward pass
initialize beta[T][i] = 1 for all i
for t from T-1 down to 1:
    for i from 1 to N:
        beta[t][i] = sum over all j (A[i][j] * B[j][O[t+1]] * beta[t+1][j])

return: alpha, beta

## Supervised

MLE (Maximum Likelihood Estimation) can be used to estimate the parameters of an HMM when the dataset is fully labeled, meaning that both the observation sequences and the corresponding hidden state sequences are known. 

the estimated parameters A and B can then be used as input for the Viterbi and Forward-Backward algorithms.

- Estimate the state transition probabilities

    $$
    A_{ij} = \frac{\text{Count}(q_t = s_i, q_{t+1} = s_j)}{\text{Count}(q_t = s_i)}
    $$

- Estimate the observation probabilities

    $$
    B_{ij} = \frac{\text{Count}(q_t = s_i, o_t = w_j)}{\text{Count}(q_t = s_i)}
    $$