notes/meetings/09_markov_models.qmd

---
title: Markov Models
date: 2024-03-07
knitr: 
  opts_chunk: 
    echo: false
    message: false
    warning: false
    dev: ragg_png
format:
  html:
    mermaid:
      theme: neutral
filters: 
  - codeblocklabel
categories:
  - compling
---

```{r}
library(reticulate)
library(tidyverse)
library(gt)
```

I started [the notes on ngrams models](04_ngrams.qmd) saying that we can think of them like finite-state-automata, but with probabilities on each transition between states. Another word for this is "Markov Model."

## Markov Chains

Any system that has a set of state that you can transition between problematically could be called a Markov Chain. They have applications well beyond ngram models, and even beyond linguistics.

### Working out ngram probabilities

Let's say we had a Markov Chain with "just" three states:

```{mermaid}
stateDiagram
  direction LR
  [*] --> A: 0.5
  [*] --> B: 0.25
  [*] --> C: 0.25
  
  A --> [*]: 0.1
  A --> A: 0.2
  A --> B: 0.3
  A --> C: 0.4
  
  B --> [*]: 0.9
  B --> C: 0.1
  
  C --> [*]: 0.2
  C --> C: 0.7
  C --> B: 0.1
```

It's going to be easier to represent all of the probabilities of transition between one state and the next as a "transition matrix"

```{r}
tibble(
  from = c("start", "A", "B", "C", "end"),
  start = 0,
  A = c(0.5, 0.2, 0, 0, 0),
  B = c(0.25, 0.3, 0, 0.1, 0),
  C = c(0.25, 0.4, 0.1, 0.7, 0),
  end = c(0, 0.1, 0.9, 0.2, 1)
) -> prob_table
prob_table |> 
  gt() |> 
    tab_spanner(
      label = "to",
      columns = start:end
    )
```

### 👻 Matrix Multiplication 🤖

-   What is the probability that we'll wind up in the "end state" in 3 steps?

Or, if we asked this question about sentences generated by an ngram model:

-   What is the probability that a sentence will be 3 words long?

We can get at this with "matrix multiplication".

```{r}
prob_table |> 
  select(-1) |> 
  as.matrix()->
  foo
```

```{python}
trans_mat = r.foo
```

```{python}
#| echo: true
import numpy as np
trans_mat
```

To get the probability of our state after the first step, we need to get that top row of probabilities. One way to do this is to represent our *current* state (the start state) like this

```{python}
#| echo: true

start_state = np.array([
  1, # start 
  0, # A 
  0, # B
  0, # C
  0, # end
  ])
  
start_state
```

The 1 represents the probability we're in the starting position, and the 0s represent that we're in no other position. If we treat this like a "row vector", we can get the probability of the next state with matrix multiplication.

$$
[ 1, 0, 0, 0,0] \left[ \begin{array}{ccccc}
0.  & 0.5 & 0.25& 0.25& 0\\
0.  & 0.2 & 0.3 & 0.4 & 0.1 \\
0.  & 0.  & 0.  & 0.1 & 0.9\\
0.  & 0.  & 0.1 & 0.7 & 0.2 \\
0.  & 0.  & 0.  & 0.  & 0.  \end{array} \right]
$$

In matrix multipication, we multiply each *row* from the left hand side by each *column* on the right hand side, and sum the result together. In numpy, there's an operator `@` for this.

```{python}
#| echo: true
first_step = start_state @ trans_mat
first_step
```

Now, our `first_state` row vector represents the probability we're in each state. How do we get the probabilities of our location after the second step? Just multiply again!

```{python}
#| echo: true
second_step = first_step @ trans_mat
second_step
```

And for the third step:

```{python}
#| echo: true
third_step = second_step @ trans_mat
third_step
```

::: {.callout-note collapse="true"}
## Full Matrix Multiplication

```{python}
#| echo: true
trans_mat @ trans_mat @ trans_mat
```
:::

```{python}
#| echo: true

net_mat = np.array([
  [0.9, 0.1],
  [0.3, 0.7]
])

net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat @ net_mat
```

## Hidden Markov Models

"Hidden" Markov Models, or HMMs, have had a lot of applications in acoustics, specifically in Automated Speech Recognition and Forced Alignment. The core idea is that you can't directly observe the state, but you can observe something *about* the state. One analogy is to use the weather.

Let's say we are, for whatever reason, unable to directly observe the weather outside. But we *are* able to observe whether or not the sidewalk is wet. And we do know the following:

|     | Wet Sidewalk | Dry Sidewalk |
|:---:|-------------:|-------------:|
| 🌧️  |          0.9 |          0.1 |
| 🌞  |          0.2 |          0.8 |

And we also know that perhaps the best predictor of weather today is what it was yesterday, so

|     |  🌧️ |  🌞 |
|:---:|----:|----:|
| 🌧️  | 0.8 | 0.2 |
| 🌞  | 0.2 | 0.8 |

So we have our general Markov Model:

```{mermaid}
flowchart TB
  subgraph hidden
    direction LR
    sunny --> rainy
    sunny --> sunny
    rainy --> rainy
    rainy --> sunny
  end
  
  subgraph observable
    sunny --> wet
    sunny --> dry
    rainy --> wet
    rainy --> dry
  end
```

If we get the sequence

```         
wet wet wet dry wet dry dry dry
```

We can actually work out the probability of the weather state at each point in time.

### Application to ASR