# Hidden Markov Models

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import quantecon as qe

%matplotlib inline

## What is a hidden Markov model?

A hidden Markov model is a model in which there is a hidden state, $x_t$, that follows a Markov process and an observed state, $y_t$, that is a function of $x_t$ and randomness

## Examples of hidden Markov models (and hidden states)

It isn't always obvious to think about what hidden states (i.e., variables that one can't observe) are and how they would be useful.

Let's begin by presenting a few examples.

**Example 1: Cell phone location**

In spite of what it may appear, your phone cannot directly measure your physical location.

The phone listens for radio signals from various satellites and uses the relative strengths (and time to receive the signal) to uncover where you are.

**Example 2: Stock market**

Some people assert that the stock market follows "animal spirits" with bear runs being periods of time in which the value of stocks (typically) declines and bull runs being periods of time in which the value of stocks (typically) rise.

The current animal spirit is not necessarily observable, but we can observe stock returns.

**Example 3: Animal behavior**

It can be difficult to directly observe what ocean creatures are doing at various moments in time.

In order to learn more about what these animals do, researchers often tag these animals with GPS trackers. They can then learn more about the different types of behavior animals might be exhibiting. For example:

* When the animal slows/stops its movement that it is likely to be sleeping
* Short bursts of rapid movement may indicate hunting (or fleeing a predator!)

I'm admittedly no ecologist so I refer any questions of how this works to [researchers in this field](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383489/) and this [excellent podcast](https://www.learnbayesstats.com/episode/14-hidden-markov-models-statistical-ecology-with-vianey-leos-barajas)

**Example 4: Speech recognition**

Imagine that you were tasked with identifying whether a certain sports announcer lead to people changing the channel. Hypothetically you could determine who was speaking at each moment of the broadcast, but collecting sufficient data to make a reliable inference would be difficult.

One alternative you could use is to make the speaker a hidden state and use audio data as the observed data to determine who was speaking at each moment in time.

(We have spoken with data scientists at large media companies who have been tasked with work that is very close to this!)

## HMM examples (with math!)

We next begin working with some simplified examples of hidden Markov models.

We will do a simple discrete state example and a simple continuous state example. In both cases, we will begin with a somewhat static model and then make it dynamic.

The theme in these examples is to "walk before you run".

## Discrete example

We begin with a version of the "canonical" HMM.

Imagine that we are a pyschologist and are trying to learn about whether an individual in our care is happy or unhappy. Every day the individual chooses whether to play one of two card games: 

* Solitaire ($S$) or
* go fish ($G$)

We know when the individual is happy ($H$) that they play go fish with probability 0.80 and solitaire with probability 0.20. When the individual is unhappy ($U$), they play go fish with probability 0.40 and solitaire with probability 0.60.

Additionally, historically we have found that the individual is happy 60\% of the time and unhappy 40\% of the time.

Imagine that we see an individual choose to play go fish today. What is the probability that they are happy?

We can learn abaout this using conditional probabilities (aka Bayes law)!

\begin{align*}
  \text{Prob}(H | G) &= \frac{\text{Prob}(H) \text{Prob}(G | H)}{\text{Prob}(G)} \\
  &= \frac{\text{Prob}(H) \text{Prob}(G | H)}{\text{Prob}(G | H) \text{Prob}(H) + \text{Prob}(G | S) \text{Prob}(S)} \\
\end{align*}

In [None]:
p_H1, p_U1 = 0.6, 0.4
p_GgH, p_SgH = 0.8, 0.2
p_GgU, p_SgU = 0.4, 0.6

(p_H1 * p_GgH) / (p_GgH*p_H1 + p_GgU*p_U1)

## Discrete example with dynamics

Now imagine that we're able to observe two days of whether the individual plays solitaire or go fish.

Let the games that the individual plays on each day be

* Day 1: Go fish
* Day 2: Solitaire

What is the probability that the individual is happy on each day?

One way to proceed is to simply calculate the day by day probabilities separately using the same rules as above.

What if instead, we account of the fact that if someone is happy yesterday, then they are likely to be happy today?

More specifically, we will assume a that an individual's mood follows a particular Markov chain where the states are $\{H, U\}$ and the transition matrix is

$$\begin{bmatrix} 0.95 & 0.05 \\ 0.2 & 0.8 \end{bmatrix}$$


In [None]:
p_HH, p_HU, p_UH, p_UU = 0.95, 0.05, 0.2, 0.8

Let's write down our conditional probabilities:

\begin{align*}
  \text{Prob}(H_1 | \{G_1, S_2\}) &= \frac{\text{Prob}(H_1) \text{Prob}(G_1, S_2 | H_1)}{\text{Prob}(G_1, S_2)} \\
  \text{Prob}(H_2 | \{G_1, S_2\}) &= \frac{\text{Prob}(H_2) \text{Prob}(G_1, S_2 | H_2)}{\text{Prob}(G_1, S_2)} \\
\end{align*}

Beginning with the components of the first equation,

\begin{align*}
  \text{Prob}(G_1, S_2) &= \text{Prob}(G_1, S_2 | H_1 H_2) \text{Prob}(H_1 H_2) \\
  &\quad + \text{Prob}(G_1, S_2 | H_1 U_2) \text{Prob}(H_1 U_2) \\
  &\quad + \text{Prob}(G_1, S_2 | U_1 H_2) \text{Prob}(U_1 H_2) \\
  &\quad + \text{Prob}(G_1, S_2 | U_1 U_2) \text{Prob}(U_1 U_2) \\
  &= \text{Prob}(G_1 | H_1) \text{Prob}(S_2 | H_2) \text{Prob}(H_2 | H_1) \text{Prob}(H_1) \\
  &\quad+ \text{Prob}(G_1 | H_1) \text{Prob}(S_2 | U_2) \text{Prob}(U_2 | H_1) \text{Prob}(H_1) \\
  &\quad+ \text{Prob}(G_1 | U_1) \text{Prob}(S_2 | H_2) \text{Prob}(H_2 | U_1) \text{Prob}(U_1) \\
  &\quad+ \text{Prob}(G_1 | U_1) \text{Prob}(S_2 | U_2) \text{Prob}(U_2 | U_1) \text{Prob}(U_1) \\
\end{align*}

and

\begin{align*}
  \text{Prob}(G_1, S_2 | H_1) &= P(G_1 | H_1) P(S_2 | H_1) \\
  &= P(G_1 | H_1) (P(S_2 | H_2) P(H_2 | H_1) + P(S_2 | U_2) P(U_2 | H_1))
\end{align*}


In [None]:
# Prob(H1 | G1, S2)
p_G1S2 = (
    (p_GgH*p_SgH*p_HH*p_H1) +
    (p_GgH*p_SgU*p_HU*p_H1) +
    (p_GgU*p_SgH*p_UH*p_U1) +
    (p_GgU*p_SgU*p_UU*p_U1)
)
p_G1S2gH1 = p_GgH*(p_SgH*p_HH + p_SgU*p_HU)

p_H1gG1S2 = (p_H1 * p_G1S2gH1) / p_G1S2
p_H1gG1S2

Now onto the components of the second equation,

\begin{align*}
  \text{Prob}(H_2) &= \text{Prob}(H_2 | H_1) \text{Prob}(H_1) + \text{Prob}(H_2 | U_1) \text{Prob}(U_1)
\end{align*}

and


\begin{align*}
  \text{Prob}(H_1 | H_2) &= \frac{\text{Prob}(H_1) \text{Prob}(H_2 | H_1)}{\text{Prob}(H_2)} \\
  \text{Prob}(U_1 | H_2) &= \frac{\text{Prob}(U_1) \text{Prob}(H_2 | U_1)}{\text{Prob}(H_2)}
\end{align*}

and

\begin{align*}
  \text{Prob}(G_1 S_2 | H_2) &= \text{Prob}(G_1 S_2 | H_1 H_2) \text{Prob}(H_1 | H_2) \\
  &\quad + \text{Prob}(G_1 S_2 | U_1 H_2) \text{Prob}(U_1 | H_2) \\
  &= \text{Prob}(G_1 | H_1) \text{Prob}(S_2 | H_2) \text{Prob}(H_1 | H_2) \\
  &\quad + \text{Prob}(G_1 | U_1) \text{Prob}(S_2 | H_2)  \text{Prob}(U_1 | H_2) \\
\end{align*}


In [None]:
p_H2 = (p_HH*p_H1 + p_UH*p_U1)

p_H1gH2 = (p_H1*p_HH)/p_H2
p_U1gH2 = (p_U1*p_UH)/p_H2

p_G1S2gH2 = p_GgH*p_SgH*p_H1gH2 + p_GgU*p_SgH*p_U1gH2

p_H2gG1S2 = (p_H2 * p_G1S2gH2) / p_G1S2
p_H2gG1S2

Notice how with two observations we start to learn a little more about which states generated the observations... The observation from period 2 told us that it was less likely that the individual actually was happy yesterday!

Conditional probabilities are going to be at the center of EVERY HMM!

## Continuous example

Let $x_0$ be an $n \times 1$ random vector and $y_0$ be a $p \times 1$ random vector such that

\begin{align*}
  x_0 &\sim N(\bar{x}_0, \Sigma_0) \\
  y_0 &= G x_0 + v_0 \\
  v_0 &\sim N(0, R)
\end{align*}

where $v_0$ is orthogonal to $x_0$, $R$ is a $p \times p$ positive definite matrix, and $\Sigma$ is an $n \times n$ positive definite matrix.

We will consider the problem of someone who observes $y_0$ but not $x_0$. Additionally, the individual knows $\bar{x}_0$, $\Sigma_0$, $G$, and $R$

**What do we know?**

We know that

\begin{align*}
  \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} \sim N \left(\mu, \Sigma \right)
\end{align*}

where

$$\mu = \begin{bmatrix} \bar{x} \\ G \bar{x} \end{bmatrix},\; \Sigma = \begin{bmatrix} \Sigma_0 & \Sigma_0 G' \\ G \Sigma_0 & G \Sigma_0 G' + R \end{bmatrix}$$


**Conditional normal equations**

Conditional on knowing $y_0$, what is the distribution of $x_0$?

$$x_0 | y_0 \sim N(\tilde{\mu}, \tilde{\Sigma})$$

where

$$\tilde{\mu} = \bar{x_0} + \Sigma_0 G'(G \Sigma_0 G' + R)^{-1} (y_0 - G \bar{x_0})$$

and

$$\tilde{\Sigma} = \Sigma_0 - \Sigma_0 G' (G \Sigma_0 G' + R)^{-1} G \Sigma_0$$

**What do we learn from these equations?**

* $R = \mathbb{0} \rightarrow$
  - $G \tilde{\mu} = y_0$
  - $\tilde{\Sigma} = 0$
* $\tilde{\mu}$ effectively is scaling the difference between observed $y_0$ and expected $G \bar{x}_0$.
  - $y_0 - G \bar{x}_0 > 0$ implies either $\tilde{\mu} > \bar{x}_0$ or $\tilde{\mu} < \bar{x}_0$ based on the value of $\Sigma_0 G'(G \Sigma G' + R)^{-1}$
  - $y_0 - G \bar{x}_0 < 0$ implies the opposite of the above

## Continuous with "dynamics"

Suppose that we have a two observation time series:

$$\{x_0, y_0, x_1, y_1\}$$

where

\begin{align*}
  x_0 &\sim N(\bar{x}_0, \Sigma_0) \\
  y_0 &= G x_0 + v_0 \\
  v_0 &\sim N(0, R) \\
  x_1 &= A x_0 + C w_1 \\
  y_1 &= G x_1 + v_1
\end{align*}

We will explore the probability distribution over $x_1$

Using what we computed in the previous section, we can determine that

\begin{align*}
  x_1 | y_0 \sim N(A \tilde{\mu}_0, A \tilde{\Sigma}_0 A' + C C')
\end{align*}

Let

\begin{align*}
  \hat{\mu}_1 &= A \tilde{\mu}_0 \\
  \hat{\Sigma}_1 &= A \tilde{\Sigma}_0 A' + C C'
\end{align*}

Starting from here, we have a very similar problem to what we solved in the static component!

We want to compute the distribution of $x_1 | y_1$. We can do this using the same formulas as in part 1 to get

$$x_1 | y_1 \sim N(\tilde{\mu}_1, \tilde{\Sigma}_1)$$

where

$$\tilde{\mu}_1 = \hat{\mu}_1 + \hat{\Sigma}_1 G'(G \hat{\Sigma}_1 G' + R)^{-1} (y_1 - G \hat{\mu}_1)$$

and

$$\tilde{\Sigma}_1 = \hat{\Sigma}_1 - \hat{\Sigma}_1 G' (G \hat{\Sigma}_1 G' + R)^{-1} G \hat{\Sigma}_1$$

This dynamic example in a continuous state/observation equation is a preface to linear state space models and the Kalman filter.

We will explore these topics in more depth soon.

## Discrete state HMMs

Now that we've done some two-period examples, we're going to move on to a $T$ period examples.

Consider the following setting:

The weekly returns for a particular stock alternate between bear and bull cycles according to a Markov chain. You have been told that the transition matrix that describes this Markov chain is given by:

\begin{align*}
  \begin{bmatrix} p_{\text{bear}} & 1 - p_{\text{bear}} \\ 1 - p_{\text{bull}} & p_{\text{bull}} \end{bmatrix}
\end{align*}

where $p_{\text{bear}} = 0.85$ and $p_{\text{bull}} = 0.7$.

Returns can either be negative ($N$), zero ($Z$), or positive ($P$).

The weekly returns that an individual earns are random and depend on whether the market is in a bear or bull cycle.

\begin{align*}
  r_{\text{bear}} = \begin{cases} N \text{ with probability } 0.2 \\ Z \text{ with probability } 0.75 \\ P \text{ with probability } 0.05 \end{cases} \\
  r_{\text{bull}} = \begin{cases} N \text{ with probability } 0.1 \\ Z \text{ with probability } 0.6 \\ P \text{ with probability } 0.3 \end{cases}
\end{align*}

**Simulate data**

We start by simulating the output of such a model.

In [None]:
# Two years of data
T = 104

p_bear = 0.85
p_bull = 0.7
P = np.array([[p_bear, 1 - p_bear], [1 - p_bull, p_bull]])

r_bear_probs = np.array([0.2, 0.75, 0.05])
r_bull_probs = np.array([0.1, 0.6, 0.3])

mc = qe.MarkovChain(P)


def simulate_bb_model(mc, r_bear_probs, r_bull_probs, T):
    # First simulate the bear/bull component
    bb_idx = mc.simulate_indices(T)

    realized_returns = np.zeros(T, dtype=int)
    for t, bb in enumerate(bb_idx):
        # Build the discrete random variable for each period
        if bb == 0:
            r_probs = qe.DiscreteRV(r_bear_probs)
        else:
            r_probs = qe.DiscreteRV(r_bull_probs)

        realized_returns[t] = r_probs.draw()[0]

    return bb_idx, realized_returns


**Examining the data**

In [None]:
bb_idx, realized_returns = simulate_bb_model(mc, r_bear_probs, r_bull_probs, 104)

In [None]:
def plot_bb_model_output(bb_idx, realized_returns):
    # Relevant plotting stuff
    T = bb_idx.shape[0]
    tvalues = np.arange(T)

    fig, ax = plt.subplots(2, 1, figsize=(8, 10), sharex=True)
    ax0, ax1 = ax

    ax0.scatter(tvalues, bb_idx)
    ax0.set_yticks([0, 1])
    ax0.set_yticklabels(["Bear", "Bull"])
    ax0.spines["right"].set_visible(False)
    ax0.spines["top"].set_visible(False)

    ax1.scatter(tvalues, realized_returns)
    ax1.set_yticks([0, 1, 2])
    ax1.set_yticklabels(["Negative", "Zero", "Positive"])
    ax1.spines["right"].set_visible(False)
    ax1.spines["top"].set_visible(False)

    pass

plot_bb_model_output(bb_idx, realized_returns)

### Objects (mostly probabilities) that we might be interested in:

1. $P(x_t | y^t)$: Can we use the history of observed returns to identify whether we are currently in a bear or bull market -- This is known as the "filtering problem".
2. $P(x_\tau | y^t)$ where $\tau < t$: Can we use the history of observed returns to identify whether we were in a bear or bull market in the past -- This is known as the "smoothing problem"
3. $P(x_\tau | y^t)$ where $\tau > t$: Can we use the data we've observed until now to predict the state in the future -- This is known as the "forecasting (or prediction) problem"
4. $P(y^t)$: What is the likelihood of having observed the returns that we see -- This is known as the "likelihood problem"
5. $\hat{x}^t$: What is the most likely sequence of market conditions to have generated the data we see -- This is known as the "most likely hidden path"

#### Filtering problem

The filtering problem is about using the history of observed data to identify the current hidden state, i.e. $P(x_t | y^t)$

The probabilities will be computed recursively.

Let

$$\alpha(x_t) \equiv P(x_t, y^{t})$$

then, $\alpha(x_0) = P(y_0 | x_0) P(x_0)$

Recursively, if we have $\alpha(x_{t-1})$ then

\begin{align*}
  \alpha(x_t) &= P(x_t, y^{t}) \\
  &= \sum_{x_{t-1}} P(x_t, x_{t-1} y^{t}) \\
  &= \sum_{x_{t-1}} P(y_t | x_{t-1}, x_{t}) P(y^{t-1} | x_{t-1}, x_{t}) P(x_{t} x_{t-1}) \\
  &= P(y_t | x_{t}) \sum_{x_{t-1}} P(y^{t-1} | x_{t-1}) P(x_{t} | x_{t-1}) P(x_{t-1}) \\
  &= P(y_t | x_{t}) \sum_{x_{t-1}} P(y^{t-1}, x_{t-1}) P(x_{t} | x_{t-1}) \\
  &= P(y_t | x_{t}) \sum_{x_{t-1}} \alpha(x_{t-1}) P(x_{t} | x_{t-1}) \\
\end{align*}

Now notice that

\begin{align*}
  P(x_t | y^t) &= \frac{P(x_t, y^t)}{P(y^t)} \\
  &\propto P(x_t, y^t) \\
  &= \alpha(x_t)
\end{align*}

Let's see whether we can figure out what is the probability of being in a bear/bull market in period 52:

In [None]:
# Allocate memory for our alphas
t_of_interest = 104
alphas = np.zeros((t_of_interest, 2))

# Solve for period 0 -- Equal probability of starting
# in bear/bull market
alphas[0, 0] = r_bear_probs[realized_returns[0]] * 0.5
alphas[0, 1] = r_bull_probs[realized_returns[0]] * 0.5

for t in range(1, t_of_interest):

    # Sum over  x_{t-1}
    predictor_bear = 0.0
    predictor_bull = 0.0
    for j in range(2):
        #            alpha(x_{t-1}) P(x_t | x_{t-1})
        predictor_bear += alphas[t-1, j]*mc.P[j, 0]
        predictor_bull += alphas[t-1, j]*mc.P[j, 1]

    alphas[t, 0] = r_bear_probs[realized_returns[t]]*predictor_bear
    alphas[t, 1] = r_bull_probs[realized_returns[t]]*predictor_bull

# Convert with normalizing factor!
filtering_probs = np.divide(alphas, alphas.sum(axis=1)[:, None])

print(f"Probability of bear/bull is {filtering_probs[-1, :]}")
print(f"Actual state is {bb_idx[t_of_interest-1]}")

In [None]:
tvalues = np.arange(bb_idx.shape[0])

fig, ax = plt.subplots(3, 1, sharex=True, figsize=(10, 8))

ax[0].scatter(tvalues, bb_idx)
ax[1].scatter(tvalues, realized_returns)
ax[2].plot(tvalues, filtering_probs[:, 1])
ax[2].set_ylim(0, 1)

**Useful References**

* [Blog post by Jonathan Hui](https://jonathan-hui.medium.com/machine-learning-hidden-markov-model-hmm-31660d217a61)
* [Slides by Martin Haugh @ Columbia](http://www.columbia.edu/~mh2078/MachineLearningORFE/HMMs_MasterSlides.pdf)