# CS486 - Artificial Intelligence
## Lesson 22 - Markov Chains

Today we will discuss how to build Markov Chains that approximate the probability distribution over a set of random variables. Markov chains are the simplest form of Markov models are useful for predictive modeling. 

Here's an example of using Markov chains to generate English sentences:

In [60]:
import helpers
from aima.text import *
from utils import open_data

text = open_data("EN-text/flatland.txt").read()
model = NgramWordModel(3, words(text))
print(model.samples(10), "...")

i were the maddest of the sides a square there ...


Markov chains are how your phone suggests the next word in a sentence you're typing and spell checkers find the closest word to a misspelled word.

So how does it work? We'll need a few more probability tricks to understand.  

### Bayes' Rule

Bayes' rule is considered to be the most important equation in AI:  

> The essence of the Bayesian approach is to provide a mathematical rule explaining how you should change your existing beliefs in the light of new evidence. In other words, it allows scientists to combine new data with their existing knowledge or expertise. 
*<p>[In Praise of Bayes](https://www.economist.com/science-and-technology/2000/09/28/in-praise-of-bayes)</p>*

Suppose you have two coins: A fair coins and a coin with heads on both sides. If you pulled one of the two coins from a bag at random, the probability of pulling the fair coin would be 1 in 2. Now suppose someone else pulls a coin and flips head. What are the odds, given this observation, that they pulled the fair coin ($x$)? We can use Bayes' Rule, on the left below, to compute it:

$$P(x|y) = \frac{P(y|x)}{P(y)}P(x) = \frac{\frac{1}{2}}{\frac{3}{4}}\frac{1}{2} = \frac{1}{3}$$

Here's a short video that provides an intuition for Bayes' Rule using the coin example:

In [61]:
from IPython.display import YouTubeVideo
YouTubeVideo('Zxm4Xxvzohk?rel=0&showinfo=0')

Bayes' Rule can also be though of as:

```
                 likelihood * prior
    posterior = ----------------------
                 marginal likelihood
```                  

The denominator is just a normalizing constant that ensures the posterior adds up to 1; it can be computed by summing up the numerator over all possible values the conditional can take on. If you don't normalize, you can write Bayes' rule as:

$$ P(x\mid{y}) \propto P(y\mid{x})P(x) $$

## Independence

Two variables are independent if they have no effect on each other. Consider the probability distribution of a coin:

| Coin   |  P  |
|--------|-----|
| heads  | 0.5 | 
| tails  | 0.5 | 

Now consider the distribution of two coin flips:

| Coin1 | Coin2 | P |
|-------|-------|---|
| heads  | heads  | 0.25 | 
| heads  | tails  | 0.25 | 
| tails  | heads  | 0.25 | 
| tails  | tails  | 0.25 | 

The distribution is simply the product of each individual distribution, which is the definition of **independence**:

$$ X{\perp\!\!\!\perp}Y \iff P(X,Y) = P(X)P(Y) = \forall x,y P(x,y) = P(x)P(y) $$

In practice, variables are seldom independent, but it is *modeling assumption* that can greatly simplify our model. 

## Conditional Independence 

**Conditional independence** is our most basic and robust form of knowledge about uncertain environments. A variable is be conditionally independent of another if, presented some evidence, the likelihood of one is not influenced by the other. For example, height and vocabulary are dependent but they are conditionally independent given age. We would write that as:

$$ P(Vocabulary \mid{Height,Age}) = P(Vocabulary \mid{Age}) $$



## Markov Models

We want to model a joint distribution without computing the full join. Suppose $X$ is a **state** at a given time. Each state encodes the joint distribution over all of our variables. We connect states linearly and assume that states that are not connected are independent. 

<center><img src="images/markov_chain.png" /></center>

We would write the independence as:

$$ X_t {\perp\!\!\!\perp} X_1,..,X_{t-2}\mid{X_{t-1}} $$

The independence assumption is a strong and probably doesn't hold in reality, but we're only looking for an approximation. Another assumption is that our transition model is stationary. In other words, $P(X_t\mid{X_{t-1}})$ doesn't change with $t$. 

So what does a transition model look like? Consider the partial transition model for a sentence generator that trained on *I am Sam. Sam I am. I do not like green eggs and ham.*

<center><img src="images/unigrams.png" /></center>

## Mini-Forward Algorithm

Some Markov Chains have a **stationary distribution**,$P(X_{\infty})$. Intuitively, the stationary distribution tells us the probability, if we moved an infinite number of times through our transition model, of a value being assigned. Given an initial observation, we can compute the stationary distribution using the **Mini-Forward Algorithm**:

$$ P(x_1) = prior\ observation$$
$$ P(x_t) = \sum_{x_{t-1}}P(x_t\mid{x_{t-1}})P(x_{t-1})$$

This is essentially the Bellman equation of probability distributions. It gives us a way to iterate our values until they converge. 

Stationary distributions are useful. PageRank, for example, is a stationary distribution of a Markov chain. Not all Markov chains have a stationary distribution. Our sentence generator, for example, does not. Why not?