# The Markov Property
What is the markov property? 
> The markov property is when tomorrows weather only depends on todays weather, but not yesterdays weather. It is when the next word in the sentence only depends on the previous word in a sentence, but not on any other words. It is when tomorrows stock price only depends on today's stock price. 

The markov property is also called the markov assumption, because can clearly see that this is a strong assumption! We are essentially throwing away all historical data, except for the most recent. 

In more general terms, what we have been referring to as `weather`, or `stock price`, can be thought of as a **state**. We say that the markov assumption is that the current state only depends on the previous state, or that the next state only depends on the current state. Another way of saying this is that the distribution of the state at time $t$, only depends on the distribution of the state at time $t-1$:
### $$State \; at \; time \; t \rightarrow s(t)$$
### $$p\Big(s(t) \; | \; s(t-1), s(t-2),...,s(0)\Big) = p\Big(s(t) \; | \; s(t-1) \Big)$$

Why do we want to do this? Well, the goal here is to model the joint probability; in other words *the probability of seeing an entire specific sequence*. 

In other words, if we had 4 states, then without the markov property our joint probability would look like:

#### $$p(s4, s3, s2, s1) = p(s4 \;|\; s3, s2, s1)p(s3, s2, s1)$$
#### $$p(s4, s3, s2, s1) = p(s4 \;|\; s3, s2, s1)p(s3\;|\; s2, s1)p(s2, s1)$$
#### $$p(s4, s3, s2, s1) = p(s4 \;|\; s3, s2, s1)p(s3\;|\; s2, s1)p(s2\;|\; s1)p(s1)$$

On the other hand, if we do use the markov property, it looks like: 
#### $$p(s4, s3, s2, s1) = p(s4 \;|\; s3)p(s3 \;|\; s2)p(s2 \;|\;s1)p(s1)$$


Think about the sequence: $s1, s2, s3$. How often does that occur? If it doesn't happen that often, how can we accurately measure $p(s4 \;|\; s3, s2, s1)$?

## Concrete Example
Notice, that if we were to take the most general form, where the state at time $t$ depends on all of the previous states, it would be really hard to measure these probability distributions. For example, think of a wikipedia article where we try to predict the next word. Let's say it is a 1000 word wikipedia article. Now, we have to get the distribution of the 1000th word given the last 999 words:

### $$p(w_{1000} \; | \; w_{999}, ...,w_1)$$

However, we can imagine that this is the only wikipedia article with that exact same sequence of 999 words. So our probability measure is 1 out of 1. That is a 1 sample measurement and not a great language model. 

Conversely, if you are thinking of the begining of the article, and you only have 1 previous word, say that the word is "the", then you have an enormous number of possible next words. So, you may want to do something like train on only sequences of 3 or 4 words. In this case, your current word would depend only on the two or 3 previous words.

### $$p(w(t) \; | \; w(t-1), w(t-2))$$

## Generalize
To generalize the above concept, we have:


#### $$First \; order \; Markov \rightarrow p\Big(s(t) \; | \; s(t-1)\Big)$$
#### $$Second \; order \; Markov \rightarrow p\Big(s(t) \; | \; s(t-1), s(t-2)\Big)$$
#### $$Third \; order \; Markov \rightarrow p\Big(s(t) \; | \; s(t-1), s(t-2), s(t-3)\Big)$$