# Introduction

When I started to study reinforcement learning I did not find any good online resource which explained from the basis what reinforcement learning really is. Most of the (very good) blogs out there focus on the modern approaches (Deep Reinforcement Learning) and introduce the **Bellman equation** without a satisfying explanation. I turned my attention to books and I found the one of **Russel and Norvig** called **Artificial Intelligence: A Modern Approach**.

<img src="files/figures/artificial_intelligence_a_modern_approach.png" style="width: 500px;" />

This article is based on **chapter 17** of the second edition, and it can be considered an extended review of the chapter. I will use the same mathematical notation of the authors, in this way you can use the book to cover some missing parts or vice versa. In the next section I will introduce **Markov chain**, if you have already know this concept you can skip to the next section...

# In the Beginning was Andrey Markov

**Andrey Markov** was a Russian mathematician who studied stochastic processes. Markov was particularly interested in systems that follows a chain of linked events. In 1906 Markov produced interesting results about discrete processes that he called **chain**. A **Markov chain** has a set of **states** $S=\{ s_0, s_1, \ldots, s_m \}$ and a **process** that can move successively from one state to another. Each move is a single **step** and is based on a **transition model** $T$. You should make some effort in remembering the keywords in bold because we will use them extensively during the rest of the article. To summarise a Markov chain is defined by:

1. Set of possible states: $S=\{ s_0, s_1, \ldots, s_m \}$
2. Initial state: $s_0$
3. Transition Model: $T(s, s')$

There is something peculiar in a Markov chain that I did not mention. A Markov chain is based on the **Markov Property**. The Markov property states that **given the present, the future is conditionally independent of the past**. That's it, the state in which the process is now it is dependent only from the state it was at $t-1$. An example can simplify the digestion of Markov chains. Let's suppose we have a chain with only two states $s_0$ and $s_1$, where $s_0$ is the initial state. The process is in $s_0$ 90% of the time and it can move to $s_1$ the remaining 10% of the time. When the process is in state $s_1$ it will remain there 50% of the time. Given this data we can create a **Transition Matrix** $T$ as follow:

\begin{equation}
T =
\begin{bmatrix}
   0.90 & 0.10 \\
   0.50 & 0.50
\end{bmatrix}
\end{equation}

The transition matrix is always a square matrix, and since we are dealing with probability distributions all the entries are within 0 and 1 and a single row sums to 1. **We can graphically represent the Markov chain**. In the following representation each state of the chain is a node and the transition probabilities are edges. Highest probabilities have a thickest edge:

<img src="files/figures/simple_markov_chain.png" style="width: 500px;" />

Until now we did not mention **time**, but we have to do it because Markov chains are dynamical processes which evolve in time. Let's suppose we have to guess where the process will be after 3 steps and after 50 steps. How can we do it? We are interested in chains that have a finite number of states and are time-homogeneous meaning that the transition matrix does not change over time. Given these assumptions **we can compute the k-step transition probability as the k-th power of the transition matrix**, let's do it in Numpy:

```python
import numpy as np

# Declaring the Transition Matrix T
T = np.array([[0.90, 0.10],
             [0.50, 0.50]])

# Obtaining T after 3 steps
T_3 = np.linalg.matrix_power(T, 3)
# Obtaining T after 50 steps
T_50 = np.linalg.matrix_power(T, 50)
# Obtaining T after 100 steps
T_100 = np.linalg.matrix_power(T, 100)

# Printing the matrices
print("T: " + str(T))
print("T_3: " + str(T_3))
print("T_50: " + str(T_50))
print("T_100: " + str(T_100))
```

```
T: [[ 0.9  0.1]
    [ 0.5  0.5]]

T_3: [[ 0.844  0.156]
      [ 0.78   0.22 ]]

T_50: [[ 0.83333333  0.16666667]
       [ 0.83333333  0.16666667]]

T_100: [[ 0.83333333  0.16666667]
        [ 0.83333333  0.16666667]]
```

Now we define the **initial distribution** which represent the state of the system at k=0. Our system is composed of two states and we can model the initial distribution as a vector with two elements, the first element of the vector represents the probability of staying in the state $s_0$ and the second element the probability of staying in state $s_1$. Let's suppose that we start from $s_0$, the vector $\boldsymbol{v}$ representing the initial distribution will have this form:

$$\boldsymbol{v} = (1, 0)$$

We can calculate **the probability of being in a specific state after k iterations** multiplying the initial distribution and the transition matrix: $\boldsymbol{v}\cdot T^k$. Let's do it in Numpy:

```python
import numpy as np

# Declaring the initial distribution
v = np.array([[1.0, 0.0]])
# Declaring the Transition Matrix T
T = np.array([[0.90, 0.10]
              [0.50, 0.50]])

# Obtaining T after 3 steps
T_3 = np.linalg.matrix_power(T, 3)
# Obtaining T after 50 steps
T_50 = np.linalg.matrix_power(T, 50)
# Obtaining T after 100 steps
T_100 = np.linalg.matrix_power(T, 100)

# Printing the initial distribution
print("v: " + str(v))
print("v_1: " + str(np.dot(v, T)))
print("v_3: " + str(np.dot(v, T_3)))
print("v_50: " + str(np.dot(v, T_50)))
print("v_100: " + str(np.dot(v, T_100)))
```

```
v: [[ 1.  0.]]

v_1: [[ 0.9  0.1]]

v_3: [[ 0.844  0.156]]

v_50: [[ 0.83333333  0.16666667]]

v_100: [[ 0.83333333  0.16666667]]
```

**What's going on?** The process starts in $s_0$ and after one iteration we can be 90% sure it is still in that state. This is easy to grasp, our transition model says that the process can stay in $s_0$ with 90% probability, nothing new. Looking to the state distribution at k=3 we noticed that there is something different. We are moving in the future and different braches are possible. If we want to find the probability of being in state $s_0$ after three iteration we should sum all the possible branches that lead to $s_0$. A picture is worth a thousand words:

<img src="files/figures/markov_chain_tree.png" style="width: 500px;" />

The possibility to be in $s_0$ at $k=3$ is given by (0.729 + 0.045 + 0.045 + 0.025) which is equal to 0.844 we got the same result. Now let's suppose that at the beginning we have some uncertainty about the starting state of our process, let's define another starting vector as follow:

$$\boldsymbol{v} = (0.5, 0.5)$$

That's it, with a probability of 50% we can start from $s_0$. Running again the Python script we print the results after 1, 3, 50 and 100 iterations:

```
v: [[ 0.5, 0.5]]

v_1: [[ 0.7  0.3]]

v_3: [[ 0.812  0.188]]

v_50: [[ 0.83333333  0.16666667]]

v_100: [[ 0.83333333  0.16666667]]
```

This time the probability of being in $s_0$ at $k=3$ is lower (0.812), but in the long run we have the same outcome (0.8333333). **What is happening in the long run?** The result after 50 and 100 iterations are the same and `v_50` is equal to `v_100` no matter which starting distribution we have. The chain **converged to equilibrium** meaning that as the time progresses it forgets about the starting distribution. But we have to be careful, the convergence is not always guaranteed. The dynamics of a Markov chain can be very complex, in particular it is possible to have **transient and recurrent states**.

# 初次编辑日期 (Initial Edit Date)

2018年6月4日

# 参考文献 (References)

[1] https://mpatacchiola.github.io/blog/2016/12/09/dissecting-reinforcement-learning.html

[2] http://setosa.io/blog/2014/07/26/markov-chains/index.html