## n-step behavior in the grid world

In many RL algorithms, the core idea is to arrive at a consistency between our understanding of the environment in its current state and after steps of transitions and to iterate until this consistency is ensured. Therefore, it is important to get a solid intuition of how an environment modeled as a Markov chain evolves over time. To this end, we will look into -step behavior in the grid world example.
![](img/robot_markov_chain.png)
States coordinates. States/cells are indexed so that (0,0):1, (0,1):2, ... , (2,2):9,

Let's start by creating a 3 3 grid world with our robot in it.

In [1]:
import numpy as np
m = 3
m2 = m ** 2
q = np.zeros(m2)
q[m2 // 2] = 1
q # initial probability distribution with the robot being at the center

array([0., 0., 0., 0., 1., 0., 0., 0., 0.])

In [2]:
# get nxn transition probability matrix
# fills an n×n transition probability matrix according to specified probabilities of going up, down, left, and right
def get_P(m, p_up, p_down, p_left, p_right):
    m2 = m ** 2
    P = np.zeros((m2, m2))
    ix_map = {i + 1: (i // m, i % m) for i in range(m2)}
    for i in range(m2):
        for j in range(m2):
            r1, c1 = ix_map[i + 1]
            r2, c2 = ix_map[j + 1]
            rdiff = r1 - r2
            cdiff = c1 - c2
            if rdiff == 0:
                if cdiff == 1:
                    P[i, j] = p_left
                elif cdiff == -1:
                    P[i, j] = p_right
                elif cdiff == 0:
                    if r1 == 0:
                        P[i, j] += p_down
                    elif r1 == m - 1:
                        P[i, j] += p_up
                    if c1 == 0:
                        P[i, j] += p_left
                    elif c1 == m - 1:
                        P[i, j] += p_right
            elif rdiff == 1:
                if cdiff == 0:
                    P[i, j] = p_down
            elif rdiff == -1:
                if cdiff == 0:
                    P[i, j] = p_up
    return P

In [4]:
P = get_P(3, 0.2, 0.3, 0.25, 0.25)
P

array([[0.55, 0.25, 0.  , 0.2 , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.25, 0.3 , 0.25, 0.  , 0.2 , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.25, 0.55, 0.  , 0.  , 0.2 , 0.  , 0.  , 0.  ],
       [0.3 , 0.  , 0.  , 0.25, 0.25, 0.  , 0.2 , 0.  , 0.  ],
       [0.  , 0.3 , 0.  , 0.25, 0.  , 0.25, 0.  , 0.2 , 0.  ],
       [0.  , 0.  , 0.3 , 0.  , 0.25, 0.25, 0.  , 0.  , 0.2 ],
       [0.  , 0.  , 0.  , 0.3 , 0.  , 0.  , 0.45, 0.25, 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.3 , 0.  , 0.25, 0.2 , 0.25],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.3 , 0.  , 0.25, 0.45]])

In [7]:
# Calculate n-step probabilities, for n=1:
n = 1
Pn = np.linalg.matrix_power(P, n)
np.matmul(q, Pn)
# n=10
n = 10
Pn = np.linalg.matrix_power(P, n)
np.matmul(q, Pn)
# n=100
n = 100
Pn = np.linalg.matrix_power(P, n)
np.matmul(q, Pn)

array([0.15789474, 0.15789474, 0.15789474, 0.10526316, 0.10526316,
       0.10526316, 0.07017544, 0.07017544, 0.07017544])

Probability distribution after 10 steps and 100 steps are very similar. This is because the system has almost reached a steady state after a few steps. So, the chance that we will find the robot in a specific state is almost the same after 10, 100, or 1,000 steps. Also, you should have noticed that we are more likely to find the robot at the bottom cells, simply because we have p_down > p_up.

### Example – a sample path in an ergodic Markov chain

If the Markov chain is ergodic, we can simply simulate it for a long time once and estimate
the steady state distribution of the states through the frequency of visits. This is especially
useful if we don't have access to the transition probabilities of the system, but we can
simulate it.

In [8]:
from scipy.stats import itemfreq

In [9]:
s = 4 # initial state
n = 10 ** 6  # number of steps
visited = [s]
# simulate the env
for t in range(n):
    s = np.random.choice(m2, p=P[s, :])
    visited.append(s)

In [10]:
itemfreq(visited)

`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
  itemfreq(visited)


array([[     0, 158430],
       [     1, 158452],
       [     2, 157387],
       [     3, 105867],
       [     4, 105292],
       [     5, 104527],
       [     6,  69955],
       [     7,  70141],
       [     8,  69950]])

The results are indeed very much in line with the steady state probability distribution we
calculated.