# Bellman Equation for MRPs
In this exercise we will learn how to find state values for a simple MRPs using the scipy library.

![MRP](../pictures/mrp.png)

In [2]:
import numpy as np
from scipy import linalg

Define the transition probability matrix

In [3]:
# define the Transition Probability Matrix
n_states = 3
P = np.zeros((n_states, n_states), np.float)
P[0, 1] = 0.7
P[0, 2] = 0.3
P[1, 0] = 0.5
P[1, 2] = 0.5
P[2, 1] = 0.1
P[2, 2] = 0.9
P

array([[0. , 0.7, 0.3],
       [0.5, 0. , 0.5],
       [0. , 0.1, 0.9]])

Check that the sum over columns is exactly equal to 1, being a probability matrix.

In [4]:
# the sum over columns is 1 for each row being a probability matrix
assert((np.sum(P, axis=1) == 1).all())

We can calculate the expected immediate reward for each state using the reward matrix and the transition probability

In [5]:
# define the reward matrix
R = np.zeros((n_states, n_states), np.float)
R[0, 1] = 1
R[0, 2] = 10
R[1, 0] = 0
R[1, 2] = 1
R[2, 1] = -1
R[2, 2] = 10

In [6]:
# calculate expected reward for each state by multiplying the probability matrix for each reward
R_expected = np.sum(P * R, axis=1, keepdims=True)

In [7]:
# The matrix R_expected
R_expected

array([[3.7],
       [0.5],
       [8.9]])

The R_expected vector is the expected immediate reward foe each state.
State 1 has an expected reward of 3.7 that is exactly equal to 0.7 * 1 + 0.3*10.
The same for state 2 and so on.

In [10]:
# define the discount factor
gamma = 0.9

We are ready to solve the Bellman Equation

$$
(I - \gamma P)V = R_{\mathbb{E}}
$$

Casting this to a linear equation we have
$$
Ax = b
$$

Where
$$
A = (I - \gamma P)
$$
And
$$
b = R_{\mathbb{E}}
$$

In [11]:
# Now it is possible to solve the Bellman Equation
A = np.eye(n_states) - gamma * P

In [12]:
B = R_expected

In [15]:
# solve using scipy linalg
V = linalg.solve(A, B)
V

array([[65.540732  ],
       [64.90791027],
       [77.5879575 ]])

The vector V represents the value for each state. State 3 has the highest value (77.58). This means that state 3 is the state providing the highest expected return, it is the best state in this MRP. Intuitively, state 3 is the best state because, with high probability (0.9) the transition brings the Agent to the same state and the reward associated with the transition is high (+10).