<h1 style='text-align:center'>Value iteration algorithm</h1>

Implementation of the value iteration algorithm for the following simple Markov Decision Process graph:

![MDP Graph](pictures/MDPgraph.PNG)

For this transition model, there are two possible policies:  

$\pi_1$:  
$S_0\mapsto a_1$  
$S_1\mapsto a_0$  
$S_2\mapsto a_0$  
$S_3\mapsto a_0$  

$\pi_2$:  
$S_0\mapsto a_2$  
$S_1\mapsto a_0$  
$S_2\mapsto a_0$  
$S_3\mapsto a_0$

Equation of the optimal value function for a state $S$

<p style="text-align:center">$V^*(S)= R(S) + \gamma\:\underset{a}{max}\sum_{S'}T(S, a, S')V^*(S')$</p>

With the above formula:  
$V^*(S_0) = R(S_0) + \gamma\:max
\begin{vmatrix}
T(S_0, a_0, S_0)V^*(S_0) + T(S_0, a_0, S_1)V^*(S_1) + T(S_0, a_0, S_2)V^*(S_2) + T(S_0, a_0, S_3)V^*(S_3)
\\ 
T(S_0, a_1, S_0)V^*(S_0) + T(S_0, a_1, S_1)V^*(S_1) + T(S_0, a_1, S_2)V^*(S_2) + T(S_0, a_1, S_3)V^*(S_3)
\\ 
T(S_0, a_2, S_0)V^*(S_0) + T(S_0, a_2, S_1)V^*(S_1) + T(S_0, a_2, S_2)V^*(S_2) + T(S_0, a_2, S_3)V^*(S_3)
\end{vmatrix}$  
  
$ = \gamma\max[V^*(S_1), V^*(S_2)]$

In the same way we find:  
  
$V^*(S_1) = \gamma[(1-x)V^*(S_1) + xV^*(S_3)]$  
  
$V^*(S_2) = 1 + \gamma[(1-y)V^*(S_0) + yV^*(S_3)]$  
  
$V^*(S_3) = 10 + \gamma V^*(S_0)$

Algorithm implementation using x=y=0.25 and γ = 0.9

In [2]:
import numpy as np

# Markov decision process parameters
x = 0.25
y = 0.25

# Value function parameter
gamma = 0.9

# Values and policies for each state
Values = np.zeros(4)
Policies = np.zeros(4)

# Transition matrix for a0
T_a0 = np.zeros((4,4))
T_a0[1][1] = 1 - x
T_a0[1][3] = x
T_a0[2][0] = 1 - y
T_a0[2][3] = y
T_a0[3][0] = 1

# Transition matrix for a1
T_a1 = np.zeros((4,4))
T_a1[0][1] = 1

# Transition matrix for a2
T_a2 = np.zeros((4,4))
T_a2[0][2] = 1

# Rewards for each state
Rewards = np.zeros(4)
Rewards[3] = 10
Rewards[2] = 1

# Initialisation of a difference vector to detect values convergence
values_dif = np.array([1, 1, 1, 1])

# Loop to find the optimal value and policy for each state until convergence
while(values_dif.any() != 0):
    Values_prev = Values
    new_values = []
    new_policies = []
    
    for state_i in range(4):
        new_values.append(Rewards[state_i] + gamma * max(np.dot(T_a0[state_i], Values),
                                                   np.dot(T_a1[state_i], Values),
                                                   np.dot(T_a2[state_i], Values)))
        
        new_policies.append(np.argmax([np.dot(T_a0[state_i], Values),
                                       np.dot(T_a1[state_i], Values),
                                       np.dot(T_a2[state_i], Values)]))

    values_dif = np.array(new_values) - Values_prev
    Policies = new_policies
    Values = new_values
    
print('Optimal value for each state : {}'.format(Values))
print('Optimal policy for each state : {}'.format(Policies))

Optimal value for each state : [14.18563922942206, 15.761821366024511, 15.697898423817858, 22.767075306479853]
Optimal policy for each state : [1, 0, 0, 0]
