# Unit 05 - Homework 05: Q-Value iteration



Consider an Markov Decision Process with 6 states $s \epsilon \{0,1,2,3,4,5\}$ and 2 actions $a \epsilon \{C,M\}$, defined by the following transition probability functions
For states 1, 2, and 3:

$T(s,M,s−1)=1$
	 
$T(s,C,s+2)=0.7$
	 
$T(s,C,s)=0.3$
	 

For state 0:

$T(s,M,s)=1$
	 
$T(s,C,s)=1$
	 

For states 4 and 5:

$T(s,M,s−1)=1$
	 
$T(s,C,s)=1$
	 

Note that all transition probabilities not defined by the above are equal to 0.
The rewards R are defined by:

$R(s,a,s′)=|s′−s|^{1/3} ∀s≠s′$,

and $R(s,a,s) = (s+4)^{−1/2}$, $∀s≠0$.

$R(0,M,0) = R(0,C,0) = 0$. Also, the discount factor $\gamma = 0.6$.

We initialize $Q_0(s,a) = 0$ $∀ s \epsilon \{0,1,2,3,4,5\}$ and $∀ a \{C,M\}$.


In [1]:
import numpy as np

In [2]:
# define reward function
def reward(state_1, state_2):
    if state_1 == 0 and state_2 ==0:
        return 0
    elif state_1 != state_2:
        return np.abs(state_2 - state_1) ** (1 / 3)
    elif state_1 == state_2:
        return (state_1 + 4) ** (-1 / 2)
    else:
        return 0

In [3]:
# define T-matrix (transition probabilities) 0: start-state, 1: action, 2: target-state
# indices for actions: 0:"C", 1:"M"
T = np.zeros((6,2,6))

# set values for states 1,2,3:
for i in [1,2,3]:
    T[i,0,i+2] = 0.7
    T[i,0,i] = 0.3
    T[i,1,i-1] = 1

# set values for state 0:
T[0,0,0] = 1
T[0,1,0] = 1

#set values for states 4,5:
for i in [4,5]:
    T[i,1,i-1] = 1
    T[i,0,i] = 1
    
T

array([[[1. , 0. , 0. , 0. , 0. , 0. ],
        [1. , 0. , 0. , 0. , 0. , 0. ]],

       [[0. , 0.3, 0. , 0.7, 0. , 0. ],
        [1. , 0. , 0. , 0. , 0. , 0. ]],

       [[0. , 0. , 0.3, 0. , 0.7, 0. ],
        [0. , 1. , 0. , 0. , 0. , 0. ]],

       [[0. , 0. , 0. , 0.3, 0. , 0.7],
        [0. , 0. , 1. , 0. , 0. , 0. ]],

       [[0. , 0. , 0. , 0. , 1. , 0. ],
        [0. , 0. , 0. , 1. , 0. , 0. ]],

       [[0. , 0. , 0. , 0. , 0. , 1. ],
        [0. , 0. , 0. , 0. , 1. , 0. ]]])

In [4]:
# define reward matrix
R = np.zeros((6,2,6))

for s_start in range(len(R)):
    for a in range(R.shape[1]):
        for s_target in range(R.shape[2]):
            R[s_start, a, s_target] = reward(s_start, s_target)
R

array([[[0.        , 1.        , 1.25992105, 1.44224957, 1.58740105,
         1.70997595],
        [0.        , 1.        , 1.25992105, 1.44224957, 1.58740105,
         1.70997595]],

       [[1.        , 0.4472136 , 1.        , 1.25992105, 1.44224957,
         1.58740105],
        [1.        , 0.4472136 , 1.        , 1.25992105, 1.44224957,
         1.58740105]],

       [[1.25992105, 1.        , 0.40824829, 1.        , 1.25992105,
         1.44224957],
        [1.25992105, 1.        , 0.40824829, 1.        , 1.25992105,
         1.44224957]],

       [[1.44224957, 1.25992105, 1.        , 0.37796447, 1.        ,
         1.25992105],
        [1.44224957, 1.25992105, 1.        , 0.37796447, 1.        ,
         1.25992105]],

       [[1.58740105, 1.44224957, 1.25992105, 1.        , 0.35355339,
         1.        ],
        [1.58740105, 1.44224957, 1.25992105, 1.        , 0.35355339,
         1.        ]],

       [[1.70997595, 1.58740105, 1.44224957, 1.25992105, 1.        ,
         0.

In [7]:
# define gamma and initialization
gamma = 0.6
Q_0 = np.zeros((6, 2))
V_0 = np.zeros(6,)

Following code runs the Q-Value iteration algorithm and prints the Q-values after given no. of iterations. Individual states are represented by lines, column 0 represents the value of the "C"-action, column 1 represents the "M"-action.

## TODO:
- Convergence criteria can be implenented
- Usage of matplotlib for better visualisation

In [10]:
# run Q-step algorithm
num_iter = 1
Q = Q_0
V = V_0
for i in range(num_iter):
    Q = np.sum(T * (R + gamma * np.max(Q, axis=1)), axis=2)
    V = np.max(Q, axis=1)

print("Q ( iter ", num_iter, "): \n", Q)
print("V: \n", V)

Q ( iter  1 ): 
 [[0.         0.        ]
 [1.01610881 1.        ]
 [1.00441922 1.        ]
 [0.99533408 1.        ]
 [0.35355339 1.        ]
 [0.33333333 1.        ]]
V: 
 [0.         1.01610881 1.00441922 1.         1.         1.        ]
