# Student MDP

![student-mdp](../pictures/student-mdp.png)

Goal: calculate the state value function for an undecided student, i.e. a student with a random uniform policy for each state.

In [1]:
import numpy as np
from scipy import linalg

In [2]:
n_states = 6
P_pi = np.zeros((n_states, n_states)) # transition matrix together with policy
R = np.zeros_like(P_pi)

State encoding:

- 0: Class 1
- 1: Class 2
- 2: Class 3
- 3: Social
- 4: Pub
- 5: Bed

Create the transition matrix considering a random policy.

In [3]:
P_pi[0, 1] = 0.5
P_pi[0, 3] = 0.5
P_pi[1, 2] = 0.5
P_pi[1, 5] = 0.5
P_pi[2, 4] = 0.5
P_pi[2, 5] = 0.5
P_pi[4, 5] = 0.5
P_pi[4, 0] = 0.5
P_pi[3, 0] = 0.5
P_pi[3, 3] = 0.5
P_pi[5, 5] = 1

In [4]:
P_pi

array([[0. , 0.5, 0. , 0.5, 0. , 0. ],
       [0. , 0. , 0.5, 0. , 0. , 0.5],
       [0. , 0. , 0. , 0. , 0.5, 0.5],
       [0.5, 0. , 0. , 0.5, 0. , 0. ],
       [0.5, 0. , 0. , 0. , 0. , 0.5],
       [0. , 0. , 0. , 0. , 0. , 1. ]])

Create the reward matrix:

In [5]:
R[0, 1] = -2
R[0, 3] = -1
R[1, 2] = -2
R[1, 5] = 0
R[2, 4] = 15
R[2, 5] = 10
R[4, 5] = 10
R[4, 0] = -10
R[3, 3] = -1
R[3, 0] = -3

In [6]:
R

array([[  0.,  -2.,   0.,  -1.,   0.,   0.],
       [  0.,   0.,  -2.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,  15.,  10.],
       [ -3.,   0.,   0.,  -1.,   0.,   0.],
       [-10.,   0.,   0.,   0.,   0.,  10.],
       [  0.,   0.,   0.,   0.,   0.,   0.]])

In [7]:
# check the correctness of P
assert((np.sum(P_pi, axis=1) == 1).all())

In [8]:
# expected reward for each state
R_expected = np.sum(P_pi * R, axis=1, keepdims=True)

In [9]:
R_expected

array([[-1.5],
       [-1. ],
       [12.5],
       [-2. ],
       [ 0. ],
       [ 0. ]])

The vector R_expected contains the expected immediate reward for each state.

We are ready to solve the Bellman Equation to find the value for each state

In [15]:
# Now it is possible to solve the Bellman Equation
gamma = 0.9
A = np.eye(n_states, n_states) - gamma * P_pi
B = R_expected

In [16]:
# solve using scipy linalg
V = linalg.solve(A, B)

In [17]:
V

array([[-1.78587056],
       [ 4.46226255],
       [12.13836121],
       [-5.09753046],
       [-0.80364175],
       [ 0.        ]])

This is the vector of state values.

See the image below for a graphical representation of the state values.

![student-mdp](../pictures/student-mdp-solved-gamma-09.png)

Let's see how the results change with $\gamma = 0$, i.e., a myopic random student.

In [18]:
gamma = 0.
A = np.eye(n_states, n_states) - gamma * P_pi
B = R_expected
# solve using scipy linalg
V_gamma_zero = linalg.solve(A, B)
V_gamma_zero

array([[-1.5],
       [-1. ],
       [12.5],
       [-2. ],
       [ 0. ],
       [ 0. ]])

![student-mdp](../pictures/student-mdp-solved-gamma-0.png)

As we can see using $\gamma=0$ the value of each state it is exactly equal to the average reward according to the policy.

## Q function calculation

<img src="../pictures/reward_matrix.png" alt="Drawing" style="width: 400px;"/>

In [19]:
R_sa = np.zeros(((n_states-1)*2, 1))
R_sa[0] = -2 # study in state 0
R_sa[1] = -1 # social in state 0
R_sa[2] = -2 # study in state 1
R_sa[3] = 0 # sleep in state 1
R_sa[4] = 10 # sleep in state 2
R_sa[5] = 15 # beer in state 2
R_sa[6] = -1 # social in state 3 (social)
R_sa[7] = -3 # study in state 3 (social)
R_sa[8] = 10 # sleep in state 4 (pub)
R_sa[9] = -10 # study in state 4 (pub)

In [20]:
R_sa.shape

(10, 1)

The transition matrix contains the probability of landing in a given state starting from a state and an action. On the rows we have the source state and action, on the column the landing state.

![transition_matrix](../pictures/transition_matrix.png)

In [21]:
P = np.zeros(((n_states-1)*2, n_states)) # Transition Matrix (states x action, states)
P[0, 1] = 1 # study in state 0 -> state 1
P[1, 3] = 1 # social in state 0 -> state 3
P[2, 2] = 1 # study in state 1 -> state 2
P[3, 5] = 1 # sleep in state 1 -> state 5 (bed)
P[4, 5] = 1 # sleep in state 2 -> state 5 (bed)
P[5, 4] = 1 # beer in state 2 -> state 4 (pub)
P[6, 3] = 1 # social in state 3 -> state 3 (social)
P[7, 0] = 1 # study in state 3 -> state 0 (class1)
P[8, 5] = 1 # sleep in state 4 -> state 5 (bed)
P[9, 0] = 1 # study in state 4 -> state 0 (class 1)

Calculate the action value function using $\gamma=0.9$

In [22]:
gamma = 0.9
Q_sa_pi = R_sa + gamma * P @ V

In [23]:
Q_sa_pi

array([[  2.01603629],
       [ -5.58777741],
       [  8.92452509],
       [  0.        ],
       [ 10.        ],
       [ 14.27672242],
       [ -5.58777741],
       [ -4.60728351],
       [ 10.        ],
       [-11.60728351]])

Q_sa_pi is the action value vector, for each couple state action we have the value of the action in that state.

The action value function is represented in the figure below. Action values are represented with $q_\pi$.

![student-mdp-q](../pictures/student-mdp-q.png)

In [24]:
# reshape the column so that we obtain a vector with shape (n_states, n_actions)
n_actions = 2
Q_sa_pi2 = np.reshape(Q_sa_pi, (-1, n_actions))
Q_sa_pi2

array([[  2.01603629,  -5.58777741],
       [  8.92452509,   0.        ],
       [ 10.        ,  14.27672242],
       [ -5.58777741,  -4.60728351],
       [ 10.        , -11.60728351]])

In this way, performing the argmax we obtain the index of the best action in each state.

In [25]:
best_actions = np.reshape(np.argmax(Q_sa_pi2, -1), (-1, 1))
best_actions

array([[0],
       [0],
       [1],
       [1],
       [0]])

![student-mdp-best-actions](../pictures/student-mdp-q-best-actions.png)

In the image, the green arrows are the best actions in each state. We can easily find them by looking at the action maximizing the q function in each state.

From the action value calculation we can see that when $\gamma=0$, the action value function is equal to the expected immediate reward.

In [26]:
Q_sa_pi_gamma_zero = R_sa
Q_sa_pi_gamma_zero

array([[ -2.],
       [ -1.],
       [ -2.],
       [  0.],
       [ 10.],
       [ 15.],
       [ -1.],
       [ -3.],
       [ 10.],
       [-10.]])

In [27]:
n_actions = 2
Q_sa_pi_gamma_zero2 = np.reshape(Q_sa_pi_gamma_zero, (-1, n_actions))
Q_sa_pi_gamma_zero2

array([[ -2.,  -1.],
       [ -2.,   0.],
       [ 10.,  15.],
       [ -1.,  -3.],
       [ 10., -10.]])

In [28]:
best_actions_gamma_zero = np.reshape(np.argmax(Q_sa_pi_gamma_zero2, -1), (-1, 1))
best_actions_gamma_zero

array([[1],
       [1],
       [1],
       [0],
       [0]])

The result is visualized in the following figure.

![student-mdp-best-actions](../pictures/student-mdp-q-best-actions-gamma-zero.png)

It is interesting to notice how the best actions are changed by modifying only the discount factor. Here, the best actions the agent can take, starting from Class 1 is "Social", as it provides a bigger immediate reward with respect to the action "Study". The action "Social" brings the agent in state "Social". Here, the best the agent can do is to repeat the action social, cumulating a negative rewards.