# Markov Decision Process using Value Iteration

In this notebook, a simple roll dice game will be modeled as a MDP problem.  The framework of mdptoolbox will be used to 
solve the game and Value Iteration used to find the optimal policy and expected value of the initial state given.

### Imports
We will use the mdptoolbox which is a python package for solving MDP problems as well as numpy which is a library which
 provides abstractions for 
matrix and array manipulations.

In [1]:
import mdptoolbox
import numpy as np

### Simple Coin Toss Example

In a simple coin toss game, you can choose to flip the coin or leave the game and take your earnings.  If you decide to
 flip you gain a reward of 1 if heads or if tails you will lose all of your earnings.

Below illustrates the state and transitions for this markov decision process.

![markov example](./markov.png)

#### Initialization

In [2]:
# the number of sides of the dice/coin
n_sides = 2

# the number of runs to simulate
n_runs = 2

# the number of actions
n_actions = 2

# beginning and ending states
n_initial_terminal_states = 2

# the number of total states: the - n_runs is due to the fact that two of the states
# end up in the terminal state so we can subtract them from the number of overall states
n_states = n_runs * n_sides + n_initial_terminal_states  - n_sides

# the boolean mask to indicate which states you will loose money on
isBadSide = np.array([0, 0])
isGoodSide = [not i for i in isBadSide]

# the array which contains the values of the die
die = np.array([1,1])

# the total earnings given a die roll
earnings = die * isGoodSide  # [1, 1]

# Calculate probability for Input:
probability_dice = 1.0 / n_sides

[1 1]


#### Actions and Transistions
The transistion matrix represents all of the possible transistions between all of the states.

There are two actions, thus the transition matrix will be of size 2.  For each, action there will be a probability of transistioning
between each state.  Thus the transition matrix will be of n_actions * n_states * n_states.

In [None]:
transition_matrix = np.zeros([n_actions, n_states, n_states])

print(transition_matrix)

In [1]:

# the number of sides of the dice
n_sides = 6

# the number of runs to simulate
n_runs = 2

# the number of actions
n_actions = 2

# beginning and ending states
n_initial_terminal_states = 2

# the number of total states
n_states = n_runs * n_sides + n_initial_terminal_states  # from 0 to 2N, plus quit

# the boolean mask to indicate which states you will loose money on
isBadSide = np.array([1, 1, 1, 0, 0, 0])
isGoodSide = [not i for i in isBadSide]

# the array which contains the values of the die
die = np.arange(1, n_sides + 1)  # [1, 2, 3, 4, 5, 6]

# the total earnings given a die roll
earnings = die * isGoodSide  # [0, 0, 0, 4, 5, 6]

# Calculate probability for Input:
probability_dice = 1.0 / n_sides

(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
(2.5833333333333335, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)


There are two possible actions: 1) roll the dice or 2) do not roll the dice and take the money.

The transition probability will be modeled as a 3D matrix.  The first index gives which action is taken.
Let T be the transistion matrix, then T[0] is the first action, T[1] is the second action, etc.

For each action in T, a matrix will give the probability of transitioning from one state to another.  The row
indicates the starting state, the column gives the ending state.  Thus a change from state of 0 (representing 0 earnings)
to a state of 4 (representing $4 in earnings) is given by row 0, column 4 in the T[action][0][3][4].  Where the first
index is the index of the action, the second index is the initial state, and the third index is the resultant state.

The reward matrix is similar to the transition probability matrix but instead of each matrix indicating the probability
from one state to the next, it gives the reward for that transition.

The following gives the initialization of the parameters for this simulation.

#### Action[0]: Do not Roll the Dice

There 100% chance that you do not have to roll the dice if you are in a given state and you will not transition to
another state but will remain in the same one.

In [None]:
# the probability matrix for the first action, if you do not roll
prob[0] = [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]

#### Action[1]: Roll the Dice

Rolling the dice is a little more complex.  First the probability of transitioning to another state (including remaining)
in the same state is given by: 1/(# sides of the dice).

The first row of the prob[1] matrix gives all of possible transitions from the first state.
Thus, there is a p chance of rolling a 4, 5, 6, and a 1/2 chance of transitioning to the final state which is losing,
all of the money earned thus far (which is 0 at state 0).


The second row gives all of the possible transitions from the second state.  We can see that the the same ratio exists
from transitioning, but is shifted down by the minimum valid roll.

In [None]:
#if roll
p=1.0/N
# after the first roll, you have a 1/6 chance of transistioning 
prob[1] = [[0, p, p, p, 0, 0, 0, 0, 0, 0.5],
           [0, 0, 0, 0, p, p, p, 0, 0, 0.5],
           [0, 0, 0, 0, 0, p, p, p, 0, 0.5],
           [0, 0, 0, 0, 0, 0, p, p, p, 0.5],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
np.sum(prob[0],axis=1)
np.sum(prob[1],axis=1)

rewards = np.zeros((2, 10, 10))
# if leave
rewards[0] = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
#if roll
rewards[1] = [[0, 4, 5, 6, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 4, 5, 6, 0, 0, -4],
            [0, 0, 0, 0, 0, 4, 5, 6, 0, -5],
            [0, 0, 0, 0, 0, 0, 4, 5, 6, -6],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -8],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -9],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -10],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -11],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, -12],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

vi = mdptoolbox.mdp.ValueIteration(prob, rewards, 1)
vi.run()

optimal_policy = vi.policy
expected_values = vi.V

print optimal_policy
print expected_values