# Markov Decision Process using Value Iteration

In this notebook, a simple roll dice game will be modeled as a MDP problem.  The framework of mdptoolbox will be used to 
solve the game and Value Iteration used to find the optimal policy and expected value of the initial state given.

### Model

#### States
The states will represent the earnings of the agent.  Thus, the agent starts at state 0 representing 0 initial earnings.
The agent can then transition to other states or earnings by rolling the die and gaining rewards.

#### Actions
An action is required to move the agent from one state to the next.  This is modeled by a transition matrix of all of
the probabilities from moving from one state to the next given an action.  A 3D matrix of the first set of indices being 
the type of action, the second set of indices or the rows being the initial state from whence the action is being taken 
from, and the third set of indices or the columns being the transition state to where the state is being transitioned to.
The values given by the two sets of indices give probability of that transition happening.

#### Rewards
The rewards is the immediate benefit from taking an action in a state.  In this case it is rolling the die and rolling a
valid number.  The reward is equal to the face value of the die.

### Bellman Equation
The Markov Decision Process is solved by the Bellman Equation, which calculates the long-term value of a state and 
the optimal policy of a state, meaning which action should the agent take in each state.  

The long term value of a state is given by taking the action which maximizes the short term reward as well as the sum 
of the value all of the possible transition states taking into account a discount factor and the probability of 
transitioning into those future states.  More formally: 

Q(s) = arg.max( R(s,ai) +  $\gamma$ * $\sum_{j}$T(s, ai, sj) * Q(sj))


Q(s)  = The quality of a state, which is the long-term value of being in a state

ai = each action that could be taken from state s

R(s, ai) = the reward for taking action ai being in state s

T(s, ai, sj) = the transition matrix, which gives the probabilities for transistioning from state s to sj taking action ai

Q(sj) = the quality of state sj

$\gamma$ = the discount factor, < 1 and models the cost of transitioning from one state to the next.  If this was >=1,
the model would never converge and each state would have infinite benefit.

### Value Iteration
Value iteration is the process of continuously updating the optimal policy and values of a MDP model 
by recursively solving the Bellman Equation until the differences between successive runs reaches some epsilon.  In the mdptoolbox the default is epsilon = .01

A detailed breakdown of the method is found below.

### Simple Coin Toss Example

In a simple coin toss game, you can choose to flip the coin or leave the game and take your earnings.  If you decide to
 flip you gain a reward of 1 if heads or if tails you will lose all of your earnings.

Below illustrates the state and transitions for this markov decision process.  p is the probability of the transition,
r is the reward for the transition, and a is the type of action to take.  a=1 is flipping the coin, a=0 is leaving the game.

There is a 100% chance that you will remain in the same state if you leave the game.  If you flip a coin then you will get 
a reward of 1.  If you flip and loss, you will lose what you have accumulated in the game and transition to the terminal state.

Not shown is the transition after you chose to leave the game and remain in the same state, you will transition to the terminal state.

![markov example](./_markov_coin_flip.png)
### Initialization
We will use the mdptoolbox which is a python package for solving MDP problems as well as numpy which is a library which provides abstractions for matrix and array manipulations.

The actions are to flip the coin to leave the game.  The states represent the earnings you have.  The reward represents the earnings gained by flipping the coin and landing on the desired side.

The discount rate will be close to 1 or .99999 for the sake of convergence.

In [2]:
import mdptoolbox
import numpy as np

# the number of sides of the dice/coin
n_sides = 2

# the number of runs to simulate
n_runs = 2

# the number of actions
n_actions = 2

# beginning and ending states
n_initial_terminal_states = 2

# the number of total states: the - n_runs is due to the fact that two of the states
# end up in the terminal state so we can subtract them from the number of overall states
n_states = n_runs * n_sides + n_initial_terminal_states  - n_sides

# the boolean mask to indicate which states you will loose money on
isBadSide = np.array([0]*n_states)

isGoodSide = [not i for i in isBadSide]

# the array which contains the values of the die
die = np.array([1] * n_states)

# the total earnings given a die roll
earnings = die * isGoodSide  # [1, 1]

# Calculate probability for Input:
probability_dice = 1.0 / n_sides

#### Transitions
The transition matrix represents all of the possible transitions between all of the states.

There are two actions, thus the transition matrix will be of size 2.  

For each, action there will be a probability of transitioning between each state.  Thus the transition matrix will be 
of n_actions * n_states * n_states.

In [3]:
transition_matrix = np.zeros([n_actions, n_states, n_states])
print(transition_matrix)

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]


### Action: Leave the Game 
Probabilities of the transitions of leaving the game and keeping the same score.  There is 100% chance that you can remain
in the same state.

In [4]:
for i in range(len(transition_matrix[0])):
    transition_matrix[0][i][i] = 1
            
print(transition_matrix[0])

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


### Action: Roll the Dice
The probabilities of transitions of rolling the die.  From the first state there is a .5 chance of going to state 1 and 
.5 chance of getting tails and going to the terminal state.  The third and fourth state are terminal states with no chance
of the agent moving from from any other state than the ending state.

In [5]:
for i in range(len(transition_matrix[1]) -2):
    transition_matrix[1][i][i+1] = .5
    transition_matrix[1][i][-1] = .5

for i in range(len(transition_matrix[1])-2, len(transition_matrix[1])):
    transition_matrix[1][i][-1] = 1

print(transition_matrix[1])

[[0.  0.5 0.  0.5]
 [0.  0.  0.5 0.5]
 [0.  0.  0.  1. ]
 [0.  0.  0.  1. ]]


### Reward: Initialization and Leaving the Game
The reward matrix is the same dimension as the transition matrix.
There is no reward for leaving the game.  Rewards are only accumulated by rolling the die.

In [6]:
reward_matrix = np.zeros([n_actions, n_states, n_states])

### Reward: Rolling the Die
Rolling the die can result in gaining a reward if it was heads or losing everything.  The ending column describes
the result of rolling a tails and losing the accumulated reward.

In [7]:
reward_acc = 1
for i in range(len(transition_matrix[1])-1):
    reward_matrix[1][i][i+1] = reward_acc
    reward_matrix[1][i][-1] = (1- reward_acc)
    reward_acc+=1
    
reward_matrix[1][len(transition_matrix[1])-1][len(transition_matrix[1])-1] = (1 - reward_acc)

print(reward_matrix[1])

[[ 0.  1.  0.  0.]
 [ 0.  0.  2. -1.]
 [ 0.  0.  0. -2.]
 [ 0.  0.  0. -3.]]


### Initializing the MDP: Value Iteration Model
Using the mdptoolbox the Value Iteration Model is initiated and run with the given transition and reward matricies.
The library provides a high-level abstraction of the MDP, please see below for a more detailed breakdown of how the
framework actually performs the value iteration. 

In [8]:
discount_factor = .99999  # less than 1 for convergence
vi = mdptoolbox.mdp.ValueIteration(transition_matrix, reward_matrix, discount_factor)
vi.run()

The optimal policy indicates which action to take in each state.  The expected_values indicates what is the
value of the state and how many we points we can expect by following the optimal policy.

The optimal policy and expected value are as follows:

In [9]:
optimal_policy = vi.policy
expected_values = vi.V

print optimal_policy
print expected_values

(1, 1, 0, 0)
(0.7499975, 0.5, 0.0, 0.0)


From the optimal policy:  (1, 1, 0, 0) we can see that we should choose action 1 or flip the coin in the initial state
and the second state or the first two flips.

The value of each state is: (0.7499975, 0.5, 0.0, 0.0).  The value at the initial state is .75, .5 in the second state.

## N-Die Roll Game
The same game is played with the sides of the die at N=6.  The isBadSide = [1, 1, 1, 0 , 0, 0] or in other words
the only rolls which will be rewarded is the complement or when a 4, 5, or 6 is rolled.

### Initialization

In [10]:
# the number of sides of the dice
n_sides = 6

# the number of runs to simulate
n_runs = 2

# the number of actions
n_actions = 2

# beginning and ending states
n_initial_terminal_states = 2

# the number of total states
n_states = n_runs * n_sides + n_initial_terminal_states  # from 0 to 2N, plus quit

# the boolean mask to indicate which states you will loose money on
isBadSide = np.array([1, 1, 1, 0, 0, 0])
isGoodSide = [not i for i in isBadSide]

# the array which contains the values of the die
die = np.arange(1, n_sides + 1)  # [1, 2, 3, 4, 5, 6]

# the total earnings given a die roll
earnings = die * isGoodSide  # [0, 0, 0, 4, 5, 6]

# Calculate probability for Input:
probability_dice = 1.0 / n_sides

### Action: Do not Roll the Dice

There 100% chance that you do not have to roll the dice if you are in a given state and you will not transition to
another state but will remain in the same one.

In [11]:
transition_matrix = np.zeros([n_actions, n_states, n_states])
# the probability matrix for the first action, if you do not roll
for i in range(len(transition_matrix[0])):
    transition_matrix[0][i][i] = 1
            
print(transition_matrix[0])

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


### Action: Roll the Dice
The first row gives the probabilities of transitioning from the initial state to the possible states of valid dice
rolls and the state of rolling a game ending dice roll which is represented by the last column.  In this example, there 
is a p chance of rolling a 4, 5, 6, and a 1/2 chance of transitioning to the final state which is losing,
all of the money earned thus far (which is 0 at state 0).

The second row gives all of the possible transitions given earnings of 1.

In [12]:
#if roll
p=1.0/n_sides
min_roll = 4
n_min_states = 3
# after the first roll, you have a 1/6 chance of transistioning 
for i in range(n_states -1):
    for j in range(min_roll, min(min_roll + n_min_states, n_states- i)):
        transition_matrix[1][i][i+j] = p
        transition_matrix[1][i][-1] = 1 - sum(transition_matrix[1][i][:-1])

for i in range(len(transition_matrix[1])-min_roll, len(transition_matrix[1])):
    transition_matrix[1][i][-1] = 1

print(transition_matrix[1])

np.sum(transition_matrix[0],axis=1)
np.sum(transition_matrix[1],axis=1)

[[0.         0.         0.         0.         0.16666667 0.16666667
  0.16666667 0.         0.         0.         0.         0.
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.16666667
  0.16666667 0.16666667 0.         0.         0.         0.
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.
  0.16666667 0.16666667 0.16666667 0.         0.         0.
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.
  0.         0.16666667 0.16666667 0.16666667 0.         0.
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.16666667 0.16666667 0.16666667 0.
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.16666667 0.16666667 0.16666667
  0.         0.5       ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

### Result: Transition Matrix
The first state with index 0 is the initial state before any roll of the die. The last state
or index 13 is the terminal state.  All the other states represent the amount of earnings accumulated.
![transition matrix](./markov.png)

### Reward: Initialization and Leaving the Game
The reward matrix is the same dimension as the transition matrix.
There is no reward for leaving the game.  Rewards are only accumulated by rolling the die.

In [13]:
reward_matrix = np.zeros([n_actions, n_states, n_states])


### Reward: Rolling the Die
Rolling the die can result in gaining a reward if it was heads or losing everything.  
The ending column describes the result of rolling a tails and losing the accumulated reward.

In [14]:
#if roll
reward_acc = 4
reward_curr = 4
reward_tot = 0
n_min_states = 3

for i in range(n_states-1):
    reward_curr = reward_acc
    for j in range(reward_curr, min(reward_curr + n_min_states, n_states - i)):
        reward_matrix[1][i][i + j] = reward_curr
        reward_curr +=1
    
    reward_matrix[1][i][-1] = -reward_tot
    if i ==0:
        reward_tot+=1
    else:
        reward_tot+=1

reward_matrix[1][n_states -1][n_states -1] = -reward_tot 

print(reward_matrix[1])

[[  0.   0.   0.   0.   4.   5.   6.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   4.   5.   6.   0.   0.   0.   0.   0.  -1.]
 [  0.   0.   0.   0.   0.   0.   4.   5.   6.   0.   0.   0.   0.  -2.]
 [  0.   0.   0.   0.   0.   0.   0.   4.   5.   6.   0.   0.   0.  -3.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   4.   5.   6.   0.   0.  -4.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   5.   6.   0.  -5.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   5.   6.  -6.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   5.  -7.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.  -8.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.  -9.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0. -10.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0. -11.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0. -12.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.

### Results: Reward Matrix

![reward_matrix](./reward_matrix.png)

### Initializing the MDP: Value Iteration Model
Using the mdptoolbox the Value Iteration Model is initiated and run with the given transition and reward matricies.
The library provides a high-level abstraction of the MDP, please see below for a more detailed breakdown of how the
framework actually performs the value iteration. 

In [15]:
discount_factor = .9999

vi = mdptoolbox.mdp.ValueIteration(transition_matrix, reward_matrix, discount_factor)
vi.run()

optimal_policy = vi.policy
expected_values = vi.V

print optimal_policy
print expected_values

(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
(2.583325, 2.0, 1.5, 1.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)


### Results: N-Die Roll
Running the simulation, the optimal policy is to roll the die if you have earnings below or equal to $4.

With 0 initial earnings you are expected to make $2.58 using the optimal policy.

### Markov Decision Process: Toolbox: Detailed Breakdown
The following gives more information on how the value iteration actually works within the MDP: Toolbox library.

### MDP Class
The MDP class has a number of important parameters and attributes.  The following is the
documentation of the MDP class.

    A Markov Decision Problem.

    Let ``S`` = the number of states, and ``A`` = the number of acions.

    Parameters
    ----------
    transitions : array
        Transition probability matrices. These can be defined in a variety of
        ways. The simplest is a numpy array that has the shape ``(A, S, S)``,
        though there are other possibilities. It can be a tuple or list or
        numpy object array of length ``A``, where each element contains a numpy
        array or matrix that has the shape ``(S, S)``. This "list of matrices"
        form is useful when the transition matrices are sparse as
        ``scipy.sparse.csr_matrix`` matrices can be used. In summary, each
        action's transition matrix must be indexable like ``transitions[a]``
        where ``a`` ∈ {0, 1...A-1}, and ``transitions[a]`` returns an ``S`` ×
        ``S`` array-like object.
    reward : array
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
        array that has the shape ``(S, A)``, ``(S,)`` or ``(A, S, S)``. A list
        of lists can be used, where each inner list has length ``S`` and the
        outer list has length ``A``. A list of numpy arrays is possible where
        each inner array can be of the shape ``(S,)``, ``(S, 1)``, ``(1, S)``
        or ``(S, S)``. Also ``scipy.sparse.csr_matrix`` can be used instead of
        numpy arrays. In addition, the outer list can be replaced by any object
        that can be indexed like ``reward[a]`` such as a tuple or numpy object
        array of length ``A``.
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
        be displayed. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a discount factor.
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
        the optimal value function. Subclasses of ``MDP`` may pass ``None`` in
        the case where the algorithm does not use an epsilon-optimal stopping
        criterion.
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
        specified. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a maximum number of iterations.

    Attributes
    ----------
    P : array
        Transition probability matrices.
    R : array
        Reward vectors.
    V : tuple
        The optimal value function. Each element is a float corresponding to
        the expected value of being in that state assuming the optimal policy
        is followed.
    discount : float
        The discount rate on future rewards.
    max_iter : int
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
    time : float
        The time used to converge to the optimal policy.
    verbose : boolean
        Whether verbose output should be displayed or not.

    Methods
    -------
    run
        Implemented in child classes as the main algorithm loop. Raises an
        exception if it has not been overridden.

### MDP: Rewards Matrix
The reward matrix is calculating by multiplying the probability transition matrix by the corresponding reward value.  
It is then summed across the entire state to get to overall expected reward from the state.  For example, 
if we have a 1/6 chance of getting 4, 1/6 chance of rolling a 5, and 1/6 chance of rolling a 6 and 1/2 chance of 0.  
Then the expected value is Transitions * Reward Matrix = Expected Reward Matrix = 2.5.  This will be used
as the reward in the Bellman Equation.

The fifth state of index 4 which represents an agent earnings of 4 happens if you were to roll a 4 on your first roll.
From this state, you can leave in which case you would stay at 4 and get 0 reward.  Or you can roll the die, in which case you
would have 1/2 chance of rolling a bad side and losing your earnings of -4.  1/6 chance of getting 8, 1/6 or 9, 1/6 or 10.

See the results matrix above for a visualization of this matrix.

The function is as follows:

    def _computeMatrixReward(self, reward, transition):
    
        if _sp.issparse(reward):
            return reward.multiply(transition).sum(1).A.reshape(self.S)
        elif  _sp.issparse(transition):
            return transition.multiply(reward).sum(1).A.reshape(self.S)
        else:
            tran_reward = _np.multiply(transition, reward)
            sum_tran_reward = tran_reward.sum(1)
            ret_val = sum_tran_reward.reshape(self.S)
            return ret_val


### MDP: Value Iteration Class

Value Iteration is a subclass of the MDP super class.  The run method actually executes the algorithm.

    A discounted MDP solved using the value iteration algorithm.

    Description
    -----------
    ValueIteration applies the value iteration algorithm to solve a
    discounted MDP. The algorithm consists of solving Bellman's equation
    iteratively.
    Iteration is stopped when an epsilon-optimal policy is found or after a
    specified number (``max_iter``) of iterations.
    This function uses verbose and silent modes. In verbose mode, the function
    displays the variation of ``V`` (the value function) for each iteration and
    the condition which stopped the iteration: epsilon-policy found or maximum
    number of iterations reached.

    Parameters
    ----------
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
    epsilon : float, optional
        Stopping criterion. See the documentation for the ``MDP`` class for
        details.  Default: 0.01.
    max_iter : int, optional
        Maximum number of iterations. If the value given is greater than a
        computed bound, a warning informs that the computed bound will be used
        instead. By default, if ``discount`` is not equal to 1, a bound for
        ``max_iter`` is computed, otherwise ``max_iter`` = 1000. See the
        documentation for the ``MDP`` class for further details.
    initial_value : array, optional
        The starting value function. Default: a vector of zeros.

    Data Attributes
    ---------------
    V : tuple
        The optimal value function.
    policy : tuple
        The optimal policy function. Each element is an integer corresponding
        to an action which maximises the value function in that state.
    iter : int
        The number of iterations taken to complete the computation.
    time : float
        The amount of CPU time used to run the algorithm.

    Methods
    -------
    run()
        Do the algorithm iteration.
   
### MDP: Value Iteration Method

Once initialized, the run method starts the Value Iteration algorithm and calculates the optimal values and 
optimal policy at each state using the Bellman operator until convergence which is measured by epsilon.

At each iteration, the optimal values V (which belong to the class) are updated by calling the 
_bellmanOperator() function.  The difference between the new value and the old value is then taken and
compared to epsilon, the stopping condition.  

        def run(self):
            # Run the value iteration algorithm.
    
            if self.verbose:
                print('  Iteration\t\tV-variation')
    
            self.time = _time.time()
            while True:
                self.iter += 1
    
                Vprev = self.V.copy()
    
                # Bellman Operator: compute policy and value functions
                self.policy, self.V = self._bellmanOperator()
    
                # The values, based on Q. For the function "max()": the option
                # "axis" means the axis along which to operate. In this case it
                # finds the maximum of the the rows. (Operates along the columns?)
                variation = _util.getSpan(self.V - Vprev)
    
                if self.verbose:
                    print(("    %s\t\t  %s" % (self.iter, variation)))
    
                if variation < self.thresh:
                    if self.verbose:
                        print(_MSG_STOP_EPSILON_OPTIMAL_POLICY)
                    break
                elif self.iter == self.max_iter:
                    if self.verbose:
                        print(_MSG_STOP_MAX_ITER)
                    break
    
            # store value and policy as tuples
            self.V = tuple(self.V.tolist())
            self.policy = tuple(self.policy.tolist())
    
            self.time = _time.time() - self.time

### MDP: Bellman Operator Method
The initial values are 0 for each state.  

The first action to leave the game, gets 0 reward and there is a 100% probability that the current value is given.
Thus the old value is equal to the new value.

The second action to roll the dice, the Bellman equation is more interesting.

Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)

First, the reward for each state was precomputed in the Reward Matrix Above.  The probability of the transition is
has also been given as a parameter.  The current value is initially 0.
Thus the Quality of the state given rolling the die is just the Reward value at that state.  Thus the intial state, the
reward is then 2.5.

The second iteration, the input values are now the rewards matrix.  The rewards matrix does not change. The dot product
between the input values and the probability of those transitions gives the reward of future transitions.  The only
non-zero term is the transition from the initial state to the 5th state or a roll of 4.  The probability of the transition 
to roll 4 is 1/6 and the reward per the overall reward per the reward matrix is .5.  Thus the added value is then 1/6 * .5 = 
.08333.  Thus the new value is 2.8333.  This value does not change further as the value of the 5th state has already been
computed and factored into the vallue update to the initial value.
 
The bellman operator method:

        def _bellmanOperator(self, V=None):
            # Apply the Bellman operator on the value function.
            #
            # Updates the value function and the Vprev-improving policy.
            #
            # Returns: (policy, value), tuple of new policy and its value
            #
            # If V hasn't been sent into the method, then we assume to be working
            # on the objects V attribute
            if V is None:
                # this V should be a reference to the data rather than a copy
                V = self.V
            else:
                # make sure the user supplied V is of the right shape
                try:
                    assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                        "right shape (Bellman operator)."
                except AttributeError:
                    raise TypeError("V must be a numpy array or matrix.")
            # Looping through each action the the Q-value matrix is calculated.
            # P and V can be any object that supports indexing, so it is important
            # that you know they define a valid MDP before calling the
            # _bellmanOperator method. Otherwise the results will be meaningless.
            Q = _np.empty((self.A, self.S))
            for aa in range(self.A):
                Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
    
            # Get the policy and value, for now it is being returned but...
            # Which way is better?
            # 1. Return, (policy, value)
    
            Qmax = (Q.argmax(axis=0), Q.max(axis=0))
            return Qmax
        

