# Introduction to Dynamic Programming

## CSCI E-82A
## Stephen Elston

In the previous lesson we explored the concepts of **Markov decision processes (MDP)** and **Markov reward processes (MRP)**. Now we will extend these concepts and apply them to finding **optimal solutions** for such systems. By an optimal solution, we mean a solution which produces **greater reward** than any other solution. 

To understand how we can find optimal solutions we must introduce some new concepts:
1. An **action** causes a state transition. This transition may be to the same state. 
2. A **policy** specifies the **actions** given the state. In other words, the policy defines the actions to be specified by the agent. 
3. An **optimal policy** produces the greatest reward possible given the initial state of the system.
4. A **plan** is the sequence of actions leading to an optimal result. 

In this an subsequent lessons, we will explore some powerful methods for finding optimal policies. Broadly, these methods are known as **dynamic programming** and **reinforcement learning**. In this lesson we will focus on the representation and learning methods for dynamic programming. Dynamic programming algorithms can be the basis of effective and flexible intelligent agents as shown below.

<img src="img/DPAgent.JPG" alt="Drawing" style="width:500px; height:300px"/>
<center> **Dynamic Programming Agent and Environment** </center>

## Value Functions and Policy Evaluation

Given a Markov random process, a reward function, and a policy of actions on the Markov process, how can we tell how good our policy is? We can perform **policy evaluation** in two ways.  

First, we can perform **policy evaluation** with a **value function**. The value function is the expected value of the **gain** achieved by following a policy, $\pi$, given the current state. We can express the state value function as follows:

$$
\begin{align}
v_{\pi}(s) &= \mathbb{E}_{\pi} [ G_t\ |\ S_t = s] \\
&= \mathbb{E}_{\pi} [R_{t+1} + \gamma G_{t+1}\ |\ S_t = s] \\
&= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi} (S_{t+1})\ |\ S_t = s] \\
&= \sum_a \pi(a|s) \sum_{s',r} p(s',r | s,a) \big[ r + \gamma v_{\pi}(s') \big],\ \forall a 
\end{align}
$$

This relation are known as the **Bellman value equations**. This relationship tells us how to compute the value of being in a particular state, $s$. There is one such equation for each state, $s \in \xi$, of the Markov process.

Examine the last line of the above and notice that this relationship can be viewed as a recursion.    

<img src="img/ValueBackup.JPG" alt="Drawing" style="width:300px; height:200px"/>
<center> **Backups diagram of Bellman Value Function** </center>

As an alternative we can use the **Bellman action value function**:

$$\begin{align}
q_{*}(s,a) &= \mathbb{E_\pi} \big[G_{t}\ \big|\ S_t = s, A_t = a \big]  \\
&= \mathbb{E_\pi} \Big[ \sum_{k=0}^\infty \gamma^k\ R_{t+k+1} \big|\ S_t = s, A_t = a \Big]  \\
&= \sum_{s',r} p(s',r\ |\ s,a) \big[ r + \gamma\ q_{*}(S_{t+1},a') \big]
\end{align}$$

Whereas, $v_{\pi}(s)$ is the value of being in a state, $q_{*}(s,a)$ tells us the value of taking a particular action, $a$, from a state, $s$. There is one such equation for each action value tuple, $(a,s)$.

<img src="img/ActionValueBackup.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Backups diagram of Bellman Action Value Function** </center>

## Grid World Example

Let's try an example of computing the state values of a Markov process. In this example, we will work with both the **representations** and **learning** required. 

**Navigation** to a goal is a significant problem in robotics. Real-world navigation is rather complex. Therefore, in this example we will use a simple analog called a **grid world**. The grid world for this problem is shown below. 

<img src="img/GridWorld.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **A 4x4 Grid World with Terminal State** </center>

The grid world consists of a 4x4 set of positions the robot can occupy. each position is considered a state. The goal is to navigate to state 0, the goal, in the minimum steps. We will explore methods to find policies which reach this goal and achieve maximum reward. 

Grid position 0 is a **terminal node**. There are no possible state transitions out of this position. The presence of a terminal node makes this an **episodic Markov random process**. In each episode the robot can start in any other random position, $\{ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \}$, and the episode terminates when the robot enters position (state) 0.  

In each state, there are four possible actions the robot can take:
- up, u
- down, d,
- left, l
- right, r

We encode, or represent, these possibilities in a dictionary as shown in the code block below. We use a dictionary of dictionaries to perform the lookup. The keys of the outer dictionary are the identifiers (numbers) of the states. The keys of the inner dictionary are the possible actions and the values are the **successor state**, $s'$, for that transition.  

Notice that there are no allowed transitions out of the terminal state. Also, any transition that takes the robot off the grid, leaves the state unchanged. 

In [23]:
## import numpy for latter
import numpy as np

## Define the transition dictonary of dictionaries:
policy = {0:{'u':0, 'd':0, 'l':0, 'r':0},
          1:{'u':1, 'd':5, 'l':0, 'r':2},
          2:{'u':2, 'd':6, 'l':1, 'r':3},
          3:{'u':3, 'd':7, 'l':2, 'r':3},
          4:{'u':0, 'd':8, 'l':4, 'r':5},
          5:{'u':1, 'd':9, 'l':4, 'r':6},
          6:{'u':2, 'd':10, 'l':5, 'r':7},
          7:{'u':3, 'd':11, 'l':6, 'r':7},
          8:{'u':4, 'd':12, 'l':8, 'r':9},
          9:{'u':5, 'd':13, 'l':8, 'r':10},
          10:{'u':6, 'd':14, 'l':9, 'r':11},
          11:{'u':7, 'd':15, 'l':10, 'r':11},
          12:{'u':8, 'd':12, 'l':12, 'r':13},
          13:{'u':9, 'd':13, 'l':12, 'r':14},
          14:{'u':10, 'd':14, 'l':13, 'r':15},
          15:{'u':11, 'd':15, 'l':14, 'r':15},}

We need to define the initial transition probabilities for the Markov process. Initially, we set the probabilities for each transition as **uniform** or random. As there are 4 possible transitions from each state, this means all transition probabilities are 0.25. In other words, this is a random policy which does not favor any particular transitions. 

The initial uniform transition probabilities are encoded using a dictionary of dictionaries. The organization of this data structure is identical to the foregoing data structure. 

In [24]:
state_to_state_probs = {0:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0},
                        1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}, 
                        2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        6:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        7:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        8:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}


The robot receives the following rewards:
- 10 for entering position 0, 
- -1 for attempting to leave the grid. In other words, we are , 
- -0.1 for all other state transitions.

We encode this reward in the same type of dictionary structure used for the foregoing structures.  

In [25]:
rewards = {0:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0},
          1:{'u':-1, 'd':-0.1, 'l':10.0, 'r':-0.1},
          2:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          4:{'u':10.0, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          5:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          6:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          7:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          8:{'u':-0.1, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          9:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          10:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          11:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          12:{'u':-0.1, 'd':-1.0, 'l':-1.0, 'r':-0.1},
          13:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          14:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          15:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-1.0}}

So far, we have constructed a rather poor policy. This policy is just a random walk around the grid. Still, we can still measure the value of this policy. 

The function in the code below iterates over the Bellman value function to find the values of each state. The iteration continues until the convergence criteria is meet.  

> **Note:** The code in this example takes advantage of the fact that there is only one possible successor state for each action. This means there is no need to sum over successor states.

In [26]:
def compute_state_value(pi, probs, reward, gamma = 1.0, theta = 0.01, display = False):
    '''Function for policy evaluation  
    '''
    delta = theta
    values = np.zeros(len(probs)) # Initialize the value array
    while(delta >= theta):
        v = np.copy(values) ## save the values for computing the difference later
        for s in probs.keys():
            temp_values = 0.0 ## Initial the sum of values for this state
            for action in rewards[s].keys():
                s_prime = pi[s][action]
                temp_values = temp_values + probs[s][action] * (reward[s][action] + gamma * values[s_prime])
            values[s] = temp_values
            
        ## Compute the differences to see convergence has been reached.    
        diffs = np.sum(np.abs(np.subtract(v, values)))
        delta = min([delta, diffs])
        if(display): 
            print('difference metric = ' + str(diffs))
            print(values.reshape(4,4))
    return values

compute_state_value(policy, state_to_state_probs, rewards, theta = 0.1, display = False).reshape(4,4)

array([[ 0.        ,  0.92461824, -3.93184927, -6.48887002],
       [ 0.92461824, -2.10995108, -4.95119803, -6.86735717],
       [-3.93184927, -4.95119803, -6.5102566 , -7.87700873],
       [-6.48887002, -6.86735717, -7.87700873, -8.96905736]])

## Overview of Dynamic Programming

What is **dynamic programming**? In the most general terms, dynamic programming is a **planning method**. A planning method is a means for an intelligent agent to gain improved autonomy though a sequence of actions to achieve a **goal**. 

Dynamic programming was developed in the 1950's by mathematician **Richard Bellman**. By *programming* Bellman meant a computer algorithm which optimizes the **value** of the states visited in a system represented by a Markov process. By *dynamic*, Bellman meant the algorithm solves the problem recursively by operating on smaller and simpler sub-problems. 

The key idea of dynamic programming is expressed by the **Bellman optimality equations**. There are two ways we can express this optimal relationship in two different ways. First, we can find a relationship which is a solution to the **optimal value function**, $v_{*}(s)$: 

$$
\begin{align}
v_{*}(s) &= max_a\ \mathbb{E} [ G_{t+1} + \gamma v_*(S_{t+1})\ |\ S_t = s, A_t = a] \\
&= max_a\ \sum_{s',r} p(s',r\ |\ s,a) \big[ r + \gamma v_{*}(s') \big]
\end{align}
$$

The other possibility is to use the other form of the Bellman optimality equations. This is the **optimal state action** relationship:

$$\begin{align}
q_{*}(s,a) &= \mathbb{E} \big[R_{t+1} + \gamma max_{a'}\ q_{*}(S_{t+1},a')\ 
\big|\ S_t = s, A_t = a \big]  \\
&= max_a \sum_{s',r} p(s',r\ |\ s,a) \big[ r + \gamma\ max_{a'}\ q_{*}(S_{t+1},a') \big]
\end{align}$$

Like dynamic programming, **reinforcement learning** is class of optimization algorithms using a sequence of actions for a system represented by a Markov processes.Therefore, understanding dynamic programming is good path to understanding reinforcement learning. 

## Policy Iteration for Grid World

In [27]:
import copy
def policy_iteration(pi, probs, reward, gamma = 1.0, theta = 0.1, output = False):
    delta = theta
    v = np.zeros(len(probs))
    state_values = np.zeros(len(probs))
    current_policy = copy.deepcopy(probs)
    while(delta >= theta):
        for s in probs.keys():
            temp_values = []
            for action in rewards[s].keys():
                s_prime = pi[s][action]
                temp_values.append(current_policy[s][action] * (reward[s][action] + gamma * state_values[s_prime]))
            
            ## Find the max value and update current policy
            max_index = np.where(np.array(temp_values) == max(temp_values))[0]
            prob_for_policy = 1.0/float(len(max_index))
            for i,action in enumerate(current_policy[s].keys()): 
                if(i in max_index): current_policy[s][action] = prob_for_policy
                else: current_policy[s][action] = 0.0
                
        
        ## Compute state values with new policy to determine if there is an improvement
        state_values = compute_state_value(pi, current_policy, rewards, theta = .1)
        diff = np.sum(np.abs(np.subtract(v, state_values)))
        if(output): 
            print('\ndiff = ' + str(diff))
            print('Current policy')
            print(current_policy)
            print('With state values')
            print(state_values.reshape(4,4))
        
        delta = min([delta, np.sum(np.abs(np.subtract(v, state_values)))])
        v = np.copy(state_values) 
    return current_policy

policy_iteration(policy, state_to_state_probs, rewards, gamma = 1.0, output = True)


diff = 128.1086520299247
Current policy
{0: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}, 1: {'u': 0.0, 'd': 0.0, 'l': 1.0, 'r': 0.0}, 2: {'u': 0.0, 'd': 0.3333333333333333, 'l': 0.3333333333333333, 'r': 0.3333333333333333}, 3: {'u': 0.0, 'd': 0.5, 'l': 0.5, 'r': 0.0}, 4: {'u': 1.0, 'd': 0.0, 'l': 0.0, 'r': 0.0}, 5: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}, 6: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}, 7: {'u': 0.3333333333333333, 'd': 0.3333333333333333, 'l': 0.3333333333333333, 'r': 0.0}, 8: {'u': 0.3333333333333333, 'd': 0.3333333333333333, 'l': 0.0, 'r': 0.3333333333333333}, 9: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}, 10: {'u': 0.25, 'd': 0.25, 'l': 0.25, 'r': 0.25}, 11: {'u': 0.3333333333333333, 'd': 0.3333333333333333, 'l': 0.3333333333333333, 'r': 0.0}, 12: {'u': 0.5, 'd': 0.0, 'l': 0.0, 'r': 0.5}, 13: {'u': 0.3333333333333333, 'd': 0.0, 'l': 0.3333333333333333, 'r': 0.3333333333333333}, 14: {'u': 0.3333333333333333, 'd': 0.0, 'l': 0.3333333333333333, 'r': 0.3333333

{0: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 1: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 2: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 3: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 4: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 5: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 6: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 7: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 8: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 9: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 10: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 11: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 12: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 13: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 14: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 15: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5}}

## Value Iteration for Grid World

In [29]:
def value_iteration(pi, probs, reward, gamma = 1.0, theta = 0.1, output = False):
    delta = theta
    v = np.zeros(len(probs))
    state_values = np.zeros(len(probs))
    current_policy = copy.deepcopy(probs)
    while(delta >= theta):
        for s in probs.keys(): # iteratve over all states
            temp_values = []
            ## Find the values for all possible actions in the state.
            for action in rewards[s].keys():
                s_prime = pi[s][action]
                temp_values.append((reward[s][action] + gamma * state_values[s_prime]))
            
            ## Find the max value and update the value for the state
            state_values[s] = max(temp_values)
        ## Determine if convergence is achieved
        diff = np.sum(np.abs(np.subtract(v, state_values)))
        delta = min([delta, np.sum(np.abs(np.subtract(v, state_values)))])
        v = np.copy(state_values)
        if(output):
            print('Difference = ' + str(diff))
            print(state_values.reshape(4,4))
    
    ## Now we need to find the policy that makes max value state transitions
    for s in current_policy.keys(): # iterate over all states
        ## Find the indicies of maximum state transition values
        temp_values = [state_values[pi[s][s_prime]] for s_prime in pi[s].keys()]
        max_index = np.where(np.array(temp_values) == max(temp_values))[0]    
        prob_for_policy = 1.0/float(len(max_index)) ## Probabilities of transition
        for i, key in enumerate(current_policy[s]): ## Update policy
            if(i in max_index): current_policy[s][key] = prob_for_policy
            else: current_policy[s][key] = 0.0    
    return current_policy

value_iteration(policy, state_to_state_probs, rewards, output = True)

Difference = 146.70000000000002
[[ 0.  10.   9.9  9.8]
 [10.   9.9  9.8  9.7]
 [ 9.9  9.8  9.7  9.6]
 [ 9.8  9.7  9.6  9.5]]
Difference = 0.0
[[ 0.  10.   9.9  9.8]
 [10.   9.9  9.8  9.7]
 [ 9.9  9.8  9.7  9.6]
 [ 9.8  9.7  9.6  9.5]]


{0: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 1: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 2: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 3: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 4: {'d': 0.0, 'l': 1.0, 'r': 0.0, 'u': 0.0},
 5: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 6: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 7: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 8: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 9: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 10: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 11: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 12: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 1.0},
 13: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 14: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5},
 15: {'d': 0.0, 'l': 0.5, 'r': 0.0, 'u': 0.5}}

#### Copyright 2018, Stephen F Elston. All rights reserved. 