# Dynamic Programming

Dynamic Programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process. While classical DP is of limited use for reinforcement learning nowadays, because of their assumption of a perfect model and great computational expense, they are still the foundation for the understanding of more complex models and are important theoretically.

The key idea of DP is the use of value functions to organize and structure the search for good policies. 

### Grid World

We will use grid world toy example, where each agent can move up, down, left or right within a grid with previously specified rows and columns. Each move results in reward of -1. Top-left and bottom-right corners are terminal states are getting there gives agents the reward of 0. If agent makes a move that is out of bounds of the given grid, the agents stays in place instead and receives reward of -1.

In [217]:
import numpy as np

class GridWorld:
   def __init__(self, rows=4, cols=4):
      self.rows = rows
      self.cols = cols
      self.state = None
      self.is_done = False
      self.available_actions = ['up', 'down', 'left', 'right']
      self.reset()
         
   def reset(self):
      self.grid = np.zeros((self.rows, self.cols))
      for i in range(self.rows):
         self.grid[i, :] = np.arange(self.cols)+self.cols*i
      
      self.state = np.random.randint(1, self.rows*self.cols)
      self.n_states = self.rows*self.cols
      self.is_done = False
      
   def step(self, action, state=None, look_ahead=False):
      if self.is_done:
         print('Agent reached terminal state in the previous move, please reset the environment!')
         return
      if state is None: state = self.state
      
      action = action.lower()
      if action not in self.available_actions:
         raise 'Invalid action'
      
      terminal = False
      reward = -1
      if action == 'up':
         new_state = state - self.rows if state not in self.grid[0, :] else state
      elif action == 'down':
         new_state = state + self.rows if state not in self.grid[self.rows-1, :] else state
      elif action == 'right':
         new_state = state + 1 if state not in self.grid[:, self.cols-1] else state
      elif action == 'left':
         new_state = state - 1 if state not in self.grid[:, 0] else state
         
      if self.is_terminal(new_state):
         terminal = True
         reward = -1

      if not look_ahead:
         self.state = new_state
         self.is_done = terminal
         
      return new_state, reward, terminal
   
   def is_terminal(self, state):
      return state == 0 or state == (self.n_states-1)
   
   def __repr__(self):
      vis = np.zeros_like(self.grid)
      row = self.state // self.rows
      col = self.state % self.cols
      vis[row][col] = 1
      return str(vis)

In [221]:
g = GridWorld()
g

[[0. 0. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

In [222]:
g.step('down')

(7, -1, False)

In [223]:
g

[[0. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

### Policy Evaluation

First, let's look into ways of computing the state-value function $v_\pi$ for an arbitrary policy $\pi$. We know that:

$$v_\pi(s)=E_\pi [G_t|S_t=s]=\sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi (s')]$$

where $\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$. The expectations $E_\pi$ are conditional on $\pi$, thus the subscript.

We can use *iterative policy evaluation* which applies the same operation to each state $s$: it replaces the old value of $s$ with new value obtained form the old values of the successor states of $s$, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated. This is called *expected update* - each iteration of iterative policy evalution updates the value of every state once to produce the new approximate value function $v_{k+1}$. The updates are called *expected*, because they are based on expectation over all possible next states rather than on a sample next state. 

<img src='resources/iterative-policy-evaluation.jpg' />


In [247]:
import numpy as np

def ipe(env, policy, theta=1e-5, gamma=1, n_steps=100, print_every=False):
   '''
   Iterative Policy Evaluation
   Inputs:
    - pi (dict): policy to be evaluated
    - theta (float): threshold > 0 determining accuracy of estimation
    - gamma (float): discount factor how to measure expected rewards received in far future
    - n_steps (int): for how many steps to iterate through all states
    - print_every (list): whether to print 
   '''
   V = np.zeros(env.n_states)
   i_step = 0.
   while True and i_step < n_steps:
      delta = 0.
      V_tmp = V.copy()
      
      # Calculate
      for s in range(1, env.n_states-1):
         v = 0.
         for a in env.available_actions:
            new_s, r, done = env.step(a, state=s, look_ahead=True)
            v += policy[s][a] * (r + gamma*V[new_s])
         delta = max(delta, np.abs(v-V[s]))
         V_tmp[s] = v
         if delta < theta: break
      
      # Print value estimation table
      if print_every:
         if i_step in print_every:
            print(f'Iteration: {i_step}')
            print(np.round(V.reshape(env.rows, env.cols), 1))
            print('-'*50 + '\n')
            
      V = V_tmp.copy()
      i_step += 1
      
   return V.reshape(env.rows, env.cols)

In [248]:
random_policy = {key: {val: 0.25 for val in ['up', 'down', 'left', 'right']}for key in range(0, 16)}

In [250]:
g.reset()
V = ipe(g, random_policy, n_steps=100, print_every=[0, 1, 2, 3, 10, 99])

Iteration: 0.0
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
--------------------------------------------------

Iteration: 1.0
[[ 0. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1.  0.]]
--------------------------------------------------

Iteration: 2.0
[[ 0.  -1.8 -2.  -2. ]
 [-1.8 -2.  -2.  -2. ]
 [-2.  -2.  -2.  -1.8]
 [-2.  -2.  -1.8  0. ]]
--------------------------------------------------

Iteration: 3.0
[[ 0.  -2.4 -2.9 -3. ]
 [-2.4 -2.9 -3.  -2.9]
 [-2.9 -3.  -2.9 -2.4]
 [-3.  -2.9 -2.4  0. ]]
--------------------------------------------------

Iteration: 10.0
[[ 0.  -6.1 -8.4 -9. ]
 [-6.1 -7.7 -8.4 -8.4]
 [-8.4 -8.4 -7.7 -6.1]
 [-9.  -8.4 -6.1  0. ]]
--------------------------------------------------

Iteration: 99.0
[[  0.  -13.9 -19.9 -21.9]
 [-13.9 -17.9 -19.9 -19.9]
 [-19.9 -19.9 -17.9 -13.9]
 [-21.9 -19.9 -13.9   0. ]]
--------------------------------------------------



In [144]:
random_policy

{0: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 1: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 2: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 3: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 4: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 5: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 6: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 7: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 8: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 9: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 10: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 11: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 12: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 13: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 14: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25},
 15: {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}}