# Dynamic Programming

The term dynamic programming (DP) refers to a collection of algorithms that can be  
used to compute optimal policies given a perfect model of the environment as a Markov  
decision process (MDP).

Classical DP algorithms are of limited utility in reinforcement  
learning both because of their assumption of a perfect model and because of their great  
computational expense, but they are still important theoretically.

We usually assume that the environment is a finite MDP. That is, we assume that its  
state, action, and reward sets, S, A, and R, are finite, and that its dynamics are given by a  
set of probabilities p(s0 , r |s, a), for all s 2 S, a 2 A(s), r 2 R, and s0 2 S+ (S+ is S plus a  
terminal state if the problem is episodic).

The key idea of DP, and of reinforcement learning generally, is the use of value functions  
to organize and structure the search for good policies.

we can easily obtain optimal policies once we have found the optimal value functions, v⇤ or
q⇤ , which satisfy the Bellman optimality equations:

$
\begin{aligned}
v_*(s) &= {max_a E[R_{t+1} + \gamma * v_*(S_{t+1}) | S_t = s, A_t = a]} \\
       &= {max_a \sum_{s', r} p(s', r | s, a)[r + \gamma *v_*(s') ]}
\end{aligned}
$

$
\begin{aligned}
q_*(s, a) &= {E[R_{t+1} + \gamma * max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a]} \\
          &= {\sum_{s', r} p(s', r | s, a)[r + \gamma * max_{a'} q_*(s', a') ]}
\end{aligned}
$

As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments,  
that is, into update rules for improving approximations of the desired value functions.

## Policy Evaluation (Prediction)

First we consider how to compute the state-value function v⇡ for an arbitrary policy ⇡.
This is called **policy evaluation** (or **prediction problem**) in the DP literature.

## Policy Improvements



## Policy Iteration

In [None]:
# Policy Iteration

from collections import defaultdict

class PolicyIteration():

    def __init__(self, action_space, theta):
        self.theta = theta

        self.V = defaultdict(lambda s: 0)
        self.policy = defaultdict(lambda s: [1 / action_space for _ in len(action_space)])

    def policy_evaluation(self, env):
        delta = theta + 1

        while delta > theta:

    def policy_improvement(self, env):



### Jack's Car Rental

In [11]:
# Jack's Car Rental Env
import numpy as np

import gymnasium as gym
from gymnasium import spaces


class JacksCarRental(gym.Env):
    metadata = { "render_modes": ["human", "ascii"] }

    def __init__(self, render_mode=None, nb_cars_allowed=20):
        self.nb_cars_allowed = nb_cars_allowed  # The maxium number of cars allowed
        self._first_location = 0
        self._second_location = 0

        self.window_size = 512  # The size of the PyGame window

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]).
        self.observation_space = spaces.Dict(
            {
                "first_location": spaces.Box(0, self.nb_cars_allowed, shape=(1,), dtype=int),
                "second_location": spaces.Box(0, self.nb_cars_allowed, shape=(1,), dtype=int),
            }
        )

        self._agent_location = np.array([-1, -1], dtype=int)
        self._target_location = np.array([-1, -1], dtype=int)

        # We have 10 actions, corresponding to the number of cars moved (-5, -4, ..., 3, 4, 5)
        self.action_space = spaces.Discrete(10)

        assert render_mode is None or render_mode in self.metadata["render_modes"]
        self.render_mode = render_mode
    
    def _get_obs(self):
        return { "first_location": self._first_location, "second_location": self._second_location}
    
    def _get_info(self):
        return {}

    def step(self, action):

        cars_requested_at_first_location = self.np_random.poisson(3)
        cars_returned_at_first_location = self.np_random.poisson(4)

        cars_requested_at_second_location = self.np_random.poisson(3)
        cars_returned_at_second_location = self.np_random.poisson(2)

        nb_moved_cars = action

        # We use `np.clip` to make sure we don't have an incorrect number of cars
        self._first_location = np.clip(
            self._first_location + nb_moved_cars + cars_returned_at_first_location, 0, 20
        )

        self._second_location = np.clip(
            self._second_location - nb_moved_cars + cars_returned_at_second_location, 0, 20
        )

        reward_from_first_location = np.clip(self._first_location / cars_requested_at_first_location, 0, 1) * 10
        reward_from_second_location = np.clip(self._first_location / cars_requested_at_first_location, 0, 1) * 10
        reward_cost_from_moving_car = 2 * nb_moved_cars
    
        # An episode is done iff the agent has reached the target
        terminated = self._step >= 1
        reward = reward_from_first_location + reward_from_second_location - reward_cost_from_moving_car
        observation = self._get_obs()
        info = self._get_info()

        if self.render_mode == "human":
            self._render_frame()

        return observation, reward, terminated, False, info
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)

        self._step = 0

        self.first_location = 0
        self.second_location = 0

        return None, {}

In [12]:
env = JacksCarRental()

env.reset()

observation, reward, termiated, truncated, info = env.step(2)
print(observation, reward, termiated, truncated, info)


{'first_location': np.int64(3), 'second_location': np.int64(2)} 16.0 False False {}
