Attempting to solve Jack's Car Rental example from Sutton & Barto.

Since this is a finite MDP problem, I would like to be able to store all of the probability distributions in arrays in memory. However, since there are $20^2 = 400$ states, $41$ possible actions, and an untold number $\left|\mathcal{R}\right|$ of rewards, the size of this array will be quite large.

The key difference between this problem and something like the previous Gridworld examples is that the dynamics here are stochastic; the reward, $r$, obtained for taking an action in a particular state as well as the resultant state, $s'$, both depend on the number of customers that walk into the two locations to rent or to return cars.

The dynamics for the problem is
\begin{align}
    p\left(s', r\, \middle| \, s, a\right) & = \sum_{n_1, n_2, m_1, m_2} p\left(s', r \, \middle| \, s, a, n_1, n_2, m_1, m_2 \right) p\left(n_1\right)p\left(n_2\right)p\left(m_1\right)p\left(m_2\right) \\
\end{align}

where $n_1$ and $n_2$ are the numbers of customers that come to rent a car at location 1 and 2, respectively, and $m_1$ and $m_2$ are the numbers of customers that come to return cars. The distributions of these random variables are assumed to be Poisson distributions: this is the basis of our known model of the environment.

But what about that first distribution with all of the conditioning variables? Think about it: if we know the values of all of those variables, we can deterministically compute the next state and the reward obtained!

$$ r = 10 \cdot \min \left\{ n_1, s_1 - a \right\} + 10 \cdot \left\{ n_2, s_2 + a\right\} - 2 \cdot a $$

$$ \left[ \begin{array}{c} s_1' \\ s_2' \end{array}\right] = \left[ \begin{array}{cc} \max \left\{ s_1 - a - n_1, \, 0 \right\} + m_1 \\ \max \left\{ s_2 + a - n_2, \, 0 \right\} + m_2 \end{array}\right] $$

In [35]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson as poi

x = range(0, 10)

n1_dist = poi.pmf(k=x, mu=3)
n1_dist /= n1_dist.sum()

n2_dist = poi.pmf(k=x, mu=4)
n2_dist /= n2_dist.sum()

m1_dist = poi.pmf(k=x, mu=3)
m1_dist /= m1_dist.sum()

m2_dist = poi.pmf(k=x, mu=2)
m2_dist /= m2_dist.sum()

In [59]:
states = dict(enumerate([(i,j) for i in range(21) for j in range(21)]))
def state_tup_to_idx(state_tup, states=states):
    
    for k, tup in states.items():
        if state_tup == tup:
            return k
    raise ValueError(f"Tuple {state_tup} not found in states!")


actions = dict(enumerate(range(-5, 6)))
def action_to_idx(action, actions=actions):
    for k, act in actions.items():
        if act == action:
            return k
    raise ValueError(f"Action {action} not found in actions!")

In [60]:
def calc_r(s_idx, a_idx, n_1, n_2):
    s_1, s_2 = states[s_idx]
    a_val = actions[a_idx]
    return int(10 * np.min([n_1, s_1 - a_val]) + 10 * np.min([n_2, s_2 + a_val]) - 2 * a_val)


def calc_sp_idx(s_idx, a_idx, n_1, n_2, m_1, m_2) -> int:
    s_1, s_2 = states[s_idx]
    a_val = actions[a_idx]
    sp_tup = (
        np.min(
            [np.max([s_1 - a_val - n_1, 0]) + m_1, 20]
        ), 
        np.min(
            [np.max([s_2 + a_val - n_2, 0]) + m_2, 20]
        )
    )
    return state_tup_to_idx(sp_tup)

def calc_p(n_1, n_2, m_1, m_2):
    return np.exp(np.log([n1_dist[n_1], n2_dist[n_2], m1_dist[m_1], m2_dist[m_2]]).sum())

In [61]:
def dynamics(s_idx: int, a_idx: int):
    dynamics_list = []
    for n_1 in range(len(n1_dist)):
        for n_2 in range(len(n2_dist)):
            for m_1 in range(len(m1_dist)):
                for m_2 in range(len(m2_dist)):
                    # calc reward
                    r = calc_r(s_idx, a_idx, n_1, n_2)
                    # calc s'
                    sp_idx = calc_sp_idx(s_idx, a_idx, n_1, n_2, m_1, m_2)
                    # calc (lookup) prob
                    p = calc_p(n_1, n_2, m_1, m_2)
                    dynamics_list.append((sp_idx, r, p))
    return dynamics_list
    

In [65]:
def calc_update_for_v(s_idx: int, v: np.ndarray, policy: np.ndarray, gamma: float):
    n_actions = len(actions)

    expected_reward = 0.0
    for a_idx in actions:
        dl = dynamics(s_idx, a_idx)
        for sp, r, p in dl:
            expected_reward += policy[sp, a_idx] * p * (r + gamma * v[sp])
    return expected_reward

In [66]:
v = np.zeros(len(states))
policy = np.ones((len(states), len(actions))) / len(actions)

Do one step of iterative policy evaluation. This constructs an approximation $v_1 \approx v_{\pi}$ which should improve upon our initial guess $v_0$. 

In [68]:
v_new = np.zeros_like(v)
for s_idx in states:
    v_new[s_idx] = calc_update_for_v(s_idx, v, policy, 0.9)