![Logo](https://raw.githubusercontent.com/BartaZoltan/deep-reinforcement-learning-course/main/notebooks/shared_assets/logo.png)

# Practice 2 Homework: Jack's Car Rental (Policy Iteration)

**Developers:** Domonkos Nagy, Balazs Nagy, Zoltan Barta  
**Date:** 2026-02-25  
**Version:** 2025-26/2

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/BartaZoltan/deep-reinforcement-learning-course/blob/main/notebooks/sessions/session_02_mdp_dynamic_programming/session_02_mdp_dynamic_programming_homework.ipynb)

## Summary

This homework implements Policy Iteration for Jack's Car Rental (Sutton & Barto, Ch. 4).

Content outline:
- Poisson demand/return model,
- expected return computation for state-action pairs,
- iterative policy evaluation,
- greedy policy improvement,
- final policy/value visualization.


## Task Description: Jack's Car Rental

Jack manages two car rental locations. Every night he can move cars between locations (max 5 cars per night). During the next day, rentals and returns happen stochastically at both locations.

Your task is to solve this finite MDP with **Policy Iteration**:
1. implement expected return for a given state-action pair,
2. implement policy evaluation,
3. implement greedy policy improvement,
4. iterate until the policy becomes stable.

### Model details
- State: `(cars_at_A, cars_at_B)` with each value in `[0, 20]`.
- Action: cars moved overnight from A to B, integer in `[-5, 5]`.
- Reward: `$10` per rented car, movement cost `$2` per moved car.
- Extension rules (same as lecture notebook):
  - If action > 0 (move A->B), one moved car is free.
  - Parking cost `$4` at each location if cars there exceed 10 after moving.
- Discount factor: `gamma = 0.9`.

This assignment is aligned with Chapter 4 of Sutton & Barto {cite}`sutton2018`.


In [None]:
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

np.set_printoptions(precision=1, suppress=True, linewidth=160)


## Environment Setup

The next cells define constants and helper functions used by all tasks.


In [None]:
# DO NOT MODIFY THIS CELL

MAX_CARS = 20
MAX_MOVE = 5
GAMMA = 0.9

RENT_REWARD = 10
MOVE_COST = 2
PARKING_COST = 4

LAMBDA_RENTALS_A = 3
LAMBDA_RENTALS_B = 4
LAMBDA_RETURNS_A = 3
LAMBDA_RETURNS_B = 2

POISSON_LIMIT = 11  # probability mass above this is ignored

N_STATES = MAX_CARS + 1
ACTIONS = np.arange(-MAX_MOVE, MAX_MOVE + 1)

V = np.zeros((N_STATES, N_STATES), dtype=float)
policy = np.zeros((N_STATES, N_STATES), dtype=int)


In [None]:
# DO NOT MODIFY THIS CELL

poisson_cache = {}

def poisson_prob(lmbda: int, n: int) -> float:
    key = (lmbda, n)
    if key not in poisson_cache:
        poisson_cache[key] = (lmbda ** n) * math.exp(-lmbda) / math.factorial(n)
    return poisson_cache[key]

for lam in [LAMBDA_RENTALS_A, LAMBDA_RENTALS_B, LAMBDA_RETURNS_A, LAMBDA_RETURNS_B]:
    for n in range(POISSON_LIMIT):
        poisson_prob(lam, n)

def is_action_legal(state, action) -> bool:
    cars_a, cars_b = state
    cars_a_after_move = cars_a - action
    cars_b_after_move = cars_b + action
    return (
        0 <= cars_a_after_move <= MAX_CARS
        and 0 <= cars_b_after_move <= MAX_CARS
    )


In [None]:
# DO NOT MODIFY THIS CELL

def plot_policy(policy_arr, title="Policy (cars moved A->B)"):
    plt.figure(figsize=(10, 7))
    sns.heatmap(
        np.flip(policy_arr, axis=0),
        cmap="vlag",
        linecolor="white",
        linewidths=0.05,
        square=True,
        yticklabels=np.arange(MAX_CARS, -1, -1),
        cbar_kws={"label": "Action"},
    )
    plt.title(title)
    plt.ylabel("Cars at A")
    plt.xlabel("Cars at B")
    plt.tight_layout()
    plt.show()

def plot_values(V_arr, title="State-value function"):
    plt.figure(figsize=(10, 7))
    sns.heatmap(
        np.flip(V_arr, axis=0),
        cmap="viridis",
        linecolor="white",
        linewidths=0.05,
        square=True,
        yticklabels=np.arange(MAX_CARS, -1, -1),
        cbar_kws={"label": "V(s)"},
    )
    plt.title(title)
    plt.ylabel("Cars at A")
    plt.xlabel("Cars at B")
    plt.tight_layout()
    plt.show()


## Task 1: Implement Expected Return for `(state, action)`

Complete `expected_return(state, action, V)` using the Poisson loops.

Requirements:
- illegal actions return `-np.inf`,
- apply movement reward/cost with extension rules,
- compute expected one-step reward + discounted next-state value.


In [None]:
def expected_return(state, action, V_table):
    ########################################################################
    # TODO: implement expected return for one state-action pair
    # Hint: follow the structure from lecture notebook: rentals -> returns loops
    # and aggregate prob * (reward + gamma * V(next_state)).

    # 1) Check legality
    # 2) Compute action-dependent immediate costs/rewards
    # 3) Sum over rentals and returns (Poisson)

    raise NotImplementedError("Task 1: implement expected_return")
    ########################################################################


## Task 2: Implement Policy Evaluation

Evaluate the current deterministic policy until `delta < theta`.

Use in-place updates over all states:
- `old_v = V[s]`
- `V[s] = expected_return(s, policy[s], V)`
- `delta = max(delta, abs(old_v - V[s]))`


In [None]:
def policy_evaluation(V_table, policy_table, theta=1e-2):
    ########################################################################
    # TODO: iterative policy evaluation

    raise NotImplementedError("Task 2: implement policy_evaluation")
    ########################################################################


## Task 3: Implement Policy Improvement and Full Policy Iteration

Improve policy greedily with respect to current `V`, then repeat evaluation+improvement until stable.

Tie-handling rule: if multiple actions are exactly best, selecting any of them is acceptable.


In [None]:
def policy_improvement(V_table, policy_table):
    ########################################################################
    # TODO: greedy policy improvement over all states
    # Return True if policy is stable, else False

    raise NotImplementedError("Task 3: implement policy_improvement")
    ########################################################################


# After completing the tasks above, run policy iteration below.
# You can reduce max_iterations while debugging.
max_iterations = 20

for it in range(1, max_iterations + 1):
    display.clear_output(wait=True)
    print(f"Policy Iteration outer loop: {it}")

    policy_evaluation(V, policy, theta=1e-2)
    stable = policy_improvement(V, policy)

    if stable:
        print("Policy converged.")
        break

plot_policy(policy, title="Optimal policy (cars moved A->B)")
plot_values(V, title="Estimated optimal state values")


## Optional checks

- Verify your policy heatmap has structured regions (not random noise).
- Try changing `theta` and compare runtime vs final policy changes.
- Try removing extension rules (free transfer / parking penalty) and compare policies.


# References

- Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.), Chapter 4.
