<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/Ch3%20MPD/Ch3_Jack's_car_rental_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jack's Car Rental problem

Sutton & Barto - Example 4.2

Jack manages two locations for a nationwide car rental company.  Each day, some number of customers arrive at each location to rent cars.


If Jack has a car available, he rents it out and is credited \$10 by the national company.  
If he is out of cars at that location, then the business is lost. 

**Cars become available for renting the day after they are returned.**  
To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of \$2 per car moved. 

We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number is $n$ is $\frac{\lambda^n}{n!}e^{-\lambda}$ where $\lambda$ is the expected number. 

Suppose $\lambda$ is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. 

To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars
are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. 

We take the discount rate to be $\gamma = 0.9$ and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations
overnight. 

Figure 4.2 shows the sequence of policies found by policy iteration starting from the policy that never moves any cars.


#### Create the gym environment

In [0]:
import gym
import sys

from gym.envs.toy_text.discrete import DiscreteEnv, categorical_sample

In [0]:
# Define some key parameters
RENTAL_INCOME = 10
COST_PER_MOVE = 2
DISCOUNT_RATE = 0.9

MAX_A_CARS = 15
MAX_B_CARS = 15

MAX_CAR_MOVE_BETWEEN = 5

EXPECTED_RENTAL_REQUEST_A = 3
EXPECTED_RENTAL_REQUEST_B = 4

EXPECTED_RENTAL_RETURN_A = 3
EXPECTED_RENTAL_RETURN_B = 2

#### Define the problem in MDP context

- State: The number of cars at each location -> 21 x 21 grid including 0 cars in each branch
- Action: The net number of cars moved between two locations
    - +ve represents moving from A to B
    - -ve represents moving from B to A
    - Action is an array of [-MAX_CAR_MOVE_BETWEEN, ..., MAX_CAR_MOVE_BETWEEN]
- Reward: Rental Income * rent probability
- Transition Probability matrix

##### Visualising the state table
A | **0** | **1** | **2** | **3** | **4** | **5** | **6** | **7** | **8** | **9** | **10** | **11** | **12** | **13** | **14** | **15** | **16** | **17** | **18** | **19** | **20**
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
0|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20
1|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41
2|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62
3|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83
4|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104
5|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125
6|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146
7|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167
8|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188
9|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209
10|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230
11|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251
12|252|253|254|255|256|257|258|259|260|261|262|263|264|265|266|267|268|269|270|271|272
13|273|274|275|276|277|278|279|280|281|282|283|284|285|286|287|288|289|290|291|292|293
14|294|295|296|297|298|299|300|301|302|303|304|305|306|307|308|309|310|311|312|313|314
15|315|316|317|318|319|320|321|322|323|324|325|326|327|328|329|330|331|332|333|334|335
16|336|337|338|339|340|341|342|343|344|345|346|347|348|349|350|351|352|353|354|355|356
17|357|358|359|360|361|362|363|364|365|366|367|368|369|370|371|372|373|374|375|376|377
18|378|379|380|381|382|383|384|385|386|387|388|389|390|391|392|393|394|395|396|397|398
19|399|400|401|402|403|404|405|406|407|408|409|410|411|412|413|414|415|416|417|418|419
20|420|421|422|423|424|425|426|427|428|429|430|431|432|433|434|435|436|437|438|439|440

Consider the problem as a continuing MDP, we can also build a transition dynamics table as follow:  

s | a | s' | r | p(s', r \| s, a) | Remarks
--- | --- | --- | --- | --- | ---
19 | 0 | 17 | +20 | poisson(2, 3) | Branch A rent out 2, No car movement
Row 2, Col 1 | Row 2, Col 2

Note:  
Recall the Poisson Distribution represents the number of discrete events or accurences over a specified interval.

For a poisson random variable, $x$, is the number of events in a given unit of time, which can be any non-negative whole value.

An event can occur 0, 1, 2, ... times in an interval. The average number of events in an interval is designated $\lambda$. $\lambda$ is the event rate, also called the rate parameter. Hence, the probability of observing k events in an interval is given by the equation

\begin{equation}
\mathop{\mathbb{P}}(\text{k events in interval}) = \frac{\lambda^k}{k!}e^{-\lambda}
\end{equation}

where
- $\lambda$ is the average number of events per interval, or the expected events to be happened in an interval
- $k$ takes values 0, 1, 2, ...

In [0]:
import math
import numpy as np

In [0]:
# Create a probability backup table
prob_backup = {}

In [0]:
 def prob_from_poisson_dist(k, lam):
  """
  Calculate the probability from a poisson distribution
  """
  return (lam ** k / math.factorial(k)) * np.exp(-lam)

In [0]:
prob_from_poisson_dist(3, 3)

0.22404180765538775

##### Define the MDP dynamics

In [0]:
# def check_valid(current_a, current_b, demand_a, demand_b):
#   """
#   Check if the transaction is valid.
#   Rule: Demand a <= current a; Demand b <= current b
#   Rule: current_a <= 20; current_b <= 20
#   """
  
#   if current_a > MAX_A_CARS:
#     return False
  
#   if current_b > MAX_B_CARS:
#     return False
  
#   return True

In [0]:
# print(check_valid(20, 20, 5, 5))
# print(check_valid(10, 10, 5, 5))
# print(check_valid(25, 10, 5, 5))
# print(check_valid(5, 10, 10, 5))

False
False
False
False


In [0]:
# P[s][a] == [(probability, nextstate, reward, done), ...]
P = {}

In [0]:
states = np.arange(21 * 21).reshape(21, 21)

In [0]:
states[20][20]

440

In [0]:
def car_rental_control(num_cars_available, demand, action, capacity):
  """
  Args:
    num_car_available: Number of cars availabe
    demand: Number of cars demand
    
  Return:
    (Cars rented, Cars left)
  """
  net = num_cars_available + action - demand
  
#   print("Net: ", net)
  
  if net > capacity:
    return (demand, capacity)
  elif net < 0:
    return (num_cars_available, 0)
  else:
    return (demand, net)

In [0]:
def car_return_control(num_cars_available, num_cars_return, capacity):
  """
  Args:
    num_cars_available: Number of cars available
    num_cars_return: Number of cars return
    capacity: Maximum capacity of the branch
  """
  
  net = num_cars_available + num_cars_return
  
  if net > capacity:
    return capacity
  else:
    return net

In [0]:
car_return_control(3, 1, MAX_A_CARS)

4

In [0]:
def valid_action(action, current_a, current_b):
  """
  Define correct action
  """
  
  tmp_a = current_a + action
  tmp_b = current_b - action
  
  if (tmp_a < 0) | (tmp_b < 0) | (tmp_a > MAX_A_CARS) | (tmp_b > MAX_B_CARS):
    return False
  else:
    return True

In [0]:
def state_transition_prob(states, actions):
  """
  Args:
    states: The complete states of the environment
    actions: The complete actions list
    
  Return:
    P = {(prob, next_state, reward, done)}
  """
  
  P = {}
  
  def is_done(s):
    if s == 0:
      return True
    else: 
      return False
  
  it = np.nditer(states, flags=['multi_index'])
  
  # Demand list
  demand_list_a = [prob_from_poisson_dist(d, EXPECTED_RENTAL_REQUEST_A) for d in range(10)]
  demand_list_b = [prob_from_poisson_dist(d, EXPECTED_RENTAL_REQUEST_B) for d in range(10)]
  
  # Return list
  return_list_a = [prob_from_poisson_dist(r, EXPECTED_RENTAL_REQUEST_A) for r in range(10)]
  return_list_b = [prob_from_poisson_dist(r, EXPECTED_RENTAL_REQUEST_B) for r in range(10)]  
  
  while not it.finished:
    s = it.iterindex
    current_cars_b, current_cars_a = it.multi_index
    
    print("Building state %s -> (%s, %s)" % (s, current_cars_a, current_cars_b))
    
    # Initialise P
    P[s] = {a: [] for a in actions}
    
    if s == 0:
      for a in actions:
        P[s][a] = [(1.0, s, 0, True, 0, 0, 0, 0)]
    else:
      # Calculate demand
      for demand_a, prob_demand_a in enumerate(demand_list_a):
        for demand_b, prob_demand_b in enumerate(demand_list_b):
          for return_a, prob_return_a in enumerate(return_list_a):
            for return_b, prob_return_b in enumerate(return_list_b):

              # CRule here
              # 1. Returned cars only availabe on the next day
              # 2. Invalid movement will be denied

              for a in actions:
                
                if not valid_action(a, current_cars_a, current_cars_b):
                  # If action is not valid, Set the probability to 0
                  p = 0  
                else:
                  p = prob_demand_a * prob_demand_b * prob_return_a * prob_return_b

                # Calculate the cars left after current_cars + a - demand
                cars_rented_a, cars_left_a = \
                  car_rental_control(current_cars_a, demand_a, a, MAX_A_CARS)
                cars_rented_b, cars_left_b = \
                  car_rental_control(current_cars_b, demand_b, -a, MAX_B_CARS)
                
                # Return comes last
                # cars left = original cars left + cars returned
                cars_left_a = car_return_control(cars_left_a, return_a, MAX_A_CARS)
                cars_left_b = car_return_control(cars_left_b, return_b, MAX_B_CARS)

                next_state = states[cars_left_a][cars_left_b]

                P[s][a].append((p,
                                next_state, 
                                RENTAL_INCOME * (cars_rented_a + cars_rented_b),
                                is_done(next_state),
                                demand_a,
                                demand_b,
                                return_a,
                                return_b))
    
    it.iternext()
    
  return P

#### Wrap it up as a gym environment

In [0]:
class CarRentalBranch(DiscreteEnv):
  """
  Car Rental Branch
  """
  
  def __init__(self, branch_a_cars=MAX_A_CARS, branch_b_cars=MAX_B_CARS):
       
    self.branch_a_cars = branch_a_cars
    self.branch_b_cars = branch_b_cars
    
    def is_done(s):
      if s == 0:
        return True
      else: 
        return False
    
    nS = (self.branch_a_cars + 1) * (self.branch_b_cars + 1)
    nA = MAX_CAR_MOVE_BETWEEN + 1
    
    # Regroup nS into a 2D pane
    states = np.arange(nS).reshape((MAX_A_CARS + 1,
                                  MAX_B_CARS + 1))
    actions = np.arange(-MAX_CAR_MOVE_BETWEEN, MAX_CAR_MOVE_BETWEEN + 1)
    
    P = state_transition_prob(states, actions)
    
    isd = np.zeros(nS)
    isd[-1] = 1
    
    super().__init__(nS, nA, P, isd)
    
  def render(self, mode="human", close=False):
    """
    Render the environment
    """
    
    if close:
      return
    
    outfile = StringIO() if mode == 'ansi' else sys.stdout
    
    states = np.arange(self.nS).reshape((MAX_A_CARS + 1,
                                  MAX_B_CARS + 1))
    
    cars_b, cars_a = np.where(states == self.s)
    
    output = "Cars a: %s ; " % cars_a[0]
    output += "Cars b: %s \n" % cars_b[0]
    
    outfile.write(output)
    
  def step(self, a):
      transitions = self.P[self.s][a]
      i = categorical_sample([t[0] for t in transitions], self.np_random)
      p, s, r, d, da, db, ra, rb = transitions[i]
      self.s = s
      self.lastaction = a
      return (s, r, d, {"prob" : p, "da": da, "db": db, "ra": ra, "rb": rb})

In [0]:
env = CarRentalBranch()

Building state 0 -> (0, 0)
Building state 1 -> (1, 0)
Building state 2 -> (2, 0)
Building state 3 -> (3, 0)
Building state 4 -> (4, 0)
Building state 5 -> (5, 0)
Building state 6 -> (6, 0)
Building state 7 -> (7, 0)
Building state 8 -> (8, 0)
Building state 9 -> (9, 0)
Building state 10 -> (10, 0)
Building state 11 -> (11, 0)
Building state 12 -> (12, 0)
Building state 13 -> (13, 0)
Building state 14 -> (14, 0)
Building state 15 -> (15, 0)
Building state 16 -> (0, 1)
Building state 17 -> (1, 1)
Building state 18 -> (2, 1)
Building state 19 -> (3, 1)
Building state 20 -> (4, 1)
Building state 21 -> (5, 1)
Building state 22 -> (6, 1)
Building state 23 -> (7, 1)
Building state 24 -> (8, 1)
Building state 25 -> (9, 1)
Building state 26 -> (10, 1)
Building state 27 -> (11, 1)
Building state 28 -> (12, 1)
Building state 29 -> (13, 1)
Building state 30 -> (14, 1)
Building state 31 -> (15, 1)
Building state 32 -> (0, 2)
Building state 33 -> (1, 2)
Building state 34 -> (2, 2)
Building state 35 

In [0]:
env.P[0]

{-5: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 -4: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 -3: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 -2: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 -1: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 0: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 1: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 2: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 3: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 4: [(1.0, 0, 0, True, 0, 0, 0, 0)],
 5: [(1.0, 0, 0, True, 0, 0, 0, 0)]}

In [0]:
total_reward = 0

for t in range(100):
  print("Step %s: " %  t)
  env.render()
#         print(observation)
  action = env.action_space.sample()
  observation, reward, done, info = env.step(action)
  print(info)
  print("Reward -> %s" % reward)
  print("End of day Action -> A %s; B %s" % (action, -action))
  total_reward += reward
  if done:
      print("Episode finished after {} timesteps".format(t+1))
      print(total_reward)
      break
env.close()

Cars a: 13 ; Cars b: 10 
Cars a: 6 ; Cars b: 15 
Cars a: 13 ; Cars b: 8 
Cars a: 3 ; Cars b: 15 
Cars a: 15 ; Cars b: 5 
Cars a: 3 ; Cars b: 15 
Cars a: 11 ; Cars b: 8 
Cars a: 5 ; Cars b: 11 
Cars a: 7 ; Cars b: 11 
Cars a: 9 ; Cars b: 7 
Cars a: 6 ; Cars b: 14 
Cars a: 11 ; Cars b: 6 
Cars a: 6 ; Cars b: 12 
Cars a: 9 ; Cars b: 15 
Cars a: 8 ; Cars b: 10 
Cars a: 7 ; Cars b: 9 
Cars a: 8 ; Cars b: 12 
Cars a: 15 ; Cars b: 8 
Cars a: 5 ; Cars b: 15 
Cars a: 8 ; Cars b: 12 
Cars a: 14 ; Cars b: 12 
Cars a: 10 ; Cars b: 15 
Cars a: 9 ; Cars b: 15 
Cars a: 15 ; Cars b: 9 
Cars a: 7 ; Cars b: 15 
Cars a: 8 ; Cars b: 12 
Cars a: 9 ; Cars b: 4 
Cars a: 8 ; Cars b: 14 
Cars a: 15 ; Cars b: 3 
Cars a: 1 ; Cars b: 15 
Cars a: 12 ; Cars b: 2 
Cars a: 0 ; Cars b: 15 
Cars a: 15 ; Cars b: 2 
Cars a: 1 ; Cars b: 15 
Cars a: 15 ; Cars b: 4 
Cars a: 3 ; Cars b: 15 
Cars a: 8 ; Cars b: 9 
Cars a: 7 ; Cars b: 15 
Cars a: 14 ; Cars b: 10 
Cars a: 9 ; Cars b: 13 
Cars a: 5 ; Cars b: 11 
Cars a: 12 ; Car

KeyboardInterrupt: ignored