<a href="https://colab.research.google.com/github/LarrySnyder/RLforInventory/blob/main/notebooks/Part_4b_Beer_Game_DQN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DQN for the Beer Game

---
> **Note:** This file is read-only. To work with it, you first need to save a copy to your Google Drive:
> 
> 1. Go to the File menu. (The File menu inside the notebook, right below the filename—not the File menu in your browser, at the top of your screen.)
> 2. Choose Save a copy in Drive. (Log in to your Google account, if necessary.) Feel free to move it to a different folder in your Drive, if you want.
> 3. Colab should open up a new browser tab with your copy of the notebook. 
> 4. Close the original read-only notebook in your browser.
---

---
> This notebook is part of the *Summer Bootcamp at Kellogg: RL in Operations* workshop at Northwestern University, August 2022. The notebooks are for Day 4, taught by Prof. Larry Snyder, Lehigh University.
---

Following Oroojlooyjadid, et al. (2022), we'll consider the following 4-node series system:

![beer game system](https://raw.githubusercontent.com/LarrySnyder/RLforInventory/main/images/beer-game-schematic.png)

The long-run systemwide expected cost is given by

$$\sum_{t=1}^T \sum_{i=1}^4 h^i(IL_t^i)^+ + p^i(-IL_t^i)^+$$

where $h^i$ and $p^i$ are the holding and stockout costs at node $i$, $IL_t^i$ is the inventory level at node $i$ at the end of period $t$, $T$ is the number of periods in one play of the game, and $z^+ \equiv \max\{0,z\}$.

The inventory levels $IL_t^i$ are complicated random functions of the decision variables (i.e., the ordering policies), so this cost is difficult to formulate, let alone to optimize. Under certain assumptions (e.g., no fixed costs, stationary demands, etc.), and if there is a centralized decision maker who can make all of the ordering decisions, then a base-stock policy is optimal (Clark and Scarf 1960), and the optimal base-stock levels can be found relatively easily by optimizing a sequence of single-variable, convex problems (Chen and Zheng 1994).

However, in the beer game, there is no centralized decision maker: Each node is controlled by a different player, each of whom make independent decisions about their ordering policies. Moreover, each player only knows the values of the state variables at their own node, not at the other nodes. The goal of our RL agent is to **choose order quantities at a single node to minimize the total systemwide cost under incomplete information.**

Chess, Go, Atari, and other games that have successfully been solved by deep RL algorithms tend to have the following characteristics:

* Competitive
* Zero-sum
* Full information
* Instant reward signal (in some cases)

But the beer game differs along all of these dimensions:

* It is cooperative (the 4 players try to minimize their total cost)
* It is not zero-sum (when one player succeeds, the whole team succeeds)
* Players have only partial information (they have state information about only their own node)
* The reward signal is delayed until the end of the game (since costs at other nodes are unknown during the game)

Oroojlooyjadid, et al. (2022) propose an DQN-based algorithm they call the *shaped-reward DQN* (SRDQN). The SRDQN algorithm deals with the partial information by restricting the state variables that are available to the agent when making decisions. It deals with the delayed reward signal using **reward shaping,** which updates the reward information retroactively after the game ends. (We won't consider this in our simplified algorithm in this notebook, though.)

### Preliminary Python Stuff

First we'll install the Python packages we need that are not pre-installed in Colab. The `pip install` commands below worked for me; I hope they work for you. I recommend not modifying the version numbers in the commands. Once you start tinkering with the dependencies, things can get messy. (Take my word for it.) 

In [None]:
!pip install tensorflow==2.8.2
!pip install gym==0.23
!pip install keras==2.8.0
!pip install keras-rl2

In [None]:
!pip install stockpyl

Next, we'll import the packages we need.

In [None]:
import numpy as np
from gym import Env
from gym.spaces import Discrete
import random
import matplotlib.pyplot as plt

In [None]:
from stockpyl import sim
from stockpyl.supply_chain_network import serial_system

### Beer Game Environment

#### States 

Assume that the RL agent is the decision maker at node $i\in \{1,\ldots,4\}$. (For example, if the RL is playing the role of the warehouse, then $i=2$.)
As in OroojlooyJadid, et al., we assume that **state space** has 4 components:

* $IL_t^i$, the inventory level at node $i$ in period $t$
* $OO_t^i$, the on-order quantity at node $i$ in period $t$
* $AO_t^i$, the arriving order (i.e., the demand received from the downstream neighbor) at node $i$ in period $t$
* $AS_t^i$, the arriving shipment (i.e., the units received from the upstream neighbor) at node $i$ in period $t$

In fact, the SRDQN algorithm assumes that we store the history of these state variables for the most recent 5 or 10 periods, but we will only use 1 period's worth of information in the algorithm below.

It's natural to store the state as a tuple $(IL, OO, AO, AS)$, but it can be tricky to handle a tuple-based state in Tensorflow. Therefore, we convert the state tuple to a unique integer. The state space is therefore of type `Discrete` (using `gym` state classes). We'll never use the state integer directly; we'll covert the state tuple to an integer for storage and indexing, and convert back to a tuple when we need to know the individual state components. The `tuple_to_int()` and `int_to_tuple()` methods in the `BeerGameEnv` class do these conversions.

The constants below provide the indices of the state-space components, so we don't have to remember them.

In [None]:
# Shortcuts to indices of the various states in the state space tuple.
kIL = 0
kOO = 1
kAO = 2
kAS = 3

#### Actions

Actions represent order quantities. In theory, any nonzero order quantity is allowed. However, to keep the state space manageable, we will require that the order quantity differs from the most recent demand by at most a fixed number (e.g., 5). In other words, if $AO$ is the most recent demand (arriving order), the order quantity is $AO+a$, where $a$ is constrained to be in some set such as $\{-5,\ldots,5\}$. (This is sometimes called a "$d+x$" rule.)

$a$ is the action, and can be different in different time periods.

Next is the `BeerGameEnv` environment class. The code is missing some pieces. Your job is to fill in the missing pieces.

---
> **Note:** In the code below, the portions that you need to complete are marked with
> 
> ```python
> # #################
> # TODO:
> ```
> 
> In place of the missing code is a line that says 
> 
> ```python
> 	raise NotImplementedError
> ```
> 
> This is a way of telling Python to raise an exception (error) because there's something missing here. You should **delete (or comment out) this line** after you write your code.

---

In [53]:
class BeerGameEnv(Env):
    """Beer game problem environment. A state represents a tuple (IL, OO, AO, AS),
    where:
    
        * IL = inventory level at the agent at the end of the time period
        * OO = on-order quantity at the end of the time period (items the agent 
            has ordered but not yet received)
        * AO = arriving order, i.e., demand during the time period
        * AS = arriving shipment, i.e., units received during the time period

    However, this tuple is converted to an int via tuple_to_int() so that the
    observation space is a 1-dimensional array.
    
    Actions represent differences from the demand observed in the time period.
    That is, if the action is a, then the order quantity is AO + a. 
    a is restricted to be in a certain range, e.g., {-2, 1, 0, 1, 2}.
    (This is sometimes called a "d+x" rule.)

    Parameters
    ----------
    network : SupplyChainNetwork
        The network to simulate.
    episode_length : int
        The number of periods in one episode.
	agent_node_index : int
		Index of the node that the RL agent will play (e.g., 2 = wholesaler).
    min_state : tuple
        The minimum value of each state to consider: IL, OO, AO, AS.
    max_state : tuple
        The maximum value of each state to consider: IL, OO, AO, AS.
    min_action : int
        The minimum allowable action.
    max_action : int
        The maximum allowable action.
    """

    def __init__(self, network, episode_length: int, agent_node_index: int,
	            min_state: tuple, max_state: tuple, min_action: int, max_action: int):

        # Store problem data.
        self.network = network
        self.episode_length = episode_length
        self.agent_node_index = agent_node_index
        self.min_state = min_state
        self.max_state = max_state
        self.min_action = min_action
        self.max_action = max_action

        # #################
        # Set self.action_space to a gym Discrete space with elements
        # min_action, min_action + 1, ..., max_action.
        # (Hint: remember that you can use the `start` parameter; see 
		# `MPNVEnv.__init__()` in the "MPNV DQN" notebook.)
        # Also set self.action_space_list to a list with the same elements.
        raise NotImplementedError

        # Determine the sizes of each component of the state space, and the
        # total number of states in integer form.
        self.state_size = [max_state[i] - min_state[i] + 1 for i in range(4)]
        self.num_int_states = self.tuple_to_int(tuple(max_state[i] for i in range(4))) + 1
        # Set the observation space as a Discrete space, as well as a list version.
        self.observation_space = Discrete(self.num_int_states)
        self.observation_space_list = list(range(self.num_int_states))

        # #################
        # Set self.initial_state assuming all components start at 0. That is,
		# use tuple_to_int() to set it to the integer version of the tuple (0, 0, 0, 0).
        raise NotImplementedError

        # Initialize current state info.
        self.state = None

        # Get shortcuts to the RL agent node (as a SupplyChainNode object) 
		# and its predecessor and successor node indices.
        self.agent_node = network.get_node_from_index(self.agent_node_index)
        self.predecessor_index = self.agent_node.predecessor_indices(include_external=True)[0]
        self.successor_index = self.agent_node.successor_indices(include_external=True)[0]

    def tuple_to_int(self, the_tuple: tuple):
        """Convert a tuple (n_0, ..., n_{m-1}) to a unique integer, where element
        n_i can take one of self.state_sizes[i] values beginning at self.min_state[i]; that is, 
        n_i can be in {self.min_state[i], self.min_state[i] + 1, ..., self.min_state[i] + self.state_size[i] - 1}.
        """
        # Get length of tuple/lists.
        m = len(self.state_size)
        # Convert tuple to a tuple in which each element starts at 0.
        new_tuple = tuple(the_tuple[i] - self.min_state[i] for i in range(m))
        # Convert new_tuple to int.
        the_int = 0
        for i in range(m):
            the_int += int(np.prod([self.state_size[j] for j in range(i + 1, m)]) * new_tuple[i])
        return the_int

    def int_to_tuple(self, the_int: int):
        """Convert an integer to a unique tuple (n_0, ..., n_{m-1}), where element
        n_i can take one of self.state_size[i] values beginning at self.min_state[i]; that is, 
        n_i can be in {self.min_state[i], self.min_state[i] + 1, ..., self.min_state[i] + self.state_size[i] - 1}.
        """
        # Get length of tuple/lists.
        m = len(self.state_size)
        # Convert int to a tuple assuming each element starts at 0.
        the_list = []
        for i in range(m):
            base = int(np.prod([self.state_size[j] for j in range(i + 1, m)]))
            the_list.append(the_int // base)
            the_int = the_int % base
        # Convert list to new list accounting for min values.
        new_list = [the_list[i] + self.min_state[i] for i in range(m)]
        return tuple(new_list)
        
    def reset(self):
        """Reset the environment and the simulation. Choose an initial state randomly from
        the list of possible initial states. Return it and set it in self.inventory_level."""

        # #################
        # Reset the environment, following the same steps as in the reset()
		# method of the `MPNVEnv` class in the "MPNV DQN" notebook.)
        raise NotImplementedError

        return self.state

    def step(self, action):
        """Run one time step of the environment by taking the specified action.
        Update the environment state to the new state. 
        Return a tuple (new_state, reward, done)."""

		# #################
        # Convert self.state to a tuple.
        raise NotImplementedError

		# #################
        # Determine the order quantity.
		# Note: remember that the order quantity equals the most recent AO
		# (which is already stored in the state) plus the action.
		# Also: make sure to clip the order quantity so that it does not bring
		# the IL above its max value.
        raise NotImplementedError

		# #################
        # Build dict specifying order quantity to use in this time period.
        # (This will override the order quantities that the stockpyl simulation
        # would choose on its own.) 
		# Note: the dict should contain only one entry, for the RL agent's
		# node; the other nodes are not included because we are not overriding
		# their order quantities.
        raise NotImplementedError

		# #################
        # Simulate one time period.
        raise NotImplementedError

        # Determine reward by querying the simulation's state variables.
        # NOTE: reward includes ALL nodes even though the agent only knows
        # its own information. This is a simplification of the assumptions in
        # Oroojlooyjadid et al (2021).
        reward = -np.sum([n.state_vars_current.total_cost_incurred for \
                          n in self.network.nodes])

		# #################
        # If episode length has been reached, terminate.
        raise NotImplementedError

        # Get new state variables from simulation. (Round to int -- should 
        # already be integer but sometimes there are small rounding errors.)
        # Clip states to state-space bounds.
        IL = int(np.clip(self.agent_node.state_vars_current.inventory_level, \
                self.min_state[kIL], self.max_state[kIL]))
        OO = int(np.clip(self.agent_node.state_vars_current.on_order, \
                self.min_state[kOO], self.max_state[kOO]))
        AO = int(np.clip(self.agent_node.state_vars_current.inbound_order[self.successor_index], \
                self.min_state[kAO], self.max_state[kAO]))
        AS = int(np.clip(self.agent_node.state_vars_current.inbound_shipment[self.predecessor_index], \
                self.min_state[kAS], self.max_state[kAS]))

		# #################
        # Update the state: first determine the new state tuple, then convert
		# it to an integer and store it in self.state.
        raise NotImplementedError

        # Fill the demand into the info dict. (This repeats what's already in AO.)
        info = {'demand': self.agent_node.state_vars_current.inbound_order[self.successor_index]}

        return self.state, reward, done, info

    def render(self):
        """This function can contain code for drawing the environment to
        a graphics window, or printing it in ASCII format to the terminal.
        But we'll just do something very simple and print the state.
        (Feel free to add some nicer visualization code here if you want!)"""
        print(self.state)

    def play_episode(self, policy, messages=False):
        """Play one episode of the environment following the specified policy. 
        Return the total discounted reward over the episode.

        `policy` is a dict in which keys are states and values are actions.
        If `messages` is True, will print state and action in each time step.
        """
        
		# #################
        # Write this function, using the analogous function in `MPNVEnv` as a template.
        raise NotImplementedError


### Beer Game Instance

We'll use the following beer game instance. (This is similar to the "simple instance" in §4.1 of Oroojlooyjadid, et al. (2021).) The vectors below give the values for stages $1, ..., 4$, respectively. (Node 4 is upstream, node 1 is downstream.)

* $h = [2, 2, 2, 2]$ 
* $p = [2, 0, 0, 0]$
* $l^{tr} = [2, 2, 2, 2]$ (shipment lead time)
* $l^{in} = [2, 2, 2, 2]$ (order lead time)
* $D \sim \text{Poisson}(1)$ (demand uniformly drawn from Poisson distribution with mean 1)
* Coplayers use base-stock policies with base-stock level 2

We'll restrict the spaces as follows:

* Action space: ${\mathcal A} = \{-2, -1, 0, 1, 2\}$ (remember that the order quantity equals the action plus the observed demand)
* State space: 
    * ${\mathcal S}_{IL} = \{-4, -3, ..., 4\}$
    * ${\mathcal S}_{OO} = \{0, 1, ..., 8\}$
    * ${\mathcal S}_{AO} = \{0, 1, ..., 4\}$
    * ${\mathcal S}_{AS} = \{0, 1, ..., 4\}$

And we'll use episodes of length 100.
    


In [None]:
# Build the network as a SupplyChainNetwork object.
network = serial_system(
    num_nodes=4,
    node_order_in_system=[4, 3, 2, 1],  # in the network, nodes go 4 > 3 > 2 > 1
    node_order_in_lists=[1, 2, 3, 4],   # in the lists below, nodes go 1 > 2 > 3 > 4
    local_holding_cost=[2, 2, 2, 2],
    stockout_cost=[2, 0, 0, 0],
    shipment_lead_time=[2, 2, 2, 2],
    order_lead_time=[2, 2, 2, 2],
    demand_type='P', 
    mean=1,                         
    policy_type='BS',                   
    base_stock_level=2             
)

In [None]:
min_state = (-4, 0, 0, 0)
max_state = (4, 8, 8, 8)
min_action = -2
max_action = 2
episode_length = 100

Finally, let's build our `BeerGameEnv` environment.

Remember: This is now a well-defined `gym` environment. It's possible to "register" a custom environment to take advantage of the full `gym` API, but we won't need to do that here.

In [None]:
# Build BeerGameEnv object.
env = BeerGameEnv(
    network=network,
    episode_length=episode_length,
    agent_node_index=2, # wholesaler
    min_state=min_state,
    max_state=max_state,
    min_action=min_action,
    max_action=max_action
)

Let's give our new environment a quick spin. First, we'll create a base-stock policy with a base-stock level of 2 at every node. Then we'll ask our environment to play one episode of the beer game. In each time period, it will print the starting state, the action (order quantity), the demand, the new state, and the reward.

In [None]:
base_stock_policy = {}
for s in env.observation_space_list:
    state_tuple = env.int_to_tuple(s)
    base_stock_policy[s] = max(0, 2 - state_tuple[kIL])

env.play_episode(base_stock_policy, messages=True)

### Setting up TensorFlow

Next we'll set up our model in TensorFlow. First, some imports:

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [None]:
from rl.agents import DQNAgent
from rl.policy import EpsGreedyQPolicy 
from rl.memory import SequentialMemory

Then a helper function to build the TF **model:**

In [None]:
def build_model(num_states, num_actions):
    model = Sequential()    
    model.add(Dense(24, activation='relu', input_shape=(1,))) 
    model.add(Dense(24, activation='relu'))
    model.add(Dense(num_actions, activation='linear'))
    return model

Now we'll build the model itself:

In [None]:
# Get shortcut to size of observation and action spaces.
num_states = env.observation_space.n
#num_states = np.sum([sp.n for sp in env.observation_space.spaces])
num_actions = env.action_space.n
# Build the model.
# NOTE: This must happen *after* the `from rl.x` imports.
# (See https://stackoverflow.com/a/72438856/3453768)
model = build_model(num_states, num_actions)

Print a summary of the model:

In [None]:
model.summary()

Next, we need an RL **agent.** We'll use the `DQNAgent` class built into `keras` (part of TensorFlow). 

Our agent also needs a **policy.** We'll use the `EpsGreedyQPolicy`, again built into `keras`. (Feel free to play around with different policies. You'll have to `import` them like we did for `EpsGreedyQPolicy` above. I haven't been able to find good documentation for these policies, but you can find different policies to try by looking at the [source code](https://github.com/keras-rl/keras-rl/blob/master/rl/policy.py).)

In [None]:
def build_agent(model, actions):
    policy = EpsGreedyQPolicy(eps=0.1) 
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

Next, build the DQN agent, store it in a variable called `dqn`, and "compile" it (a preprocessing step).

In [None]:
dqn = build_agent(model, num_actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

### Training the DQN Agent

Now we're finally ready for the main step: training the DQN agent. The command below trains it for 60,000 episodes, which should take about 10 minutes and produce medium-good results. Feel free to change this number to do more or less training.

In [None]:
dqn.fit(env, nb_steps=60000, visualize=False, verbose=1)

Most likely, you'll see the `episode_reward` get gradually better as the training progresses (though not necessarily monotonically so).

### Exploring the Results

The DQN agent has a feature to test the learned policy by playing multiple episodes and print the results. Let's play 50 of them.

In [None]:
results = dqn.test(env, nb_episodes=50, visualize=False)
print(f"Average reward per episode = {np.mean(results.history['episode_reward'])}")

My DQN resulted in an average reward per episode of $-9405.72.52$. (Your mileage may vary.) Since this is an undiscounted episode with 100 periods, the average cost per period is $94.06$.

Using a base-stock policy with a base-stock level of 2 at each node is a reasonable benchmark. `network` is already set up like this, so we can just simulate it.

In [None]:
avg_cost_per_period, _ = sim.run_multiple_trials(network, num_trials=50, num_periods=episode_length)
avg_cost_per_period

The average cost per period from my simulation is $26.25$. The DQN is not competitive with the base-stock policy, but it's at least in the same ballpark, confirming that we are on the right track. More intensive training should improve the results.

### If You Have Extra Time

Try to improve the results using different hyperparameters, training agents, etc.

Or, try using DQN to optimize different supply chain networks other than the beer game system.