<a href="https://colab.research.google.com/github/LarrySnyder/RLforInventory/blob/main/notebooks/Part_1_NV_as_MAB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Newsvendor Problem as a Multi-Armed Bandit (MAB)

This notebook contains code for an MAB implementation of the newsvendor problem.

---
> **Note:** This file is read-only. To work with it, you first need to save a copy to your Google Drive:
> 
> 1. Go to the File menu. (The File menu inside the notebook, right below the filename—not the File menu in your browser, at the top of your screen.)
> 2. Choose Save a copy in Drive. (Log in to your Google account, if necessary.) Feel free to move it to a different folder in your Drive, if you want.
> 3. Colab should open up a new browser tab with your copy of the notebook. 
> 4. Close the original read-only notebook in your browser.
---



---
> This notebook is part of the *Summer Bootcamp at Kellogg: RL in Operations* workshop at Northwestern University, August 2022. The notebooks are for Day 4, taught by Prof. Larry Snyder, Lehigh University.
---



In the **newsvendor problem**, the goal is to choose an order quantity $Q$ to use in each time period, in order to minimize the expected cost per period, given by

$$g(Q) = {\mathbb E}\left[h(Q-D)^+ + p(D-Q)^+\right],$$

where $h$ is the **holding cost** (aka overage cost, the cost per unit left over at the end of the day), $p$ is the **stockout cost** (aka underage cost, the cost per unit of unmet demand), $D$ is the **demand** (a random variable), and $z^+ \equiv \max\{0, z\}$.

We'll assume the demand has a discrete probability distribution with pmf $f(d)$ and cdf $F(d)$, in which case

$$g(Q) = h\sum_{d=0}^Q (Q-d)f(d) + p\sum_{d=Q}^\infty (d-Q)f(d).$$

This is the objective function for the newsvendor problem, which we wish to minimize. Equivalently, we can maximize the expected reward function (to stay consistent with RL and MAB terminology), which is the negative of the cost function:

$$r(Q) = -g(Q).$$

The optimal order quantity $Q^*$, which minimizes the expected cost or maximizes the expected reward, is the smallest $Q$ such that

$$F(Q) \ge \frac{p}{p+h}.$$

In this notebook, we will model the newsvendor problem as a **multi-armed bandit (MAB)**.

### Preliminary Python Stuff

In [None]:
# Import the packages we will need.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, poisson

In the code below, we'll use the `stockpyl` Python package (https://pypi.org/project/stockpyl/) for inventory optimization stuff. We have to install the `stockpyl` package ourselves. (It doesn't come pre-installed on Colab like `numpy`, etc. do.) You should only need to do this once per notebook.

If you get a message like

```
WARNING: The following packages were previously imported in this runtime:
  [sphinxcontrib]
You must restart the runtime in order to use newly installed versions.
```

you can ignore it.


In [None]:
!pip install stockpyl

In [None]:
from stockpyl import newsvendor

### Bandit Class

First, we'll define a `Bandit` class that implements a generic multi-armed bandit (MAB). 

The class is very simple. It has two attributes:

* `k`: the number of arms
* `mean`: a list of mean rewards, one per bandit
* `sd`: a list of standard deviations of rewards, one per bandit

And the class has two methods:

* `__init__()` initializes the class
* `pull()` takes an action and returns a randomly generated reward for that action

At its default values, the bandit has $k=5$ arms whose rewards have mean $[1, \ldots, 5]$ (respectively) and standard deviation 1.

In [None]:
class Bandit(object):

    def __init__(self, k: int = 5, mean: list = list(range(5)), sd: list = [1]*5):
        """Initialize the attributes."""
        self.k = k
        self.mean = mean
        self.sd = sd

    def pull(self, action: int):
        """Get a random variate from a normal distribution with the mean and SD
        corresponding to the action."""
        return norm.rvs(loc=self.mean[action], scale=self.sd[action])


Let's give it a spin 🎰. 

(Sorry—dad joke.) 

In [None]:
bandit = Bandit(k=3, mean=[5, 3, 1], sd=[1, 1, 0.5])
for _ in range(10):
    a = np.random.randint(3)
    r = bandit.pull(a)
    print(f"Pulled arm {a}, got reward {r}")

### $\epsilon$-Greedy Class

Next we'll define an `EpsilonGreedyAgent` class that implements the $\epsilon$-greedy algorithm for generic MABs. The algorithm implementation is based on the discussions in Sutton and Barto (2nd edition, 2018).

In order to use the class, you need to provide it with an instance of the `Bandit` class defined above.

Feel free to explore the code if you want, but all that's required is for you to execute the cell.

In [None]:
class EpsilonGreedyAgent(object):

    def __init__(self, bandit: Bandit, epsilon: float = 0.1):
        # Initialize the attributes.
        self.bandit = bandit
        self.epsilon = epsilon

    def epsilon_greedy(self, num_time_steps: int = 1000, initial_Q: list = None):
        # Initialize Q-value estimates (or use `initial_Q` if provided).
        Q = initial_Q or [0] * self.bandit.k
        # Initialize action counts.
        N = [0] * self.bandit.k

        # Initialize evaluation info.
        
        # Main loop.
        for t in range(num_time_steps):

            # Choose action.
            if np.random.rand() < 1 - self.epsilon:
                A = np.argmax(Q)
            else:
                A = np.random.randint(self.bandit.k)

            # Get reward.
            R = self.bandit.pull(A)

            # Update stats.
            N[A] += 1
            Q[A] += (1 / N[A]) * (R - Q[A])

        # Return Q estimates as well as best guess for optimal action.
        return Q, np.argmax(Q)

Let's try it.

In [None]:
agent = EpsilonGreedyAgent(bandit)
Q, A = agent.epsilon_greedy()
print(f"Best guess for optimal action is {A}")
print(f"Estimates of action values:")
for a in range(bandit.k):
    print(f"  {a}: {Q[a]}")

### The Newsvendor Bandit

Now it's your turn. Your goal is to build a class called `NewsvendorBandit`. I started you off by building the structure of the class. (It's similar to the `Bandit` class declared earlier.) You need to fill in some details.

A few things to note:

* The `NewsvendorBandit` class takes a parameter `k`, like the `Bandit` class, that indicates the number of "arms". The arms will be indexed $a=0,\ldots,k-1$, and arm $a$ corresponds to using an order quantity of $a$.
* The class takes three parameters specifying the newsvendor problem instance: 

    * `h` and `p` are the holding and stockout costs
    * `mu` is the mean of the Poisson demand distribution
    
* "Pulling" an arm should return the **negative of the cost of one newsvendor period,** based on a randomly generated demand, rather than returning a random variate from a particular distribution.

---
> **Note:** In the code below, the portions that you need to complete are marked with
> 
> ```python
> # #################
> # TODO:
> ```
> 
> In place of the missing code is a line that says 
> 
> ```python
> 	raise NotImplementedError
> ```
> 
> This is a way of telling Python to raise an exception (error) because there's something missing here. You should **delete (or comment out) this line** after you write your code.

---

In [None]:
class NewsvendorBandit(object):

    def __init__(self, k: int = 10, h: float = 1, p: float = 10, mu: int = 5):
        """Initialize the attributes."""
        self.k = k

        # #################
        # TODO: store the attributes h, p, and mu in the object, too.
        raise NotImplementedError

    def pull(self, action: int):
        """Return a random newsvendor cost for the given action."""

        # Generate a Poisson(mu) random variate.
        d = poisson.rvs(self.mu)

        # #################
        # TODO: Calculate the cost for the chosen action and the random demand.
        # Set `reward` to the negative of this cost.
        raise NotImplementedError

        # Get a random variate from a normal distribution with the mean and SD
        # corresponding to the action.
        return reward

Let's try out your `NewsvendorBandit` class on a newsvendor instance with:

* $h=0.5$
* $p=15$
* $\mu=4$

We'll use 12 arms.

In [None]:
# Build the bandit.
num_arms = 12
bandit = NewsvendorBandit(k=num_arms, h=0.5, p=15, mu=4)

In [None]:
# Pull lever 5 a few times.
a = 5
for _ in range(10):
    r = bandit.pull(a)
    print(f"Pulled arm {a}, got reward {r}")

In [None]:
# Pull it a lot of times and get the average reward.
avg_reward = 0
a = 5
num_pulls = 10000
for _ in range(num_pulls):
    r = bandit.pull(a)
    avg_reward += r / num_pulls
print(f"Pulled arm {a} {num_pulls} times, got average reward {avg_reward}")

Let's validate this using `stockpyl` by calculating the expected cost of using an order quantity of 5 (`stockpyl` calls this `base_stock_level`) for the given newsvendor instance.

In [None]:
_, exp_cost = newsvendor.newsvendor_poisson(
    bandit.h,
    bandit.p,
    bandit.mu,
    base_stock_level=a
)
print(f"Exact expected cost for order quantity 5 is {exp_cost}")

And let's get the *optimal* order quantity using `stockpyl`.

In [None]:
opt_Q, opt_exp_cost = newsvendor.newsvendor_poisson(
    bandit.h,
    bandit.k,
    bandit.mu
)
print(f"Optimal order quantity is {opt_Q}, with expected cost {opt_exp_cost}")

---
**Note:** Before proceeding, you should make sure that the results from your bandit are similar to those returned by `stockpyl`.

---

### Training the $\epsilon$-Greedy Agent for the Newsvendor MAB

Now let's train the $\epsilon$-greedy agent on the newsvendor MAB. The `EpsilonGreedyAgent` class does not need any modifications—we built it to be very generic—so all you need to do is pass your newsvendor bandit to it.

In [None]:
# #################
# TODO: Train the epsilon-greedy agent on your `NewsvendorBandit` object.
# Print the agent's buess guess for the optimal action, as well as its
# estimates of the action values. (Use the analogous cell above as a template.)
raise NotImplementedError

## Validating the Results

Let's compare your trained agent's action-value estimates (or really, their negatives) with the true expected costs of those actions as calculated by `stockpyl`. 

(The code below assumes that you have stored the agent's estimates of the action values in a variable called `Q` in the previous cell.)

In [None]:
action_list = list(range(bandit.k))

bandit_cost = [-Q[a] for a in action_list]
exp_cost = [newsvendor.newsvendor_poisson(
    bandit.h, 
    bandit.p, 
    bandit.mu, 
    base_stock_level=a
)[1] for a in action_list]

plt.scatter(action_list, bandit_cost, label='Bandit Cost Estimates')
plt.scatter(action_list, exp_cost, label='True Expected Cost')
plt.legend()
plt.xlabel('Action')
plt.ylabel('Cost');

How good a job did your bandit and $\epsilon$-greedy agent do of estimating the expected cost function for the newsvendor problem?

(If you're not happy with the results, you can try increasing the `num_time_steps` parameter passed to the `epsilon_greedy()` method.)