# Chapter 1 Dynamic Programming and Bellman Equations

## Example: Exercising a Stock Option

In this example, we will solve the following stock option problem using stochastic dynamic programming (DP).

### Exercising a Stock Option

You have a call option, which gives you the option to buy a share at the "striking price" $p$ (a positive integer) before the option expires.  You have up to $N$ days to exercise it. If you exercise the option on day $k$ when the stock price is $x_k,$ you can immediately make a profit of $x_k-p.$ If you do not exercise it on the last day, the option expires and has no value. Suppose
the price of a share of the stock is an integer and obeys the equation $x_{k+1}=x_k+w_k,$ where $\{w_k\}$ are i.i.d. random variables uniformly distributed in $\{-1, 0, 1\}$. Given an initial stock price $x_0$, the aim is to exercise the option optimally to maximize the expected profit.

### Problem Formulation

This problem can be formulated into a finite-horizon stochastic DP problem as follows:

- Stage $k$: Day $k$, $k\in\{0,\ldots,N-1\}$; stage $N$ is the time after the last day.
    
- State $x_k$: the price of the stock at Day $k$. Define a terminal state $g$ such that $x_k=g$ means that the option has already been exercised.
    
- Action: $u_k$:

    If $x_k\neq g$, then

     - $u_k=1$: exercise the stock option;
     - $u_k=0$: do not exercise.

  If $x_k=g$, then the only available action is $u_k=0$ because the stock option has already been exercised.
    
- Reward: $r(x_k,u_k)$: 

    \begin{align*}
        r(x_k,u_k)=
        \begin{cases}
            x_k-p, & \mbox{if $x_k\neq g$ and $u_k=1$;}\\
            0, & \mbox{otherwise.}
        \end{cases}
    \end{align*}
    
- Transition: 
    For any $x\neq g$,

    \begin{align*}
        &\Pr(x_{k+1}=x-1|x_k=x,u_k=0)\\
        =&\Pr(x_{k+1}=x|x_k=x,u_k=0)\\
        =&\Pr(x_{k+1}=x+1|x_k=x,u_k=0)=\frac{1}{3},
    \end{align*}

    and

    \begin{align*}
        &\Pr(x_{k+1}=g|x_k=g,u_k=0)=1,\\
        &\Pr(x_{k+1}=g|x_k=x,u_k=1)=1.
    \end{align*}

- Goal: given $x_0,$ maximize the expected total reward:

    \begin{align*}
        \max_{\mu_0,\ldots,\mu_{N-1}} \mathbb{E}\left[\sum_{k=0}^{N-1}r_k(x_k,\mu_k(x_k))| x_0\right].
    \end{align*}

### Value Function

Let $V_k(x_k)$ denote the optimal value function for state $x_k$ at stage $k$, defined by

\begin{align*}
    V_k(x_k)=\max_{\mu_k,\ldots,\mu_{N-1}} \mathbb{E}\left[\sum_{l=k}^{N-1}r_l(x_{l},\mu_l(x_l))\right].
\end{align*}

$V_k(x_k)$ is the optimal expected profit given the stock price $x_k$ and that there are still $N-k$ days (Day $k$, ..., Day $N-1$) to go. 

### Bellman Equation

Based on the formulation, the Bellman equation can be written as

\begin{align*}
    V_k(x_k)=&\max\left\{x_k-p, \mathbb{E}[V_{k+1}(x_{k+1})]\right\}\nonumber\\
    =&\max\left\{x_k-p, \frac{1}{3} V_{k+1}(x_k)+ \frac{1}{3} V_{k+1}(x_k+1) + \frac{1}{3} V_{k+1}(x_k-1)\right\},
\end{align*}

for any $k=0,\ldots,N-1$.

We know $V_{N}(x)=0$ for any value of $x$ since the option expires. Also notice that given $x_0$, the values of $x_1,\ldots,x_{N}$ are from $x_0-N$ to $x_0+N$. Using backward search, we can calculate the value function $V_k(x_k)$. The optimal policy can be obtained by a forward pass.


## Codes

### Backward Search

We will calculate the optimal value function backward using the Bellman equation, i.e., compute the values of $V_N(x_N), V_{N-1}(x_{N-1}),\ldots,V_1(x_0)$.

For the Python function backward_cal in the next cell, the inputs are `N`, `p`, `x_0`:

- `N`: the number of days

- `p`: the striking price

- `x_0`: initial stock price at Day $0$

The output `value_function`, a numpy array with shape `(N + 1, 2N + 1)`, is the value function.
`value_function[k, m]` means the value function $V_{k}(x_0-N+m)$.
For example, `value_function[4, 2]` is the value of $V_{4}(x_0-N+2)$.

Note that given $x_0$, there are some values in the array `value_function` that are irrelevant to our decision. We will set all those values to be $-1$.

In [None]:
# Import packages. Run this cell.

import numpy as np

In [None]:
def backward_cal(N, p, x_0):
    """
    Calculate the optimal value function $V_{k}(x_k)$ using the Bellman equation
    Args:
        N: the number of days
        p: the striking price
        x_0: initial stock price at Day $0$
    Returns:
        value_function: a numpy array with shape (N + 1, 2N + 1). value_function[k, m] means the value function $V_{k}(x_0-N+m)$.
    """
    value_function = -1 * np.ones((N + 1, 2 * N + 1))

    for m in range(2 * N + 1):
        value_function[N, m] = 0

    for k in range(N - 1, -1, -1):
        for m in range(N - k, 2 * N - (N - k) + 1):
            value_continue = 1 / 3 * (value_function[k + 1, m] + value_function[k + 1, m + 1] + value_function[k + 1, m - 1])
            value_function[k, m] = max(x_0 - N + m - p, value_continue)

    return value_function


In [None]:
# Sample Test, checking the output of the function backward_cal

# Sample input
N = 4
p = 2
x_0 = 3

# Sample output
value_function = np.array([[-1.0000, -1.0000, -1.0000, -1.0000,  1.1852, -1.0000, -1.0000, -1.0000, -1.0000],
                           [-1.0000, -1.0000, -1.0000,  0.4444,  1.1111,  2.0000, -1.0000, -1.0000, -1.0000],
                           [-1.0000, -1.0000,  0.0000,  0.3333,  1.0000,  2.0000,  3.0000, -1.0000, -1.0000],
                           [-1.0000,  0.0000,  0.0000,  0.0000,  1.0000,  2.0000,  3.0000,  4.0000, -1.0000],
                           [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]])

# Sample test
func_out = backward_cal(N, p, x_0)
for k in range(N + 1):
    for m in range(2 * N - 1):
        assert round(func_out[k, m], 4) == round(value_function[k, m], 4), "The sample test failed."

### Find the Optimal Policy

The following function `optimal_action` will output the optimal action `u_k` given the inputs:

- `N`: the number of days
- `p`: the striking price
- `x_0`: initial stock price at Day $0$
- `k`: Day $k$
- `x_k`: stock price at Day $k$
- `value_function`: the value function


In [None]:
def optimal_action(N, p, x_0, k, x_k, value_function):
    """
    Calculate the optimal action $u_k$ at stage $k$
    Args:
        N: the number of days
        p: the striking price
        x_0: initial stock price at Day $0$
        k: Day $k$
        x_k: stock price at Day $k$
        value_function: the value function
    Returns:
        u_k: the optimal action, 1 means exercising the option, 0 means not exercising the option
    """
    m = x_k - x_0 + N
    value_continue = 1 / 3 * (value_function[k + 1, m] + value_function[k + 1, m + 1] + value_function[k + 1, m - 1])
    if x_k - p >= value_continue:
        u_k = 1
    else:
        u_k = 0
    return u_k


In [None]:
# Sample Test, checking the output of the function optimal_action

# Sample input
N = 4
p = 2
x_0 = 3
k = 1
x_k = 3
value_function = np.array([[-1.0000, -1.0000, -1.0000, -1.0000,  1.1852, -1.0000, -1.0000, -1.0000, -1.0000],
                           [-1.0000, -1.0000, -1.0000,  0.4444,  1.1111,  2.0000, -1.0000, -1.0000, -1.0000],
                           [-1.0000, -1.0000,  0.0000,  0.3333,  1.0000,  2.0000,  3.0000, -1.0000, -1.0000],
                           [-1.0000,  0.0000,  0.0000,  0.0000,  1.0000,  2.0000,  3.0000,  4.0000, -1.0000],
                           [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]])

# Sample output
u_k = 0

# Sample test
func_out = optimal_action(N, p, x_0, k, x_k, value_function)
assert func_out == u_k, "The sample test failed."
