In [None]:
version = "REPLACE_PACKAGE_VERSION"

# Reinforcement Learning


## Assignment 1 Part 2: Secretary Problem

In this assignment, we will solve the secretary problem using stochastic dynamic programming (DP).

### Secretary Problem

Imagine an administrator who wants to hire the best secretary out of $n$ candidates for a position. The candidates are interviewed one by one in random order. After interviewing the $h^{\mathrm{th}}$ candidate, the administrator knows the ranking of the $h^{\mathrm{th}}$ candidate among the candidates interviewed so far, but is unaware of the quality of unseen candidates. The administrator needs to determine whether to hire or reject the $h^{\mathrm{th}}$ candidate immediately after the interview. Note that the administrator cannot go back to hire candidate $1,...,h-1$ after interviewing the $h^{\mathrm{th}}$ candidate.

### Assumption

For each interview the administrator picks a candidate uniformly at random from the remaining candidates.

### Problem Formulation

This is a finite horizon stochastic problem, which can be formulated as the following stochastic DP:

- *Goal*:

    Maximize the probability of hiring the best candidate.


- *Stage $h$*: 

    The stage after interviewing the $h^{\mathrm{th}}$ candidate. $h\in\{1,...,n\}$.


- *State $s_h$*: 

    The ranking of the $h^{\mathrm{th}}$ candidate among the candidates interviewed so far. $s_h\in\{1,...,h\}$.


- *Action $a_h$*: 

    $a_h=1$ means hiring the $h^{\mathrm{th}}$ candidate;
    $a_h=0$ means rejecting the $h^{\mathrm{th}}$ candidate.


- *Reward $r_h(s_h,a_h)$*: 

    $r_h=1$ if $a_h=1$ and the $h^{\mathrm{th}}$ candidate is the best candidate among all the $n$ candidates. 
    Otherwise, $r_h=0$.

Note that once the administrator decides to hire the candidate after the interview, the process will terminate and all the future rewards will be 0.

### Value Function

Let $V_h(s_h)$ denote the optimal value function for state $s_h$ at stage $h$, defined by
$$
V_h(s_h) = \max_{\mu} \mathbb{E} \left[\sum_{k=h}^n r_k(s_k, \mu_k(s_k))\right]
$$

where $\mu=(\mu_k)$ is any sequence of state dependent policies.
The optimal value function $V_h(s_h)$ can be interpreted as the probability that the best candidate will be hired under the optimal policy given $s_h$ at stage $h$. 

For example, let $n = 10$, $h = 5$, $s_5 = 1$. In this scenario, there are 10 candidates total. After interviewing the $5^{\mathrm{th}}$ candidate, we know that this candidate ranks first among the 5 candidates that have been interviewed. If we follow the optimal policies from now on, the probability that we can successfully hire the best candidate is $V_5(1)$.


In [None]:
# Import packages. Run this cell.

import numpy as np
import matplotlib.pyplot as plt

### Questions

Our ultimate goal in this assignment is to find the optimal sequence of policies that maximizes the probability of hiring the best candidate, i.e., we want to find the optimal sequence of policies $\mu^*$ such that
$$
\mu^* = \mathop{\mathrm{argmax}}_{\mu} \mathbb{E} \left[\sum_{h=1}^n r_h(s_h, \mu_h(s_h))\right]
$$

Please answer the following questions, which will guide you towards the goal step by step.

**1.** (2 pts) 

We start by calculating the optimal value function at the last stage $n$, i.e., the values of $V_n(s_n)$ for all $s_n\in\{1,...,n\}$.

Please complete the Python function `V_n` in the next cell. The inputs of the function are $n$ and $s_n$ with $s_n\in\{1,...,n\}$. The output of the function is the value of $V_n(s_n)$ given the inputs $n$ and $s_n$. 

For example, if the inputs are 10 and 1, the output of the function should be the value of $V_{10}(1)$.


In [None]:
def V_n(n, s_n):
    """
    Calculate the optimal value function $V_n(s_n)$
    Args:
        n: the total number of stages
        s_n: the state at stage n
    Returns:
        v_n_s_n: the value of the optimal value function $V_n(s_n)$
    """
    if s_n == 1:
        ### BEGIN SOLUTION
        v_n_s_n = 1
        ### END SOLUTION
    else:
        ### BEGIN SOLUTION
        v_n_s_n = 0
        ### END SOLUTION
    
    # Make sure that the output type of your function is int or float
    if not isinstance(v_n_s_n, (float, int)):
        raise ValueError("The output type of your function should be int or float.")
    
    return v_n_s_n


In [None]:
# Sample Test, checking the output of your function V_n

# Sample input
n = 2
s_n = 1

# Sample output
v_n_s_n = 1

# Sample test
func_out = V_n(n, s_n)
assert round(func_out, 4) == round(v_n_s_n, 4), "Question 1: The sample test failed."


In [None]:
# Hidden Test 1, checking the output of your function V_n if s_n = 1
### BEGIN HIDDEN TESTS
for n in range(1, 11):
    assert round(V_n(n, 1), 4) == 1.0000, "Question 1, Test 1, the output value of your function V_n does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 2, checking the output of your function V_n if s_n = 2,...,n
### BEGIN HIDDEN TESTS
for n in range(2, 11):
    for s_n in range(2, n + 1):
        assert round(V_n(n, s_n), 4) == 0.0000, "Question 1, Test 2, the output value of your function V_n does not match expected."
### END HIDDEN TESTS

**2.** (2 pts) 

The Bellman equation for $h=1,...,n-1$ is as follows:

\begin{equation}
    \begin{aligned}
    V_h(1) =& \max \biggl\{\mathbb{E}[r_h(1,1)], \mathbb{E}[V_{h+1}(s_{h+1})]\biggr\} = \max\left\{\frac{h}{n},\frac{1}{h+1}\sum_{s=1}^{h+1}V_{h+1}(s)\right\}\\
    V_h(s_{h}) =& \max \biggl\{0, \mathbb{E}[V_{h+1}(s_{h+1})]\biggr\} = \frac{1}{h+1}\sum_{s=1}^{h+1}V_{h+1}(s), s_h=2,...,h.
    \end{aligned}
\end{equation}

For $V_h(1)$, the first term inside the maximum function, $\mathbb{E}[r_h(1,1)]$, is the expected value given $a_h=1$, i.e., hiring the $h^{\mathrm{th}}$ candidate. The total future reward is 0 since the process terminates. The second term $\mathbb{E}[V_{h+1}(s_{h+1})]$ is the expected value given $a_h=0$, i.e., rejecting the $h^{\mathrm{th}}$ candidate, since the current reward is 0. 

For $V_h(s_{h})$ where $s_h=2,...,h$, the expected value for hiring the $h^{\mathrm{th}}$ candidate is 0 since the ranking of the $h^{\mathrm{th}}$ candidate among all the $n$ candidates must be larger than or equal to 2. Similar to the case for $V_h(1)$, the expected value of rejecting the $h^{\mathrm{th}}$ candidate is $\mathbb{E}[V_{h+1}(s_{h+1})]$.

Based on the results of $V_n(s_n)$ obtained in Question **1**, calculate $V_{n-1}(s_{n-1})$ for all $s_{n-1}\in\{1,...,n-1\}$ using the above Bellman equation.

Please complete the Python function `V_n_minus_1` in the next cell. The inputs of the function are $n$ and $s_{n-1}$ with $s_{n-1}\in\{1,...,n-1\}$. The output of the function is the value of $V_{n-1}(s_{n-1})$ given the inputs $n$ and $s_{n-1}$.

For example, if the inputs are 10 and 1, the output of the function should be the value of $V_{9}(1)$.

In [None]:
def V_n_minus_1(n, s_n_minus_1):
    """
    Calculate the optimal value function $V_{n-1}(s_{n-1})$ using the Bellman equation
    Args:
        n: the total number of stages
        s_n_minus_1: the state s_{n-1} at stage n-1
    Returns:
        output: the value of the optimal value function $V_{n-1}(s_{n-1})$
    """
    if s_n_minus_1 == 1:
        ### BEGIN SOLUTION
        output = max((n - 1) / n, 1 / n * 1)
        ### END SOLUTION
    else:
        ### BEGIN SOLUTION
        output = 1 / n * 1
        ### END SOLUTION
    
    # Make sure that the output type of your function is int or float
    if not isinstance(output, (float, int)):
        raise ValueError("The output type of your function should be int or float.")
    
    return output


In [None]:
# Sample Test, checking the output of your function V_n_minus_1

# Sample input
n = 3
s_n_minus_1 = 1

# Sample output
output = 2 / 3  # V_2(1)

# Sample test
func_out = V_n_minus_1(n, s_n_minus_1)
assert func_out == output, "Question 2: The sample test failed."


In [None]:
# Hidden Test 1, checking the output of your function V_n_minus_1 if s_n_minus_1 = 1
### BEGIN HIDDEN TESTS
for n in range(2, 11):
    assert V_n_minus_1(n, 1) == max((n - 1) / n, 1 / n * 1), "Question 2, Test 1, the output value of your function V_n_minus_1 does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 2, checking the output of your function V_n_minus_1 if s_n_minus_1 = 2,...,n-1
### BEGIN HIDDEN TESTS
for n in range(2, 11):
    for s in range(2, n):
        assert V_n_minus_1(n, s) == 1 / n * 1, "Question 2, Test 2, the output value of your function V_n_minus_1 does not match expected."
### END HIDDEN TESTS

**3.** (4 pts)

Then we can do backward computation based on the results obtained in Question **1** and the Bellman equation in Question **2**.

Please complete the Python function `V_h` in the next cell. The inputs of the function are $n$, $h$, and $s_{h}$ with $s_{h}\in\{1,...,h\}$. The output of the function is the value of $V_{h}(s_{h})$ given the inputs. 

For example, if the inputs are 10, 5, and 1, the output of the function should be the value of $V_{5}(1)$ assuming that the number of stages is 10.


In [None]:
def V_h(n, h, s_h):
    """
    Calculate the optimal value function $V_h(s_h)$ using backward computation
    Args:
        n: the total number of stages
        h: stage h
        s_h: the state s_h at stage h
    Returns:
        v_h_s_h: the value of the optimal value function $V_h(s_h)$
    """
    ### BEGIN SOLUTION
    v_next_stage_1 = 1
    v_next_stage_s = 0
    stage = n - 1
    while stage >= h:
        v_stage_s = (v_next_stage_1 + v_next_stage_s * stage) / (stage + 1)
        v_stage_1 = max(stage / n, v_stage_s)
        v_next_stage_1 = v_stage_1
        v_next_stage_s = v_stage_s
        stage = stage - 1
    if s_h == 1:
        v_h_s_h = v_next_stage_1
    else:
        v_h_s_h = v_next_stage_s
    ### END SOLUTION
    
    # Make sure that the output type of your function is int or float
    if not isinstance(v_h_s_h, (float, int)):
        raise ValueError("The output type of your function should be int or float.")
    
    return v_h_s_h


In [None]:
# Sample Test, checking the output of your function V_h

# Sample input
n = 3
h = 1
s_h = 1

# Sample output
v_h_s_h = 1 / 2  # V_1(1)

# Sample test
func_out = V_h(n, h, s_h)
assert func_out == v_h_s_h, "Question 3: The sample test failed."


In [None]:
# Hidden Test 1, checking the output of your function V_h if s_h = 1, n = 10
### BEGIN HIDDEN TESTS
n = 10
value_function = np.zeros((n, n))
for h in range(1, n):
    for s_h in range(1, h + 1):
        for h_star in range(2, n):
            left = sum([1 / i for i in range(h_star, n)])
            right = left + 1 / (h_star - 1)
            if left <= 1 and right > 1:
                break
        if h >= h_star:
            if s_h == 1:
                value_function[h - 1, s_h - 1] = max(h / n, h / n * sum([1 / i for i in range(h, n)]))
            else:
                value_function[h - 1, s_h - 1] = h / n * sum([1 / i for i in range(h, n)])
        else:
            value_function[h - 1, s_h - 1] = (h_star - 1) / n * sum([1 / i for i in range(h_star - 1, n)])
value_function[n - 1, 0] = 1

for h in range(1, n + 1):
    assert round(V_h(n, h, 1), 4) == round(value_function[h - 1, 0], 4), "Question 3, Test 1, the output value of your function V_h does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 2, checking the output of your function V_h if s_h = 2,...,h, n = 10
### BEGIN HIDDEN TESTS
for h in range(1, n + 1):
    for s_h in range(2, h + 1):
        assert round(V_h(n, h, s_h), 4) == round(value_function[h - 1, s_h - 1], 4), "Question 3, Test 2, the output value of your function V_h does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 3, checking the output of your function V_h if s_h = 1, n = 20
### BEGIN HIDDEN TESTS
n = 20
value_function = np.zeros((n, n))
for h in range(1, n):
    for s_h in range(1, h + 1):
        for h_star in range(2, n):
            left = sum([1 / i for i in range(h_star, n)])
            right = left + 1 / (h_star - 1)
            if left <= 1 and right > 1:
                break
        if h >= h_star:
            if s_h == 1:
                value_function[h - 1, s_h - 1] = max(h / n, h / n * sum([1 / i for i in range(h, n)]))
            else:
                value_function[h - 1, s_h - 1] = h / n * sum([1 / i for i in range(h, n)])
        else:
            value_function[h - 1, s_h - 1] = (h_star - 1) / n * sum([1 / i for i in range(h_star - 1, n)])
value_function[n - 1, 0] = 1

for h in range(1, n + 1):
    assert round(V_h(n, h, 1), 4) == round(value_function[h - 1, 0], 4), "Question 3, Test 3, the output value of your function V_h does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 4, checking the output of your function V_h if s_h = 2,...,h, n = 20
### BEGIN HIDDEN TESTS
for h in range(1, n + 1):
    for s_h in range(2, h + 1):
        assert round(V_h(n, h, s_h), 4) == round(value_function[h - 1, s_h - 1], 4), "Question 3, Test 4, the output value of your function V_h does not match expected."
### END HIDDEN TESTS

**4.** (1 pts)

Consider $n\ge 3$. Let $h^*$ be such that 
$\frac{1}{h^*}+\frac{1}{h^*+1}+...+\frac{1}{n-1} \le 1 < \frac{1}{h^*-1}+\frac{1}{h^*}+...+\frac{1}{n-1}$.

For example, if $n=10$, then $h^*=4$.

Given $n$, $h$, and $s_h$, the value function $V_h(s_h)$ can also be calculated using the closed-form equations below.

For all $h^*-1 \le h \le n-1$,

\begin{equation}
    \begin{aligned}
        V_h(1) =& \max\left\{\frac{h}{n}, \frac{h}{n}\left(\frac{1}{h}+\frac{1}{h+1}+...+\frac{1}{n-1}\right)\right\}\\
        V_h(s_h) =& \frac{h}{n}\left(\frac{1}{h}+\frac{1}{h+1}+...+\frac{1}{n-1}\right), s_h=2,...,h
    \end{aligned}
\end{equation}

and for all $1 \le h\le h^*-1$,

\begin{align}
    V_h(s_h) =& \frac{h^*-1}{n}\left(\frac{1}{h^*-1}+\frac{1}{h^*}+...+\frac{1}{n-1}\right), \forall s_h=1,...,h.
\end{align}
    
We implemented this closed-form calculation as the Python function `V_h_closed_form`. To verify these closed-form equations, please select several sets of $n$, $h$, and $s_h$ as inputs and then compare the values of $V_h(s_h)$ obtained by the Python function `V_h` in Question **3** with those obtained by `V_h_closed_form`. Are they the same?

**Note**: This question will be manually graded.

In [None]:
def V_h_closed_form(n, h, s_h):
    """
    Calculate the optimal value function $V_h(s_h)$ using the closed-form equations
    Args:
        n: the total number of stages
        h: stage h
        s_h: the state s_h at stage h
    Returns:
        v_h_s_h: the value of the optimal value function $V_h(s_h)$
    """
    if h == n:
        if s_h == 1:
            v_h_s_h = 1
        else:
            v_h_s_h = 0
    else:
        for h_star in range(2, n):
            left = sum([1 / i for i in range(h_star, n)])
            right = left + 1 / (h_star - 1)
            if left <= 1 and right > 1:
                break
        if h >= h_star:
            if s_h == 1:
                v_h_s_h = max(h / n, h / n * sum([1 / i for i in range(h, n)]))
            else:
                v_h_s_h = h / n * sum([1 / i for i in range(h, n)])
        else:
            v_h_s_h = (h_star - 1) / n * sum([1 / i for i in range(h_star - 1, n)])
    
    return v_h_s_h

In [None]:
# You may use this space to compare the values. Please print the values out.
### BEGIN SOLUTION
n = 20  # any positive integer you like
h = 9  # any integer between 1 and n
s_h = 2  # any integer between 1 and h
print(V_h(n, h, s_h), V_h_closed_form(n, h, s_h))
n = 20  # any positive integer you like
h = 11  # any integer between 1 and n
s_h = 2  # any integer between 1 and h
print(V_h(n, h, s_h), V_h_closed_form(n, h, s_h))
n = 20  # any positive integer you like
h = 11  # any integer between 1 and n
s_h = 1  # any integer between 1 and h
print(V_h(n, h, s_h), V_h_closed_form(n, h, s_h))
### END SOLUTION

Please give a short answer to the question in the next cell. (A yes/no will suffice.)

**5.** (2 pts)

Consider the following sequence of policies:

At stage $h$, hire the candidate if

- They are the best candidate so far ($s_h=1$), and
- $\frac{1}{h}+\frac{1}{h+1}+...+\frac{1}{n-1} \le 1$, i.e., $h\ge h^*$.
    
i.e., reject first $h^*-1$ candidates and hire the best candidate so far after that.

Let $n=5$. Then $h^*=3$. Please use the closed-form expressions of the value function in Question **4** to verify that the above sequence of policies is optimal.

**Hint**: Remember the backward-forward algorithm in the lecture. Now we have value functions, so we can find the optimal action forward by comparing the terms inside the maximum function in the Bellman equation in Question **2**.

**Note**: This question will be manually graded.

In [None]:
# You may use this space to do the calculation and comparison for each stage.
### BEGIN SOLUTION
n = 5
for h in range(1, 6):
    print(h / n, V_h_closed_form(n, h, h))
### END SOLUTION

Please provide reasoning for the optimality of the sequence of policies based on the results you obtained in the last cell.