In [None]:
version = "REPLACE_PACKAGE_VERSION"

# Reinforcement Learning


## Assignment 1 Part 1: Knapsack Problem

In this assignment, we will solve the 0-1 Knapsack problem using deterministic dynamic programming (DP).

### 0-1 Knapsack Problem

Given $n$ items, each with a weight $w_i > 0$ and a value $v_i$, $i=0,1,...,n-1$. Given a knapsack that has a limited weight capacity $W>0$, determine which items to include so that the total weight is less than or equal to the given limit $W$ and the total value is as large as possible. Assume that all weights $w_i, i=0,1,...,n-1$ and $W$ are integers. Note that $v_i\in\mathbb{R}, i=0,1,...,n-1$ are not necessarily integers.


### Problem Formulation

The 0-1 Knapsack problem can be formulated as a finite horizon deterministic DP problem:

- *Stage $i$*: 

    We determine whether to include each item one by one. Let stage $i$ be the time when we decide whether to include item $i$ or not, $i=0,1,...,n-1$. Let stage $n$ be the final stage after all decisions.
    

- *State $s_i$*: 

    The remaining weight capacity of the knapsack at stage $i$. By assumption, $s_i$ is an integer between $0$ and $W$.
    

- *Action $a_i$*:

    $a_i=1$: To include item $i$ in the knapsack;
    
    $a_i=0$: Not to include item $i$ in the knapsack.
    
    Note that if $w_i > s_i$, i.e., the weight of item $i$ is larger than the remaining weight capacity of the knapsack, then $a_i=0$.
    

- *Reward $r_i(s_i,a_i)$*: 

    $r_i(s_i,a_i)=v_i$ if $a_i=1$ and $w_i \le s_i$;
    
    Otherwise, $r_i(s_i,a_i)=0$.
    
    
- *State Transition function $f_i(s_i, a_i)$*:

    \begin{equation}
    \begin{aligned}
        s_{i+1} = f_i(s_i, a_i) = \begin{cases}
                s_i - w_i, & a_i=1, w_i \le s_i \\
                s_i, & a_i=0
                \end{cases}
    \end{aligned}
    \end{equation}

- *Goal*:

    Maximize the total reward $\sum_{i=0}^{n-1} r_i(s_i,a_i)$, i.e., the total value in the knapsack.


### Value Function

Let $V_i(s_i)$ denote the optimal value function for state $s_i$ at stage $i$, defined by
$$
V_i(s_i) = \max_{a_i,...a_{n-1}} \sum_{j=i}^{n-1} r_j(s_j, a_j)
$$

The optimal value function $V_i(s_i)$ can be interpreted as the largest total value we can obtain given remaining weight capacity $s_i$ and items $i,i+1,...,n-1$. Note that $V_{n}(s_{n}) = 0$ by definition.


In [None]:
# Import packages. Run this cell.

import numpy as np


### Questions

Our ultimate goal in this assignment is to find the optimal sequence of actions $a_0,a_1,...,a_{n-1}$ that maximize the total value obtained in the knapsack, i.e., to solve the following problem:
$$
\mathop{\mathrm{argmax}}_{a_0,...,a_{n-1}} \sum_{i=0}^{n-1} r_i(s_i, a_i)
$$

Please answer the following questions.

**1.** Bellman Equation (2 pts) 

Please write down the Bellman equation for the optimal value function $V_i(s_i),~i=0,...,n-1$. (Do not leave $f_i(s_i,a_i)$ in your final answer.)

**Note**: This question will be manually graded.

**2.** Backward Computation (3 pts) 

Calculate the optimal value function backward using the Bellman equation, i.e., compute the values of $V_{n}(s_{n}), V_{n-1}(s_{n-1}),...,V_0(s_0)$.

Please complete the Python function `backward_cal` in the next cell. The inputs of the function are `n`, `W`, `weights`, `values`:

   - `n`: the number of items, i.e., $n$.
 
   - `W`: the weight limit of the knapsack, i.e., $W$.

   - `weights`: the weights of the items. It is a numpy array with size $n$. `weights[i]` represents $w_i$.

   - `values`: the values of the items. It is a numpy array with size $n$. The precision is up to 4 decimal places. `values[i]` represents $v_i$.
 
The output `value_function` is the value function $V_i(s_i)$:

   - `value_function`: a numpy array with shape `(n + 1, W + 1)`. The precision is up to 4 decimal places. `value_function[i, s_i]` represents $V_i(s_i)$.
    
    For example, `value_function[0, 1]` is the value of $V_{0}(1)$ given the inputs.


In [None]:
def backward_cal(n, W, weights, values):
    """
    Calculate the optimal value function $V_{i}(s_i)$ using the Bellman equation
    Args:
        n: the number of items.
        W: the weight limit of the knapsack.
        weights: the weights of the items. It is a numpy array with size $n$. weights[i] represents $w_i$.
        values: the values of the items. It is a numpy array with size $n$. values[i] represents $v_i$.
    Returns:
        value_function: a numpy array with shape (n + 1, W + 1). value_function[i, s_i] represents $V_i(s_i)$.
    """
    value_function = np.zeros((n + 1, W + 1))
    
    ### BEGIN SOLUTION
    for i in range(n, -1, -1):
        for j in range(W + 1):
            if i != n and j != 0:
                if weights[i] <= j:
                    value_function[i, j] = max(value_function[i + 1, j - weights[i]] + values[i], value_function[i + 1, j])
                else:
                    value_function[i, j] = value_function[i + 1, j]
    ### END SOLUTION
    
    return value_function


In [None]:
# Sample Test, checking the output of your function backward_cal

# Sample input
n = 3
W = 3
weights = np.array([2, 1, 3])
values = np.array([8.0, 9.0, 10.0])

# Sample output
value_function = np.array([[ 0.0, 9.0, 9.0, 17.0],
                           [ 0.0, 9.0, 9.0, 10.0],
                           [ 0.0, 0.0, 0.0, 10.0],
                           [ 0.0, 0.0, 0.0, 0.0]])

# Sample test
func_out = backward_cal(n, W, weights, values)
for i in range(n + 1):
    for j in range(W + 1):
        assert round(func_out[i, j], 4) == round(value_function[i, j], 4), "Question 2: The sample test failed."


In [None]:
# Hidden Test 1, checking the output of your function backward_cal
### BEGIN HIDDEN TESTS
# generate test samples
n = 1
W = 1
weights = np.array([1])
values = np.array([8.5])
tr_v_func_secr = np.array([[ 0.0, 8.5],
                           [ 0.0, 0.0]])

student = backward_cal(n, W, weights, values)
for i in range(n + 1):
    for j in range(W + 1):
        assert round(student[i, j], 4) == round(tr_v_func_secr[i, j], 4), "Question 2, Test 1, the output value of your function backward_cal does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 2, checking the output of your function backward_cal
### BEGIN HIDDEN TESTS
# generate test samples
n = 5
W = 10
weights = np.array([6, 1, 4, 4, 8])
values = np.array([4.2365, 6.4589, 4.3759, 8.9177, 9.6366])
tr_v_func_secr = np.array([[ 0.,      6.4589,  6.4589,  6.4589,  8.9177, 15.3766, 15.3766, 15.3766, 15.3766, 19.7525, 19.7525],
                           [ 0.,      6.4589,  6.4589,  6.4589,  8.9177, 15.3766, 15.3766, 15.3766, 15.3766, 19.7525, 19.7525],
                           [ 0.,      0.,      0.,      0.,      8.9177,  8.9177,  8.9177,  8.9177, 13.2936, 13.2936, 13.2936],
                           [ 0.,      0.,      0.,      0.,      8.9177,  8.9177,  8.9177,  8.9177,  9.6366, 9.6366,  9.6366 ],
                           [ 0.,      0.,      0.,      0.,      0.,      0.,      0.,      0.,      9.6366, 9.6366,  9.6366 ],
                           [ 0.,      0.,      0.,      0.,      0.,      0.,      0.,      0.,      0.,     0.,      0.     ]])
                
student = backward_cal(n, W, weights, values)
for i in range(n + 1):
    for j in range(W + 1):
        assert round(student[i, j], 4) == round(tr_v_func_secr[i, j], 4), "Question 2, Test 2, the output value of your function backward_cal does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 3, checking the output of your function backward_cal
### BEGIN HIDDEN TESTS
# generate test samples
np.random.seed(0)
n = 500
W = 100
weights = np.random.randint(low=1, high=W+1, size=n)
values = (np.random.rand(n) * 10).round(4)
tr_v_func_secr = np.zeros((n + 1, W + 1))
for i in range(n, -1, -1):
    for j in range(W + 1):
        if i != n and j != 0:
            if weights[i] <= j:
                tr_v_func_secr[i, j] = max(tr_v_func_secr[i + 1, j - weights[i]] + values[i], tr_v_func_secr[i + 1, j])
            else:
                tr_v_func_secr[i, j] = tr_v_func_secr[i + 1, j]

student = backward_cal(n, W, weights, values)
for i in range(n + 1):
    for j in range(W + 1):
        assert round(student[i, j], 4) == round(tr_v_func_secr[i, j], 4), "Question 2, Test 3, the output value of your function backward_cal does not match expected."
### END HIDDEN TESTS

**3.** Find the Optimal Actions (2 pts)

Assume that we have obtained the optimal value function $V_{i}(s_{i})$ for all $i$ and $s_i$. Then we can find the optimal sequence of actions $a_0,...,a_{n-1}$ forward.
 
Please complete the Python function `find_optimal_actions` in the next cell.

Inputs:
   - `n`: the number of items, i.e., $n$.
 
   - `W`: the weight limit of the knapsack, i.e., $W$.
   
   - `weights`: the weights of the items. It is a numpy array with size $n$. `weights[i]` represents $w_i$.

   - `values`: the values of the items. It is a numpy array with size $n$. The precision is up to 4 decimal places. `values[i]` represents $v_i$.

   - `value_function`: the optimal value function $V_{i}(s_{i})$. It is a numpy array with shape `(n + 1, W + 1)`. The precision is up to 4 decimal places. `value_function[i, s_i]` represents $V_i(s_i)$. For example, `value_function[0, 1]` is the value of $V_{0}(1)$.

Output:
   - `opt_actions`: the optimal actions $a_0, a_1,...,a_{n-1}$. It is a numpy array with size $n$. `opt_actions[i]` represents $a_i$. For example, `opt_actions[0]=1` means that we determine to include item $0$ in the knapsack.


In [None]:
def find_optimal_actions(n, W, weights, values, value_function):
    """
    Find the optimal actions $a_0,...,a_{n-1}$ using the optimal value function.
    Args:
        n: the number of items, i.e., $n$.
        W: the weight limit of the knapsack, i.e., $W$.
        weights: the weights of the items. It is a numpy array with size $n$. weights[i] represents $w_i$.
        values: the values of the items. It is a numpy array with size $n$. values[i] represents $v_i$.
        value_function: optimal value function, a numpy array with shape (n + 1, W + 1). value_function[i, s_i] represents $V_i(s_i)$.
    Returns:
        opt_actions: the optimal actions, a numpy array with size n. opt_actions[i] represents $a_i$.
    """
    opt_actions = np.zeros((n,), dtype=int)
    ### BEGIN SOLUTION
    s = W
    for i in range(n):
        if weights[i] > s:
            opt_actions[i] = 0
        else:
            if value_function[i + 1, s] >= value_function[i + 1, s - weights[i]] + values[i]:
                opt_actions[i] = 0
            else:
                opt_actions[i] = 1
                s = s - weights[i]
    ### END SOLUTION
    
    return opt_actions


In [None]:
# Sample Test, checking the output of your function find_optimal_actions

# Sample input
n = 3
W = 3
weights = np.array([2, 1, 3])
values = np.array([8.0, 9.0, 10.0])
value_function = np.array([[ 0.0, 9.0, 9.0, 17.0],
                           [ 0.0, 9.0, 9.0, 10.0],
                           [ 0.0, 0.0, 0.0, 10.0],
                           [ 0.0, 0.0, 0.0, 0.0]])

# Sample output
opt_actions = np.array([1, 1, 0])

# Sample test
func_out = find_optimal_actions(n, W, weights, values, value_function)
for i in range(n):
    assert round(func_out[i]) == round(opt_actions[i]), "Question 3: The sample test failed."


In [None]:
# Hidden Test 1, checking the output of your function find_optimal_actions
### BEGIN HIDDEN TESTS
# generate test samples
n = 5
W = 10
weights = np.array([7, 6, 6, 8, 4])
values = np.array([4.0303, 7.4523, 5.2691, 4.8768, 0.0055])
tr_v_func_secr = np.array([[0.0000, 0.0000, 0.0000, 0.0000, 0.0055, 0.0055, 7.4523, 7.4523, 7.4523, 7.4523, 7.4578],
                           [0.0000, 0.0000, 0.0000, 0.0000, 0.0055, 0.0055, 7.4523, 7.4523, 7.4523, 7.4523, 7.4578],
                           [0.0000, 0.0000, 0.0000, 0.0000, 0.0055, 0.0055, 5.2691, 5.2691, 5.2691, 5.2691, 5.2746],
                           [0.0000, 0.0000, 0.0000, 0.0000, 0.0055, 0.0055, 0.0055, 0.0055, 4.8768, 4.8768, 4.8768],
                           [0.0000, 0.0000, 0.0000, 0.0000, 0.0055, 0.0055, 0.0055, 0.0055, 0.0055, 0.0055, 0.0055],
                           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]])
 
opt_act_secr = np.array([0, 1, 0, 0, 1])
student = find_optimal_actions(n, W, weights, values, tr_v_func_secr)
for i in range(n):
    assert round(student[i]) == round(opt_act_secr[i]), "Question 3, Test 1, the output value of your function find_optimal_actions does not match expected."
### END HIDDEN TESTS

In [None]:
# Hidden Test 2, checking the output of your function find_optimal_actions
### BEGIN HIDDEN TESTS
# generate test samples
np.random.seed(0)
n = 500
W = 100
weights = np.random.randint(low=1, high=W+1, size=n)
values = (np.random.rand(n) * 10).round(4)
tr_v_func_secr = np.zeros((n + 1, W + 1))
for i in range(n, -1, -1):
    for j in range(W + 1):
        if i != n and j != 0:
            if weights[i] <= j:
                tr_v_func_secr[i, j] = max(tr_v_func_secr[i + 1, j - weights[i]] + values[i], tr_v_func_secr[i + 1, j])
            else:
                tr_v_func_secr[i, j] = tr_v_func_secr[i + 1, j]

student = find_optimal_actions(n, W, weights, values, tr_v_func_secr.round(4))
sum_weights = 0
sum_values = 0.0
for i in range(n):
    if student[i] == 1:
        sum_weights = sum_weights + weights[i]
        sum_values = sum_values + values[i]
        
assert (sum_weights <= W) and (round(sum_values, 4) == round(tr_v_func_secr[0, W], 4)), "Question 3, Test 2, the output value of your function find_optimal_actions does not match expected."
### END HIDDEN TESTS