STA 220 Homework 2
===========

- Do not distribute
- Do the entire homework here in the notebook by adding cells below the given exercise.
- When turning the homework in simply submit your notebook file to canvas.
- Obviously, do not copy the code from some online source or the other students.

## Your Name: Chenghan Sun

## Your Student ID: 915030521

The multi-armed bandit framework assumes that you are in a two player game where you select a number from 1 to K (we call this number the arm that you pulled).  Then the other player selects a reward, $r_{i,t}$ based on the arm, $i$, at time $t$ that you selected and reveals that to you.  Your job is to come up with a policy which determines which arm to pull at a given time based on the past performances of the arms.

The name multi-armed bandit comes from the gambling world, in which a slot machine is called a one armed bandit.  In that setting, you pull the arm and recieve some reward.  In this fictional setting there are multiple arms for the slot machine, each paying out different rewards.  Because you can only pull one at a time, you only see the reward from the arm you pulled.  This partial observability puts you in the challenging position of needing to explore the arms, seeing which has better performance, before you start to exploit the best arm.  Below is a simulation from a simple mult-armed bandit.

In [1]:
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')

In [2]:
class SimpleBandit:
    def __init__(self):
        self._mu = np.array([-1.,-2.,1.5,0.5,-0.25,.75,.1,1.8,-3])
        self._p = 1 / (1 + np.exp(-self._mu))  # probability space 
        self.num_arms = len(self._mu)  # arms space 
        self.total_rewards = np.zeros(len(self._mu))  # rewards space 

    def pull(self, arms):
        self.current_rewards = np.random.binomial(1, self._p)  # reward of each slot machine, obey certain probability distributions 
        self.total_rewards += self.current_rewards
        return self.current_rewards[arms]

In [3]:
band = SimpleBandit()  # reset
[band.pull(1) for t in range(10)]  # do arm 1 for 10 times 

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

In the above we see that if we had pulled the 1 arm 10 times then our total reward for that arm is 2 because it returned a 1 twice.  Our rewards are binary, only 0 or 1.  You can also see the total rewards from each arm below.

In [4]:
band.total_rewards  # for each arm 

array([ 4.,  1., 10.,  9.,  5.,  5.,  4.,  9.,  2.])

One simple policy is to always pull the arm 2, and another is to pull the arm 1.  We can compare these with the following:

In [5]:
rewards = np.array([band.pull([1,2]) for t in range(100)])

rewards now has the rewards for each policy in each column, we can compare the rewards for these policies below.

In [6]:
rewards.sum(axis=0)  # first two arms, 100 time steps 

array([18, 81])

It seems that 'always pull 2' is a better policy.  Another simple policy is to randomly select an arm and pull it.  This can be seen as a pure exploration policy.  All of your policies should have the **select_arm** method which tells you which arm to pull, and the **update_reward** method which updates any internal state information based on the observed reward.  In this case nothing needs to be updated.

In [7]:
class RandomPolicy:
    """
    Random policy, pure exploration
    """
    def __init__(self, num_arms):
        self.num_arms = num_arms
        self.current_arm = None
        
    def select_arm(self):
        """
        choose which arm to pull
        """
        self.current_arm = np.random.randint(self.num_arms)
        return self.current_arm
    
    def update_reward(self, reward):
        """
        enter observed reward
        """
        return None

**Exercise 1.** The regret of a bandit policy is the difference between the total reward you would get from the "best" arm in hindsight and the total reward your policy recieved.  So if your policy selected arms $i_1,\ldots,i_T$ for total of $T$ time, then the regret is 
$$\max_j \sum_{t=1}^T r_{j,t} - \sum_{t=1}^T r_{i_t,t}.$$
The SimpleBandit class maintains what reward you would have recieved if you just pulled a given arm in the ``arm_reward`` attribute (`arm_reward[i]` is the total reward recieved if i is pulled each time).  Fill the def below which takes a list of policies to play over T time steps and returns a list of regrets for each policy.  Test it on the simple bandit and `[rand]`, the random policy.

In [8]:
def run_trajectory(bandit, policies, T):
    """
    Run T steps of bandit pulling each policy in each time step
    
    Arguments: 
    bandit: 
        a fresh instance of a Bandit class, 
        in this homework you will be always using an instance from RandomPolicy class
    
    policies:
        a list like [policy1, policy2, policy3 ...]
        each of the policy will have select_arm method and update_reward method
    
    Output: 
        regret of each policy in list, like
        [regret1, regret2, ...]
    """
    regrets_list = []  # per policy
    i = 0  # policy index 
    for policy in policies:  # iterate all policies 
        bandit = SimpleBandit()  # reset the simulation
        num_arms = policy.num_arms  # arms space in each policy 
        selected_arms = [policy.select_arm() for _ in range(T)]  # select random arm T times under each policy
        print(f"The {i}th policy selected these arms: {selected_arms} under {T} time steps")
        arms_current_rewards = [bandit.pull(arm) for arm in selected_arms]  # for a specific time step
        actual_rewards = sum(arms_current_rewards)
        print(f"The current binary rewards for {T} arms: {arms_current_rewards}")
        print(f"The actually rewards received from the {i}th pilicy: {actual_rewards}")
        print(f"The potential total rewards under the {i}th policy for each arm: {bandit.total_rewards}")
        regret_i = np.amax(bandit.total_rewards) - actual_rewards
        print(f"The regrets under the {i}th policy: {regret_i}")
        regrets_list.append(regret_i)
        i += 1
    return regrets_list

In [10]:
""" DEMO run_trajectory
"""
bandit = SimpleBandit()
p1 = RandomPolicy(3)
p2 = RandomPolicy(4)
p3 = RandomPolicy(5)
regrets = [run_trajectory(SimpleBandit(), [p1, p2, p3], 15)] #for _ in range(10)]
print(regrets)
np.mean(regrets)

The 0th policy selected these arms: [0, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 1, 2, 0, 0] under 15 time steps
The current binary rewards for 15 arms: [0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0]
The actually rewards received from the 0th pilicy: 6
The potential total rewards under the 0th policy for each arm: [ 5.  1. 11.  9.  5. 10.  7. 13.  0.]
The regrets under the 0th policy: 7.0
The 1th policy selected these arms: [3, 3, 2, 1, 1, 3, 2, 0, 2, 2, 2, 1, 3, 2, 1] under 15 time steps
The current binary rewards for 15 arms: [0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0]
The actually rewards received from the 1th pilicy: 9
The potential total rewards under the 1th policy for each arm: [ 4.  1. 12.  9.  6. 10.  7. 15.  1.]
The regrets under the 1th policy: 6.0
The 2th policy selected these arms: [2, 3, 2, 0, 3, 4, 3, 4, 3, 2, 1, 0, 2, 3, 4] under 15 time steps
The current binary rewards for 15 arms: [1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0]
The actually rewards received from the 2th pilicy: 7

6.666666666666667

**Exercise 2.** epsilon-greedy is a policy that has a few roughly equivalent variants, but the one we will use is the following:  For each time step, with probability epsilon, pull an arm at random, otherwise pull the current best arm.  The current best arm is the one which has the most total reward up to that time.  If there is a tie for current best policy, break the tie randomly.

Implement epsilon-greedy in the following class, test it by having it compete once with the random policy for epsilon=0.1 for T = 1000 time points.

**Exercise 3.** Exp3 is a policy that is more nuanced.  The idea is that at each time point you pull an arm with some probability `pi[i]` which is updated based on the performance of the arm.  Look at the full algorithm in http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf, Algorithm 1.  

Implement this version of the Exp3 algorithm in the following class, test it by having it compete once with the random policy for T = 1000 time points.

**Exercise 4.** The Upper Confidence Bound (UCB) algorithm is intuitive.  You pull the arm that has the largest upper confidence bound for the mean reward.  So if you have an upper bound on the mean reward for an arm this is either because the arm is performing well or you have not pulled the arm much, resulting in a wide confidence interval.  The UCB for arm i is then,
$$ U_{i,t} = \hat \mu_{i,t} + C \sqrt{\frac{\log(t)}{n_{i,t}}}$$
where $\hat \mu_{i,t}$ is the mean reward of the arm up to time $t$, $n_{i,t}$ is the number of times that that arm has been pulled.  $C$ is an argument (can be taken to be 2 by default).

Implement this version of the UCB algorithm in the following class, test it by having it compete once with the random policy for T = 1000 time points with $C = 2$.

**Exercise 5.** 

1. Modify the run_trajectory def to output all of the regrets up to that time in a T x K (for K arms) array.
2. Try 4 different values of epsilon for epsilon-greedy: 0.01,0.05,0.1,0.2 and have them compete.  Plot the regrets as a function of t.  Note the best performing selection of epsilon at $T = 1000$.
3. Try 4 different values of C in UCB: 1,2,3,4 and have them compete.  Plot the regrets as a function of t.  Note the best performing selection of C at $T = 1000$.
4. Using the optimal values of C and epsilon have all four methods compete. Plot the regrets as a function of t, remark on if the relative performance changes over time.  Is one algorithm always dominant?  Make any other conclusions.