STA 220 Homework 2
===========

- Do not distribute
- Do the entire homework here in the notebook by adding cells below the given exercise.
- When turning the homework in simply submit your notebook file to canvas.
- Obviously, do not copy the code from some online source or the other students.
- After finishing the homwork, please select Kernel -> Restart & Run All, then save, submit

The multi-armed bandit framework assumes that you are in a two player game where you select a number from 1 to K (we call this number the arm that you pulled).  Then the other player selects a reward, $r_{i,t}$ based on the arm, $i$, at time $t$ that you selected and reveals that to you.  Your job is to come up with a policy which determines which arm to pull at a given time based on the past performances of the arms.

The name multi-armed bandit comes from the gambling world, in which a slot machine is called a one armed bandit.  In that setting, you pull the arm and recieve some reward.  In this fictional setting there are multiple arms for the slot machine, each paying out different rewards.  Because you can only pull one at a time, you only see the reward from the arm you pulled.  This partial observability puts you in the challenging position of needing to explore the arms, seeing which has better performance, before you start to exploit the best arm.  Below is a simulation from a simple mult-armed bandit.

In [None]:
import numpy as np
import time

In [None]:
class SimpleBandit:
    '''
    The bandit class you will use in this homework. DO NOT modify
    '''
    def __init__(self):
        self._mu = np.array([-1.,-2.,1.5,0.5,-0.25,.75,.1,1.8,-3])
        self._p = 1 / (1 + np.exp(-self._mu))
        self.num_arms = len(self._mu)
        self.total_rewards = np.zeros(len(self._mu))
        
    def pull(self,arms):
        self.current_rewards = np.random.binomial(1,self._p)
        self.total_rewards += self.current_rewards
        return self.current_rewards[arms]

In [None]:
np.random.seed(5)
band = SimpleBandit()
[band.pull(1) for t in range(10)]

In the above we see that if we had pulled the 1 arm 10 times then our total reward for that arm is 2 because it returned a 1 twice.  Our rewards are binary, only 0 or 1.  You can also see the total rewards from each arm below.

In [None]:
band.total_rewards

One simple policy is to always pull the arm 2, and another is to pull the arm 1.  We can compare these with the following:

In [None]:
rewards = np.array([band.pull([1,2]) for t in range(100)])

rewards now has the rewards for each policy in each column, we can compare the rewards for these policies below.

In [None]:
rewards.sum(axis=0)

It seems that 'always pull 2' is a better policy.  Another simple policy is to randomly select an arm and pull it.  This can be seen as a pure exploration policy.  All of your policies should have the select_arm method which tells you which arm to pull, and the update_reward method which updates any internal state information based on the observed reward.  In this case nothing needs to be updated.

In [None]:
class RandomPolicy:
    """
    Random policy, pure exploration. DO NOT modify
    """
    def __init__(self, num_arms):
        self.num_arms = num_arms
        self.current_arm = None
        
    def select_arm(self):
        """
        choose which arm to pull
        """
        self.current_arm = np.random.randint(self.num_arms)
        return self.current_arm
    
    def update_reward(self, reward):
        """
        enter observed reward
        """
        return None

**Exercise 1.** The regret of a bandit policy is the difference between the total reward you would get from the "best" arm in hindsight and the total reward your policy recieved.  So if your policy selected arms $i_1,\ldots,i_T$ for total of $T$ time, then the regret is 
$$\max_j \sum_{t=1}^T r_{j,t} - \sum_{t=1}^T r_{i_t,t}.$$
The SimpleBandit class maintains what reward you would have recieved if you just pulled a given arm in the ``total_rewards`` attribute (`current_rewards[i]` is the total reward recieved if i is pulled each time).  Fill the def below which takes a list of policies to play over T time steps and returns a list of regrets for each policy.  Test it on the simple bandit and random policy.

In [None]:
def run_trajectory(bandit, policies, T):
    """
    Run T steps of bandit pulling each policy in each time step
    
    Arguments:
    bandit: 
        a fresh instance of a Bandit class, 
        in this homework you will be always using an instance from RandomPolicy class
    
    policies:
        a list like [policy1, policy2, policy3 ...]
        each of the policy will have select_arm method and update_reward method
    
    Output: 
        regret of each policy in list, like
        [regret1, regret2, ...]
    """

In [None]:
# test code, do not change
t1 = time.time()
regrets = [run_trajectory(SimpleBandit(), [RandomPolicy()], 1000) for _ in range(10)]
print('Time Usage : {}'.format(time.time() - t1))
np.mean(regrets)

**Exercise 2.** epsilon-greedy is a policy that has a few roughly equivalent variants, but the one we will use is the following:  For each time step, with probability epsilon, pull an arm at random, otherwise pull the current best arm.  The current best arm is the one which has the most total reward up to that time.  If there is a tie for current best policy, break the tie randomly.

Implement epsilon-greedy in the following class, test it by having it compete once with the random policy for epsilon=0.1 for T = 1000 time points.

In [None]:
# change the class below
class Epsilon_Greedy_Policy:
    
    def __init__(self, num_arms, eps):
        self.num_arms = num_arms
        self.current_arm = None
        
    def select_arm(self):
        None
        
    def update_reward(self, reward):
        None
        
    # more methods if you need

In [None]:
# test code
t1 = time.time()
regrets = []

# Your code here, 
    # generate 10 regrets with simple bandit and epsilon-greedy policy, 
    # save results in regrets,
    # refer to second block in exercise 1

print('Time Usage : {}'.format(time.time() - t1))
np.mean(regrets)

**Exercise 3.** Exp3 is a policy that is more nuanced.  The idea is that at each time point you pull an arm with some probability `pi[i]` which is updated based on the performance of the arm.  Look at the full algorithm in http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf, Algorithm 1.  

Implement this version of the Exp3 algorithm in the following class, test it by having it compete once with the random policy for T = 1000 time points.

In [None]:
# change the class below
class Exp3:
    
    def __init__(self, num_arms):
        self.num_arms = num_arms
        self.current_arm = None
        
    def select_arm(self):
        None
        
    def update_reward(self, reward):
        None
        
    # more methods if you need

In [None]:
# test code
t1 = time.time()
regrets = []

# Your code here, 
    # generate 10 regrets with simple bandit and Exp3 policy, 
    # save results in regrets,
    # refer to second block in exercise 1

print('Time Usage : {}'.format(time.time() - t1))
np.mean(regrets)

**Exercise 4.** The Upper Confidence Bound (UCB) algorithm is intuitive.  You pull the arm that has the largest upper confidence bound for the mean reward.  So if you have an upper bound on the mean reward for an arm this is either because the arm is performing well or you have not pulled the arm much, resulting in a wide confidence interval.  The UCB for arm i is then,
$$ U_{i,t} = \hat \mu_{i,t} + C \sqrt{\frac{\log(t)}{n_{i,t} + 1}}$$
where $\hat \mu_{i,t}$ is the mean reward of the arm up to time $t$, $n_{i,t}$ is the number of times that that arm has been pulled.  $C$ is an argument (can be taken to be 2 by default).

Implement this version of the UCB algorithm in the following class, test it by having it compete once with the random policy for T = 1000 time points with $C = 2$.

In [None]:
# change the class below
class UCB:
    
    def __init__(self, num_arms, C):
        self.num_arms = num_arms
        self.current_arm = None
        
    def select_arm(self):
        None
        
    def update_reward(self, reward):
        None
        
    # more methods if you need

In [None]:
# test code
t1 = time.time()
regrets = []

# Your code here, 
    # generate 10 regrets with simple bandit and UCB policy, 
    # save results in regrets,
    # refer to second block in exercise 1

print('Time Usage : {}'.format(time.time() - t1))
np.mean(regrets)

**Exercise 5.** 

1. Modify the `run_trajectory` def to output all of the regrets up to that time in a T x K (for K policies) array.
2. Try 4 different values of $\epsilon$ for epsilon-greedy: 0.01, 0.05, 0.1, 0.2 and have them compete.  Plot the regrets as a function of t.  Note the best performing selection of $\epsilon$ at $T = 1000$.
3. Try 4 different values of $C$ in UCB: 0.5, 0.75, 1, 2 and have them compete.  Plot the regrets as a function of t.  Note the best performing selection of $C$ at $T = 1000$.
4. Using the optimal values of $C$ and $\epsilon$ have all four methods compete. Plot the regrets as a function of t, remark on if the relative performance changes over time.  Is one algorithm always dominant?  Make any other conclusions.

**Note**: 
  + Please finishe 4 parts in 4 code blocks.
  + For part 2, 3, and 4 you should call `run_trajectory` in each of them. In the end of each part, you need to write a markdown block to explain your findings and state your conclusion.
  + For each part, set figure size to (17, 8), plot all lines in the same graph and set proper title. DO NOT create more than 1 subplots. Include legend in your plots so that they are easy to read.

In [None]:
# Part 1. Write run_trajectory function here again, do not modify the above block

In [None]:
# Part 2 code, output your plot

### Findings and conclusion for part 2

In [None]:
# Part 3 code, output your plot

### Findings and conclusion for part 3

In [None]:
# Part 4 code, output your plot

### Findings and conclusion for part 4