# Bandit Assignment

This assignment should be done in groups of 3 and consists of a number of implementation and theory problems based on the topics discussed in the lectures and the course literature (specifically, **version 5** on arXiv):

[Bandits] *Aleksandrs Slivkins, [Introduction to Multi-Armed Bandits](https://arxiv.org/pdf/1904.07272v5.pdf), Found. Trends Mach. Learn. 12(1-2): 1-286 (2019)*

In the implementation problems **(1, 2, 3 and 5)**, you will implement multi-armed bandit algorithms from the [Bandits] book and use them in a provided multi-armed bandit environment. These problems will be graded based on the correctness of the code.

In the theory problems **(4 and 6)**, you will derive some properties of the algorithms. These problems will be graded based on the correctness of the arguments.

You may use the python libraries imported below (*numpy*, *scipy.stats* and *pandas*).

The assignment should be handed in as an updated notebook. The entire notebook should be run before it is handed in, so that the plots are visible. Ensure that it is completely runnable, in the case that we want to reproduce the results. 

## Setup

The cell below contains imports. It may not be modified!

In [None]:
# DO NOT MODIFY
import pandas as pd
import numpy as np
import scipy.stats as st

SEED = 150
ITERATIONS = 5
K = 100
T = 10000

The cell below contains the bandit environment and may not be modified!

In [None]:
# DO NOT MODIFY
class Environment:
    def __init__(self, K=10, seed=0):
        self.random_state = np.random.RandomState(seed=seed)
        self.mu = st.beta.rvs(a=1, b=1, size=K, random_state=self.random_state)
        
    def expected_value(self, a):
        return self.mu[a]
        
    def perform_action(self, a):
        return st.bernoulli.rvs(self.mu[a], random_state=self.random_state)
        
    def optimal_action(self):
        return np.argmax(self.mu)

The cell below contains the bandit algorithm base class and may not be modified!

In [None]:
# DO NOT MODIFY
class BanditAlgorithmBase:
    def select_action(self):
        pass
    
    def update(self, action, reward):
        pass

The cell below contains the bandit experiment and may not be modified!

In [None]:
# DO NOT MODIFY
class Experiment:
    def __init__(self, environment, bandit_algorithm):
        self.environment = environment
        self.bandit_algorithm = bandit_algorithm
        
    def run_experiment(self, T=100):
        instant_regrets = np.zeros(T)
        for t in range(0, T):
            action = self.bandit_algorithm.select_action()
            reward = self.environment.perform_action(action)
            self.bandit_algorithm.update(action, reward)
            
            optimal_action = self.environment.optimal_action()
            instant_regret = self.environment.expected_value(optimal_action) - self.environment.expected_value(action)
            instant_regrets[t] = instant_regret
        cumulative_regrets = np.cumsum(instant_regrets)
        return (instant_regrets, cumulative_regrets)
            

The cell below contains a function for repeated experiments with a provided bandit algorithm, averaging regret over the runs. It may not be modified!

In [None]:
# DO NOT MODIFY
def run_repeated_experiments(bandit_algorithm_class, seed):
    instant_regrets = []
    cumulative_regrets = []
    for i in range(ITERATIONS):
        bandit_algorithm = bandit_algorithm_class(T, K)
        environment = Environment(K, seed+i+1)
        experiment = Experiment(environment, bandit_algorithm)

        instant_regrets_i, cumulative_regrets_i = experiment.run_experiment(T)
        instant_regrets.append(instant_regrets_i)
        cumulative_regrets.append(cumulative_regrets_i)
    return pd.DataFrame(data={'t': np.arange(1, T+1),
                             'instant_regret': np.mean(np.vstack(np.array(instant_regrets)), axis=0),
                             'regret': np.mean(np.vstack(np.array(cumulative_regrets)), axis=0)})


## Stochastic Bandits (Chapter 1)

### Problem 1 
(3 points)

Implement the *Explore-First* algorithm (**Algorithm 1.1** in [Bandits]) within the provided bandit algorithm template below. Use $N = \left(\frac{T}{K}\right)^{2/3} \cdot \left( \log T \right)^{1/3}$.

In [None]:
class ExploreFirst(BanditAlgorithmBase):
    def __init__(self, T, K):
        """
        Constructor of the bandit algorithm

        Parameters
        ----------
        T : int
            Horizon
        K : int
            Number of actions
        """
        
        # FILL IN CODE HERE
        pass
    
    def select_action(self):
        """
        Select an action which will be performed in the environment in the 
        current time step

        Returns
        -------
        An action index (integer) in [0, K-1]
        """
        
        # FILL IN CODE HERE
        pass
    
    def update(self, action, reward):
        """
        Update the bandit algorithm with the reward received from the 
        environment for the action performed in the current time step

        Parameters
        ----------
        action : int
            An action index (integer) in [0, K-1]
        reward : int
            Reward (integer) in {0, 1} (Bernoulli rewards)

        """
        
        # FILL IN CODE HERE
        pass

Run the algorithm in the provided environment using the code below (averaging regret over 5 runs). The exploration and exploitation phases should be clearly visible in the plot.

In [None]:
# DO NOT MODIFY
np.random.seed(SEED)
ef_df = run_repeated_experiments(ExploreFirst, SEED)
ef_df.plot(x='t', y='regret', title='Explore-First')

### Problem 2
(3 points) 

Implement the $ \epsilon_t $-*Greedy* algorithm (**Algorithm 1.2** in [Bandits]) within the provided bandit algorithm template below. Use $\epsilon_t = \min \left\{1,\ t^{-1/3} \cdot (K \log t)^{1/3}\right\} $.

In [None]:
class EpsilonTGreedy(BanditAlgorithmBase):
    def __init__(self, T, K):
        """
        Constructor of the bandit algorithm

        Parameters
        ----------
        T : int
            Horizon
        K : int
            Number of actions
        """
        
        # FILL IN CODE HERE
        pass
    
    def select_action(self):
        """
        Select an action which will be performed in the environment in the 
        current time step

        Returns
        -------
        An action index (integer) in [0, K-1]
        """
        
        # FILL IN CODE HERE
        pass
    
    def update(self, action, reward):
        """
        Update the bandit algorithm with the reward received from the 
        environment for the action performed in the current time step

        Parameters
        ----------
        action : int
            An action index (integer) in [0, K-1]
        reward : int
            Reward (integer) in {0, 1} (Bernoulli rewards)

        """
        
        # FILL IN CODE HERE
        pass

Run the algorithm in the provided environment using the code below (averaging regret over 5 runs). The plot should show sublinear regret with respect to $t$.

In [None]:
# DO NOT MODIFY
np.random.seed(SEED)
eg_df = run_repeated_experiments(EpsilonTGreedy, SEED)
eg_df.plot(x='t', y='regret', title='Epsilon_t-Greedy')

### Problem 3
(3 points) 

Implement the UCB1 algorithm (**Algorithm 1.5** in [Bandits]) within the provided bandit algorithm template below.

In [None]:
class UCB1(BanditAlgorithmBase):
    def __init__(self, T, K):
        """
        Constructor of the bandit algorithm

        Parameters
        ----------
        T : int
            Horizon
        K : int
            Number of actions
        """
        
        # FILL IN CODE HERE
        pass
    
    def select_action(self):
        """
        Select an action which will be performed in the environment in the 
        current time step

        Returns
        -------
        An action index (integer) in [0, K-1]
        """
        
        # FILL IN CODE HERE
        pass
    
    def update(self, action, reward):
        """
        Update the bandit algorithm with the reward received from the 
        environment for the action performed in the current time step

        Parameters
        ----------
        action : int
            An action index (integer) in [0, K-1]
        reward : int
            Reward (integer) in {0, 1} (Bernoulli rewards)

        """
        
        # FILL IN CODE HERE
        pass

Run the algorithm in the provided environment using the code below (averaging regret over 5 runs). The plot should show sublinear regret with respect to $t$.

In [None]:
# DO NOT MODIFY
np.random.seed(SEED)
ucb1_df = run_repeated_experiments(UCB1, SEED)
ucb1_df.plot(x='t', y='regret', title='UCB1')

### Problem 4
(6 points) 

This theory problem is based on **Exercise 1.1** in [Bandits]. The proofs in **Chapter 1** consider environments where the rewards are in the interval $[0,1]$. Consider the case when we have additional knowledge about about the problem and that we know that the rewards for each action are in the interval $\left[\frac{1}{2}, \frac{1}{2} + \epsilon\right]$ for some fixed $\epsilon \in \left(0, \frac{1}{2}\right)$. 

Consider a version of $\text{UCB1}$ modified to utilize this knowledge (you do not need to specify the algorithm completely, just define the new confidence radius $r_t(a)$). For this algorithm and problem setting, prove that:

$\mathbb{E}\left[R(t)\right] \leq \frac{2 t}{T^2} + 2 \epsilon \sqrt{2 K t \log T}$

**Instructions:** Use a version of Hoeffding Inequality with ranges (**Theorem A.2** in the [Bandits] book) to modify the confidence radius $r_t(a)$. Subsequently follow the steps of the analysis leading up to **Theorem 1.14** in [Bandits] to derive the regret bound, though show the actual constants instead of using big O notation:

1. Define the clean event, like in **Section 1.3.1**, and bound the probability of the event.
2. Start with the definition of the regret $\mathbb{E}\left[R(t)\right]$, and perform a regret decomposition like on **Page 11** of **Section 1.3.2**.
3. Bound the *gap* $\Delta (a)$, like in **Section 1.3.3**.
4. Complete the proof using the technique on **Page 12** of **Section 1.3.2**.

*Write the solution in the Markdown cell below (use LaTeX-math mode for equations, etc.).*

#### Solution

WRITE SOLUTION HERE

## Bayesian Bandits (Chapter 3)

### Problem 5
(3 points)

Implement the *Thompson Sampling* algorithm (**Algorithm 3.3** in [Bandits]) within the provided bandit algorithm template below. Assume independent priors and that the prior is $\mathbb{P} = \text{Beta}(\alpha_0, \beta_0)$ with $\alpha_0 = 1$ and $\beta_0 = 1$ (i.e. the **Beta-Bernoulli** setting, on **page 35** in [Bandits]).

**Note:** There is a typo in the expression for the posterior $\mathbb{P}_H$ in [Bandits]. It should be $\text{Beta}(\alpha_0 + \text{REW}_H,\ \beta_0 + t - \text{REW}_H)$.

In [None]:
class ThompsonSampling(BanditAlgorithmBase):
    def __init__(self, T, K):
        """
        Constructor of the bandit algorithm

        Parameters
        ----------
        T : int
            Horizon
        K : int
            Number of actions
        """
        
        # FILL IN CODE HERE
        pass
    
    def select_action(self):
        """
        Select an action which will be performed in the environment in the 
        current time step

        Returns
        -------
        An action index (integer) in [0, K-1]
        """
        
        # FILL IN CODE HERE
        pass
    
    def update(self, action, reward):
        """
        Update the bandit algorithm with the reward received from the 
        environment for the action performed in the current time step

        Parameters
        ----------
        action : int
            An action index (integer) in [0, K-1]
        reward : int
            Reward (integer) in {0, 1} (Bernoulli rewards)

        """
        
        # FILL IN CODE HERE
        pass

Run the algorithm in the provided environment using the code below (averaging regret over 5 runs). The plot should show sublinear regret with respect to $t$.

In [None]:
# DO NOT MODIFY
np.random.seed(SEED)
ts_df = run_repeated_experiments(ThompsonSampling, SEED)
ts_df.plot(x='t', y='regret', title='Thompson Sampling')

### Problem 6
(6 points)

In this theory problem, you will show an intermediary step in the proof for the Bayesian regret bound of *Thompson Sampling* in the [Bandits] book.

You are given a $K$-armed bandit problem with rewards in the interval $[0, 1]$. You can assume that $K \leq T$, where $T$ is the horizon. Additionally, you can assume that **Lemma 1.5** holds (i.e., for this assignment we define $r_t (a) := \sqrt{\frac{2  \log T}{ n_t (a)}}$, and then it holds that $\text{Pr}\left\{ \mathcal{E} \right\} \geq 1 - \frac{2}{T^2}$ with $\mathcal{E} := \left\{ \forall a \forall t \;\; \vert \bar{\mu}_t (a) - \mu (a) \vert \leq r_t (a) \right\}$). Then, with $\text{UCB}_t (a) := \bar{\mu}_t (a) + r_t (a)$, show that $\mathbb{E}\left[ \left[ \text{UCB}_t (a) - \mu (a) \right]^{-} \right] \leq \frac{2}{TK}$ (i.e., show that Equation 3.14 in [Bandits], with $\gamma = 2$, holds for all arms $a$ and rounds $t$).

**Note:** $[x]^{-}$ is the negative portion of $x$, i.e., $[x]^{-} = 0$ if $x \geq 0$ and $[x]^{-} = \vert x \vert$ otherwise.

**Hint:** Remember that, given a random variable $X$, an event $\mathcal{E}$ (subset of the sample space) and its complement $\mathcal{E}^c$, by the tower rule, $\mathbb{E}\left[ X \right] = \mathbb{E}\left[ X \;\vert\; \mathcal{E} \right] \cdot \text{Pr}\left\{ \mathcal{E} \right\} + \mathbb{E}\left[ X \;\vert\; \mathcal{E}^c \right] \cdot \text{Pr}\left\{ \mathcal{E}^c \right\}$.

*Write the solution in the Markdown cell below (use LaTeX-math mode for equations, etc.).*

#### Solution

WRITE SOLUTION HERE

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=15c42e9f-e6bb-4427-b594-fb21b448c014' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>