# Multi-Armed Bandit 

Multi-Armed Bandit (MAB) is a form of Reinforcement Learning (RL) problem with a single state and multiple actions in which, like the all form of RL problems, the goal is to maximize the reward in a time horizon.

In MAB, the agent is confronted with multiple actions selecting each of which leads to a scalar reward drawn from an unknown distribution. 

Selecting action is costly, and the agent goal is to maximize the long term reward.

The challenge is that the underlying distributions are unknown, so feeding the best actions require efficiently balancing between exploration and exploitation. 

## 0. Imports

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

## 1. Defining Bandit Game Class

In this section, we want to define the Bandit Game class. 

But before that, since each bandit machine follows a distribution, let's first implement two types of distributions: Normal and Bernoulli distributions.

### 1.1. Defining distributions

#### 1.1.1. Normal Distribution 

In [2]:
class Gaussian:
    def __init__(self, mean=0, std=1):
        """
        Define a Guassian distribution

        Parameters
        ----------
        mean : float, optional
            The mean of normal distribution. The default is 0.
        std : float, optional
            std of normal distribution. The default is 1.

        Returns
        -------
        None.

        """
        self.mean = mean
        self.std = std

    def draw(self):
        """
        Draw a single sample from the normal distribution

        Returns
        -------
        reward : float
            A sample from the distribution
        """
        reward = np.random.normal(self.mean, self.std)
        reward = np.round(reward, 1) # rounding the reward to have consistant resutls
        return reward

#### 1.1.2. Bernoulli Distribution

In [3]:
class Bernoulli:
    def __init__(self, p):
        """
        Define a Bernoulli distribution

        Parameters
        ----------
        p : a float number between 0 to 1
            p represents the probability of choosing 1

        Returns
        -------
        None.

        """
        self.p = p
        
    def draw(self):
        """
        Draw a single sample from the Bernoulli distribution

        Returns
        -------
        reward : binary: 0 or 1
            A sample from the distribution
        """
        reward = np.random.binomial(n=1, p=self.p)
        return reward

### 1.2. Defining Bandit Game class

Now, we can define the bandit game class which receives some distributions and allows the user to pull them.

In [4]:
class BanditGame:
    def __init__(self, bandits: list):
        """
        Parameters
        ----------
        bandits : list
            a list of distributions for bandits, each of which has an unkown 
            distribution that might be any distribution like Normal and Betta.

        Returns
        -------
        None.

        """
        
        self.bandits = bandits
        
        ## shuffle the list to make an 
        np.random.shuffle(self.bandits)


        self._reset()


    def _reset(self):
        """
        Define some variables to keep track of the reward and timestep

        Returns
        -------
        None.

        """
        self.rewards = []
        self.total_reward = 0
        self.avg_reward = 0 # the average reward received sofar
        self.time_step = 0

    
    def _update(self, rew):
        """
        Updating the reward related variables and time_step in each timestep

        Parameters
        ----------
        rew : float
            a float number showing the received reward in the current timestep.

        Returns
        -------
        None.

        """
        self.rewards.append(rew)
        self.total_reward += rew
        self.avg_reward = np.mean(self.rewards)
        self.time_step += 1
        
        
    def _step(self, choice):
        """
        Pulling a machine according to the agent's choice, and updating the 
        reward related variables and time_step

        Parameters
        ----------
        choice : int
            an integer showing the agent's choice in the current timestep.

        Returns
        -------
        rew : flaot
            the reward that the pulled machine delivers.

        """
        # sampling a reward from the underlying distribution of the selected bandit
        rew = self.bandits[choice].draw() 
        
        self._update(rew)
        return rew 
    
    
    def render(self):
        """
        Printing the rewards

        Returns
        -------
        None.

        """
        print(f"\n- - - You pulled the arms for {self.time_step} times.")
        print(f"- - - Reward list: {self.rewards}")
        print(f"- - - Total reward: {self.total_reward}")
        print(f"- - - Average reward: {self.avg_reward}")
        

    def play(self):
        print("#"*10 + "Interactivelly simulate the MAB problem" + "#"*10)
        
        while True:
            print(f" \n\n\n {'#'}*5 Timestep: {self.time_step}")
            
            try:
                choice = int(input(f"Enter a number between 1 to {len(self.bandits)} to select a machine or any other numbers to exit: ")) -1
                
                if choice in range(0, len(self.bandits)):
                    reward = self._step(choice)
                    print(f"--- Mahcine {choice} gave a reward of {reward}")
                    print(f"--- Average reward sofar is: {self.avg_reward}")
            
                else:
                    print("___ No machine exist with this ID! ")
                    break
                
            except: 
                print("You entered a wrong input, but your rewards were stored!")
                break
        
        print("_"*7 + "You teminated the game!" + "_"*7)

## 2. Play with the Bandit machines

Now, we define some arbitrary normal distributions as the underlying distributions for each bandit machine, and then we instantiate the BanditGame class and play with it for a couple of timesteps. 

In [5]:
## Define bandit machines with Gaussian dist
g_slot_A = Gaussian(5, 3)
g_slot_B = Gaussian(6, 2)
g_slot_C = Gaussian(1, 5)

g_bandits = [g_slot_A, g_slot_B, g_slot_C]

In [6]:
## Instantiate the BanditGame
g_game = BanditGame(g_bandits)

In [7]:
## Play the game for some timesteps
g_game.play()

##########Interactivelly simulate the MAB problem##########
 


 #*5 Timestep: 0
Enter a number between 1 to 3 to select a machine or any other numbers to exit: 1
--- Mahcine 0 gave a reward of 4.0
--- Average reward sofar is: 4.0
 


 #*5 Timestep: 1
Enter a number between 1 to 3 to select a machine or any other numbers to exit: 2
--- Mahcine 1 gave a reward of 4.8
--- Average reward sofar is: 4.4
 


 #*5 Timestep: 2
Enter a number between 1 to 3 to select a machine or any other numbers to exit: 1
--- Mahcine 0 gave a reward of 6.5
--- Average reward sofar is: 5.1000000000000005
 


 #*5 Timestep: 3
Enter a number between 1 to 3 to select a machine or any other numbers to exit: 3
--- Mahcine 2 gave a reward of 7.7
--- Average reward sofar is: 5.75
 


 #*5 Timestep: 4
Enter a number between 1 to 3 to select a machine or any other numbers to exit: 3
--- Mahcine 2 gave a reward of 3.3
--- Average reward sofar is: 5.26
 


 #*5 Timestep: 5
Enter a number between 1 to 3 to select a machi

In [8]:
## Print the final results
g_game.render()


- - - You pulled the arms for 15 times.
- - - Reward list: [4.0, 4.8, 6.5, 7.7, 3.3, 4.1, 4.0, -1.6, 1.6, 0.3, 5.9, 8.9, -7.4, 6.1, 5.5]
- - - Total reward: 53.699999999999996
- - - Average reward: 3.5799999999999996
