# K-armed Bandit Problem

In this lab, we will apply a complete bandit algorithm using  <b> $ \large \epsilon$-greedy algorithm</b> on K-armed bandit problem.
$ \large \epsilon$-greedy algorithm is an algorithm used to make tradeoff between exploration and exploitation techniques so we can get the optimal action. it represents an exploration strategy where the agent behave greedily most of the time but every once in a while , the agent explore all the available actions with probability $ \large \epsilon $

     
     
<b>In the K-armed Bandit problem: </b>
- The agent is faced repeatedly  with a choice among $K$ differenet choices  or actions 
- After each action , the agent receive a reward chosen from a stationary probability distribution which depends on the action the agent selected 
- The agent objective is to maximize the expected total reward over some time period 

<b> Why do we call it "K-armed Bandit" ? </b>

It is named by analogy to a slot machine or "one-armed bandit machine" expect that it takes K levers instead of one 

The slot machine is designed as follows: 

<img src  = "https://secure.img1-fg.wfcdn.com/im/00280517/resize-h800-w800%5Ecompr-r85/1548/15481915/Vegas+Slot+Machine+Cardboard+Standup.jpg" width = "50%"> 
 
- Each time we pull the lever, we get a reward taken from a stationary probability distribution 
- Our goal is to find out which slot machine will give us the maximum cumulative rewards over a sequence of time

<br> 


The pseudocode for a complete bandit algorithm using incrementally computed sample
averages and  $ \large \epsilon$-greedy algorithm is shown in the box below. 
<img src = "https://i.imgur.com/vdPAifG.png" > 

In [14]:
import gym 

import numpy as np 

import gym_bandits

In [15]:
env = gym.make("BanditTenArmedGaussian-v0") # Replace with relevant env

[2020-10-31 16:10:03,662] Making new env: BanditTenArmedGaussian-v0


In [17]:
#Explore our action space 
env.action_space

Discrete(10)

## Epsilon-greedy Algorithm 

Lets create a function for espilon greedy algorithm part. The pseudocode of the algorithm is shown in the box below 

<img src = "https://i.imgur.com/Rsh8mrf.png" >

In [24]:
def epsilon_greedy(epsilon, Q):
    '''
    Usage:
      #epsilon_greedy --> used for construct epsilon greedy algorithm used in the bandit algorithm 
                                          
    Arguments:
      #epsilon --> the probabilty of taking an exploratory action 
      #Q --> a numpy array hold the value of every action in the action space 
    
    Returns:
      #action --> the selected action at a certain time 
      
    '''
    
    #get a random number from  a continuous uniform distribution over the interval [0.0, 1.0)
    rand = np.random.random()
    
    #if the random value is less than epsilon 
    if rand < epsilon: 
        
        action = env.action_space.sample() #take a random action from the action space 
        
    else: 
        
        action = np.argmax(Q) #take a greedy action 
        
        
    return action

## Bandit Algorithm 

In [37]:
def bandit(epsilon):
    '''
    Usage:
      #bandit --> used to get the optimal action should be selected 
                                          
    Arguments:
      #epsilon --> the probabilty of taking an exploratory action 
    
    Returns:
      #The optimal action
      
    '''
    
    #initialize the number of iterations 
    num_iters = 20000
    
    #pre-allocating an empty array which will hold the number of times every action is taken 
    count = np.zeros(10)
    
    #pre-allocating an empty array which will hold the sum of rewards corresponding to every action when it is taken
    sum_rewards = np.zeros(10)
    
    #pre-allocating an empty array which will hold the estimated action value for every action 
    Q = np.zeros(10)
    
    
    for i in range(num_iters):
        
        # make an action selection
        action = epsilon_greedy(epsilon, Q) 
        
        # get the reward corresponding to this action
        observation, reward, done , info = env.step(action)
        
        # increment the count corresponding to this action 
        count[action] += 1
        
        # acummualte the rewards obtained from this action 
        sum_rewards[action] += reward 
        
        # estimate the value of this action 
        Q[action] = sum_rewards[action] / count[action]
        
        
    
    print(f"The optimal action is {np.argmax(Q)}")
    
    return np.argmax(Q)

In [40]:
optimal_action = bandit(0.5)

The optimal action is 7
