In [3]:
import numpy as np
import random

std=1
mean=1
'''
This class simply generates a bandit problem with number of bandit arms
being `num_bandits`, which itself is chosen randomly. The numbers being sampled
from the gaussian are the rewards associated with those arms.
For simplicity, I have restricted ourselved to nonnegative rewards and decimals upto 2 digits
to avoid any floating point calculation errors.
'''
class Bandit_arms:

    def __init__(self, num_bandits):
        self.bandits=np.round(np.absolute(np.random.normal(loc=mean, scale=std, size=num_bandits)), decimals=2)

    def return_array(self):
        return self.bandits


'''
Minimum: 5 bandit arms and maximum: 20 bandit arms.
'''

num_bandits=np.random.randint(low=5, high=20,size=1)[0]
bandit_problem=Bandit_arms(num_bandits).return_array()
print("True mean rewards",bandit_problem)

'''
This is technically cheating, you should not be aware of the true underlying 
reward through any means. It doesn't matter as beginners but keep it in mind
that this data is unavailable to you (no need for learning if we knew already!).
'''

'''
Now for each arm, when we pull it, we get modulus of number sampled
from a gaussian with mean defined in `bandit_problem` array and standard
deviation=std.
Keep in mind that first bandit arm would have index 0.
'''

def bandit_simulator(arm_index):
    reward=np.round(np.absolute(np.random.normal(loc=bandit_problem[arm_index], scale=std, size=1)), decimals=2)[0]
    return reward


'''
So, now we have an environment that generates a bandit problem
with a random number of arms and random mean rewards. We have a function to 
sample reward on "pulling" each of these arms. Now, try out algorithms that 
can come close to true mean rewards.
Any rule based method won't work, since means change everytime you run the python file.
So, your algorithm must truly be able to learn as good as it can in a single run of this file.
As you must have understood by now, learning is an iterative procedure. Typically, to represent limited
computational and time resources, upper limits on allowed learning iterations are imposed. So, I am setting 
a variable that defines number of allowed iterations. Of course, while building, you can play around with it.
'''

num_allowed_iterations=2000

'''
Except for the values of mean, std, num_iterations, I don't think you should
have the need to change any of the code written to this point.
'''

estimated_means=np.zeros(shape=num_bandits, dtype=float)
#--------------------------
# the part that truly matters
N=np.zeros(shape=num_bandits, dtype=int) #number of times each arm was pulled
c = 2 #degree of exploration
arr=np.arange(0,num_bandits) #array for the arms
for pull in range (1,num_allowed_iterations):
    r_pull=[] #all rewards for this iter
    ucb_Q=estimated_means+np.sqrt(c*np.log(pull)/N)
    a=np.argmax(ucb_Q)
    temp_r=np.random.normal(bandit_simulator(a),1)
    N[a]=N[a]+1
    estimated_means[a]=estimated_means[a]+(temp_r-estimated_means[a])/N[a]
print("Estimated mean rewards",estimated_means)
'''
YOUR ALGORITHM MUST NOT USE THE TRUE BANDIT MEAN REWARD ARRAY AT ANY STEP.
`num_bandits` VARIABLE PROVIDES YOU THE NUMBER OF BANDITS ARMS AND SIMPLY CALL THE
SIMULATOR FUNCTION. ANY ALGORITHM ACCESSING MEAN REWARD ARRAY IS OBVIOUSLY WRONG.
'''

'''
Underlying function checks how well your estimate is. Since, the learning is completed by this point
so accessing the true mean for checking is somewhat acceptable.
'''
def check(bandit_problem, estimate_means):
    errors=np.absolute(estimate_means-bandit_problem)
    print("Error in mean reward prediction for each arm:",errors)

check(bandit_problem, estimated_means)

True mean rewards [1.3  0.91 0.98 0.84 0.85 0.99 0.96 0.46 0.34 0.27]
Estimated mean rewards [ 1.42459994  1.16156388  0.89679398  1.04188793  0.99395848 -0.7205857
  0.74235244  0.28143069  1.07459855  0.77818449]
Error in mean reward prediction for each arm: [0.12459994 0.25156388 0.08320602 0.20188793 0.14395848 1.7105857
 0.21764756 0.17856931 0.73459855 0.50818449]


  ucb_Q=estimated_means+np.sqrt(c*np.log(pull)/N)
  ucb_Q=estimated_means+np.sqrt(c*np.log(pull)/N)
