<a href="https://colab.research.google.com/github/Junxia8221/sunshine/blob/main/Lesson_1_Multi_Armed_Bandit_with_OpenAi_Gym_ver_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1, exercise 1: The Multi Armed Bandit Problem with OpenAi Gym

the purpose of this notebook is:



1.   To understand the Gym Environment,
2.   Implement Epsilon-Greedy on the bandit problem as discussed in the post
3.   **Win at slot machines!!.**

Lets get started!



~

Some tips on using Colaboratory:

1.   First off click **File**, on the menu above & **Save a copy in drive**, this will copy an instance to your google drive, to allow you to start running it!

2.   Once the notebook has finished copying, go to your fresh copy and click **"connect"** in the top right hand corner of Colabs. This connects your computer to a power Virtual Machine sitting in Google's cloud.

How to run it:


**For those new to Colaboratory**, there are two types of "cell blocks" **text** (like the one you are reading right now) and **code**. For code Cell blocks just click the little play button in order to get the code to run. The play button looks like this:

![alt text](https://image.ibb.co/i4sxHH/Screen_Shot_2018_04_10_at_3_04_50_pm.png)

Or you can simply click the "**Runtime**" menu button up above, and click "**Run all**", sit back & watch colabs go to work.

~~~~~


First we need to download & install the Gym Library so that it works in Colabs.

In [None]:
!pip install gym > /dev/null 2>&1

**Great!** now for our first bit of code


Lets import the Gym class and walk through a basic example of Gym Code

In [None]:
import gym

Gym's main purpose is to provide a large collection of "environments" that expose a common interface, using standardized inputs & outputs for Reinforcement Learning model testing purposes. You can find a listing of these environments below, as follows:

In [None]:
from gym import envs
print(envs.registry.all())

Unfortunatley, Gym does **not provide a bandit** environment so we need to import it, lets install one with the command below:

In [None]:
!git clone https://github.com/JKCooper2/gym-bandits.git > /dev/null 2>&1
!pip install /content/gym-bandits/. > /dev/null 2>&1

And import the bandit library too

In [None]:
import gym_bandits

Unlike in the post, where there was only 2 bandits & we were trying to figure out on average which one paid out the most, this time around we are going to be dealing with **TEN** (10) bandits!!!

Each bandit will have a payout with a normal distribution (bell curve), but the average payout or, centre of the distribution, will be different for each bandit, like in the image below

![alt text](https://i.stack.imgur.com/SazYv.png)

.

  >** >>>> It will be our goal to try & determine which bandit, out of the 10, pays out the most!!!!!!! <<<<**

.

We want to randomly initialise our enviroment, do this by running the code below:

In [None]:
#For this exercise we will be using a powerful python array library known as *numpy*, so lets import that.
import numpy as np

In [None]:
#rerun this part of the code if you would like to "reset" or reinitialize your bandit environement
np.random.seed(42)

Lets make a a variable called "env" to hold our freshly created 10 arm multi bandit environment

In [None]:
#gaussian distribution is just another name for "normal distribution" or bell curve (so many different names for the same thing!)
env = gym.make('BanditTenArmedGaussian-v0')

.

We are now going to go over a basic example of how OpenAi Gym works.

run the code below & then we will explain what is going on here piece by piece

In [None]:
observation = env.reset()

for i_episode in range(5):

    print("episode Number is", i_episode)

    action = env.action_space.sample() # sampling the "action" array which in this case only contains 10 "options" because there is 10 bandits

    print("action is", action)


    # here we taking the next "step" in our environment by taking in our action variable randomly selected above
    observation, reward, done, info = env.step(action)

    print("observation space is: ",observation)
    print("reward variable is: ",reward)
    print("done flag is: ",done)
    print("info variable is: ",info)



env.close()

.

**Reinforcement Learning** is an **extremely broad machine learning "framework"**, that looks like this:

.



![alt text](https://keon.io/images/deep-q-learning/rl.png)

explaining that picture above, The RL framework goes something like this:



1.   You have an **AGENT** (machine learning algorithm), it the image above the agent is a human brain, lol
2.   The agent takes an **ACTION**, in the image above available actions are using the joystick, up, down, left, right and the red button. So five actions available (5) in total.
3.   a single ACTION is chosen (from our available five from the joystick) and fed to our **ENVIRONMENT**, which in our example above is the Atari game environment
4.   at this point our ENVIRONMENT measures how good the action taken was and produces a **REWARD** signal. a Postive number is usually good & negative number is usually bad.
5.   the environment then produces an **OBSERVATION**, again using our example above, think of the OBSERVATION as the next graphical "frame" of the game. In RL, we also call observations **STATES**  
6.   The new OBSERVATION & REWARD signal (produced by the old observation-action pair) is then fed back to AGENT for it to decide what move to make next and so on and so forth


However, lets now take the above framework and see how it is implements on our example "toy problem" of the multi armed bandit

Taking the code apart piece by piece

we have already created our enviroment with this line of code:


```
env = gym.make('BanditTenArmedGaussian-v0')
```



but, we need to ask the environment to produce the first **OBSERVATION** (or as we also call it - state) so that we can feed it to our **AGENT** (the RL algorithm) to decide what to do next.


So next up we get the *first* OBSERVATION by calling the following code:



```
observation = env.reset()

```


Next up, we create a 'for loop' that looped 5 times. In the Multi armed banded scenario an "episode" is just a single "play" of the game. So think of the For loop then as playing the multi armed bandit game 5 times.



```
for i_episode in range(5):
```



each **ENVIRONMENT** is different. That is to say, each environment gives us different **ACTIONS** that are available, different **OBSERVATIONS** that are available etc

For the *10 armed bandit problem,* we should have *ten actions available *to us (as there should be 10 different slot machines, each with a different lever to pull)

We can confirm this is the case by running the code below:

In [None]:
print(env.action_space)

So taking this back to our example code previous, the next thing we do is take an action,* by randomly sampling from the* **ACTION SPACE**. Again, dont worry. The ACTION SPACE is just a number assigned to each our bandits. EG [0,1,2,3,4,5,6,7,8,9]. So for our example if we randomly sample the ACTION SPACE and get back the number 8 - all this means is that we will be pulling the "lever" on bandit 8.

It is also important to note that at this point - we have not implemented any machine learning yet. We are only choosing actions at random. So lets choose one then. This was done with this line of code:



```
action = env.action_space.sample()
```



We have now completed step 2 of our Reinforcement Learning framework discussed earlier. Lets now do step 3, feeding the action into the environment, this was achieved with this line of code



```
observation, reward, done, info = env.step(action)
```



The above line of code is actually doing a couple of things. We are feeding in our selected action with this line:



```
env.step(action)
```



And getting back 4 new variables in return , in this part of the code:



```
observation, reward, done, info =
```



For now, we do not have to worry about the DONE and INFO variables. All we care about, in this tutorial is the OBSERVATION & REWARD variables. Also note that in this single line of code we have achieved step 4 & 5 of our RL framework discussed earlier in one go!

Where we now breakaway from the RL framework, is that, we are not feeding the **REWARD** & **OBSERVATION** variables back to any **AGENT** (aka step 6 from the framework) to do anything intelligent yet. This is because we yet to create an **AGENT**!!





## Exercise 1: creating your first (very simple) Agent

We are now ready to create your very first agent, the epsilon Greedy algorithm

We will need to keep track of which agent is the best, we do this by creating a big table, with 10 cells, one for each bandit. In computer terms this table is known as an array. We are creating this array to keep track of which bandit is doing the best for every time we play a game.  As before we are using numpy to store our table.

In [None]:
import numpy as np

we also want to randomly initialise our enviroment, do this by running the code below:

In [None]:
#rerun this part of the code if you would like to "reset" or reinitialize your bandit environement
np.random.seed(<seed>)
env.seed(34)

Let also make a variable that is the total number of bandits operating in our environment, **complete the code below:**

In [None]:
numberOfBandits = #???? hint we mentioned this number above

we are going to call the array (remember this is just a table) which keeps track of which bandit is the best, a **Q TABLE**.

Q in this case just stands for quality. The idea being that the number that is the highest in our table, is associated with the action we would like to take with the highest quality. but we are getting a little ahead of ourselves. If you did not get that dont worry. It will all become apparent in a couple of lines of reading!

we also want to initialize our table so that all of the values are ones at the start -avoiding a divide by zero error in our algorithm- so lets do that

In [None]:
q_table = np.ones(numberOfBandits)

Remember, we would like to keep track of the bandit with the *HIGHEST AVERAGE PAYOUT*. In order to do this we need another table to keep track of the number of times each bandit has been "pulled". We also want this table to be fulled with ones, this is to stop a divide by zero error later. **complete the code below:**

In [None]:
n_table = #????

In the lecture we talked a lot about epsilon. Lets create a new variable called "epsilon" and initialize it to 0.9. **complete the code below:**

In [None]:
epsilon = #????

Below is the pseudo code for the epsilon greedy algorithm. In this implementation we are not going to vary epsilon - it is going to be a fixed number.

This means that we will keep our exploration rate fixed.

We will now give you the pseudo code (code recipe) on how to implement the epsilon greedy algorithm. Look below at the "recipe". If you need help, all the ingredients on how to implement each part are below in the pseudo code.



```
create a for loop, to loop 1000 times

      (IF STATEMENT, inside the loop,) generate a random number between 0 and 1 , if this number is less than
      Epsilon enter "exploitation mode" aka use the best bandit we have discovered so far
      
            (inside the if statement) get the POSITION (index) in our array of the current max value within our
            table, this index is the bandit that is giving the best payout so far.
            
            (inside the if statement) set your ACTION variable equal to the index we discovered in the last
            statement
            
      (ELSE, otherwise..) if the number is greater than or equal to Epsilon, go into "exploration mode" and
      choose a bandit at random
      
          
            (inside the if statement) generate a random number between 0 and THE_NUMBER_OF_BANDITS
            
            (inside the if statement) set your ACTION variable to equal to the random number we just generated.
            
        
            
      (inside the loop) feed our ACTION variable into our environment by updating it with a step generated
      by either of the steps above
      
      (inside the loop) now that we have gained some new information from our environment we want to update our
      Q_table. We do this using the formula: Q_n+1 = Q_n + (R - Q_n)/n or in simpler english:
      
      NewQvalue = OldQvalue + ((reward - OldQvalue)/numberOfTimesLeverHasBeenPulledForThisBandit)
      
      Lets think about the intituition of what this forumula is doing. Implement the formula in code.
      
      
      (inside the loop) now that we have updated our Q table, we also need to update the table that is keeping
      track of how many times each bandit's lever has been pulled. Do this by adding +1 in the position
      of our currently selected bandit in the N_TABLE array
      
      
(OUTSIDE the loop) once everything is done, we would like to print the Bandit with the highest score! Using
a print statement, and numpy's argmax function, using our Q table, print the bandit with the highest
AVERAGE payout
      
   
   
  
```



.

**Here are all the pieces required to build the above:**

How to make a loop that loops a 10 times:

In [None]:
#notice the indentation of the print statement, this indicates that this function (the print statement), is "inside" the loop

for k in range(10):
  print("now in loop iteration number: ",k)

How to generate a random number between 0 & 1 using numpy:

In [None]:
#rerun this multiple times to generate a different random number
#remember np stands for "numpy" as declared above

randomNumber = np.random.random(1)[0]

print("random number is ", randomNumber)

How to create an if statement to see if our random number is less than epsilon

In [None]:
#epsilon should be set to 0.9 above...

if(randomNumber < epsilon):
  print("doing something less than epsilon")

How to create an if statement to see if our random number is less than epsilon and do something else if this is not the case

In [None]:
if(randomNumber < epsilon):
  print("doing something less than epsilon")
else:
  print("doing something more than epsilon")

How to lookup the best bandit's *location* (index) in our Q table, again remember there are 10 values to choose from (10 bandits) & we need to look up the index (location) in our Q_table array of the bandit with the highest score. Numpy's argmax() function allows us to do this.


In [None]:
best_bandit = np.argmax(q_table)

print("best bandit is ",best_bandit)

how to assign the index of our best action so that OpenAi gym can understand it:

In [None]:
action = best_bandit

how to generate a random bandit ACTION number between 0 to 9 to take:

In [None]:
#run this multiple times to see different results
random_number = np.random.randint(numberOfBandits)-1

print(random_number)

how to update OpenAi gym one time step into the future

In [None]:
observation, reward, done, info = env.step(action)

How to directly access a numpy array's value, in this example we are accessing the best bandit's so far payout

In [None]:
example_table = np.array([2,3,1,0,5,7,9,8,6,4])

example_table[best_bandit]

To print out all the values of a numpy array for testing purposes, use this code

In [None]:
print(example_table[:])

after looping 1000 times we would like to present our result - the best bandit out of the 10. Using Numpys argmax function:

In [None]:
# think about the result after running this relative to our example table above:

print('and the best bandit is....', np.argmax(example_table))

## Code up Epsilon-greedy below

We now have everything required to code up epsilon greedy. Using our **recipe**, and all the **blocks above,** code up an implementation of epsilon greedy below:

In [None]:
import numpy as np

env.seed(34)

numberOfBandits = 10
q_table = np.zeros(numberofbandits)
n_table = np.ones(numberofbandits)

epsilon = 0.9

## YOUR CODE GOES HERE