<a href="https://colab.research.google.com/github/RobDeutsche/CISC499/blob/main/Copy_of_Module13_06_ReinforcementLearning_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement learning with Q-learning

Reinforcement learning is a type of machine learning that requires the use of a different approach for learning data. This type of learning is explicitly used for situations which require learning from environments. Have you ever wondered how a dog learns tricks? Let us consider how we could train a dog. 

The dog doesn't understand our language and needs to be taught how to do a certain trick. Since we cant tell the dog what to do we can follow a different strategy. We provide the dog with a prompt or a cue. For example, if we want the dog to sit we point at the floor and say 'Sit!'. At this point the dog will respond to us with a response. Depending on the type of response, the dog will be provided with a reward. So, if the dog does nothing we dont reward it. If the dog moves around, we dont reward it. If it sits, only then do we reward it. The dog is learning what to do from positive experiences. 


Let us consider some key terms now:
1. The **agent** here is the dog.
2. The environment is us since we provide the result of the action. 
3. The action that takes the dog from one state to another is its **action**.
4. The **state** is the state of the dog. For example: sitting, standing, walking.
5. The reward is a value that the dog knows which equates to the amount of snacks that it receives. 


We will now be looking into an example of reinforcement learning. Here is a game in which we need to pick up a passenger from one location and drop him/her off to another location. How do we do that? Lets import a few libraries and get started first.

## 1. Import libraries

In [8]:
from collections import defaultdict
import pickle
import random
from IPython.display import clear_output
import numpy as np

import click
import gym

We will be using the `gym.make()` function to make an environment and play the game. Run the code bellow and try it out. 

In [9]:
env = gym.make("Taxi-v3").env



Let us learn a little bit about the gym environment that we are currently working with. Head over to [this](https://gym.openai.com/envs/Taxi-v2/) link to check out the source of the environment. See also [here](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/). We will try out a few functions with this environment and get started with setting it up. 

Can you find out the function which can reset the environment? Run it in the code block below. What is the output like? What does the output represent?

In [15]:
env.reset()




388

The next step is to try and execute a step in the environment. We can decide between 6 actions for out agent. 

0 = south  
1 = north  
2 = east  
3 = west  
4 = pickup  
5 = dropoff  

You can use the `env.step()` function to run an action. Try it out below. 

In [11]:
env.step(0)
env.render()
env.step(3)
env.render()


+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|Y| : |[34;1mB[0m: |
+---------+
  (South)
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[34;1mB[0m: |
+---------+
  (West)


Attempt to use the `env.render()` function to render the environment to see it.

In [12]:
env.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[34;1mB[0m: |
+---------+
  (West)


### Task: Try to create an instance of the game using a while loop. 

The env.step function returns 4 variables. Play around with the game to see what those variables do. 

In [16]:
x = False
while not x:
     #Render the environment in this line
     env.render()
     i = int(input())
     clear_output(wait=True)
     # run a step on the environment here
     obs, reward, done, info = env.step(i)
     print('Observation = ', obs, '\nreward = ', reward, '\ndone = ', done, '\ninformation = ', info)
     x = done

Observation =  488 
reward =  -1 
done =  False 
information =  {'prob': 1.0}
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B:[43m [0m|
+---------+
  (South)


KeyboardInterrupt: ignored

Now that we have worked with the environment and understand the problem. Let us define a few terms. 

**State** - The state is provided by the variable 'obs' in the code above. It defines the state of the environment.  
**Agent** - Here is the taxi.  
**Action** - The action is the variable that we are passing to the environment to perform. So the Agent performs the action.  
**Reward** - The reward is a number that tells how well the player is doing. The fewer steps it takes to reach state of 'done' the better it is. 

## 2. Q-Learning

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

In order to remember what has worked for the AI we store the result of each step in a table called the **Q-table**. This  table will have a map of (state, action) -> Q-value. The Q-value is a number that represents if an action is beneficial or not. 

Here is what our Q-table should look like

<img src="https://lrccd.instructure.com/files/30631391/download?download_frd=1">


We need a few hyper parameters to be able to effectively implement the Q-learning algorithm. During the learning process we are able to modify the 

1. Alpha value. The Alpha value is any number between 0 and 1. It is a measure of the learning rate. 
2. Gamma value. This value is a measure of the how greedy our algorithm is. Having a gamma value of 0 makes our learning algorithm more short sighted. 
3. Epsilon value. This variable sets how much the training should rely on old data and how much it should rely on new data. 

In [19]:

# The hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

NUM_EPISODES = 100000

q_table = np.zeros([env.observation_space.n, env.action_space.n])



#### Task: Solve the following function in python

$$
Q(state, action)  \leftarrow (1 -  \alpha ) *Q (state,action) +  \alpha (reward +  \gamma  \max Q(next state, all actions))
$$


In [20]:
# In Python:
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])

new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)

q_table[state, action] = new_value




NameError: ignored

This is the code to update the Q-value on each iteration. Let us now iterate through the num episodes and find the results. 

In [21]:
all_epochs = []
all_penalties = []

for i in range(1, NUM_EPISODES+1):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon: # there is a 10% chance of exploring a new action rather than using previous knowledge
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) # Performing the next step.
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)

        q_table[state, action] = new_value    
        


        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.



Congratulations!!! You have successfully trained a Q-learning model. In the supervised and unsupervised learning models we saved models in the model objects, but what about this case? Can you answer, what is the model and how is it stored in this case?

In [22]:
#It is stored in the variable q_table
print(q_table)

[[  0.           0.           0.           0.           0.
    0.        ]
 [ -2.41837065  -2.3639511   -2.41837066  -2.3639511   -2.27325184
  -11.36395091]
 [ -1.87014398  -1.45024002  -1.870144    -1.45024001  -0.7504
  -10.45023974]
 ...
 [ -1.02440159   0.4159963   -1.0174437   -1.20640708  -2.75200302
   -5.26187208]
 [ -2.17023619  -2.12194197  -2.13112338  -2.1219448   -6.76730809
   -5.10303263]
 [  4.38587247   1.63684257   3.10500447  11.          -2.37460269
   -2.74344319]]


## 3. Evaluation

Let us now evaluate our Q-table. How can we do this? Well, we simply use the same training algorithm except we dont add the formula to update the Q-table. Try it yourself. 

In [24]:

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    while not done:
        # Complete the following block
        action = np.argmax(q_table[state]) # Exploit learned values
        state, reward, done, info = env.step(action) # Performing the next step.

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 13.11
Average penalties per episode: 0.0
