<h3> Part 2 - Improving the baseline performance by tuning the hyperparameters </h3>

<h2> Blog - Written by me </h2>

https://medium.com/@nancyjemi/level-up-understanding-q-learning-cf739867eb1d

This blog summarizes all my understanding of this notebook and how it is implemented. I would recommend anyone who reads this notebook to first see my blog and then you would understand the concepts

<h4> Importing the libraries </h4>

In [1]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

<h4> Rendering the environment - Toy text </h4>

In [2]:
env = gym.make("Taxi-v3")
env.render()

+---------+
|R: |[43m [0m: :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



<h4> States, Actions and the Q-table </h4>

In [3]:
action_space_size = env.action_space.n
print("Action size ", action_space_size)

state_space_size = env.observation_space.n
print("State size ", state_space_size)


q_table = np.zeros((state_space_size, action_space_size))
print("The size of Q-table ", q_table.shape)
print(q_table)

Action size  6
State size  500
The size of Q-table  (500, 6)
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


<h4> Creating the hyperparameters </h4>

In [7]:
total_episodes = 5000
total_test_episodes = 100
max_steps = 99
alpha= 0.7 # Learning rate
gamma = 0.8 # Discounting rate
epsilon = 1.0 # Exploration rate
max_exploration_rate_1 = 1
min_exploration_rate_1 = 0.01
decay_rate = 0.01 # Exponential decay rate

In [8]:
rewards_all_episodes = []
epsilon_all_episodes = []
max_step_epsilon = []

# Q-learning algorithm
for episode in range(total_episodes):
    state = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps):       
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > epsilon:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        new_state, reward, done, info = env.step(action)

        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] + alpha * (reward + gamma * 
                                    np.max(q_table[new_state, :]) - q_table[state, action])
        
        state = new_state
        rewards_current_episode += reward        
            
        
        if done == True: 
            break
            
           
    # Exploration rate decay
    epsilon = min_exploration_rate_1 + \
        (max_exploration_rate_1 - min_exploration_rate_1) * np.exp(-decay_rate*episode) 
    
    if step == max_steps:
        print("max steps reached")
        max_step_epsilon.append(epsilon)
    
    rewards_all_episodes.append(rewards_current_episode)
    epsilon_all_episodes.append(epsilon)

# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(rewards_all_episodes), total_episodes/1000)
count = 1000
print("********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000    

# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)

print("epsilon values for all episodes : \n",epsilon_all_episodes)

********Average reward per thousand episodes********

1000 :  -9.591000000000097
2000 :  7.58699999999997
3000 :  7.32099999999995
4000 :  7.612999999999964
5000 :  7.250999999999962


********Q-table********

[[  0.           0.           0.           0.           0.
    0.        ]
 [ -2.85914843  -2.31661225  -2.8746169   -2.31671641  -1.6445568
  -11.31569921]
 [  0.24287816   1.55359761   0.24287999   1.55359983   3.192
   -7.44647622]
 ...
 [  3.14628067   5.24         3.19085125   1.52134295  -5.80832239
   -5.80826069]
 [ -3.71336291  -1.21551381  -3.71336291  -3.54352334 -11.60054835
  -10.2396    ]
 [  6.38223404   7.79737362  -1.302       14.99650257  -7.
  -10.39248   ]]
epsilon values for all episodes : 
 [1.0, 0.9901493354116764, 0.9803966865736877, 0.970741078213023, 0.9611815447608, 0.9517171302557069, 0.9423468882484062, 0.9330698817068888, 0.9238851829227694, 0.9147918734185159, 0.9057890438555999, 0.896875793943563, 0.888051232349986, 0.8793144766113556, 0.8706646530

In [9]:
# Watch our agent play taxi-vv3 by playing the best action 
# from each state according to the Q-table
reward_all_test_episode = []
total_test_steps_taken = []
for episode in range(total_test_episodes):
    state = env.reset()
    reward_current_test_episode = 0
    current_test_steps_taken = 0
    done = False
    print("*****EPISODE ", episode+1, "*****\n\n\n\n")
    time.sleep(1)

    for step in range(max_steps):        
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])        
        new_state, reward, done, info = env.step(action)
        print(done, reward)
        time.sleep(0.5)
        reward_current_test_episode += reward
        if done:
            clear_output(wait=True)
            env.render()
            print(step)
            if reward == 20:
                print("****You reached the goal!****")
                time.sleep(3)
            elif reward == -1:
                print(step)
                print("****You moved an action!****")
            else:
                print("****Illegal pickup/dropoff!****")
                time.sleep(3)
            clear_output(wait=True)
            break
        state = new_state
    reward_all_test_episode.append(reward_current_test_episode)
    total_test_steps_taken.append(step) 
env.close()
print ("Score over time: " +  str(sum(reward_all_test_episode)/total_test_episodes))
print ("Steps over time: " +  str(sum(total_test_steps_taken)/total_test_episodes))

Score over time: 7.54
Steps over time: 12.46


<h3> Tuning the hyperparameter to improve the baseline </h3>

In [13]:
total_episodes_1 = 5000
total_test_episodes_1 = 100
max_steps_1 = 99
alpha_1= 0.7 # Learning rate
gamma_1 = 0.65 # Discounting rate
epsilon = 1.0 # Exploration rate
max_exploration_rate_1 = 1
min_exploration_rate_1 = 0.01
decay_rate = 0.001 # Exponential decay rate

In [14]:
rewards_all_episodes = []
epsilon_all_episodes = []
max_step_epsilon = []

# Q-learning algorithm
for episode in range(total_episodes_1):
    state = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_1):       
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > epsilon:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        new_state, reward, done, info = env.step(action)

        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] + alpha_1 * (reward + gamma_1 * 
                                    np.max(q_table[new_state, :]) - q_table[state, action])
        
        state = new_state
        rewards_current_episode += reward        
            
        
        if done == True: 
            break
            
           
    # Exploration rate decay
    epsilon = min_exploration_rate_1 + \
        (max_exploration_rate_1 - min_exploration_rate_1) * np.exp(-decay_rate*episode) 
    
    if step == max_steps_1:
        print("max steps reached")
        max_step_epsilon.append(epsilon)
    
    rewards_all_episodes.append(rewards_current_episode)
    epsilon_all_episodes.append(epsilon)

# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(rewards_all_episodes), total_episodes_1/1000)
count = 1000
print("********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000    

# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)

print("epsilon values for all episodes : \n",epsilon_all_episodes)
len(epsilon_all_episodes)

********Average reward per thousand episodes********

1000 :  -135.3559999999997
2000 :  -9.326999999999996
3000 :  2.8519999999999937
4000 :  5.685999999999967
5000 :  6.884999999999962


********Q-table********

[[  0.           0.           0.           0.           0.
    0.        ]
 [ -2.65712496  -2.54942301  -2.65712496  -2.54942301  -2.38372771
  -11.54942301]
 [ -1.73663362  -1.1332825   -1.73663362  -1.1332825   -0.20505
  -10.1332825 ]
 ...
 [ -0.20504994   1.223       -0.20504995  -1.1332825   -9.20504987
   -9.20504918]
 [ -2.3837277   -2.12881186  -2.38372771  -2.12881186 -11.38372771
  -11.3837277 ]
 [  6.8          3.42         6.8         12.          -2.2
   -2.2       ]]
epsilon values for all episodes : 
 [1.0, 0.9990104948350412, 0.9980219786806598, 0.9970344505483393, 0.9960479094505515, 0.9950623544007555, 0.9940777844133959, 0.9930941985039028, 0.99211159568869, 0.9911299749851548, 0.9901493354116764, 0.9891696759876151, 0.9881909957333113, 0.9872132936700847, 

5000

In [15]:
# Watch our agent play taxi-v3 by playing the best action 
# from each state according to the Q-table
reward_all_test_episode = []
total_test_steps_taken = []
for episode in range(total_test_episodes_1):
    state = env.reset()
    reward_current_test_episode = 0
    current_test_steps_taken = 0
    done = False
    print("*****EPISODE ", episode+1, "*****\n\n\n\n")
    time.sleep(1)

    for step in range(max_steps_1):        
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])        
        new_state, reward, done, info = env.step(action)
        print(done, reward)
        time.sleep(0.5)
        reward_current_test_episode += reward
        if done:
            clear_output(wait=True)
            env.render()
            print(step)
            if reward == 20:
                print("****You reached the goal!****")
                time.sleep(3)
            elif reward == -1:
                print(step)
                print("****You moved an action!****")
            else:
                print("****Illegal pickup/dropoff!****")
                time.sleep(3)
            clear_output(wait=True)
            break
        state = new_state
    reward_all_test_episode.append(reward_current_test_episode)
    total_test_steps_taken.append(step) 
env.close()
print ("Score over time: " +  str(sum(reward_all_test_episode)/total_test_episodes_1))
print ("Steps over time: " +  str(sum(total_test_steps_taken)/total_test_episodes_1))

Score over time: 8.3
Steps over time: 11.7


In [7]:
total_episodes_1 = 5000
total_test_episodes_1 = 100
max_steps_1 = 99
alpha_1= 0.3 # Learning rate
gamma_1 = 0.3 # Discounting rate
epsilon = 1.0 # Exploration rate
max_exploration_rate_1 = 1
min_exploration_rate_1 = 0.01
decay_rate = 0.001 # Exponential decay rate

In [8]:
rewards_all_episodes = []
epsilon_all_episodes = []
max_step_epsilon = []

# Q-learning algorithm
for episode in range(total_episodes_1):
    state = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_1):       
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > epsilon:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        new_state, reward, done, info = env.step(action)

        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] + alpha_1 * (reward + gamma_1 * 
                                    np.max(q_table[new_state, :]) - q_table[state, action])
        
        state = new_state
        rewards_current_episode += reward        
            
        
        if done == True: 
            break
            
           
    # Exploration rate decay
    epsilon = min_exploration_rate_1 + \
        (max_exploration_rate_1 - min_exploration_rate_1) * np.exp(-decay_rate*episode) 
    
    if step == max_steps_1:
        print("max steps reached")
        max_step_epsilon.append(epsilon)
    
    rewards_all_episodes.append(rewards_current_episode)
    epsilon_all_episodes.append(epsilon)

# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(rewards_all_episodes), total_episodes_1/1000)
count = 1000
print("********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000    

# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)

print("epsilon values for all episodes : \n",epsilon_all_episodes)
len(epsilon_all_episodes)

********Average reward per thousand episodes********

1000 :  -185.81599999999975
2000 :  -59.10699999999978
3000 :  -33.31100000000003
4000 :  -24.882000000000012
5000 :  -26.34100000000002


********Q-table********

[[  0.           0.           0.           0.           0.
    0.        ]
 [ -1.4285735   -1.42839419  -1.42829673  -1.42840719  -1.42814965
  -10.42781971]
 [ -1.42381561  -1.41292862  -1.42382806  -1.41292865  -1.3765
  -10.41292889]
 ...
 [ -1.34395942  -1.255       -1.35178937  -1.4083495  -10.36614512
  -10.36288674]
 [ -1.78790855  -1.42520692  -1.55357532  -1.42575471 -11.31278584
  -10.45818441]
 [  0.5217411   -0.56204806   0.65239644   5.          -8.40245544
   -8.45163337]]
epsilon values for all episodes : 
 [1.0, 0.9990104948350412, 0.9980219786806598, 0.9970344505483393, 0.9960479094505515, 0.9950623544007555, 0.9940777844133959, 0.9930941985039028, 0.99211159568869, 0.9911299749851548, 0.9901493354116764, 0.9891696759876151, 0.9881909957333113, 0.98721329

5000

In [None]:
# Watch our agent play taxi-v3 by playing the best action 
# from each state according to the Q-table
reward_all_test_episode = []
total_test_steps_taken = []
for episode in range(total_test_episodes_1):
    state = env.reset()
    reward_current_test_episode = 0
    current_test_steps_taken = 0
    done = False
    print("*****EPISODE ", episode+1, "*****\n\n\n\n")
    time.sleep(1)

    for step in range(max_steps_1):        
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])        
        new_state, reward, done, info = env.step(action)
        print(done, reward)
        time.sleep(0.5)
        reward_current_test_episode += reward
        if done:
            clear_output(wait=True)
            env.render()
            print(step)
            if reward == 20:
                print("****You reached the goal!****")
                time.sleep(3)
            elif reward == -1:
                print(step)
                print("****You moved an action!****")
            else:
                print("****Illegal pickup/dropoff!****")
                time.sleep(3)
            clear_output(wait=True)
            break
        state = new_state
    reward_all_test_episode.append(reward_current_test_episode)
    total_test_steps_taken.append(step) 
env.close()
print ("Score over time: " +  str(sum(reward_all_test_episode)/total_test_episodes_1))
print ("Steps over time: " +  str(sum(total_test_steps_taken)/total_test_episodes_1))

+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[43mB[0m: |
+---------+
  (West)


<h3> License </h3>

MIT License

Copyright (c) 2020 Nancy Jemimah Packiyanathan

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.