# Training RL-agent to play FrozenLake
By: Akshara Shukla

In this script, I will be walking though the steps involved in training an agent to play the FrozenLake game by using Q-table and learning parameters of reinforcement learning. The tutorial I have followed is available on this [link](https://www.youtube.com/watch?v=HGeI30uATws&list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv&index=10&ab_channel=deeplizard).

### 1. Importing the required libraries

In [1]:
import numpy as np 
import gym 
import random 
import time 
from IPython.display import clear_output

In [2]:
!pip install openai-gym

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement openai-gym (from versions: none)[0m
[31mERROR: No matching distribution found for openai-gym[0m


In [8]:
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"

In [9]:
pip install gym[toy_text]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pygame==2.1.0
  Downloading pygame-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[K     |████████████████████████████████| 18.3 MB 95 kB/s 
Installing collected packages: pygame
Successfully installed pygame-2.1.0


import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
import pygame
pygame.init()
screen = pygame.display.set_mode((400, 300))

### 2. Importing the game using the gym library enviornment

In [10]:
env = gym.make("FrozenLake-v1")

  "Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."
  "Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."


With this action, we can sample states and actions, retrieve rewards and have our agent navigate the frozen lake. 

### 3. Generating the Q -Table
The first step for building the Q - Table is to initiate all the key values to zero for each (state-action) pair.

The number of rows in the table is equivalent to the size of the state space in the environment and the number of columns is equivalent to the size of the action space.

We can get this information by using the env environment loaded above.

In [11]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size,action_space_size))
print(action_space_size,state_space_size)
print(q_table)

4 16
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### 4. Create and Initialize the parameters needed to implement the Q learning algorithm

If our trained agent hasn't received to its end goal i.e., the frizbee after the 100th step, then the model will terminate and the total reward of the model will be 0.

In [22]:
#Set 01
count = 1000
num_episodes = 4000   #Total episodes
max_steps_per_episode = 100

lr = 0.001
dr = 0.99

exploration_rate = .97   # Exploration rate
max_exploration_rate = 0.99   # Exploration probability at start
min_exploration_rate = 0.0001   # Minimum exploration probability 
exploration_decay_rate = 0.0001   # Exponential decay rate for exploration prob

Set 02
count = 1000
num_episodes = 10000
max_steps_per_episode = 100

lr = 0.01
dr = 0.999

exploration_rate = .95
max_exploration_rate = 0.95
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

In [23]:
# list to hold all the rewards we'll get from each episodes
rewards_all_episodes = []

for episode in range(num_episodes):
  state = env.reset()   #for each episode, we need to reset the state of the environment

  done = False #keeps track of whether or not the episode is finished
  rewards_in_current_episode = 0 #keeping track of the rewards within the current episode

  #for each time step within an episode
  for step in range(max_steps_per_episode):

    #exploration+exploitation trade off
    exploration_rate_threshold = random.uniform(0,1) #setting the random number
    if exploration_rate_threshold > exploration_rate:
      action = np.argmax(q_table[state,:]) #exploit and choose the highest value
    else:
      action = env.action_space.sample()  #agent will explore the environment and sample an action randomly
    #tuple
    new_state, reward, done, info = env.step(action) #take that action by calling step on env object and pass it through

    #Updating the q-table with new values 
    q_table[state,action] = q_table[state, action] * (1 - lr) + \
    lr * (reward + dr * np.argmax(q_table[new_state,:]))

    state = new_state
    rewards_in_current_episode += reward

    if done == True:
      break

  #Once episode is finished, we need to update our exploration rate using exponential decay
  exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
  
  rewards_all_episodes.append(rewards_in_current_episode)

# After all the episodes, we need to calculate the average reward per thousand episodes from our reward list during training
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000
print("*******************Average Reward Per Thousand Episodes***********************\n")
for reward_points in rewards_per_thousand_episodes:
  print(count, ": ", str(sum(reward_points/1000)))
  count += 1000

print("Total Score over time: " +  str(sum(reward_points)/num_episodes))
#Print the updated Q-table
print("\n\n**********Q-table***********\n")
print(q_table)

*******************Average Reward Per Thousand Episodes***********************

1000 :  0.011000000000000003
2000 :  0.01800000000000001
3000 :  0.010000000000000002
4000 :  0.01900000000000001
Score over time: 0.00475


**********Q-table***********

[[3.99724979e-01 2.83176000e+00 2.81932644e+00 2.85000000e+00]
 [2.84911724e+00 2.84912447e+00 2.75512277e+00 2.85000000e+00]
 [2.84911281e+00 2.84994551e+00 2.84975680e+00 2.85000000e+00]
 [2.29900584e+00 2.84558738e+00 2.29897666e+00 2.85000000e+00]
 [4.78653682e-01 3.64800020e-03 1.84472185e+00 3.07653886e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.99189211e+00 3.04000784e-02 3.26496973e-01 2.39016988e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.61168859e+00 1.52146896e+00 1.82522769e+00 4.01523842e-01]
 [3.22299185e-01 3.66381188e-01 4.50236009e-02 1.95651716e-03]
 [2.43937434e+00 1.25160604e-02 3.17036941e-01 4.04685302e-07]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.0000000

Our agent played 4000 episodes. At each time step within an episode the agent received a reward of 1 if it reached the frizbee and otherwise 0.

The agent did indeed reach the frizbee then the episode, then the episode finihsed at that time step.

So, for the first 1000 episodes we can interpret in this score as meaning 1.1% of the time, the agent received a reward of 1 and won the episode.

And by the end, the agent was winning 1.9% of the times. This means the agent increaed a bit it's performance.

### 5. Visualizing the trained agent play Frozen Lake
In this section, I'll be trying to visualize how the trained agent is playing in each episode. Therefore, we can see the final output of each episode. 

In [24]:
for episode in range(3):
  state = env.reset()
  step = 0
  done = False 
  print("***********EPISODE ", episode+1, "***********\n\n\n\n")
  time.sleep(1) #making the agent sleep for 1 second to get ready for the current or next episode

  for step in range(max_steps_per_episode):
    clear_output(wait=True) #clears the output from the current cell to avoid overrite
    env.render() #renders on the env object to render the current state of the environment to the display to visually see the game grid & where exactly our agent is on the grid
    time.sleep(0.3)

    action = np.argmax(q_table[state,:])  #setting action to the highest Q value from the Q-table for our current state
    new_state, reward, done, info = env.step(action) # updating the action's correspondng new_state, reward, done or not and important information

    if done:
      clear_output(wait=True)
      env.render()
      if reward == 1:
        print("*************You reached your goal!******")
      else:
        print("************You fell in a hole! Try Again :( ********")
        time.sleep(1)

      clear_output(wait=True)
    state= new_state

env.close()

************You fell in a hole! Try Again :( ********


From the above results, for the top 3 episodes, the agent mostly fell into the hole. The agent wasn't able to reach to the final goal. In order to combat this, I believe the next logical step would be fine-tune and experiment with different values of the learning parameters. 

### 6. Conclusions

From the above exercise, I was introduced to the process of training a reinforcement learning agent. I was able to understand the importance of learning rate, decay rate and the exploration rate. It has been really interesting to make an agent be trained to play the FrozenLake. By following the tutorial I was able to carry out this assignment. Although, the future recommendations for the assignment would be to fine - tune it a bit more and increase it's chances of reaching its goal. 