<h1>Setting up Frozen Lake in code</h1>
<h3><i>Libraries</i></h3>
<p>First we’re importing all the libraries we’ll be using. Not many, really... Numpy, gym, random, time, and clear_output from Ipython’s display.</p>

In [1]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

<h3><i>Creating the environment</i></h3>
<p>Next, to create our environment, we just call gym.make() and pass a string of the name of the environment we want to set up. We'll be using the environment FrozenLake-v0. All the environments with their corresponding names you can use here are available on Gym’s website. </p>

In [2]:
env = gym.make("FrozenLake-v0")

<p>With this <b><i>env</i></b> object, we’re able to query for information about the environment, sample states and actions, retrieve rewards, and have our agent navigate the frozen lake. That’s all made available to us conveniently with Gym.</p>

In [3]:
action_sapce_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_sapce_size))

In [4]:
print(q_table.shape)
print(q_table)

(16, 4)
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [5]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

<h3>Q-learning Algorithm</h3>
<p>First, we create this list to hold all of the rewards we’ll get from each episode. This will be so we can see how our game score changes over time.</p>

In [6]:
rewards_all_episodes = []

<h3><i>Update the Q-value</i></h3>
\begin{equation*} q^{new}\left( s,a\right) =\left( 1-\alpha \right) ~\underset{\text{old value} }{\underbrace{q\left( s,a\right) }\rule[-0.05in]{0in}{0.2in} \rule[-0.05in]{0in}{0.2in}\rule[-0.1in]{0in}{0.3in}}+\alpha \overset{\text{ learned value}}{\overbrace{\left(
                                                    R_{t+1}+\gamma \max_{a^{^{\prime }}}q\left( s^{\prime },a^{\prime }\right) \right) }} \end{equation*}
<i>Eq 1.0</i>

In [7]:
# Q-Learning algorithm
for episode in range(num_episodes):
    # initialize new episode params
    state = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state, :])
        else:
            action = env.action_space.sample()
        
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table (eq 1.0)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
        learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        # Set new state
        state = new_state
        
        # Add new reward
        rewards_current_episode += reward
        
        if done == True:
            break
        
    # Exploration rate decay (using Exponential decay)
    exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    # Add current episode reward to total rewards list
    rewards_all_episodes.append(rewards_current_episode)

print("****Trainting Complete!****")

****Trainting Complete!****


In [8]:
# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000 

********Average reward per thousand episodes********

1000 :  0.03200000000000002
2000 :  0.19600000000000015
3000 :  0.4030000000000003
4000 :  0.5520000000000004
5000 :  0.6500000000000005
6000 :  0.6620000000000005
7000 :  0.6710000000000005
8000 :  0.6740000000000005
9000 :  0.6830000000000005
10000 :  0.6790000000000005


In [9]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[0.56193005 0.51904312 0.51133082 0.52203772]
 [0.27044836 0.26739031 0.31072775 0.50329229]
 [0.3879442  0.39202743 0.38405374 0.47641747]
 [0.3658019  0.38904336 0.37912939 0.46413361]
 [0.57738551 0.40079344 0.39364308 0.44860997]
 [0.         0.         0.         0.        ]
 [0.18362053 0.1287866  0.40630663 0.17050205]
 [0.         0.         0.         0.        ]
 [0.49886795 0.48726804 0.41940613 0.60020979]
 [0.44527007 0.65397234 0.47557004 0.23692219]
 [0.5568598  0.39871816 0.31294885 0.36951209]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.48380887 0.56428717 0.79754623 0.56847762]
 [0.74634644 0.9031563  0.71339461 0.71511925]
 [0.         0.         0.         0.        ]]


<h3>The code to watch the agent play the game</h3>

In [10]:
# Watch our agent play Frozen Lake by playing the best action
# from each state according to the Q-table

for episode in range(3):
    # initialize new episode params
    state = env.reset()
    done = False
    print("*****EPISODE ", episode + 1, "*****\n\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        # Show current state of environment on screen
        clear_output(wait = True)
        env.render()
        time.sleep(0.3)
        
        # Choose action with highest Q-value for current state
        action = np.argmax(q_table[state, :])
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait = True)
            env.render()
            if reward == 1:
                # Agent recched the goal and won episode
                print("****You reached the goal!****")
                time.sleep(2)
                clear_output(wait = True)
            else:
                # Agent stepped in a hole and lost episode
                print("****You fell through a hole!****")
                time.sleep(2)
                clear_output(wait = True)
            break
                
        # Set new state
        state = new_state

env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
****You reached the goal!****


<p><b>References</b></p>
<ul>
<li><a href="http://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv">DeepLizard</a>  [Reinforcement Learning - Introducing Goal Oriented Intelligence]</li>

<li><a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a></li>
    
<li><a href="https://en.wikipedia.org/wiki/Exponential_Decay">Exponential Decay</a></li>
</ul>