# FrozenLake-v0

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The surface is described using a grid like the following:

<pre>
SFFF
FHFH
FFFH
HFFG
</pre>

<pre>
<b>State    Description                    Reward</b>
S        Agent's starting pt. - Safe    0
F        Frozen surface - Safe          0
H        Hole - Game over               0
G        Goal - Game over               1
</pre>

## Importing the libraries

In [1]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

## Creating the Environment

In [2]:
env = gym.make("FrozenLake-v0")

## Creating the Q-Table

In [3]:
state_space_size = env.observation_space.n
action_space_size = env.action_space.n

q_table = np.zeros((state_space_size, action_space_size))
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Initializing Q-Learning Parameters

In [4]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1     # alpha
discount_rate = 0.99    # gamma

exploration_rate = 1    # sigma
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

## Q-Learning Algorithm

In [5]:
rewards_all_episodes = []

# Q-Learning Algorithm
for episode in range(num_episodes):
    # initialize new episode parameters
    state = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        
        # Exploration-Exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            # exploit
            action = np.argmax(q_table[state, :])
        else:
            # explore
            action = env.action_space.sample()
            
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        # Update Q-Table for Q(state, action)
        a = (1-learning_rate)*q_table[state, action]
        # a = (1-alpha)*old_value
        b = learning_rate*(reward + discount_rate*np.max(q_table[new_state, :]))
        # b = alpha*learned_value
        q_table[state, action] = a + b
        
        # Set new state & update current episode reward
        state = new_state
        rewards_current_episode += reward
        
        # check for hole or goal
        if done == True:
            break
            
    # Update Exploration Rate Decay
    diff = max_exploration_rate - min_exploration_rate
    exploration_rate = min_exploration_rate + diff*np.exp(-exploration_decay_rate*episode)
    
    # Add final current episode reward to total rewards list
    rewards_all_episodes.append(rewards_current_episode)

In [6]:
# Calculate & print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("***Average reward per thousand episodes***")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

***Average reward per thousand episodes***
1000 :  0.03200000000000002
2000 :  0.20400000000000015
3000 :  0.4030000000000003
4000 :  0.5490000000000004
5000 :  0.6470000000000005
6000 :  0.6230000000000004
7000 :  0.6660000000000005
8000 :  0.6760000000000005
9000 :  0.6780000000000005
10000 :  0.6850000000000005


In [7]:
# Print Updated Q-Table
print("***Q-Table***")
print(q_table)

***Q-Table***
[[0.53655583 0.51845247 0.51081785 0.51127514]
 [0.38132598 0.22829275 0.39472007 0.51057823]
 [0.41214766 0.41086179 0.40675767 0.47795336]
 [0.20693785 0.24013831 0.36635295 0.4595482 ]
 [0.55665581 0.43697615 0.45059725 0.43911091]
 [0.         0.         0.         0.        ]
 [0.14240114 0.10859987 0.21434851 0.13822023]
 [0.         0.         0.         0.        ]
 [0.3371566  0.44241845 0.42375123 0.59138084]
 [0.32600981 0.63688327 0.47678581 0.29336225]
 [0.57797187 0.42539118 0.29216137 0.37807542]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.44134909 0.42568543 0.75750115 0.62405326]
 [0.72505339 0.85814899 0.80843333 0.76756864]
 [0.         0.         0.         0.        ]]


In [8]:
# Watch the agent play by playing the best action
for episode in range(3):
    # initialize new episode parameters
    state = env.reset()
    done = False
    print("Episode ", episode+1, "\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        # show current state of environment on screen
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        # choose action with highest Q-value for current state i.e., exploit
        action = np.argmax(q_table[state, :])
        # take new action
        new_state, reward, done, info = env.step(action)
        
        # check for hole or goal
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                print("\nReached the goal!")
                time.sleep(3)
            else:
                print("\nFell through a hole!")
                time.sleep(3)
            clear_output(wait=True)
            break
        
        # set new state
        state = new_state
        
env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

Reached the goal!
