Author: Salma Elbess <br>
Email: s-salmahasanelemam@zewailcity.edu.eg


Sources: <br>
https://www.youtube.com/playlist?list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv <br>
https://reinforcement-learning4.fun/2019/06/16/gym-tutorial-frozen-lake/ <br>
https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake_unslippery%20(Deterministic%20version).ipynb <br>
https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py <br>
https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf <br>
https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake.ipynb<br>
https://deeplizard.com/learn/video/HGeI30uATws <br>
https://deeplizard.com/learn/video/mo96Nqlo1L8 <br>

#Frozen lake

##Problem description

   
Winter is here. You and your friends were tossing around a frisbee at the
park when you made a wild throw that left the frisbee out in the middle of
the lake. The water is mostly frozen, but there are a few holes where the
ice has melted. If you step into one of those holes, you'll fall into the
freezing water. At this time, there's an international frisbee shortage, so
it's absolutely imperative that you navigate across the lake and retrieve
the disc. However, the ice is slippery, so you won't always move in the
direction you intend.<br>
The surface is described using a grid like the following<br>
       

>   **SFFF<br>
        FHFH<br>
        FFFH<br>
        HFFG**



*   S : starting point, safe
*   F : frozen surface, safe
*   H : hole, fall to your doom
*   G : goal, where the frisbee is located



The episode ends when you reach the goal or fall in a hole.<br>
<b>You receive a reward of 1 if you reach the goal, and zero otherwise.<b>
    

In [98]:
#import statmenets
import numpy as np
import gym
import random

##States and Actions

In [164]:
#create the environment 
env = gym.make("FrozenLake-v0")

**Avalaible actions:** <br>


*   LEFT = 0
*   DOWN = 1
* RIGHT = 2
* UP = 3 <br>

**Available States:** The state represents the player position on the grid. The player may be on any square on the grid (16 squares = 16 States)


In [165]:
n_actions = env.action_space.n #number of available actions
n_states = env.observation_space.n #number of possible states

print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)

Action space:  Discrete(4)
Observation space:  Discrete(16)


##Stochastic Vs. Deterministic Environments

###Stochastic Environment (Slippery)

In [101]:
env = gym.make("FrozenLake-v0")
#play without training
env.reset()
env.render()
for i in range(16):
    random_action = env.action_space.sample()
    new_state, reward, done, info = env.step(
       random_action)
    env.render()
    print("reward: ",reward)
    print(info)
    if done:
        break


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
reward:  0.0
{'prob': 0.3333333333333333}
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
reward:  0.0
{'prob': 0.3333333333333333}
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
reward:  0.0
{'prob': 0.3333333333333333}


###Deterministic Environment (Non Slippery)

In [102]:
env = gym.make("FrozenLake-v0",is_slippery=False)
#play without training
env.reset()
env.render()
for i in range(16):
    random_action = env.action_space.sample()
    new_state, reward, done, info = env.step(
       random_action)
    env.render()
    print("reward: ",reward)
    print(info)
    if done:
        break



[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
reward:  0.0
{'prob': 1.0}
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
reward:  0.0
{'prob': 1.0}
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
reward:  0.0
{'prob': 1.0}
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
reward:  0.0
{'prob': 1.0}
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
reward:  0.0
{'prob': 1.0}


##Q-Learning

Solving the Frozen Lake 4×4 map using Q-learning. the agent is trained and <br>
tested in the deterministic environment, so the results are easier to interpret.

In [154]:
env = gym.make("FrozenLake-v0",is_slippery=False)
n_actions = env.action_space.n #number of available actions
n_states = env.observation_space.n #number of possible states

In [161]:
# constructing the Q-table, actions in the horizontal - states in 
# vertical (n columns = n actions, n rows = n states)

q_table = np.zeros((n_states,n_actions))
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [162]:
#initialize parameters

n_epis = 20000 #number of episodes the agent will play to learn

#max number steps the agent can take in one episode, if excedded without the 
#agent reaches a terminating state the episode will close and it recieves 0 points
max_n_steps_per_episode = 99 

learning_rate = 0.8 #alpha
discount_rate = 0.95 #gamma 

# Exploration parameters
epsilon = 1.0                 # Exploration rate - probability of exploration
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.001             # Exponential decay rate for exploration prob

#For visualization reasons
n_show = 5 #number of epis to - 1
epis_step = 4000// (n_show)
epis_to_visualize = list(range(0,4000+1,epis_step))
print(epis_to_visualize)

[0, 800, 1600, 2400, 3200, 4000]


See how the agent changes its behavior from exploration to exploitation as it <br>is being trained

In [163]:
#List of rewards
all_epis_rewards = []

for epis in range(n_epis):

    #for each episode

    #reset state
    state = env.reset()
    done = False
    current_epis_reward = 0
    step = 0
    if epis in epis_to_visualize:
        print("----------------------")
        print("Episode ", epis +1)
        print("Exploration rate: ",epsilon)

    #iterate steps
    for step in range(max_n_steps_per_episode):
        
        
        
        #for each time step
        # Exploration-exploitation trade-off
        # Take new action
        # Update Q-table
        # Set new state
        # Add new reward   

        #1. Exploration-exploitation trade-off - choose an action

        #1.1 generate random number
        random_num = random.uniform(0,1)

        if random_num > epsilon:
            #1.2 if random number > epsilon -> exploitation
            # Highest q value for the state
            action = np.argmax(q_table[state,:])
        else:
            #1.3 if random number <= epsilon -> exploration
            # select random action 
            action = env.action_space.sample()
        if epis in epis_to_visualize:
            env.render()
            print("action with max q-value for state-action pair: ", np.argmax(q_table[state,:]))
            print("action Selected by agent: ",action)
        #2.Take action
        new_state, reward, done, info = env.step(action)

        #3. update Q table for the state action pair
        q_table[state,action] = q_table[state,action]*(1-learning_rate) + learning_rate*(reward + discount_rate*np.max(q_table[new_state,:]))

        #4. set the new state
        state = new_state

        #5. add new reward
        current_epis_reward += reward

        #check if terminating state
        if done:
            break #stop the episode - move to next episode
        
    #after each episode - update exploration rate

    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*epis) 
    all_epis_rewards.append(current_epis_reward)


----------------------
Episode  1
Exploration rate:  1.0

[41mS[0mFFF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  2
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  2
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  2
  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  2
  (Right)
SFF[41mF[0m
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  3
  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  0
  (Left)
SF[41mF[0mF
FHFH
FFFH
HFFG
action with max q-value for state-action pair:  0
action Selected by agent:  2
  (Right)
SFF

**Results**

In [158]:
# Policy Results After Training
print("Q-table")
print()
print(q_table)
print()

print("------------------------ ")

env.reset()
print(" Environment ")
env.render()
print()
print("------------------------ ")
print("Action selection for each state")
print()
print(np.argmax(q_table,axis=1).reshape(4,4))
print("LEFT = 0 DOWN = 1 RIGHT = 2 UP = 3")

Q-table

[[0.73509189 0.77378094 0.77378094 0.73509189]
 [0.73509189 0.         0.81450625 0.77378094]
 [0.77378094 0.857375   0.77378094 0.81450625]
 [0.81450625 0.         0.77376622 0.77375009]
 [0.77378094 0.81450625 0.         0.73509189]
 [0.         0.         0.         0.        ]
 [0.         0.9025     0.         0.81450625]
 [0.         0.         0.         0.        ]
 [0.81450625 0.         0.857375   0.77378094]
 [0.81450625 0.9025     0.9025     0.        ]
 [0.857375   0.95       0.         0.857375  ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.9025     0.95       0.857375  ]
 [0.9025     0.95       1.         0.9025    ]
 [0.         0.         0.         0.        ]]

------------------------ 
 Environment 

[41mS[0mFFF
FHFH
FFFH
HFFG

------------------------ 
Action selection for each state

[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
LEFT = 0 DOWN = 1 RIGHT = 2 UP = 3


**Testing the trained agent**

In [152]:
for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_n_steps_per_episode):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(q_table[state,:])
        
        new_state, reward, done, info = env.step(action)
        env.render()

        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state

****************************************************
EPISODE  0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 5
****************************************************
EPISODE  1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 5
****************************************************
EPISODE  2
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
 