<a href="https://colab.research.google.com/github/CompuScien/Deep-Reinforcement-Learning/blob/master/QLearning0-FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:



import numpy as np
import pandas as pd
import gym
import random


The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable,
 and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The surface is described using a grid like the following:

SFFF       (S: starting point, safe)

FHFH       (F: frozen surface, safe)

FFFH       (H: hole, fall to your doom)

HFFG       (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. 

You receive a reward of 1 if you reach the goal, and zero otherwise.

In [None]:

#-------------create environment for RL-------------#
environment = gym.make("FrozenLake-v0")

#-------------initialization for Qtable-------------#
action_size = environment.action_space.n
state_size = environment.observation_space.n
# print(action_size, state_size)
# Qtable = np.zeros((state_size, action_size)) + 0.00000001
Qtable = np.random.random((state_size, action_size))/1000000
# print(Qtable)


#--------------hyper parameters set----------------#
episodes = 250000
lr = 0.5
max_steps = 100
gamma = 0.95


epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01


#------------implement Q-Learning------------#
# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(episodes):
    # Reset the environment
    state = environment.reset()
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)

        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(Qtable[state, :])

        # Else doing a random choice --> exploration
        else:
            action = environment.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = environment.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        Qtable[state, action] = Qtable[state, action] + lr * (reward + gamma * np.max(Qtable[new_state, :]) - Qtable[state, action])

        total_rewards += reward

        # Our new state is state
        state = new_state

        # If done (if we're dead) : finish episode
        if done == True:
            break

    episode += 1
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
    rewards.append(total_rewards)






print("Score over time: " + str(sum(rewards) / episodes))
print(Qtable)


Score over time: 0.534332
[[1.87367225e-01 1.10109139e-01 1.11012236e-01 1.12229358e-01]
 [4.66200689e-02 3.45890139e-02 5.37973660e-02 9.43830194e-02]
 [4.73729702e-02 5.39232512e-02 4.45348642e-02 5.59959032e-02]
 [3.87718050e-02 1.17455740e-02 1.04604812e-02 5.62096897e-02]
 [2.34774071e-01 5.75251043e-02 1.19338210e-01 6.17766650e-02]
 [2.71793350e-07 7.67745890e-09 7.01262639e-07 2.59824048e-07]
 [2.72084778e-02 1.43342047e-03 2.00466338e-03 2.07945748e-02]
 [7.78481061e-07 2.45926406e-07 7.09304494e-07 1.64703385e-07]
 [9.59067560e-02 1.54294145e-01 3.54874402e-02 2.75139333e-01]
 [1.21037850e-01 3.94063375e-01 6.40696575e-02 4.73496819e-02]
 [6.61518855e-01 5.20558155e-02 3.63073006e-03 4.79602012e-02]
 [9.10605005e-07 8.39605700e-07 8.29081844e-07 6.78281053e-08]
 [3.85527755e-07 8.65628852e-07 3.16740328e-07 3.47672591e-07]
 [6.04360686e-02 1.86545222e-01 5.61863838e-01 1.23172408e-01]
 [4.32360960e-01 9.18886804e-01 4.04859073e-01 4.42299972e-01]
 [4.70511835e-08 1.98317035e-

In [None]:

#--------------Run program after learning Q-Table-----------#

environment.reset()

for episode in range(5):
    state = environment.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        environment.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(Qtable[state, :])

        new_state, reward, done, info = environment.step(action)

        if done:
            break
        state = new_state
environment.close()


****************************************************
EPISODE  0

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
****************************************************
EPISODE  1

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Le