This code is a commented version of the orginal code that can be found in [this link](https://gist.github.com/simoninithomas/baafe42d1a665fb297ca669aa2fa6f92#file-q-learning-with-frozenlake-ipynb)

In [47]:
from numpy import *
import gym
import random

In [49]:
# Loading environment
env = gym.make("FrozenLake-v0")

# Getting environment data for Q-table
# will have Q dimensions(N x M)
# - N : number of possible actions
# - M : number of possible states
M = env.action_space.n
N = env.observation_space.n

In [50]:
qtable = zeros([N,M])

In [51]:
total_episodes = 1000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

Some of the theory:

Q-table values:
 * They are the expected cumulative reward of taking an action
 * They are weighted by the $\gamma \in [0,1]$ discount rate
   * If $\gamma$ is bigger the agent will care more for long term reward
   * If $\gamma$ is smaller the agent will care more for long term reward
 
The Q-table values are defined by:
![title](./q-table_ecuation.png "ShowMyImage")

For learning the values of the Q-table will be update as:
 * The current Q-table value (expected cumulative reward)
 * Updated by a learning rate (multiplied)
 * Plus the reward of the new state
 * Plus the max expected cumulative reward of the actions for the new state
 * Minus the cumulative reward of current state
Update function of Q table
![title](./update_function.png "ShowMyImage")

The updating part can be seen as:
![title](./update_term.png "ShowMyImage")

The algorithm:

In [53]:
# List of rewards
rewards = []

# Looping through different episodes
for episode in range(total_episodes):
    
    # Reseting environment for new episode
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    # Looping for actions in episode
    for step in range(max_steps):
        exploration_explotation_tradeoff = random.uniform(0,1)
        
        # Decision
        # Determining next action to take
        if exploration_explotation_tradeoff > epsilon:
            # Returns the index of the best action for state_i
            action = argmax(qtable[state,:])
        else:
            # Sample: takes a random action from qtable
            action = env.action_space.sample()
            
        # Observation
        # Getting the new state or stepping into that state
        new_state, reward, dead, info = env.step(action)
            
        # Valoration
        # Updating Q-table
        qtable[state, action] = qtable[state, action] + learning_rate * \
                (reward + gamma * max(qtable[new_state, :]) - qtable[state, action])
        
        # Cumulative reward for this episode
        total_rewards += reward
        
        # Updating the state
        state = new_state
        
        # If dead, finish episode
        if dead == True: 
            break
    
    # Updating number of episodes
    episode += 1
    
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)* exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.309
[[1.03410633e-01 5.69054829e-02 6.18268817e-02 5.68786763e-02]
 [4.62253377e-03 7.71736000e-03 5.41749488e-04 6.06359617e-02]
 [9.39598041e-03 1.05926034e-02 4.61398199e-03 2.81459656e-02]
 [2.55523825e-03 4.60224124e-03 1.07003513e-02 2.05694241e-02]
 [1.59022101e-01 4.75562396e-03 2.44240361e-02 4.28948024e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.37910391e-04 0.00000000e+00 1.46771748e-01 4.25338786e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.50320218e-01 1.52737283e-02 6.01022675e-02 3.77884569e-01]
 [1.05838405e-03 4.51453832e-01 1.25794634e-01 4.85993017e-01]
 [7.73033775e-01 1.00487611e-03 3.26899909e-03 1.03105200e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.57491910e-01 1.88617038e-02 8.55935282e-01 1.55215294e-02]
 [2.15192306e-01 9.63024549e-01 2.22909032e-01 4.40999156e-01]
 [0.00000000e+00 0.00000000e+00 

Using our Q-table to play frozen lake

In [55]:
# Playing
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False

    # Print steps
    print("****************************************************")
    print("EPISODE ", episode)
    
    for step in range(max_steps):

        env.render()
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            break

        state = new_state

        
env.close()

****************************************************
EPISODE  0

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
****************************************************
EPISODE  1

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
****************************************************
EPISODE  2

[41mS[0mFFF
FHFH
FFFH
HFFG
  