This code is a commented version of the orginal code that can be found in [this link](https://gist.github.com/simoninithomas/baafe42d1a665fb297ca669aa2fa6f92#file-q-learning-with-frozenlake-ipynb)

In [1]:
from numpy import *
import gym
import random

In [2]:
# Loading environment
env = gym.make("FrozenLake-v0")

# Getting environment data for Q-table
# will have Q dimensions(N x M)
# - N : number of possible actions
# - M : number of possible states
M = env.action_space.n
N = env.observation_space.n

In [3]:
qtable = zeros([N,M])

In [4]:
total_episodes = 10000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.01             # Exponential decay rate for exploration prob

Some of the theory:

Q-table values:
 * They are the expected cumulative reward of taking an action, this is:
   * Cumulative rewards (sum of rewards) of every action taken starting at step t for all the following next steps t+k to take (from t+1 to t+T+1 steps). This cumulative reward will differ for different "paths" or different actions taken
![title](./images/cumulative_reward99.png "ShowMyImage")

 * They are weighted by the $\gamma \in [0,1]$ discount rate
   * The discount rate models how much to care about "inmediate" or "close" reward rather than "future" or "far" reward. It tries to ponder the question: what do you prefer 10 euros now or 100 euros tomorrow?
     * If $\gamma$ is bigger the agent will care more for long term reward
     * If $\gamma$ is smaller the agent will care more for short term reward
![title](./images/discounted_cum_reward90.png "ShowMyImage")


The Q-table values are defined by:
![title](./images/q-table_ecuation.png "ShowMyImage")

For learning the values of the Q-table will be update as:
 * The current Q-table value (expected cumulative reward)
 * Updated by a learning rate (multiplied)
 * Plus the reward of the new state
 * Plus the max expected cumulative reward of the actions for the new state
 * Minus the cumulative reward of current state
Update function of Q table
![title](./images/update_function.png "ShowMyImage")

The updating part can be seen as:
![title](./images/update_term.png "ShowMyImage")

### Training Algorithm

The training algorithm:

In [5]:
# List of rewards
rewards = []

# Looping through different episodes
for episode in range(total_episodes):
    
    # Reseting environment for new episode
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    # Looping for actions in episode
    for step in range(max_steps):
        exploration_explotation_tradeoff = random.uniform(0,1)
        
        # Decision
        # Determining next action to take
        if exploration_explotation_tradeoff > epsilon:
            # Returns the index of the best action for state_i
            action = argmax(qtable[state,:])
        else:
            # Sample: takes a random action from qtable
            action = env.action_space.sample()
            
        # Observation
        # Getting the new state or stepping into that state
        new_state, reward, dead, info = env.step(action)
            
        # Valoration
        # Updating Q-table
        qtable[state, action] = qtable[state, action] + learning_rate * \
                (reward + gamma * max(qtable[new_state, :]) - qtable[state, action])
        
        # Cumulative reward for this episode
        total_rewards += reward
        
        # Updating the state
        state = new_state
        
        # If dead, finish episode
        if dead == True: 
            break
    
    # Updating number of episodes
    episode += 1
    
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)* exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.446
[[1.17213572e-01 8.90896821e-02 2.05433254e-01 2.31589608e-02]
 [4.63079171e-03 1.96224811e-03 6.04218467e-03 1.81992427e-01]
 [8.26812370e-03 5.02196000e-03 5.02493860e-03 5.19320452e-02]
 [2.46089023e-03 1.91281106e-03 1.36980911e-04 3.08874972e-02]
 [2.70440625e-01 1.47026574e-02 2.46069322e-02 4.19618831e-06]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.38935535e-05 3.90605986e-07 2.28820979e-05 2.60552064e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.23263010e-02 1.59609442e-02 8.75174212e-03 4.68793140e-01]
 [5.70306382e-02 6.10070944e-01 2.43031911e-03 2.90563856e-03]
 [8.16996214e-01 1.83040391e-03 4.77686934e-03 2.03383127e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.97916073e-02 4.20735411e-03 5.28112899e-01 3.14808098e-02]
 [1.69696964e-01 9.07647768e-01 1.89174878e-01 2.40478258e-01]
 [0.00000000e+00 0.00000000e+00 

### Playing Frozen Lake

Using our Q-table to play frozen lake

#### NOTE: To render the visualization for playing frozen-lake, best to run this program in its python version 

Using the other program from this repo

```sh
    $ python frozen-lake.py
```


In [6]:
# Playing
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False

    # Print steps
    print("EPISODE ", episode)
    
    for step in range(max_steps):

        env.render()
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            print("EPISODE ", episode)
            env.render()

            # Printing reports
            print(" ")
            print("Finish Report:")
            print("Steps:    " + str(step))
            print("Position: " + str(new_state))
            if new_state == 15:
                print("Success:  " + "GOAL REACHED! :)")
            else:
                print("Success:  " + "Game Over :(")
            print(" ")
            input("Press enter...")
            break

        state = new_state

        
env.close()

EPISODE  0

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
 