# Q-Learning

Based on: https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

This notebook uses Taxi-v3 (v2 in original code is deprecated).  

For more information, check gym's site:

https://gym.openai.com/envs/Taxi-v3/

The Taxi Problem <br />
from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"<br />
by Tom Dietterich

In [None]:
import gym

import numpy as np

from IPython.display import clear_output
from time import sleep

import random

In [None]:
env = gym.make("Taxi-v3").env

In [None]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

## Understanding the environment

Rendering:
* The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
* The pipe ("|") represents a wall which the taxi cannot cross.
* R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

We have an Action Space of size 6 and a State Space of size 500. 

The actions are:

* 0 = south
* 1 = north
* 2 = east
* 3 = west
* 4 = pickup
* 5 = dropoff

There are 500 discrete states since there are 
* 25 taxi positions,
* 5 possible locations of the passenger (including the case when the passenger is in the taxi), and
* 4 destination locations. 

Passenger locations:
* 0: R(ed)
* 1: G(reen)
* 2: Y(ellow)
* 3: B(lue)
* 4: in taxi

Destinations:
* 0: R(ed)
* 1: G(reen)
* 2: Y(ellow)
* 3: B(lue)

state space is represented by:
* (taxi_row, taxi_col, passenger_location, destination)

Rewards:
* There is a default per-step reward of -1,
* except for delivering the passenger, which is +20,
* or executing "pickup" and "drop-off" actions illegally, which is -10.

The optimal action for each state is the action that has the highest cumulative long-term reward.

In [None]:
# What is our current state? (row,column,passenger index,destination index)

env.render()
print("State:",env.s,list(env.decode(env.s)))

## The reward table

Here we have a reward dependent on the current state and action.  The reward table called `P` is a dictionary with the number of states as rows and number of actions as columns, i.e. a states × actions matrix.

This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.

Since every state is in this matrix, we can see the default reward values assigned to our current state:

In [None]:
env.P[env.s]

## Solving with random policies

Since we have our P table for default rewards in each state, we can try to have our taxi navigate just using that.

We'll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The env.action_space.sample() method automatically selects one random action from set of all possible actions.

Let's see what happens:

In [None]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

In [None]:
def print_frames(frames):
    for i, frame in enumerate(frames[0:100]):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.04)
        
#print_frames(frames)

## Q-Learning

In [None]:
## Q-Table of states x actions
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [None]:
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

## Episode iteration
for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        ## Epsilon-Greedy
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")
print("Penalties:",penalties, "Epochs:",epochs)

Now that the Q-table has been established over 100,000 episodes, let's see what the Q-values are at our illustration's state:

In [None]:
env.s = 328 # force state
env.render()

print(q_table[328])


## Let's try the agent

In [None]:
total_epochs, total_penalties = 0, 0
episodes = 100

for i in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        # print("Action:",action)
        state, reward, done, info = env.step(action)
        # print("Reward: {0}  done: {1}  info: {2}".format(reward,done,info))
        
        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs
    
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

In [None]:
## TODO

## Let's plot the agent in action