# Q* Learning with OpenAI Taxi v3

This notebook was created thanks to a course on Deep Reinforcement Learning created by Thomas Simonini. You can find the syllabus here: https://simoninithomas.github.io/Deep_reinforcement_learning_Course/

In this notebook I will implement an agent that can play the OpenAI Gym environment Taxi-v3. The goal of the game is to pick up the passenger at one location and drop him off in another. There are 4 different location marked by 4 different letters. The point system for this environment works as follows:

- You receive +20 points for a successful dropoff
- Lose 1 point for every timestep it takes.
- There is also a 10 point penalty for illegal pick-up and drop-off actions (if you don't drop the passenger in one of the 3 other locations)

The libraries needed are:
- Numpy to generate the Qtable
- Gym for the Taxi environment
- Random to generate random numbers

In [1]:
import numpy as np
import gym
import random

In [2]:
# Creating Taxi environment and rendering an example game state
env = gym.make("Taxi-v3")
env.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y|[43m [0m: |[34;1mB[0m: |
+---------+



In [3]:
# Obtaining number of possible actions (number of rows of Q-table) and
# number of possible states (number of columns)
action_size = env.action_space.n
print("Action size ", action_size)

state_size = env.observation_space.n
print("State size ", state_size)

Action size  6
State size  500


In [4]:
# Creating empty Q-table
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


In [5]:
# Setting hyperparameters
total_episodes = 50000 # Total train episodes
total_test_episodes = 1000 
max_steps = 99

learning_rate = 0.7
gamma = 0.618

# setting exploration parameters
epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01

In [6]:
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    
    # Executing up to 99 actions
    for step in range(max_steps):
        # Obtaining random integer.
        exp_exp_tradeoff = random.uniform(0,1)
        # If tradeoff > epsilon the action will be based on the biggest Q-value for this state
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
        # else the action will be random (exploring the environment)
        else:
            action = env.action_space.sample()
        
        # Passing the action into the step method of the environment
        new_state, reward, done, info = env.step(action)
        
        # Updating the Q-table with new value based on the results of the last action
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma *
                                    np.max(qtable[new_state, :]) - qtable[state, action])
        
        # Overwriting old state to use new_state of the environment in next loop
        state = new_state
        
        # Break the loop if the agent finishes the game
        if done:
            break
            
    # Updating epsilon to decrease the ratio of exploration/exploitation over time
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)

In [7]:
env.reset()
rewards = []

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        total_rewards += reward
        
        if done:
            rewards.append(total_rewards)
            break
        state = new_state
env.close()
print ("Score over time: " +  str(sum(rewards)/total_test_episodes))

Score over time: 7.89


In [8]:
import qlearning_functions

In [9]:
trained_qtable = qlearning_functions.agent_training(env, qtable)

In [10]:
qlearning_functions.agent_testing(env, trained_qtable)

'7.903'