# OpenAI gym's Taxi-v3 task

For this coding exercise, we will use OpenAI gym's [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/) environment to design an algorithm to teach a taxi agent to navigate a small gridworld. There are 4 locations (labeled by different letters) and our job is to pick up the passenger at one location and drop him off in another. We will receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

The Taxi Problem   
from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" by Tom Dietterich

Description:  
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.

Observations:  
There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations.
Note that there are 400 states that can actually be reached during an episode. The missing states correspond to situations in which the passenger is at the same location as their destination, as this typically signals the end of an episode.
Four additional states can be observed right after a successful episodes, when both the passenger and the taxi are at the destination.
This gives a total of 404 reachable discrete states.

Passenger locations:
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)
- 4: in taxi

Destinations:
- 0: R(ed)
- 1: G(reen)
- 2: Y(ellow)
- 3: B(lue)

Actions:  
There are 6 discrete deterministic actions:
- 0: move south
- 1: move north
- 2: move east
- 3: move west
- 4: pickup passenger
- 5: drop off passenger

Rewards:  
There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10.

Rendering:  
- blue: passenger
- magenta: destination
- yellow: empty taxi
- green: full taxi
- other letters (R, G, Y and B): locations for passengers and destinations

State space is represented by:  
(taxi_row, taxi_col, passenger_location, destination)

### Import the Necessary Packages

In [1]:
import gym
import numpy as np
from collections import defaultdict, deque
import sys
import math
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output
from time import sleep

### Specify the Environment, and explore the state and action spaces
Create an instance of the [Taxi-V3](https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py) environment that has a discreate state and action spaces.

In [2]:
# Create the environment and set random seed
env = gym.make('Taxi-v3')
env.seed(505)
print('State space:', env.observation_space)
print('Action space:', env.action_space)

State space: Discrete(500)
Action space: Discrete(6)


It might be helpful to get some experience with the output that is returned as the agent interacts with the environment.

In [9]:
def print_frames(frames):
    """
    Watch agent interacts with environment in each rendered frame.
    
    Parameters
    ----------
    frames: array_like
        A sequence of frame containing agent interacts with enviroment.
    
    Returns
    -------
    Animation of frames
    """
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"Action: {frame['action']}")
        print(f"State: {frame['state']}")
        print(f"Reward: {frame['reward']}")
        sleep(1)

Play **Taxi-v3** with a random policy in 200 steps.

In [6]:
with env:
    # begin the episode
    state = env.reset()
    # initialize the sampled reward & frames
    samp_reward = 0
    frames = []
    while True:
        # agent selects an action
        action = env.action_space.sample()
        # agent performs the selected action
        state, reward, done, _ = env.step(action)
        # Put each rendered frame into dict for animation
        # update the sampled reward & frames
        samp_reward += reward
        frames.append({
            'frame': env.render(mode='ansi'),
            'action': action,
            'state': state,
            'reward': reward
        })
        if done:
            break

print_frames(frames)
print(f"Final score: {score}")

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (West)

Timestep: 200
Action: 3
State: 371
Reward: -1
Final score: -713


In [None]:
class Agent:
    
    def __init__(self, nA=6):
        """ Initialize agent.
        
        Params
        ======
        - nA: number of actions available to the agent
        """
        self.nA = nA
        self.Q = defaultdict(lambda: np.zeros(self.nA))
        
    def select_action(self, state):
        """ Given the state, select an action.
        
        Params
        ======
        - state: the current state of the environment
        
        Returns
        =======
        - action: an integer, compatible with the task's action space
        """       
        return np.random.choice(self.nA)
    
    def step(self, state, action, reward, next_state, done):
        """ Update the agent's knowledge, using the most recently sampled tuple.
        
        Params
        ======
        - state: the previous state of the environment
        - action: the agent's previous choice of action
        - reward: last reward received
        - next_state: the current state of the environment
        - done: whether the episode is complete (True or False)
        """
        self.Q[state][action] += 1

In [None]:
def interact(env, agent, num_episodes=20000, window=100):
    """ Monitor agent's performance.
    
    Params
    ======
    - env: instance of OpenAI Gym's Taxi-v3 environment
    - agent: instance of class Agent (see Agent for details)
    - num_episodes: number of episodes of agent-environment interaction
    - window: number of episodes to consider when calculating average rewards
    
    Returns
    =======
    - avg_rewards: deque containing average rewards
    - best_avg_reward: largest value in the avg_rewards deque
    """
    # initialize average rewards
    avg_rewards = deque(maxlen=num_episodes)
    # initialize best average reward
    best_avg_reward = -math.inf
    # initialize monitor for most recent rewards
    samp_rewards = deque(maxlen=window)
    # for each episode
    for i_episode in range(1, num_episodes + 1):
        # begin the episode
        state = env.reset()
        # initialize the sampled reward
        samp_reward = 0
        while True:
            # agent selects an action
            action = agent.select_action(state)
            # agent performs the selected action
            next_state, reward, done, _ = env.step(action)
            # agent performs internal updates based on sampled experience
            agent.step(state, action, reward, next_state, done)
            # update the sampled reward
            samp_reward += reward
            # update the state (s <- s') to next time step
            state = next_state
            if done:
                # save final sampled reward
                samp_rewards.append(samp_reward)
                break
        if (i_episode >= 100):
            # get average reward from last 100 episodes
            avg_reward = np.mean(samp_rewards)
            # append to deque
            avg_rewards.append(avg_reward)
            # update best average reward
            if avg_reward > best_avg_reward:
                best_avg_reward = avg_reward
        # monitor progress
        print('\rEpisode {}/{} || Best average reward {}'.format(
            i_episode, num_episodes, best_avg_reward), end="")
        sys.stdout.flush()
        # check if task is solved (according to OpenAI Gym)
        if best_avg_reward >= 9.7:
            print('\nEnvironment solved in {} episodes.'.format(
                i_episode), end="")
            break
        if i_episode == num_episodes:
            print('\n')
        if i_episode % 5000 == 0:
            env.render()
            print('Reward = {}'.format(reward))
    return avg_rewards, best_avg_reward

In [None]:
agent = Agent()
avg_rewards, best_avg_reward = interact(env, agent)