# Introduce to OpenAI Gym

We will introduce the main API methods that users of this class need to know are:
        reset
        step
        render

### Taxi-v2 Environment
The Taxi Problem is from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" by  Tom Dietterich (2000) in Journal of Artificial Intelligence Research.

We consider shows a 5 by 5 grid world inhabited by a Duckiebot(taxi agent). The taxi problem is episodic, and in each episode a passenger is located at one of the 4 specially designated locations (R, Y, B, and G). The Dockiebot starts in a given location and must go to the transported passenger’s location, pick up the passenger, go to the destination location, and put down the passenger. The episode ends when the passenger is deposited at the destination location to one of the 4 locations.

This tutorial is adpated from https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym 

### Initialize Taxi-v2

In [6]:
import gym
import gym.spaces
import numpy as np

In [7]:
env = gym.make("Taxi-v2")

In [8]:
env.reset()

431

Resets the state of the environment and returns an initial observation.
When end of episode is reached, you are responsible for calling `reset()`
to reset this environment's state.

### observation_space: 
There are 500 states from: 5 (grid x) x 5 (grid y) x 5 (passenger locations: R, Y, B, G, or on taxi) x 4 (destinations: R, Y, B, or G)

In [9]:
env.observation_space.n

500

rendering the current state.

In taxi problem, the colors mean:
* blue: passenger
* magenta: destination
* yellow: empty taxi
* green: full taxi

In [41]:
env.render()

+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+



Essentially the empty taxi is supposed to: 
* move toward the blue letter, 
* pickup the passenger (now the taxi is green), 
* drive to the magenta letter, and 
* drop the passenger (the taxi is yellow again).

### action_space: 
There are 6 possible actions in Taxi-v2 environment
down (0), up (1), right (2), left (3), pick-up (4), and drop-off (5)

In [10]:
env.action_space.n

6

Run one timestep of the environment's dynamics. 
It returns a tuple (observation, reward, done, info)
* observation (object): agent's observation of the current environment
* reward (float) : amount of reward returned after previous action
* done (boolean): whether the episode has ended, in which case further step() calls will return undefined results
* info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)


In [18]:
env.step(1)

(16, -1, False, {'prob': 1.0})

### Reward in Taxi-v2


In [14]:
env.render()



+---------+
|[43mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (North)


In [15]:
state = env.reset()
counter = 0
reward = None
while reward != 20:
    state, reward, done, info = env.step(env.action_space.sample())
    counter += 1

print(counter)


3142


In [16]:
Q = np.zeros([env.observation_space.n, env.action_space.n])
G = 0
alpha = 0.618

In [17]:
for episode in range(1,1001):
    done = False
    G, reward = 0,0
    state = env.reset()
    while done != True:
        action = np.argmax(Q[state]) #1
        state2, reward, done, info = env.step(action) #2
        Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action]) #3
        G += reward
        state = state2    
    if episode % 50 == 0:
        print('Episode {} Total Reward: {}'.format(episode,G))


Episode 50 Total Reward: -127
Episode 100 Total Reward: -35
Episode 150 Total Reward: 8
Episode 200 Total Reward: -60
Episode 250 Total Reward: 9
Episode 300 Total Reward: 5
Episode 350 Total Reward: 9
Episode 400 Total Reward: 11
Episode 450 Total Reward: 13
Episode 500 Total Reward: 11
Episode 550 Total Reward: 4
Episode 600 Total Reward: 7
Episode 650 Total Reward: 6
Episode 700 Total Reward: 9
Episode 750 Total Reward: 9
Episode 800 Total Reward: 9
Episode 850 Total Reward: 11
Episode 900 Total Reward: 9
Episode 950 Total Reward: 11
Episode 1000 Total Reward: 7
