# Introduce to OpenAI Gym

We will introduce the main API methods that users of this class need to know are:
        reset
        step
        render

### Taxi-v2 Environment
The Taxi Problem is from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" by  Tom Dietterich (2000) in Journal of Artificial Intelligence Research.

We consider shows a 5 by 5 grid world inhabited by a Duckiebot(taxi agent). The taxi problem is episodic, and in each episode a passenger is located at one of the 4 specially designated locations (R, Y, B, and G). The Dockiebot starts in a given location and must go to the transported passenger’s location, pick up the passenger, go to the destination location, and put down the passenger. The episode ends when the passenger is deposited at the destination location to one of the 4 locations.

Adapted from https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym


### Initialize Taxi-v2

In [1]:
import gym
import gym.spaces
import numpy as np

In [2]:
env = gym.make("Taxi-v2")

Resets the state of the environment and returns an initial observation.
When end of episode is reached, you are responsible for calling `reset()`
to reset this environment's state.

In [15]:
env.reset()

189

### observation_space: 
There are 500 states from: 5 (grid x) x 5 (grid y) x 5 (passenger locations: R, Y, B, G, or on taxi) x 4 (destinations: R, Y, B, or G)

In [16]:
env.observation_space.n

500

Show current state

In [17]:
env.env.s

189

In [18]:
for p in env.env.decode(env.env.s): print p

1
4
2
1


The current state is from :
* current taxi x
* current taxi y
* passenger location (Blue or in taxi) from R: (0,0), G: (0,4), Y: (4,0), B: (4,3); in taxi is 5.
* destination location (Magenta) from R: (0,0), G: (0,4), Y: (4,0), B: (4,3)

In [20]:
env.render()

+---------+
|R: | : :[35mG[0m|
| : : : :[43m [0m|
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



rendering the current state.

In taxi problem, the colors mean:
* blue: passenger
* magenta: destination
* yellow: empty taxi
* green: full taxi

Let's se the state to 114

In [29]:
env.env.s = 114
for p in env.env.decode(env.env.s): print p
env.render()

1
0
3
2
+---------+
|R: | : :G|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (East)


### action_space: 
There are 6 possible actions in Taxi-v2 environment
down (0), up (1), right (2), left (3), pick-up (4), and drop-off (5)

In [30]:
env.action_space.n

6

Let's move up, by step(1). 1 is the index for action up.

In [36]:
state, reward, done, info = env.step(1)
env.render()
reward

+---------+
|[43mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (North)


-1

Run one timestep of the environment's dynamics. 
It returns a tuple (observation, reward, done, info)
* observation (object): agent's observation of the current environment
* reward (float) : amount of reward returned after previous action
* done (boolean): whether the episode has ended, in which case further step() calls will return undefined results
* info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

You can see the reward is -1

In fact, the rewards in the taxi problem are as following:
* -1
* -10
* 20 

Let's move right

In [27]:
env.step(2)
env.render()

+---------+
|R:[43m [0m| : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (East)


Essentially the empty taxi is supposed to: 
* move toward the blue letter, 
* pickup the passenger (now the taxi is green), 
* drive to the magenta letter, and 
* drop the passenger (the taxi is yellow again).

### Reward in Taxi-v2


In [None]:
env.render()



In [None]:
state = env.reset()
counter = 0
reward = None
while reward != 20:
    state, reward, done, info = env.step(env.action_space.sample())
    counter += 1

print(counter)


In [None]:
Q = np.zeros([env.observation_space.n, env.action_space.n])
G = 0
alpha = 0.618

In [None]:
for episode in range(1,1001):
    done = False
    G, reward = 0,0
    state = env.reset()
    while done != True:
        action = np.argmax(Q[state]) #1
        state2, reward, done, info = env.step(action) #2
        Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action]) #3
        G += reward
        state = state2    
    if episode % 50 == 0:
        print('Episode {} Total Reward: {}'.format(episode,G))
