# Introduce to OpenAI Gym

We will introduce the main API methods that users of this class need to know are:
* reset
* step
* render

### DuckieNav-v0 Environment
The example is modifed from the Taxi Problem in "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" by  Tom Dietterich (2000), Journal of Artificial Intelligence Research.

<img style="float: right;" src="images/DuckieNav-v1.png"  width="240" height="240">


```
MAP = [
    "+-----------------+",
    "|O|O| : : : : :G: |",
    "|O|O| |O| |O| |O| |",
    "|O| : |O| |O| |O| |",
    "| : |O|O| : : : : |",
    "| |O|O|O|O|O| |O| |",
    "| : :R: : : : :O: |",
    "| |O|O|O| |O|O|O| |",
    "| |O| : : |O| : : |",
    "| |O| |O|O|O|B|O| |",
    "| : : : : : : |O| |",
    "| |O| |O| |O| |O| |",
    "| : : : : : : |O| |",
    "| |O| |O| |O|O|O| |",
    "| : : :Y: : : : : |",
    "+-----------------+",
]
```

We consider shows a 14 by 9 grid world inhabited by a Duckietown, except the "service area." The taxi problem is episodic, and in each episode a passenger is located at one of the 4 specially designated locations (R, Y, B, and G). The Duckiebot (taxi agent) starts in a given location and must go to the transported passenger’s location, pick up the passenger, go to the destination location, and put down the passenger. The episode ends when the passenger is deposited at the destination location to one of the 4 locations.

Adapted from https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym


## 1. Initialize DuckieNav-v0

### Installation
You can obtain and install this customized gym environment: 
```
$ git clone https://github.com/ARG-NCTU/gym-duckienav.git
$ cd gym-duckienav
$ pip install -e . # you may need sudo depending on your setup
```

In [1]:
import gym
import gym_duckienav
import gym.spaces
import numpy as np

env = gym.make("DuckieNav-v0")

### How many 'states' in observation_space: 
There are 2520 states from: 14 (rows) x 9 (columns) x 5 (passenger locations: R, Y, B, G, or on taxi) x 4 (destinations: R, Y, B, or G)

In [2]:
env.observation_space.n

2520

### action_space: 
There are 6 possible actions in Taxi-v2 environment
* down (0), up (1), right (2), left (3), pick-up (4), and drop-off (5)

In [3]:
env.action_space.n

6

## 2. States

Resets the state of the environment and returns an initial observation (state).

The current state is from :
* current taxi row position
* current taxi colum position
* passenger location (Blue or in taxi) from 0: R, 1: G, 2: Y, 3: B; 4: in taxi.
* destination location (Magenta) from 0: R, 1: G, 2: Y, 3: B

In [47]:
env.reset()
print "Current state: " + str(env.s)
for p in env.decode(env.s): print p
env.render()

Current state: 1563
8
6
0
3
+-----------------+
|O|O| : : : : :G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[34;1mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|[35m[43mB[0m[0m|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+



Repeat previous cell for a few times.

In taxi problem, the colors mean:
* blue: passenger's current position
* magenta: destination
* yellow: empty taxi
* green: full taxi

## 3. Actions

Remember that the taxi agent can perform the following actions:
* 0: "South", 
* 1: "North", 
* 2: "East", 
* 3: "West", 
* 4: "Pickup", 
* 5: "Dropoff"

Let's set the state to 124.
Let the taxi agent perform some actions.  

In [70]:
env.s = 124
env.render()

+-----------------+
|O|O| : : : :[43m [0m:[34;1mG[0m: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (Dropoff)


### `step()`

Run one timestep of the environment's dynamics. 
It returns a tuple (observation, reward, done, info)
* observation (object): agent's observation of the current environment
* reward (float) : amount of reward returned after previous action
* done (boolean): whether the episode has ended, in which case further step() calls will return undefined results
* info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Essentially the empty taxi is supposed to: 
* move toward the blue letter, 
* pickup the passenger (now the taxi is green), 
* drive to the magenta letter, and 
* drop the passenger (the taxi is yellow again).

It is obvious that we should start with moving "East" env.step(2). Index 2 is for moving "East"
We will do the followings:
* Perform "Pickup" step(4) (although the passenger is not here)
* Perform "East" step(2)
* Perform "Pickup" step(4)
* Perform "West" step(3)
* Perform "South" step(0) for 5 times
* Perfomr "Dropoff" (5)
* Perform "West" step(3) for 4 times
* Perfomr "Dropoff" (5)

In [71]:
state, reward, done, info = env.step(4)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : :[43m [0m:[34;1mG[0m: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (Pickup)
reward: -10


In [72]:
state, reward, done, info = env.step(2)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : : :[34;1m[43mG[0m[0m: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (East)
reward: -1


In [73]:
state, reward, done, info = env.step(4)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : : :[42mG[0m: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (Pickup)
reward: -1


In [74]:
state, reward, done, info = env.step(3)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : :[42m_[0m:G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (West)
reward: -1


In [75]:
for i in range(0, 5):
    env.step(0)
env.render()

+-----------------+
|O|O| : : : : :G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : :[42m_[0m:O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (South)


In [76]:
state, reward, done, info = env.step(5)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : : :G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35mR[0m: : : :[42m_[0m:O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (Dropoff)
reward: -10


In [77]:
for i in range(0, 4):
    env.step(3)
env.render()

+-----------------+
|O|O| : : : : :G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35m[42mR[0m[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (West)


In [78]:
state, reward, done, info = env.step(5)
env.render()
print "reward: " + str(reward)

+-----------------+
|O|O| : : : : :G: |
|O|O| |O| |O| |O| |
|O| : |O| |O| |O| |
| : |O|O| : : : : |
| |O|O|O|O|O| |O| |
| : :[35m[42mR[0m[0m: : : : :O: |
| |O|O|O| |O|O|O| |
| |O| : : |O| : : |
| |O| |O|O|O|B|O| |
| : : : : : : |O| |
| |O| |O| |O| |O| |
| : : : : : : |O| |
| |O| |O| |O|O|O| |
| : : :Y: : : : : |
+-----------------+
  (Dropoff)
reward: 20


### Rewards

You have probably figured out the rewards:
* Perform any movements: -1
* Pick up or drop off at the wrong position: -10
* Drop off the passenger at the right position: 20 

## 4. Ramdon Actions

We will use the funciton env.action_space.sample(); you could run the following cell for a few times

In [85]:
print env.action_space.sample()

0


### How good does behaving completely random do?

In [91]:
state = env.reset()
counter = 0
g = 0
reward = None
while reward != 20:
    state, reward, done, info = env.step(env.action_space.sample())
    counter += 1
    g += reward
print("Solved in {} Steps with a total reward of {}".format(counter,g))


Solved in 7410 Steps with a total reward of -30321


## 5. Basic Reinforcement Learning: Q-Learning

In [89]:
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros([n_states, n_actions])

# This multidimensional array will keep a history of our Q-Values for all states
Q_hist = np.zeros([n_states, n_actions, 0])


episodes = 1
G = 0
alpha = 0.618

for episode in range(1,episodes+1):
    done = False
    G, reward = 0,0
    state = env.reset()
    firstState = state
    print("Initial State = {}".format(state))
    while reward != 20:
        action = np.argmax(Q[state]) 
        state2, reward, done, info = env.step(action)
        Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action]) 
        G += reward
        state = state2
        
        #This will keep a history of Q Values in a multi dimensional array
        Q_hist = np.dstack((Q_hist, Q))
finalState = state
print("Final State = {}".format(finalState))

Initial State = 1673
Final State = 157
