# OpenAI Gym Tutorial

In [66]:
import gym
import numpy as np
import matplotlib.pyplot as plt

## Agent-Environment Loop

Each `step` the agent chooses an `action`, and the environment returns an `observation` and `reward`. The actual four returned value are:

- `observation` (**object**): an environment-specific object $o \in \mathcal O$ representing the observation of the environment.
- `reward` (**float**): each action returns a reward.
- `done` (**boolean**): when a terminal state is reached, time to `reset` the environment for the next episode.
- `info` (**dict**): diagnostic info useful for debugging.

## Actions

Our first example randomly samples from the action *space*. Each environment comes with an `action_space` and `observation_space`. A `Discrete` space is a fixed set of non-negative numbers. The `Box` space is an $\mathbf R^n$ dimensional vector of observations.

## Frozen Lake

Consider a $4 \times 4$ grid. The environment has 16 possible states $\mathcal S$, and four actions $\mathcal A$: `{0:LEFT, 1:DOWN, 2:RIGHT, 4:UP}`.

In [168]:
env = gym.make('FrozenLake-v1', map_name="4x4")
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [179]:
rewards = []
actions = {
    0: "LEFT",
    1: "DOWN",
    2: "RIGHT",
    3: "UP"
}

In [152]:
def value_iteration(env, gamma=1, theta=1e-8):
    r"""Policy evaluation function. Loop until state rewards are stable.
    
    Returns: 
        V (np.array): expected state value given an infinite horizon.
        policy (np.array): best action for each state.

    Args:
        env (gym.env): gym environment.
        gamma (float): future reward discount rate.
        theta (float): stopping criterion.
    """
    # Initialize state-value array
    S_n = env.observation_space.n
    V = np.zeros(S_n)
    policy = np.ones(S_n) * -1
    delta = np.zeros(S_n)
    i = 0
    while True:
        i += 1
        # Loop through states
        for s in env.P:
            Vs = np.zeros(len(env.P[s]))
            # Loop through available actions in each state
            for a in env.P[s]:
                # Loop though transition probabilities, next state, and rewards
                for prob, next_state, reward, done in env.P[s][a]:
                    Vs[a] = prob * (reward + gamma * V[next_state])
            delta[s] = np.abs(V[s] - Vs.max())
            V[s] = Vs.max()
            policy[s] = Vs.argmax()
        if np.all(delta < theta):
            print("Value iteration complete after {} steps".format(i))
            break
    return V, policy


In [183]:
env.P[14]

{0: [(0.3333333333333333, 10, 0.0, False),
  (0.3333333333333333, 13, 0.0, False),
  (0.3333333333333333, 14, 0.0, False)],
 1: [(0.3333333333333333, 13, 0.0, False),
  (0.3333333333333333, 14, 0.0, False),
  (0.3333333333333333, 15, 1.0, True)],
 2: [(0.3333333333333333, 14, 0.0, False),
  (0.3333333333333333, 15, 1.0, True),
  (0.3333333333333333, 10, 0.0, False)],
 3: [(0.3333333333333333, 15, 1.0, True),
  (0.3333333333333333, 10, 0.0, False),
  (0.3333333333333333, 13, 0.0, False)]}

In [175]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [180]:
V, policy = value_iteration(env, gamma=.9)

Value iteration complete after 7 steps


In [177]:
print(np.array2string(V.reshape(4,4), precision=3))

[[0.001 0.003 0.009 0.003]
 [0.003 0.    0.03  0.   ]
 [0.009 0.03  0.1   0.   ]
 [0.    0.1   0.333 0.   ]]


In [178]:
np.array([actions[a] for a in policy]).reshape(4,4)

array([['LEFT', 'DOWN', 'LEFT', 'UP'],
       ['LEFT', 'LEFT', 'LEFT', 'LEFT'],
       ['DOWN', 'LEFT', 'LEFT', 'LEFT'],
       ['LEFT', 'DOWN', 'DOWN', 'LEFT']], dtype='<U4')

In [156]:
goal = 15
for i_episode in range(20):
    obs = env.reset()
    for t in range(100):
        # env.render()
        # print(obs)
        action = policy[obs]
        obs, reward, done, info = env.step(action)  # take a random action
        if done:
            if obs == goal:
                print ("Made the goal!")
            else:
                print("Fell in the hole at ({},{})".format(obs // 4, obs%4))
            # print("Episode {} finished after {} timeteps. Reward {}.".format(
                # i_episode+1, t+1, reward))
            break
env.close()

Fell in the hole at (8,3)
Fell in the hole at (10,1)
Fell in the hole at (10,1)
Fell in the hole at (10,1)
Fell in the hole at (10,2)
Fell in the hole at (10,1)
Fell in the hole at (10,2)
Fell in the hole at (8,3)
Fell in the hole at (10,1)
Fell in the hole at (14,3)
Fell in the hole at (10,1)
Fell in the hole at (10,2)
Fell in the hole at (8,3)
Fell in the hole at (8,3)
Fell in the hole at (10,1)
Fell in the hole at (12,1)
Fell in the hole at (10,1)
Fell in the hole at (8,3)
Fell in the hole at (10,1)
Fell in the hole at (8,3)
