# Intro to OpenAI Gym -- the Mountain Car Example -- 20191004

In [1]:
import gym
import numpy as np

OpenAI's `gym` module provides users with a wide variety of environments (e.g., the video game world) in which an agent (e.g., a neural network) must learn to control some object therein. For reinforcement learning agents, the agents learn within a conceptual framework know as a **Markov decision process**.

This process discretizes time (which is what computers already do) into time steps. At each time step the agent observes the state of environment, takes an action based on that observation, and notes the immediate results/reward from taking that action. In a Markov Decision Process, each of these steps in the next-most time step is a random variable, i.e., before it is "sampled"/observed/selected/received, the agent only knows rough probabilities of the outcome of the "sample." Therefore, these steps (observe, act, note results) are a really just a sequence random variables of the form: $\textrm{state}_0$,$\textrm{action}_0$,$\textrm{reward}_1$,$\textrm{state}_1$,$\textrm{action}_1$,$\textrm{reward}_2$,...,$\textrm{state}_t$. In this sense, the sequence is a "process," a stochastic one specifically, and one over which the agent can exert some degree of  control.

The agent exerts control over the process by making decision: what $\textrm{action}_t$ should be. The notion of "should" necessitates an objective. In reinforcement learning, the objective is to maximize the reward. Given that most processes that are worth learning have multiple steps, more accurately, the objective is to maximize the sum of the rewards from here on out: the cumulative reward. However, given that maximizing cumulative reward necessarily involves anticipating the results of future actions and given that the results of future actions are random, there is no way for the agent to "know" the cumulative reward and thereby maximize it. Instead, the agent tries to do the next best thing: maximize the *expected* cumulative reward (*expected* as in *expected value*). For mathematical underpinning purposes (proofs of algorithmic convergence), and to model immediate rewards beinging preferred to future rewards (assuming the same reward), the expected cumulative reward is often present-value-discounted. Mathematically, this can all be written:
$\sum_{i=0}^{end} \gamma^i E[r_{i+1} | s_i, a_i]$, where $\gamma$ is the PV discount factor $\in [0,1).$ 

Oh, one more thing about MDPs: the action $a_i$ is decided upon by considering *only* the current state $s_i$, hence why there was no $s_{i-1}$ in the above expected value. This is what is known as the Markov assumption, which, more generally, can be stated as: the past provides no additional information beyond that which is provided by present. Now, oftentimes, this assumption cannot strictly be met by considering the only the environment at a single moment in time. Therefore, oftentimes, the "state at time $t$" is often relaxed to be include some (but not all) prior moments' details.

Reinforcement learning's place in all of this is in providing the agent with a set of instructions, an algorithm, by which it can learn to maximize the PV discounted expected cumulative reward. There are many different reinforcement learning algorithms out there; check out the "Reinforcement Learning" Wikipedia page for a great start. Since this is an OpenAI Gym walkthrough, we can stop at this explaination of each piece. And, finally, we can get to the code.

## What Environments Are Available?

OpenAI `gym` *does not* provide agents. This is purposeful; `gym`'s mission statement can be said to be: to provide reinforcement learning researchers with a set of easy-to-use, standardized environments *so as to make developing reinforcement learning algorithms easier.* The environments are easy-to-use and standardized in the sense that they (oftentimes simple video games) don't require super powerful hardware, and they also tend to only require a few lines of code, as we'll see shortly. For now, though, let's see what environments `gym` contains.

In [2]:
# https://stackoverflow.com/a/48989130
gym_environments = np.unique([e.__repr__().replace('EnvSpec(', '').split('-')[0] \
                              for e in list(gym.envs.registry.all())])  # remove class name & version info (get base name)
for idx, e in enumerate(gym_environments):
    print(idx+1, 'of', len(gym_environments), '---', e)
    # remove class name


1 of 256 --- Acrobot
2 of 256 --- AirRaid
3 of 256 --- AirRaidDeterministic
4 of 256 --- AirRaidNoFrameskip
5 of 256 --- Alien
6 of 256 --- AlienDeterministic
7 of 256 --- AlienNoFrameskip
8 of 256 --- Amidar
9 of 256 --- AmidarDeterministic
10 of 256 --- AmidarNoFrameskip
11 of 256 --- Ant
12 of 256 --- Assault
13 of 256 --- AssaultDeterministic
14 of 256 --- AssaultNoFrameskip
15 of 256 --- Asterix
16 of 256 --- AsterixDeterministic
17 of 256 --- AsterixNoFrameskip
18 of 256 --- Asteroids
19 of 256 --- AsteroidsDeterministic
20 of 256 --- AsteroidsNoFrameskip
21 of 256 --- Atlantis
22 of 256 --- AtlantisDeterministic
23 of 256 --- AtlantisNoFrameskip
24 of 256 --- BankHeist
25 of 256 --- BankHeistDeterministic
26 of 256 --- BankHeistNoFrameskip
27 of 256 --- BattleZone
28 of 256 --- BattleZoneDeterministic
29 of 256 --- BattleZoneNoFrameskip
30 of 256 --- BeamRider
31 of 256 --- BeamRiderDeterministic
32 of 256 --- BeamRiderNoFrameskip
33 of 256 --- Berzerk
34 of 256 --- BerzerkDeter

Safe to say, `gym` has a lot of environments.

The environment that we're going to sample is one that is posted on OpenAI's homepage: MountainCar. Without further adieu, let's run a gym. The goal is to push a minecart up a hill that can only be surmounted by utilizing momentum. See https://github.com/openai/gym/wiki/MountainCar-v0 for the rest of the explanation or just run the code below and see what I mean.

In [3]:
# https://gym.openai.com/
env = gym.make('MountainCar-v0')
observation = env.reset() # state_0
for i in range(10000):
    env.render()
    action = env.action_space.sample() # action_i selected -- in this case it's randomly sampled 
    # https://github.com/openai/gym/wiki/MountainCar-v0 for actions int in [0,2] (int's correspond to something)
    
    observation, reward, done, info = env.step(action) 
    # action_i taken; get observation_i+1, reward_i+1, done_i+1, and a info dict for metadata
    # done is True if goal achieved or game over; else False
    # observation: (position [one dimensional -- just the x-axis], velocity [in that one dimension])
    # ^ https://github.com/openai/gym/blob/1d31c12437e8bd7f466139a479705819fff8c111/gym/envs/classic_control/mountain_car.py#L57
    
    if done:
        observation = env.reset()
        
env.close()  # close

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


Hopefully this simple example conveys the "easy-to-use" and "standardized" -ness. All that we really have to do is specify an action selection function. 

In [4]:
def select_action(position, velocity):
    # https://github.com/openai/gym/wiki/MountainCar-v0 for actions
    push_left = 0
    no_push = 1
    push_right = 2
    position_min = -1.2
    position_max = 0.6
    velocity_min = -0.07
    velocity_max = 0.07
    
    if -0.9 <= position <= 0.0 and velocity <= 0:
        return push_left
    
    else:
        return push_right 
        

In [7]:
# https://gym.openai.com/
env = gym.make('MountainCar-v0')
observation = env.reset()
max_iter = 200
for i in range(max_iter):
    env.render()
    action = select_action(position=observation[0], velocity=observation[1])
    
    observation, reward, done, info = env.step(action) 
    
    if done and i < max_iter:
        print('Achieved the objective!')
        break
    elif done: 
        print('Failed to achieve the objective! Restarting')
        observation = env.reset()
        break
        
env.close()  # close

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Achieved the objective!


Now, of course, it isn't all that impressive for a human to hard-code a solution to this problem. This problem/environment was really created to test the capabilities of control algorithms that don't know anything about mine carts or mountains or the laws of physics and so on. For now, though, I'm going to leave off here. Hopefully by now you at least know how to use `gym` at a *very* basic level.