# OpenAI Gym tutorial walkthrough

## Testing the environment

In [1]:
import gym

In [None]:
env = gym.make('CartPole-v0')

In [None]:
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # Take a random action
    
env.close()

## Observations
If we ever want to do better than take random actions at each step, it'd probably be good to actually know what our actions are doing to the environment. 

The environment's `step()` function returns exactly what we need. In fact, `step()` return 4 values
- `observation` (object): An environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game
- `reward` (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward
- `done` (boolean): Whether it's time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and `done` being `True` indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- `info` (dict): Diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment's last state change). However, official evaluations of your agent are not allowed to use this for learning.

For a classic "agent-environment loop", at each timestep, the agent chooses an `action`, and the environment returns an `observation` and a `reward`
![AE-Loop](./images/aeloop.svg)

The process gets started by calling `reset()`, which returns an initial `observation`. So a more proper way of writing the previous code would be to respect the `done` flag:

In [2]:
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()  # Randomly sample from the envirnment action space
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after %s timesteps" % (t + 1))
            break
            
env.close()

[ 0.04351831 -0.02567718 -0.04385931  0.04388824]
[ 0.04300476  0.17004535 -0.04298155 -0.26230374]
[ 0.04640567  0.36575364 -0.04822762 -0.56822773]
[ 0.05372074  0.17134015 -0.05959217 -0.29111999]
[ 0.05714755  0.36725894 -0.06541457 -0.60198625]
[ 0.06449273  0.56323198 -0.0774543  -0.91453527]
[ 0.07575737  0.36923825 -0.09574501 -0.64716588]
[ 0.08314213  0.17557149 -0.10868832 -0.3861023 ]
[ 0.08665356 -0.0178532  -0.11641037 -0.12956878]
[ 0.0862965   0.17872726 -0.11900174 -0.45659118]
[ 0.08987104 -0.01452894 -0.12813357 -0.20366104]
[ 0.08958046  0.1821704  -0.13220679 -0.53385955]
[ 0.09322387 -0.01086873 -0.14288398 -0.28558338]
[ 0.0930065  -0.20369453 -0.14859565 -0.04115728]
[ 0.08893261 -0.3964078  -0.14941879  0.20120054]
[ 0.08100445 -0.58911203 -0.14539478  0.44327031]
[ 0.06922221 -0.78190972 -0.13652938  0.68681985]
[ 0.05358401 -0.58518328 -0.12279298  0.35446114]
[ 0.04188035 -0.7783648  -0.11570376  0.60604107]
[ 0.02631305 -0.97169499 -0.10358293  0.86015633]


[-0.02815735 -0.03319159  0.04913043  0.04792031]
[-0.02882118 -0.22898235  0.05008884  0.35569061]
[-0.03340083 -0.42477935  0.05720265  0.66373773]
[-0.04189642 -0.62064846  0.07047741  0.97386926]
[-0.05430938 -0.81664155  0.08995479  1.28783226]
[-0.07064222 -1.01278546  0.11571144  1.60726961]
[-0.09089793 -0.81920603  0.14785683  1.35278549]
[-0.10728205 -1.01584241  0.17491254  1.68783354]
[-0.12759889 -1.21250258  0.20866921  2.02948523]
Episode finished after 9 timesteps
[0.03865311 0.03749882 0.03055433 0.03356519]
[ 0.03940309  0.23216959  0.03122564 -0.24932309]
[0.04404648 0.03661595 0.02623917 0.05304322]
[ 0.0447788  -0.15887222  0.02730004  0.35388797]
[0.04160136 0.03585112 0.0343778  0.06993701]
[ 0.04231838 -0.15974639  0.03577654  0.37326501]
[ 0.03912345 -0.35535782  0.04324184  0.67701045]
[ 0.0320163  -0.55105306  0.05678205  0.98298805]
[ 0.02099523 -0.74688795  0.07644181  1.29295191]
[ 0.00605747 -0.94289373  0.10230085  1.60855401]
[-0.0128004  -0.74911879  0

[-0.10317333 -0.99728875  0.17457459  1.66729148]
[-0.12311911 -0.80457424  0.20792042  1.43367809]
Episode finished after 17 timesteps
[-0.01998234 -0.00104428  0.02405424 -0.00516771]
[-0.02000322  0.19372459  0.02395088 -0.29016523]
[-0.01612873 -0.00173055  0.01814758  0.00997428]
[-0.01616334  0.1931265   0.01834706 -0.27692807]
[-0.01230081 -0.00225233  0.0128085   0.02148458]
[-0.01234586  0.19268361  0.01323819 -0.26712975]
[-0.00849219  0.38761415  0.0078956  -0.55560803]
[-0.0007399   0.19238224 -0.00321656 -0.260448  ]
[ 0.00310774  0.38754996 -0.00842552 -0.55414374]
[ 0.01085874  0.5827892  -0.0195084  -0.84946928]
[ 0.02251452  0.38793865 -0.03649778 -0.56298417]
[ 0.0302733   0.19334734 -0.04775747 -0.28201956]
[ 0.03414024  0.38911681 -0.05339786 -0.58937431]
[ 0.04192258  0.19478163 -0.06518534 -0.31397854]
[ 0.04581821  0.00064589 -0.07146491 -0.04254432]
[ 0.04583113 -0.19338237 -0.0723158   0.22676238]
[ 0.04196348 -0.38740033 -0.06778055  0.49578637]
[ 0.03421548 -

## Spaces

In the examples above, we've been sampling random actions from the environment's action space. But what actually are those actions? Every environment comes with an `action_space` and an `observation_space`. These attributes are of type `Space`, and they describe the format of valid actions and observations:

In [3]:
import gym
env = gym.make('CartPole-v0')

In [4]:
print(env.action_space)

Discrete(2)


In [5]:
print(env.observation_space)

Box(4,)


The `Discrete` space allows a fixed range of non-negative numbers, so in this case valid `action`s are either 0 or 1.  
The `Box` space represents an `n`-dimensional box, so valid observations will be an array of 4 numbers. We can also check the `Box`'s bounds:

In [6]:
print(env.observation_space.high)

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


In [7]:
print(env.observation_space.low)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


You can create your environment by using `Box` or `Discrete` too!

In [8]:
from gym import spaces
space = spaces.Discrete(8)
x = space.sample()

In [9]:
assert space.contains(x)
assert space.n == 8

In [20]:
# There are only two values in action space {0, 1}, one apply force to the left and one apply to the right
print(env.action_space.sample()) 

1


## Available Environments

Gym comes with a diverse suite of environments that range from easy to difficult and involve different kinds of data. 

- Classic control and toy text: Complete small-scale tasks, mostly from the RL literature
- Algorithmic: Perform computations such as adding multi-digit numbers and reversing sequences. One might object that these tasks are easy for a computer. The challenge is to learn these algorithms purely from examples. These tasks have the nice property that it's easy to vary the difficulty by varying the sequence length
- Atari: Play classic Atari games
- 2D and 3D robots: Control a robot in simulation. These tasks use the MuJoCo physics engine, which was designed for fast and accurate robot simulation. Included are some enbironments from a recent benchmark by UC Berkeley researchers. 

## The registry
The `gym`'s main purpose is to provide a large collection of environments that expose a common interface and are versioned to allow for comparions. To list the environments available in your installation, just ask `gym.envs.registry`

In [21]:
from gym import envs
print(envs.registry.all())

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v2), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(HalfCheetah-v3), EnvSpec(Hopper-v2), EnvSpec(Hopper-v3), EnvSpec(Swimmer-v2), EnvSp

## Why Gym?

Reinforcement learning (RL) is the subfield of machine learning concerned with decision making and motor control. It suudies how an agent can learn how to achieve goals in a complex, uncertain environment. It's exciting for two reasons:
1. **RL is very general, encompassing all problems that involve making a sequence of decisions**: For example, controlling a robot's motors so that it's able to run and jump, making a business decision like pricing and invetory management, or playing video games and board games. RL can even be applied to supevised learning problems with sequential or structured outputs.

2. **RL algorithms have started to achieve good results in many dificult environments.**: RL has a long history, but until recent advances in deep learning, it required lots of problem-specific engineering. DeepMind's Atari results, BRETT from Pieter Abbeel's group, and AlphaGo all used deep RL algorithms which did not make too many assumptions about their environment, and thus can be applied in other settings.

However, RL research is also slowed down due to:
1. **The need for better benchmarks.** In supervisied learning, progress has been driven by large labeled datasets like ImageNet. In RL, the closest equivalent would be a large and diverse collection of environments. However, the existing open-source collections of RL environments don't have enough variety, and they are often difficult to even setup and use

2. **Lack of standardization of environments used in publications.**: Subtle differences in the problem definition, such as the reward function or the set of actions, can drastically alter a task's difficulty. This issue makes it difficult to reproduct published research and compare results from different papers.