## Reinforcement Learning

Start by installing the necessary packages:

In [None]:
!pip install gym
!pip install procgen
!pip install gym-retro
!pip install stable-baselines

### Explore Environments

**Gym**  
The gym environment is typically used when creating and leveraging new environments/games to train agents on. 

In [2]:
import gym
env = gym.make('CartPole-v0')
print(f"The observation space of the environment is {env.observation_space}")
print(f"The action space of the environment is {env.action_space}")

The observation space of the environment is Box(4,)
The action space of the environment is Discrete(2)


**Retro**  
The retro environments are fun to use as you can take any ROM and train an agent on that. 

In [3]:
import retro
env = retro.make(game='Airstriker-Genesis')
print(f"The observation space of the environment is {env.observation_space}")
print(f"The action space of the environment is {env.action_space}")

The observation space of the environment is Box(224, 320, 3)
The action space of the environment is MultiBinary(12)


**NOTE:** The observation space of the environment is an image and can therefore best use a Convolution-based neural network policy. 

**Procgen**  
These environments are great as they can be trained very fast and require very little overhead. 

In [4]:
import gym
param = {"num_levels": 1, "distribution_mode": "hard"}
env = gym.make("procgen:procgen-leaper-v0", **param)
print(f"The observation space of the environment is {env.observation_space}")
print(f"The action space of the environment is {env.action_space}")

The observation space of the environment is Box(64, 64, 3)
The action space of the environment is Discrete(15)


Again, note that the observation space is an image and policies should be adapted accordingly. 

## Train PPO
Now, it is a simple manner of selecting the correct environment and corresponding algorithm and policy.

In [None]:
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
import gym

env = gym.make('CartPole-v0')
env = DummyVecEnv([lambda: env])

model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=50_000, log_interval=10)

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Note that is a very simple example and training agents in more difficult environments typically takes tens of millions, sometimes hundreds of millions steps. 