# Stable Baselines3 Tutorial - Creating a custom Gym environment

Github repo: https://github.com/araffin/rl-tutorial-jnrr19/tree/sb3/

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3-Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


## Introduction

In this notebook, you will learn how to use your own environment following the OpenAI Gym interface.
Once it is done, you can easily use any compatible (depending on the action space) RL algorithm from Stable Baselines on that environment.

## Install Dependencies and Stable Baselines3 Using Pip



In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
# !pip install "stable-baselines3[extra]>=2.0.0a4"

## First steps with the gym interface

As you have noticed in the previous notebooks, an environment that follows the gym interface is quite simple to use.
It provides to this user mainly three methods, which have the following signature (for gym versions > 0.26)
- `reset()` called at the beginning of an episode, it returns an observation and a dictionary with additional info (defaults to an empty dict)
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether new state is a terminal state (episode is finished), whether the max number of timesteps is reached (episode is artificially finished), and additional information
- (Optional) `render()` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `render_mode='rbg_array'` to retrieve an image of the scene).

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about [gym spaces](https://gymnasium.farama.org/api/spaces/) is to look at the [source code](https://github.com/Farama-Foundation/Gymnasium/tree/main/gymnasium/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.


[Documentation on custom env](https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html)

Also keep in mind that Stabe-baselines internally uses the previous gym API (<0.26), so every VecEnv returns only the observation after resetting and returns a 4-tuple instead of a 5-tuple  (terminated & truncated are already combined to done).

In [3]:
import gymnasium as gym
from reload.simplesim.env_gym import SimpleSimGym

STARTING_BUDGET = 2000
NUM_TARGETS = 1
PLAYER_FOV = 60

env = SimpleSimGym(starting_budget=STARTING_BUDGET, num_targets=NUM_TARGETS, player_fov=PLAYER_FOV, visualize=False)

# Box(4,) means that it is a Vector with 4 components
print("Observation space:", env.observation_space)
print("Shape:", env.observation_space.shape)
# Discrete(2) means that there is two discrete actions
print("Action space:", env.action_space)

from gymnasium import spaces
spaces.utils.flatten(env.observation_space, env._get_obs())

# The reset method is called at the beginning of an episode
obs, info = env.reset()
# Sample a random action
action = env.action_space.sample()
print("Sampled action:", action)
obs, reward, terminated, truncated, info = env.step(action)
# Note the obs is a numpy array
# info is an empty dict for now but can contain any debugging info
# reward is a scalar
print(obs.shape, reward, terminated, truncated, info)

Observation space: Box([   0. -300. -300.], [359. 300. 300.], (3,), float32)
Shape: (3,)
Action space: Discrete(5)
Sampled action: 0
Reward: 0.00, Observation: [ 90. -79.  81.]

(3,) 0 False False {}


##  Gym env skeleton

In practice this is how a gym environment looks like.
Here, we have implemented a simple grid world were the agent must learn to go always left.

In [4]:
# import numpy as np
# import gymnasium as gym
# from gymnasium import spaces


# class GoLeftEnv(gym.Env):
#     """
#     Custom Environment that follows gym interface.
#     This is a simple env where the agent must learn to go always left.
#     """

#     # Because of google colab, we cannot implement the GUI ('human' render mode)
#     metadata = {"render_modes": ["console"]}

#     # Define constants for clearer code
#     LEFT = 0
#     RIGHT = 1

#     def __init__(self, grid_size=10, render_mode="console"):
#         super(GoLeftEnv, self).__init__()
#         self.render_mode = render_mode

#         # Size of the 1D-grid
#         self.grid_size = grid_size
#         # Initialize the agent at the right of the grid
#         self.agent_pos = grid_size - 1

#         # Define action and observation space
#         # They must be gym.spaces objects
#         # Example when using discrete actions, we have two: left and right
#         n_actions = 2
#         self.action_space = spaces.Discrete(n_actions)
#         # The observation will be the coordinate of the agent
#         # this can be described both by Discrete and Box space
#         self.observation_space = spaces.Box(
#             low=0, high=self.grid_size, shape=(1,), dtype=np.float32
#         )

#     def reset(self, seed=None, options=None):
#         """
#         Important: the observation must be a numpy array
#         :return: (np.array)
#         """
#         super().reset(seed=seed, options=options)
#         # Initialize the agent at the right of the grid
#         self.agent_pos = self.grid_size - 1
#         # here we convert to float32 to make it more general (in case we want to use continuous actions)
#         return np.array([self.agent_pos]).astype(np.float32), {}  # empty info dict

#     def step(self, action):
#         if action == self.LEFT:
#             self.agent_pos -= 1
#         elif action == self.RIGHT:
#             self.agent_pos += 1
#         else:
#             raise ValueError(
#                 f"Received invalid action={action} which is not part of the action space"
#             )

#         # Account for the boundaries of the grid
#         self.agent_pos = np.clip(self.agent_pos, 0, self.grid_size)

#         # Are we at the left of the grid?
#         terminated = bool(self.agent_pos == 0)
#         truncated = False  # we do not limit the number of steps here

#         # Null reward everywhere except when reaching the goal (left of the grid)
#         reward = 1 if self.agent_pos == 0 else 0

#         # Optionally we can pass additional info, we are not using that for now
#         info = {}

#         return (
#             np.array([self.agent_pos]).astype(np.float32),
#             reward,
#             terminated,
#             truncated,
#             info,
#         )

#     def render(self):
#         # agent is represented as a cross, rest as a dot
#         if self.render_mode == "console":
#             print("." * self.agent_pos, end="")
#             print("x", end="")
#             print("." * (self.grid_size - self.agent_pos))

#     def close(self):
#         pass

### Validate the environment

Stable Baselines3 provides a [helper](https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html) to check that your environment follows the Gym interface. It also optionally checks that the environment is compatible with Stable-Baselines (and emits warning if necessary).

In [5]:
from stable_baselines3.common.env_checker import check_env

In [6]:
# env = GoLeftEnv()
# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)

Reward: 0.00, Observation: [ 90. -80.  52.]

Reward: 0.00, Observation: [ 90. -65.  83.]

Reward: 0.00, Observation: [ 90. -85.  83.]

Reward: 0.00, Observation: [ 90. -65.  83.]

Reward: 0.00, Observation: [ 90. -45.  83.]

Reward: 0.00, Observation: [ 90. -45.  63.]

Reward: 0.00, Observation: [ 90. -65.  63.]

Reward: 0.00, Observation: [ 90. -85.  63.]

Reward: 0.00, Observation: [ 90. -65.  63.]

Reward: 0.00, Observation: [ 90. -65.  43.]

Reward: 0.00, Observation: [ 90. -45.  43.]



### Testing the environment

In [7]:
# env = (grid_size=10)

# obs, _ = env.reset()
# env.render()

# print(env.observation_space)
# print(env.action_space)
# print(env.action_space.sample())

# GO_LEFT = 0
# # Hardcoded best agent: always go left!
# n_steps = 20
# for step in range(n_steps):
#     print(f"Step {step + 1}")
#     obs, reward, terminated, truncated, info = env.step(GO_LEFT)
#     done = terminated or truncated
#     print("obs=", obs, "reward=", reward, "done=", done)
#     env.render()
#     if done:GoLeftEnv
#         print("Goal reached!", "reward=", reward)
#         break

### Try it with Stable-Baselines

Once your environment follow the gym interface, it is quite easy to plug in any algorithm from stable-baselines

In [8]:
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.env_util import make_vec_env

# Instantiate the env
# vec_env = make_vec_env(GoLeftEnv, n_envs=1, env_kwargs=dict(grid_size=10))
vec_env = make_vec_env(SimpleSimGym, n_envs=1, env_kwargs=dict(starting_budget=STARTING_BUDGET, num_targets=NUM_TARGETS, player_fov=PLAYER_FOV, visualize=False))

In [9]:
# Train the agent
eps_to_train = 500 # rough (over-)estimate of the number of episodes of experience we will train on 
model = DQN("MlpPolicy", env, verbose=1).learn(200*eps_to_train) # =10k, (was 5k)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Reward: 0.00, Observation: [ 90. -71. -32.]

Reward: 0.00, Observation: [ 90. -51. -32.]

Reward: 0.00, Observation: [ 90. -31. -32.]

Reward: 0.00, Observation: [ 90. -31. -52.]

Reward: 0.00, Observation: [ 90. -31. -72.]

Reward: 0.00, Observation: [ 90. -31. -52.]

Reward: 0.00, Observation: [ 90. -31. -52.]

Reward: 0.00, Observation: [ 90. -31. -52.]

Reward: 0.00, Observation: [ 90. -31. -32.]

Reward: 100.00, Observation: [ 90. -31. -12.]

Reward: 0.00, Observation: [90. 75. 74.]

Reward: 0.00, Observation: [90. 75. 94.]

Reward: 0.00, Observation: [90. 75. 94.]

Reward: 0.00, Observation: [90. 75. 94.]

Reward: 0.00, Observation: [90. 75. 74.]

Reward: 0.00, Observation: [90. 75. 94.]

Reward: 0.00, Observation: [90. 55. 94.]

Reward: 0.00, Observation: [ 90.  55. 114.]

Reward: 0.00, Observation: [ 90.  35. 114.]

Reward: 0.00, Observation: [90. 35. 94.]

Reward: 0.00, Observation: 

KeyboardInterrupt: 

: 

In [None]:
# Test the trained agent
# using the vecenv
obs = vec_env.reset()
n_steps = 1000
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=True)
    print(f"Step {step + 1}")
    print("Action: ", action)
    obs, reward, done, info = vec_env.step(action)
    print("obs=", obs, "reward=", reward, "done=", done)
    vec_env.render()
    if done:
        # Note that the VecEnv resets automatically
        # when a done signal is encountered
        print("Goal reached!", "reward=", reward)
        break