# Build a Gym Environment

This notebook is inspired to the Stable Baselines3 tutorial available at [https://github.com/araffin/rl-tutorial-jnrr19](https://github.com/araffin/rl-tutorial-jnrr19).


## Introduction

In this notebook, we will learn how to build a customized environment with **Gymnasium**.

### Links

Gymnasium Github: [https://github.com/Farama-Foundation/Gymnasium](https://github.com/Farama-Foundation/Gymnasium)

Gymnasium Documentation: [https://gymnasium.farama.org/index.html](https://gymnasium.farama.org/index.html#)

Stable Baselines 3 Github:[https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Stable Baseline 3 Documentation: [https://stable-baselines3.readthedocs.io/en/master/](https://stable-baselines3.readthedocs.io/en/master/)

## Install Gymnasium and Stable Baselines3 Using Pip

In [45]:
!pip install gymnasium
!pip install renderlab  #For rendering
!pip install stable-baselines3[extra]



In [46]:
import numpy as np

import gymnasium as gym
from gymnasium.spaces import Box
from gymnasium.envs.registration import register
import stable_baselines3
from stable_baselines3 import PPO, DDPG, SAC
from stable_baselines3.common.env_checker import check_env

print(gym.__version__)
print(stable_baselines3.__version__)

0.29.1
2.2.1


In [47]:
def evaluate(env, policy, gamma=1., num_episodes=100):
    """
    Evaluate a RL agent
    :param env: (Env object) the Gym environment
    :param policy: (BasePolicy object) the policy in stable_baselines3
    :param gamma: (float) the discount factor
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    all_episode_rewards = []
    for i in range(num_episodes): # iterate over the episodes
        episode_rewards = []
        done = False
        discounter = 1.
        obs, _ = env.reset()
        while not done: # iterate over the steps until termination
            action, _ = policy.predict(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_rewards.append(reward * discounter) # compute discounted reward
            discounter *= gamma

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    std_episode_reward = np.std(all_episode_rewards) / np.sqrt(num_episodes - 1)
    print("Mean reward:", mean_episode_reward,
          "Std reward:", std_episode_reward,
          "Num episodes:", num_episodes)

    return mean_episode_reward, std_episode_reward

## The Minigolf Environment

The `Minigolf` environment models a simple problem in which the agent has to hit a ball on a green using a putter in order to reach the hole with the minimum amount of moves.

* The green is characterized by a **friction** $f$ that is selected uniformly random at the beginning of each episode in the interval `[0.065, 0.196]` and does not change during the episode.
* The **position** of the ball is represented by a unidimensional variable $x_t$ that is initialized uniformly random in the interval `[1,20]`. The observation is made of the pair $s_t = (x_t,f)$.
* The **action** $a_t$ is the force applied to the putter and has to be bounded in the interval `[1e-5,5]`. Before being applied the action is subject to a Gaussian noise, so that the actual action $u_t$ applied is given by:

$
u_t = a_t + \epsilon \qquad \text{where} \qquad \epsilon \sim \mathcal{N}(0,\sigma^2),
$
where $\sigma =0.1$. The movement of the ball is governed by the kinematic law:

$
x_{t+1} = x_{t} - v_t \tau_t + \frac{1}{2} d \tau_t^2
$

where:
* $v_t$ is the velocity computed as $v_t = u_t l$,
* $d$ is the deceleration computed as $d = \frac{5}{7} fg$,
* $\tau_t$ is the time interval computed as $\tau_t = \frac{v_t}{d}$.

The remaining constants are the putter length $l = 1$ and the gravitational acceleration $g=9.81$. The **episode** terminates when the next state is such that the ball enters or surpasses (without entering) the hole. The **reward** is `-1` at every step and `-100` if the ball surpasses the hole. To check whether the ball will not reach, enter, or surpass the hole, refer to the following condition:

\begin{align*}
&v_t < v_{\min} \implies \text{ball does not reach the hole} \\
&v_t > v_{\max} \implies \text{ball surpasses the hole} \\
&\text{otherwise} \implies \text{ball enters the hole}
\end{align*}

where

\begin{align*}
& v_{\min} = \sqrt{\frac{10}{7} fgx_t}
& v_{\max} = \sqrt{ \frac{g(2 h - \rho)^2}{2r} + v_{\min}^2},
\end{align*}
where $h = 0.1$ is the hole size and $\rho = 0.02135$ is the ball radius.


**References**

Penner, A. R. "The physics of putting." Canadian Journal of Physics 80.2 (2002): 83-96.

## Exercise 1

Complete the constructor `__init__`, methods `reset` and `step` based on the environment description provided above.

In [48]:
# agent has to hit the ball with the minimum number of moves. f - friction value (random at the beginning, then fixed);
# state = (pos, frict); # action - bounded force with some Gaussian noise -> transition model is the kenematic law

class Minigolf(gym.Env):
    """
    The Minigolf problem.

    """

    def __init__(self):
        super(Minigolf, self).__init__()

        # Constants
        self.min_pos, self.max_pos = 1.0, 20.0
        self.min_action, self.max_action = 1e-5, 5.0
        self.min_friction, self.max_friction = 0.065, 0.196
        self.putter_length = 1.0
        self.hole_size = 0.10
        self.sigma_noise = 0.1
        self.ball_radius = 0.02135


        # Instance the spaces
        low = np.array([self.min_pos, self.min_friction])
        high = np.array([self.max_pos, self.max_friction])

        self.action_space = Box(low=self.min_action,
                                high=self.max_action,
                                shape=(1,),
                                dtype=np.float32)

        self.observation_space = Box(low=low,
                                     high=high,
                                     shape=(2,),
                                     dtype=np.float32)


    def step(self, action):

        #Retrieve the state components
        x, friction = self.state

        # Clip the action within the allowed range
        action = np.clip(action, self.min_action, self.max_action)

        # TODO Add noise to the action
        u = action + np.random.normal(0, self.sigma_noise)
        # TODO Compute the speed
        v = self.putter_length * u
        v = np.array(v).ravel().item()# make sure it is a value

        # Compute the speed limits
        v_min = np.sqrt(10 / 7 * friction * 9.81 * x)
        v_max = np.sqrt((2 * self.hole_size - self.ball_radius) ** 2 \
                        * (9.81 / (2 * self.ball_radius)) + v_min ** 2)

        # TODO Compute the deceleration
        d = 5 / 7 * 9.81 * friction
        # TODO Compute the time interval
        tau = v / d
        # TODO Update the position
        x -= v * tau + 1 / 2 * d * tau ** 2
        # Clip the position
        x = np.clip(x, self.min_pos, self.max_pos)

        # TODO Compute the reward and episode termination (done)
        reward = 0
        done = False
        if v < v_min:
           reward = -1
        elif v > v_max:
           reward = -100
           done = True
        else:
           done = True
        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, reward, done, False, {}


    def reset(self, seed=None):

        # TODO Random generation of initial position and friction
        x, friction = np.random.uniform(self.min_pos, self.max_pos), np.random.uniform(self.min_friction, self.max_friction)
        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, {}

### Solution
```python
import numpy as np
from gymnasium.spaces import Box

class Minigolf(gym.Env):
    """
    The Minigolf problem.

    """

    def __init__(self):
        super(Minigolf, self).__init__()

        # Constants
        self.min_pos, self.max_pos = 1.0, 20.0
        self.min_action, self.max_action = 1e-5, 5.0
        self.min_friction, self.max_friction = 0.065, 0.196
        self.putter_length = 1.0
        self.hole_size = 0.10
        self.sigma_noise = 0.1
        self.ball_radius = 0.02135


        # Instance the spaces
        low = np.array([self.min_pos, self.min_friction])
        high = np.array([self.max_pos, self.max_friction])

        self.action_space = Box(low=self.min_action,
                                high=self.max_action,
                                shape=(1,),
                                dtype=np.float32)

        self.observation_space = Box(low=low,
                                     high=high,
                                     shape=(2,),
                                     dtype=np.float32)


    def step(self, action):

        #Retrieve the state components
        x, friction = self.state

        # Clip the action within the allowed range
        action = np.clip(action, self.min_action, self.max_action)

        # Add noise to the action
        noisy_action = action + np.random.randn() * self.sigma_noise

        # Compute the speed
        v = noisy_action * self.putter_length
        v = np.array(v).ravel().item()

        # Compute the speed limits
        v_min = np.sqrt(10 / 7 * friction * 9.81 * x)
        v_max = np.sqrt((2 * self.hole_size - self.ball_radius) ** 2 \
                        * (9.81 / (2 * self.ball_radius)) + v_min ** 2)

        # Compute the deceleration
        deceleration = 5 / 7 * friction * 9.81

        # Compute the time interval
        t = v / deceleration

        # Update the state and clip
        x = x - v * t + 0.5 * deceleration * t ** 2
        x = np.clip(x, self.min_pos, self.max_pos)

        # Compute the reward and episode termination
        reward = 0.
        done = True

        if v < v_min:
            reward = -1.
            done = False
        elif v > v_max:
            reward = -100.

        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, reward, done, False, {}


    def reset(self, seed=None):

        # Random generation of initial position and friction
        x, friction = np.random.uniform(low=[self.min_pos, self.min_friction],
                                        high=[self.max_pos, self.max_friction])

        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, {}


To be able to instance the environment with `gym.make`, we need to register the environment

In [49]:
register(
    id="Minigolf-v1",
    entry_point="__main__:Minigolf",
    max_episode_steps=20,# terminate everything after 20 steps
    reward_threshold=0,
)

  logger.warn(f"Overriding environment {new_spec.id} already in registry.")



### Validate the environment

Stable Baselines3 provides a [helper](https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html) to check that our environment complies with the Gym interface.

In [50]:
env = Minigolf()

# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)




## Evaluate some simple Policies

* **Do-nothing policy**: a policy plays the zero action.

$$
\pi(s) = 0
$$


* **Max-action policy**: a policy that plays the maximum available actions.

$$
\pi(s) = +\infty
$$


* **Zero-mean Gaussian policy**: a policy that selects the action sampled from a Gaussian policy with zero mean and variance $\sigma^2=1$

$$
\pi(a|s) = \mathcal{N}(0,\sigma^2)
$$

In [51]:
class DoNothingPolicy():

    def predict(self, obs):
        return 0, obs


class MaxActionPolicy():

    def predict(self, obs):
        return np.inf, obs


class ZeroMeanGaussianPolicy():

    def predict(self, obs):
        return np.random.randn(), obs

In [52]:
env = gym.make("Minigolf-v1")

do_nothing_policy = DoNothingPolicy()

max_action_policy = MaxActionPolicy()

gauss_policy = ZeroMeanGaussianPolicy()


do_nothing_mean, do_nothing_std = evaluate(env, do_nothing_policy)
max_action_mean, max_action_std = evaluate(env, max_action_policy)
gauss_policy_mean, gauss_policy_std = evaluate(env, gauss_policy)

  logger.deprecation(



Mean reward: -20.0 Std reward: 0.0 Num episodes: 100
Mean reward: -86.36 Std reward: 3.502250935635769 Num episodes: 100
Mean reward: -16.43 Std reward: 0.6080694737022118 Num episodes: 100


## Train PPO, DDPG, and SAC

We now train three algorithms suitable for environments with continuous actions: [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html), [Deep Deterministic Policy Gradient](https://stable-baselines3.readthedocs.io/en/master/modules/ddpg.html), and [Soft Actor Critic](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html).

In [53]:
# Separate evaluation env
eval_env = gym.make('Minigolf-v1')

ppo = PPO("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))
ddpg = DDPG("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))
sac = SAC("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))

print('PPO')
ppo.learn(total_timesteps=50000, log_interval=4, progress_bar=True)

print('DDPG')
ddpg.learn(total_timesteps=50000, log_interval=1024, progress_bar=True)

print('SAC')
sac.learn(total_timesteps=50000, log_interval=2048, progress_bar=True)

Output()

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
PPO
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 13.1        |
|    ep_rew_mean          | -14.3       |
| time/                   |             |
|    fps                  | 386         |
|    iterations           | 4           |
|    time_elapsed         | 21          |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.013472047 |
|    clip_fraction        | 0.108       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.46       |
|    explained_variance   | 0.0387      |
|    learning_rate        | 0.0003      |
|    loss                 | 121        

Output()

DDPG
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.9      |
|    ep_rew_mean     | -1.9     |
| time/              |          |
|    episodes        | 1024     |
|    fps             | 153      |
|    time_elapsed    | 77       |
|    total_timesteps | 11920    |
| train/             |          |
|    actor_loss      | 17.1     |
|    critic_loss     | 3.89     |
|    learning_rate   | 0.001    |
|    n_updates       | 11822    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 4.39     |
|    ep_rew_mean     | -3.46    |
| time/              |          |
|    episodes        | 2048     |
|    fps             | 151      |
|    time_elapsed    | 135      |
|    total_timesteps | 20473    |
| train/             |          |
|    actor_loss      | 1.4      |
|    critic_loss     | 0.299    |
|    learning_rate   | 0.001    |
|    n_updates       | 20375    |
---------

Output()

SAC
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.9      |
|    ep_rew_mean     | -0.9     |
| time/              |          |
|    episodes        | 2048     |
|    fps             | 84       |
|    time_elapsed    | 200      |
|    total_timesteps | 17019    |
| train/             |          |
|    actor_loss      | 22.5     |
|    critic_loss     | 37.6     |
|    ent_coef        | 0.229    |
|    ent_coef_loss   | 0.677    |
|    learning_rate   | 0.0003   |
|    n_updates       | 16918    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 3.52     |
|    ep_rew_mean     | -2.52    |
| time/              |          |
|    episodes        | 4096     |
|    fps             | 84       |
|    time_elapsed    | 286      |
|    total_timesteps | 24282    |
| train/             |          |
|    actor_loss      | 3.86     |
|    critic_loss     | 2.25     |
|    ent_c

<stable_baselines3.sac.sac.SAC at 0x7be1a035a380>

Let us now evaluate the results of the training.

In [54]:
ppo_mean, ppo_std = evaluate(eval_env, ppo)
ddpg_mean, ddpg_std = evaluate(eval_env, ddpg)
sac_mean, sac_std = evaluate(eval_env, sac)

Mean reward: -2.23 Std reward: 1.0020035484523553 Num episodes: 100
Mean reward: -0.82 Std reward: 0.03861229196653691 Num episodes: 100
Mean reward: -1.01 Std reward: 0.06112580172368815 Num episodes: 100
