## Activate GPU **(Colab only)**

When in Colab, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [1]:
#@title Notebook Setup

#@markdown In order to be able to run the code, we need to install the *eagerx_tutorials* package.

try:
    import eagerx_tutorials
except ImportError:
    !{"echo 'Installing eagerx-tutorials with pip.' && pip install eagerx-tutorials >> /tmp/eagerx_install.txt 2>&1"}
    
try:
    import huggingface_sb3
except ImportError:
    !{"echo 'Installing huggingface-sb3 with pip.' && pip install huggingface-sb3 >> /tmp/eagerx_huggingface.txt 2>&1"}
    !{"echo 'Installing pickle for loading policies.' && pip install --upgrade --quiet cloudpickle pickle5 >> /tmp/eagerx_pickle.txt 2>&1"}

# Setup interactive notebook
# Required in interactive notebooks only.
from eagerx_tutorials import helper
helper.setup_notebook()

# Import eagerx
import eagerx
eagerx.set_log_level(eagerx.FATAL)

Installing eagerx-tutorials with pip.
Installing huggingface-sb3 with pip.
Installing pickle for loading policies.
Running on CoLab.
Setting up virtual display for visualisation

Installing eagerx-gui dependencies

Installing eagerx-gui

Installing opencv-python-headless



## EAGERx Getting Started with EAGERx

EAGERx: https://github.com/eager-dev/eagerx

Documentation: https://eagerx.readthedocs.io/en/master/


## Hands-on session

The goal of this tutorial is to train a policy for swinging up the famous Gym [Pendulum](https://www.gymlibrary.ml/environments/classic_control/pendulum/) and transfer this policy to a real pendulum system!

<img src="https://github.com/eager-dev/eagerx_tutorials/blob/master/tutorials/auth/figures/gym_pendulum.gif?raw=1" width="280" /> <img src="https://github.com/eager-dev/eagerx_tutorials/blob/master/tutorials/auth/figures/real_pendulum.gif?raw=1" width="280" />

**Left:** The classic OpenAI Gym pendulum.  **Right:** Real pendulum system.

The simulated and real environments have the following structure:

<img src="https://github.com/eager-dev/eagerx_tutorials/blob/master/tutorials/auth/figures/tutorial_1_gui.svg?raw=1" width=720 />

<!-- $\mathbf{x} = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix} \\ \dot{\mathbf{x}} = \begin{bmatrix} \dot{\theta} \\ \frac{1}{J}(\frac{K}{R}u - mgl \sin{\theta} - b \dot{\theta} - \frac{K^2}{R}\dot{\theta})\end{bmatrix}$ -->

<!-- with $\theta$ the angle w.r.t. upright position, $\dot{\theta}$ the angular velocity, $u$ the input voltage, $J$ the inertia, $m$ the mass, $g$ the gravitational constant, $l$ the length of the pendulum, $b$ the motor viscous friction constant, $K$ the motor constant and $R$ the electric resistance. -->

In [2]:
#@title First, we download a pretrained policy from hugging face.

import sys
import stable_baselines3 as sb3
from huggingface_sb3 import load_from_hub

# Download pretrained policy from hugging face
newer_python_version = sys.version_info.major == 3 and sys.version_info.minor >= 8
custom_objects = {}
if newer_python_version:
    custom_objects = {
        "learning_rate": 0.0,
        "lr_schedule": lambda _: 0.0,
        "clip_range": lambda _: 0.0,
    }
checkpoint = load_from_hub(
    repo_id="sb3/ppo-Pendulum-v1",
    filename="ppo-Pendulum-v1.zip",
)

# Initialize model
pretrained_model = sb3.PPO.load(checkpoint, custom_objects=custom_objects, device="cpu")

Downloading:   0%|          | 0.00/139k [00:00<?, ?B/s]

In [3]:
#@title Then, we evaluate its performance on the environment it was trained on.

import gym

# Initalize pendulum environment
env = gym.make("Pendulum-v1")

# Evaluate policy and record video
helper.record_video(env=env, model=pretrained_model, prefix="pretrained")

# Show video
helper.show_video("pretrained-step-0-to-step-500")

Saving video to /content/videos/pretrained-step-0-to-step-500.mp4


In [4]:
#@title Next, we will test the performance on the real pendulum.

#@markdown We will use a fairly accurate model as a surrogate for the real system.

# Define rate (Hz)
rate = 20.0

# Initialize empty graph
graph = eagerx.Graph.create()

# Select sensors, actuators and states of Pendulum
sensors = ["theta", "theta_dot", "image"]
actuators = ["u"]
states = ["model_state", "mass", "length", "max_speed"]

# Make pendulum
from eagerx_tutorials.pendulum.objects import Pendulum
pendulum = Pendulum.make("pendulum", rate=rate, actuators=actuators, sensors=sensors, states=states, render_fn="disc_pendulum_render_fn")

# Decompose angle [cos(theta), sin(theta)]
from eagerx_tutorials.pendulum.processor import DecomposedAngle
pendulum.sensors.theta.processor = DecomposedAngle.make()
pendulum.sensors.theta.space.low = -1
pendulum.sensors.theta.space.high = 1
pendulum.sensors.theta.space.shape = [2]

# Add pendulum to the graph
graph.add(pendulum)

# Connect the pendulum to an action and observations
graph.connect(action="voltage", target=pendulum.actuators.u)
graph.connect(source=pendulum.sensors.theta, observation="angle")
graph.connect(source=pendulum.sensors.theta_dot, observation="angular_velocity")

# Render image
graph.render(source=pendulum.sensors.image, rate=rate)


# Define eagerx environment
from eagerx_ode.engine import OdeEngine
from eagerx.engines.openai_gym.engine import GymEngine
import eagerx_tutorials.pendulum.gym_implementation  # NOOP to register Gym implementation of the pendulum.
from typing import Dict, List
import numpy as np


class PendulumEnv(eagerx.BaseEnv):
    def __init__(self, name: str, rate: float, 
                 graph: eagerx.Graph, 
                 eval: bool = False,
                 mass_rng: List = [0.04, 0.04], length_rng: List = [0.04, 0.04]):
        """Initializes an environment with EAGERx dynamics.

        :param name: The name of the environment. Everything related to this environment
                     (parameters, topics, nodes, etc...) will be registered under namespace: "/[name]".
        :param rate: The rate (Hz) at which the environment will run.
        :param graph: The graph consisting of nodes and objects that describe the environment's dynamics.
        :param engine: The physics engine that will govern the environment's dynamics.
        :param eval: If True we will create an evaluation environment, i.e. not performing domain randomization.
        """
        # Are we evaluating? Use Ode engine, else GymEngine.
        self.eval = eval
        engine = OdeEngine.make(rate=rate) if eval else GymEngine.make(rate=rate)
        
        # Make the backend specification
        from eagerx.backends.single_process import SingleProcess
        backend = SingleProcess.make()
        
        # Store ranges
        self.mass_rng = mass_rng
        self.length_rng = length_rng

        # Maximum episode length
        self.max_steps = 270 if eval else 100
        
        # Step counter
        self.steps = None
        super().__init__(name, rate, graph, engine, backend, force_start=True)
    
    def step(self, action: Dict):
        """A method that runs one timestep of the environment's dynamics.

        :params action: A dictionary of actions provided by the agent.
        :returns: A tuple (observation, reward, done, info).

            - observation: Dictionary of observations of the current timestep.

            - reward: amount of reward returned after previous action

            - done: whether the episode has ended, in which case further step() calls will return undefined results

            - info: contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
        """
        # Take step
        observation = self._step(action)
        self.steps += 1
        
        # Extract observations
        cos_th, sin_th = observation["angle"][0]
        thdot = observation["angular_velocity"][0]
        u = action["voltage"][0]

        # Calculate reward
        # We want to penalize the angle error, angular velocity and applied voltage
        th = np.arctan2(sin_th, cos_th)
        cost = th**2 + 0.1 * (thdot / (1 + 10 * abs(th))) ** 2 + 0.01 * u ** 2

        # Determine done flag
        done = self.steps > self.max_steps
        
        # Set info:
        info = {"TimeLimit.truncated": self.steps > self.max_steps}
        
        return observation, -cost, done, info
    
    def reset(self) -> Dict:
        """Resets the environment to an initial state and returns an initial observation.

        :returns: The initial observation.
        """
        # Determine reset states
        states = self.state_space.sample()
        
        if self.eval:
            theta = 3.14 * np.random.uniform(low=0.75, high=1.0) * [-1,1][np.random.randint(2)]
            states["pendulum/model_state"][:] = [theta, 0.0]
        else:
            # During training we want to vary the length and mass of the pendulum.
            # This will improve the robustness against model inaccuracies.
            # Randomly sample values for the mass and length of the pendulum.
            # Try to estimate the mass and length of the real pendulum system in Figure 1.
            # You can adjust the low and the high in the lines below to define the distributions for sampling.
            # Hint: the Gym pendulum is a rod, while the real pendulum is not.
            # They have different moments of inertia, therefore overestimating the length will help.

            # key = "[object_name]/[state_name]"
            # value should be of type np.ndarray
            
            # Sample mass (kg)
            min_mass, max_mass = self.mass_rng
            states["pendulum/mass"] = np.random.uniform(low=min_mass, high=max_mass, size=()).astype("float32")            # Sample mass (kg)
            # Sample length (m)
            min_length, max_length = self.length_rng
            states["pendulum/length"] = np.random.uniform(low=min_length, high=max_length, size=()).astype("float32")            # Sample length (m)
            
            # END OF YOUR CODE
            
        # Perform reset
        observation = self._reset(states)

        # Reset step counter
        self.steps = 0
        return observation

# Initialize environment
from eagerx.wrappers import Flatten
eval_env = PendulumEnv(name="eval", rate=rate, graph=graph, eval=True)

# Stable Baselines3 expects flattened actions & observations
eval_env = Flatten(eval_env)

# Check if the pretrained policy we downloaded at the beginning transfers to the simulated disc pendulum...
helper.evaluate(pretrained_model, eval_env, episode_length=270, video_rate=rate, video_prefix="pretrained_disc", n_eval_episodes=1)

Start evaluation episode 0 of 1


  0%|          | 0/270 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

100%|██████████| 270/270 [00:02<00:00, 104.83it/s]


Start video writer
Showing episode 0 with episodic reward: -2012.414734762492


Finished evaluation with mean episodic reward: -2012.414734762492


In [7]:
#@title The pretrained policy fails... why?

# This was also to be expected, since the mass and length of the Gym pendulum are 1 kg and 1 m, respectively.
# Therefore, we will train again a policy on the Gym pendulum, but we will now use different values for the mass and the length of the pendulum.
# There is only one problem: you don't know the exact mass and length of the real pendulum system.
# You can still train a successful policy however, by performing [domain randomization](https://sites.google.com/view/domainrandomization/).
# By varying over different values of $m$ and $l$, you can train a policy that is robust against model inaccuracies.
# In order to do this, you have to modify a few lines of code in the `reset` method of the `PendulumEnv` class.

# **NOTE: If you want to rerun code, we advice you to restart and run all code (in Colab there is the option Restart and run all under Runtime).**

#@markdown **Select randomization range for the mass:**
min_mass = 0.01  #@param {type:"slider", min:0.01, max:0.1, step:0.01}
max_mass = 0.01  #@param {type:"slider", min:0.01, max:0.1, step:0.01}
assert min_mass <= max_mass, "Minimum mass must be smaller than the maximum mass."
mass_rng = [min_mass, max_mass]

#@markdown **Select randomization range for the length:**
min_length = 0.04  #@param {type:"slider", min:0.04, max:0.2, step:0.01}
max_length = 0.04  #@param {type:"slider", min:0.04, max:0.2, step:0.01}
assert min_length <= max_length, "Minimum length must be smaller than the maximum length."
length_rng = [min_length, max_length]

# Initialize environment
from eagerx.wrappers import Flatten
train_env = PendulumEnv(name="eval", rate=rate, graph=graph, eval=False, 
                        mass_rng=mass_rng, length_rng=length_rng)

# Stable Baselines3 expects flattened actions & observations
train_env = Flatten(train_env)

# Initialize learner
model = sb3.SAC("MlpPolicy", train_env, verbose=1, learning_rate=7e-4)

# Train for 40 episodes
train_env.render("human")
model.learn(total_timesteps=int(4000))
train_env.close()

# Save model
model.save("pendulum")

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


<IPython.core.display.Javascript object>

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 101      |
|    ep_rew_mean     | -486     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 34       |
|    time_elapsed    | 11       |
|    total_timesteps | 404      |
| train/             |          |
|    actor_loss      | 9.14     |
|    critic_loss     | 11.7     |
|    ent_coef        | 0.809    |
|    ent_coef_loss   | -0.302   |
|    learning_rate   | 0.0007   |
|    n_updates       | 303      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 101      |
|    ep_rew_mean     | -441     |
| time/              |          |
|    episodes        | 8        |
|    fps             | 33       |
|    time_elapsed    | 24       |
|    total_timesteps | 808      |
| train/             |          |
|    actor_loss      | 14.6     |
|    critic_loss     | 23.7     |
|    ent_coef 

In [6]:
#@title Finally, we evaluate the policy trained with domain randomization . How does it perform?

# Create evaluation environment
eval_env = PendulumEnv(name="disc", rate=rate, graph=graph, eval=True)
eval_env = Flatten(eval_env)

helper.evaluate(model, eval_env, episode_length=270, video_rate=rate, video_prefix="trained_disc")

Start evaluation episode 0 of 3


  0%|          | 0/270 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

100%|██████████| 270/270 [00:02<00:00, 94.57it/s]


Start video writer
Showing episode 0 with episodic reward: -102.37445737320333


Start evaluation episode 1 of 3


100%|██████████| 270/270 [00:02<00:00, 90.96it/s]


Start video writer
Showing episode 1 with episodic reward: -136.1938315532574


Start evaluation episode 2 of 3


100%|██████████| 270/270 [00:02<00:00, 91.63it/s]


Start video writer
Showing episode 2 with episodic reward: -96.14934104164611


Finished evaluation with mean episodic reward: -111.57254332270229


## Send us your successful policy and we will test it on the real system!

- A successful policy should result in a mean episodic reward of at least -200.
- Click on `files` in the left sidebar.
- Download the `policy.zip`.
- Send it to `eagerx.dev@gmail.com`.