## Notebook Setup

In order to be able to run the code, we need to install the *eagerx_tutorials* package and ROS.

## Activate GPU **(Colab only)**

When in Colab, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [1]:
try:
    import eagerx_tutorials
except ImportError:
    !{"echo 'Installing eagerx-tutorials with pip.' && pip install eagerx-tutorials >> /tmp/eagerx_install.txt 2>&1"}
try:
    import huggingface_sb3
except ImportError:
    !{"echo 'Installing huggingface-sb3 with pip.' && pip install huggingface-sb3 >> /tmp/eagerx_huggingface.txt 2>&1"}
    !{"echo 'Installing pickle for loading policies.' && pip install --upgrade --quiet cloudpickle pickle5 >> /tmp/eagerx_pickle.txt 2>&1"}
if 'google.colab' in str(get_ipython()):
    !{"curl 'https://raw.githubusercontent.com/eager-dev/eagerx_tutorials/master/scripts/setup_colab.sh' > ~/setup_colab.sh"}
    !{"bash ~/setup_colab.sh"}
    # Set up fake display; otherwise rendering will fail
    import os
    !{"echo 'Setting up virtual display for visualisation' && apt-get install ffmpeg freeglut3-dev xvfb >> /tmp/eagerx_xvfb.txt 2>&1"}
    os.system("Xvfb :1 -screen 0 1024x768x24 &")
    os.environ["DISPLAY"] = ":1"

# Setup interactive notebook
# Required in interactive notebooks only.
from eagerx_tutorials import helper
helper.setup_notebook()

Not running on CoLab.
Execute ROS commands as "!...".
ROS noetic available.


# EAGERx Getting Started with EAGERx

EAGERx: https://github.com/eager-dev/eagerx

Documentation: https://eagerx.readthedocs.io/en/master/


## Introduction

The goal of this tutorial is to train a policy for swinging up the famous Gym [Pendulum](https://www.gymlibrary.ml/environments/classic_control/pendulum/) and transfer this policy to a real pendulum system!
This will not work out of the box, since the mass and length of the Gym pendulum and the real pendulum are different.
In order to come up with a policy that will be able to swing up the real pendulum system, we will perform [domain randomization](https://sites.google.com/view/domainrandomization/) by varying the mass and the length of the Gym pendulum.
In this way, we will be able to train a successful policy without knowing the exact mass and length of the real pendulum system.


<img src="../figures/gym_pendulum.gif" width="480" /> <img src="../figures/real_pendulum.gif" width="480" />

**Figure 1:** On the left we see the Gym pendulum and on the right the real pendulum system.
The real pendulum system consists of a mass attached to a rotating disc, which dynamics are similar to those of a pendulum with a mass attached to the tip.


This tutorial covers:
- Constructing a [Graph](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html) and [Environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html) with [EAGERx](https://eagerx.readthedocs.io/en/master/).
- Switching between different [Engines](https://eagerx.readthedocs.io/en/master/guide/api_reference/engine/index.html)
- Performing domain randomization

In the remainder of this tutorial, we will go more into detail on these concepts.

Furthermore, you will be asked to add/modify a couple of lines of code, which are marked by

```python

# YOUR CODE HERE

# END OF YOUR CODE
```

## Pendulum Swing-up

We will create an environment for solving the classic control problem of swinging up an underactuated pendulum, i.e. the [Pendulum-v1 environment](https://www.gymlibrary.ml/environments/classic_control/pendulum/).
Our goal is to transfer the policy to the real system and swing up this pendulum to the upright position and keep it there, while minimizing the velocity of the pendulum and the input voltage.

Since the dynamics of a pendulum actuated by a DC motor are well known, we can simulate the pendulum by integrating the corresponding ordinary differential equations (ODEs):


$\mathbf{x} = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix} \\ \dot{\mathbf{x}} = \begin{bmatrix} \dot{\theta} \\ \frac{1}{J}(\frac{K}{R}u - mgl \sin{\theta} - b \dot{\theta} - \frac{K^2}{R}\dot{\theta})\end{bmatrix}$

with $\theta$ the angle w.r.t. upright position, $\dot{\theta}$ the angular velocity, $u$ the input voltage, $J$ the inertia, $m$ the mass, $g$ the gravitational constant, $l$ the length of the pendulum, $b$ the motor viscous friction constant, $K$ the motor constant and $R$ the electric resistance.

## Let's get started

First we will import EAGERx and initialize it.
EAGERx makes use of ROS functionality for communication and during initialization a ROS master is started if there isn't one running already.

In [2]:
import eagerx
# Initialize eagerx (starts roscore if not already started.)
eagerx.initialize("eagerx_core")

... logging to /home/jelle/.ros/log/f1d94976-dc4f-11ec-a29a-31e71ff4a2a0/roslaunch-jelle-Alienware-m15-R4-40090.log
[1mstarted roslaunch server http://jelle-Alienware-m15-R4:44899/[0m
ros_comm version 1.15.14


SUMMARY

PARAMETERS
 * /rosdistro: noetic
 * /rosversion: 1.15.14

NODES

auto-starting new master
[1mprocess[master]: started with pid [40148][0m
[1mROS_MASTER_URI=http://localhost:11311[0m
[1msetting /run_id to f1d94976-dc4f-11ec-a29a-31e71ff4a2a0[0m
[1mprocess[rosout-1]: started with pid [40173][0m
started core service [/rosout]


<roslaunch.parent.ROSLaunchParent at 0x7f1f805b3520>

Next, we will download a pretrained policy in order to see what a successful policy looks like.

In [3]:
import sys
import stable_baselines3 as sb3
from huggingface_sb3 import load_from_hub

# Download pretrained policy from hugging face
newer_python_version = sys.version_info.major == 3 and sys.version_info.minor >= 8
custom_objects = {}
if newer_python_version:
    custom_objects = {
        "learning_rate": 0.0,
        "lr_schedule": lambda _: 0.0,
        "clip_range": lambda _: 0.0,
    }
checkpoint = load_from_hub(
    repo_id="sb3/ppo-Pendulum-v1",
    filename="ppo-Pendulum-v1.zip",
)

# Initialize model
pretrained_model = sb3.PPO.load(checkpoint, custom_objects=custom_objects, device="cpu")

We will create a standard *Pendulum-v1* Gym environment in order to evaluate the policy.

In [4]:
import gym

# Initalize pendulum environment
env = gym.make("Pendulum-v1")

# Evaluate policy and record video
helper.record_video(env=env, model=pretrained_model, prefix="pretrained")

# Show video
helper.show_video("pretrained-step-0-to-step-500")

Saving video to /home/jelle/eagerx_dev/eagerx_tutorials/tutorials/icra/solutions/videos/pretrained-step-0-to-step-500.mp4


We see that the pretrained policy is able to swing up the pendulum and stabilize it upright.
But will this policy also work on the real system?
Using EAGERx we can create environments that are engine agnostic, i.e. Gym environments that can be used with different [Engines](https://eagerx.readthedocs.io/en/master/guide/api_reference/engine/index.html?highlight=engine).
An engine could be a simulator, like PyBullet, but it could also be the real world.
Before creating the environment, we will create an [Object](https://eagerx.readthedocs.io/en/master/guide/api_reference/object/index.html?highlight=Object).
In EAGERx, an `Object` is an entity that has inputs (sensors), outputs (actuators) and states (that can be reset at the beginning of an episode).

We are going to create one object (the pendulum).
For this first tutorial, we don't want to go into details too much and start with existing objects.
Note that we import the pendulum.
While this might look like an unused import, it is not.
During the import, the pendulum object is registered and we can therefore make it based on its ID, i.e. *Pendulum*.

Before making the object, we will first obtain some info on the *Pendulum*, such that we know with what arguments we should make it.


In [5]:
import eagerx_tutorials.pendulum  # Registers Pendulum
import eagerx_tutorials.pendulum.gym_implementation  # Registers Gym implementation of the pendulum

eagerx.Object.info("Pendulum")

Registered entity_id=`Pendulum`:
   entity_type: `Object`
   module: `eagerx_tutorials.pendulum.objects`
   file: `/home/jelle/eagerx_dev/eagerx_tutorials/eagerx_tutorials/pendulum/objects.py`

Supported engines:
 - OdeEngine
 - GymEngine

Make this spec with (use `entity_id: str = "Pendulum"`):
   spec = Object.make(entity_id: str, name: str, actuators: List[str] = None, sensors: List[str] = None, states: List[str] = None, rate: float = 30.0, render_shape: List[int] = None, render_fn: str = None)

class Pendulum(Object):
   spec(spec: eagerx.core.specs.ObjectSpec, name: str, actuators: List[str] = None, sensors: List[str] = None, states: List[str] = None, rate: float = 30.0, render_shape: List[int] = None, render_fn: str = None):
      docs:
         Object spec of Pendulum

   agnostic(spec: eagerx.core.specs.ObjectSpec, rate: float):
      config:
       - render_shape: [480, 480]
       - render_fn: pendulum_render_fn
      sensors:
       - theta: <class 'std_msgs.msg._Float32.Flo

We see that the `eagerx.Object.info("Pendulum")` provides us information on the *Pendulum* object.
It has four sensors (*theta*, *dtheta*, *image*, *u_applied*), one actuator (*u*) and a number of states.
Here *theta*, *dtheta* and *u* correspond to $\theta$, $\dot{\theta}$ and $u$, respectively.
We can make the *Pendulum* object with the `eagerx.Object.make` method with the required arguments *entity_id* and (a unique) *name* and add it to a [Graph](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html).

The graph describes the interconnection of nodes and objects.
In this way, the creation of an environment becomes modular.
This allows users to create an implementation for nodes and objects once, and easily create new environments by reusing these implementations.
Also, this allows to construct complex environments using nodes and objects as basic building blocks.

After adding the pendulum to the graph, we will connect the actuator *u* to a new action called *voltage*.
We will connect the sensors *theta* and *dtheta* to the observations *angle* and *angular_velocity*, respectively.
In this way, the agent will be able to send actions to control $u$ of the pendulum and observe $\theta$ and $\dot{\theta}$.

Finally, we also render the *image* sensor in order to visualize the pendulum.

In [6]:
# Define rate (Hz)
rate = 20.0

# Initialize empty graph
graph = eagerx.Graph.create()


# Select sensors, actuators and states of Pendulum
sensors = ["theta", "dtheta", "image"]
actuators = ["u"]
states = ["model_state", "mass", "length", "max_speed"]

# Make pendulum
pendulum = eagerx.Object.make("Pendulum", "pendulum", rate=rate, actuators=actuators, sensors=sensors, states=states, render_fn="disc_pendulum_render_fn")

# Decompose angle [cos(theta), sin(theta)]
pendulum.sensors.theta.space_converter = eagerx.SpaceConverter.make("Space_DecomposedAngle", low=[-1, -1], high=[1, 1])

# Add pendulum to the graph
graph.add(pendulum)

# Connect the pendulum to an action and observations
graph.connect(action="voltage", target=pendulum.actuators.u)
graph.connect(source=pendulum.sensors.theta, observation="angle")
graph.connect(source=pendulum.sensors.dtheta, observation="angular_velocity")

# Render image
graph.render(source=pendulum.sensors.image, rate=rate)

It is also possible to inspect the graph using the eagerx-gui package.

It can be installed as follows:
```bash
pip3 install eagerx-gui
```
Colab has limited support for interactive applications, so we cannot open the GUI here.
But if we were to run
```python
graph.gui()
```
The ouput would be as follows:

<img src="../figures/tutorial_1_gui.svg" width=720>

Here we see that the actions of the agent are outputs of *env/actions* and that the observations of the agent are inputs of *env/observations*.
Also, we render output by connecting to *env/render*.
Note that *env/actions*, *env/observations* and *env/render* represent connections of the `Graph` to the environment.
They are split up in the GUI as nodes for visualization purposes.

Next, we will create the [Environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html).
Environment creation in EAGERx follows the same API as Gym, i.e. we have to define a [step()](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.BaseEnv.step) and [reset()](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.BaseEnv.reset) method.

In [7]:
from typing import Dict
import numpy as np


class PendulumEnv(eagerx.BaseEnv):
    def __init__(self, name: str, rate: float, graph: eagerx.Graph, engine: eagerx.Engine, eval: bool):
        """Initializes an environment with EAGERx dynamics.

        :param name: The name of the environment. Everything related to this environment
                     (parameters, topics, nodes, etc...) will be registered under namespace: "/[name]".
        :param rate: The rate (Hz) at which the environment will run.
        :param graph: The graph consisting of nodes and objects that describe the environment's dynamics.
        :param engine: The physics engine that will govern the environment's dynamics.
        :param eval: If True we will create an evaluation environment, i.e. not performing domain randomization.
        """
        self.eval = eval
        
        # Maximum episode length
        self.episode_length = 270 if eval else 100
        
        # Step counter
        self.steps = None
        super().__init__(name, rate, graph, engine, force_start=True)
    
    def step(self, action: Dict):
        """A method that runs one timestep of the environment's dynamics.

        :params action: A dictionary of actions provided by the agent.
        :returns: A tuple (observation, reward, done, info).

            - observation: Dictionary of observations of the current timestep.

            - reward: amount of reward returned after previous action

            - done: whether the episode has ended, in which case further step() calls will return undefined results

            - info: contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
        """
        # Take step
        obs = self._step(action)
        self.steps += 1
        
        # Extract observations
        cos_th, sin_th = obs["angle"][0]
        thdot = obs["angular_velocity"][0]
        u = action["voltage"][0]

        # Calculate reward
        # We want to penalize the angle error, angular velocity and applied voltage
        th = np.arctan2(sin_th, cos_th)
        cost = th**2 + 0.1 * (thdot / (1 + 10 * abs(th))) ** 2 + 0.01 * u ** 2

        # Determine done flag
        done = self.steps > self.episode_length
        
        # Set info:
        info = {"TimeLimit.truncated": done}
        
        return obs, -cost, done, info
    
    def reset(self) -> Dict:
        """Resets the environment to an initial state and returns an initial observation.

        :returns: The initial observation.
        """
        # Determine reset states
        states = self.state_space.sample()
        
        if self.eval:
            states["pendulum/model_state"] = np.array([np.pi * np.random.uniform(low=0.75, high=1.25), 0])
        else:
            # YOUR CODE HERE
            # TODO: 
            # During training we want to vary the length and mass of the pendulum.
            # This will improve the robustness against model inaccuracies.
            # Randomly sample values for the mass and length of the pendulum.
            # Try to estimate the mass and length of the real pendulum system in Figure 1.
            # You can adjust the low and the high in the lines below to define the distributions for sampling.
            # Hint: the Gym pendulum is a rod, while the real pendulum is not.
            # They have different moments of inertia, therefore overestimating the length will help.

            # key = "[object_name]/[state_name]"
            # value should be of type np.ndarray
            
            # Sample mass (kg)
            states["pendulum/mass"] = np.random.uniform(low=0.03, high=0.05, size=(1))
            # Sample length (m)
            states["pendulum/length"] = np.random.uniform(low=0.10, high=0.14, size=(1)) 
            
            # END OF YOUR CODE
            
        # Perform reset
        obs = self._reset(states)

        # Reset step counter
        self.steps = 0
        return obs
        

Next, we will create the [Engines](https://eagerx.readthedocs.io/en/master/guide/api_reference/engine/index.html) corresponding to the simulators we will use.
Here we will make use of the *GymEngine* to simulate the Gym pendulum and the *OdeEngine* to simulate the disc pendulum based on the ODE at the top of the page with identified parameters from the real system.
The *GymEngine* will be used for training and here we perform the domain randomization.
The *OdeEngine* will be used for evaluation in order to validate whether the resulting policy will work on the real system.
Switching between simulators or to the real world is only a matter of switching the engine in EAGERx.

In [8]:
# Initialize engines
gym_engine = eagerx.Engine.make("GymEngine", rate=rate)
ode_engine = eagerx.Engine.make("OdeEngine", rate=rate)

Now we are ready to make the environments! 
We will create one with the `gym_engine` for training and one with the `ode_engine` for evaluation.

In [9]:
from eagerx.wrappers import Flatten


# Initialize environments
train_env = PendulumEnv(name="train", rate=rate, graph=graph, engine=gym_engine, eval=False)
eval_env = PendulumEnv(name="eval", rate=rate, graph=graph, engine=ode_engine, eval=True)


# Stable Baselines3 expects flattened actions & observations
# Convert observation and action space from Dict() to Box()
train_env = Flatten(train_env)
eval_env = Flatten(eval_env)

[INFO] [1653499649.729925]: Node "/train/env/supervisor" initialized.
[INFO] [1653499649.862089]: Waiting for nodes "['engine']" to be initialized.
[INFO] [1653499651.398980]: Node "/train/environment" initialized.
[INFO] [1653499651.517824]: Node "/train/pendulum/theta" initialized.
[INFO] [1653499651.584983]: Node "/train/pendulum/dtheta" initialized.
[INFO] [1653499651.656253]: Node "/eval/env/supervisor" initialized.
[INFO] [1653499651.803073]: Node "/eval/engine" initialized.
[INFO] [1653499651.929835]: Node "/eval/environment" initialized.
[INFO] [1653499652.112594]: Node "/eval/pendulum/theta" initialized.
[INFO] [1653499652.139925]: Node "/eval/pendulum/dtheta" initialized.


Let's first check if the pretrained policy we downloaded at the beginning transfers to the simulated disc pendulum...

In [10]:
helper.evaluate(pretrained_model, eval_env, episode_length=270, video_rate=rate, video_prefix="pretrained_disc")

Start evaluation episode 0 of 3
[INFO] [1653499652.215022]: Adding object "pendulum" of type "Pendulum" to the simulator.
[INFO] [1653499652.231736]: Node "/eval/pendulum/x" initialized.
[INFO] [1653499652.249632]: Node "/eval/pendulum/image" initialized.
[INFO] [1653499652.263680]: Node "/eval/pendulum/u" initialized.
[INFO] [1653499652.277665]: Node "/eval/pendulum/u_applied" initialized.
[INFO] [1653499653.358406]: Nodes initialized.
[INFO] [1653499653.417277]: Pipelines initialized.


  0%|                                                                                                                                                                              | 0/270 [00:00<?, ?it/s]

[INFO] [1653499653.492071]: [pendulum/image] START RENDERING!


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 158.99it/s]


Start video writer
Showing episode 0 with episodic reward: -1734.5951758128779


Start evaluation episode 1 of 3


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 163.21it/s]


Start video writer
Showing episode 1 with episodic reward: -1736.4168247950158


Start evaluation episode 2 of 3


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 161.40it/s]


Start video writer
Showing episode 2 with episodic reward: -1733.5392102546193


Finished evaluation with mean episodic reward: -1734.8504036208378


We see that the pretrained policy fails...
This was also to be expected, since the mass and length of the Gym pendulum are 1 kg and 1 m, respectively.
Therefore, we will train again a policy on the Gym pendulum, but we will now use different values for the mass and the length of the pendulum.
There is only one problem: you don't know the exact mass and length of the real pendulum system.
You can still train a successful policy however, by performing [domain randomization](https://sites.google.com/view/domainrandomization/).
By varying over different values of $m$ and $l$, you can train a policy that is robust against model inaccuracies.
In order to do this, you have to modify a few lines of code in the `reset` method of the `PendulumEnv` class.

If you have done this, you can train a policy as follows (this will take a couple of minutes in Colab).



**NOTE: If you want to rerun code, we advice you to restart and run all code (in Colab there is the option Restart and run all under Runtime).**

In [11]:
# Initialize learner
model = sb3.SAC("MlpPolicy", train_env, verbose=1, learning_rate=7e-4)

# Train for 40 episodes
train_env.render("human")
model.learn(total_timesteps=int(4000))
train_env.close()

# Save model
model.save("pendulum")

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
[INFO] [1653499663.647081]: Nodes initialized.
[INFO] [1653499663.793247]: Pipelines initialized.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 101      |
|    ep_rew_mean     | -460     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 48       |
|    time_elapsed    | 8        |
|    total_timesteps | 404      |
| train/             |          |
|    actor_loss      | 7.25     |
|    critic_loss     | 0.538    |
|    ent_coef        | 0.81     |
|    ent_coef_loss   | -0.32    |
|    learning_rate   | 0.0007   |
|    n_updates       | 303      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 101      |
|    ep_rew_mean     | -359     |
| time/              |          |
|    episodes        | 8        |
|    fps             | 48

Next, you can evaluate your policy again on the simulated disc pendulum.

In [12]:
# Create evaluation environment
eval_env = PendulumEnv(name="disc", rate=rate, graph=graph, engine=ode_engine, eval=True)
eval_env = Flatten(eval_env)

helper.evaluate(model, eval_env, episode_length=270, video_rate=rate, video_prefix="trained_disc")

[INFO] [1653499743.430460]: Node "/disc/env/supervisor" initialized.
[INFO] [1653499743.579346]: Node "/disc/engine" initialized.
[INFO] [1653499743.711216]: Node "/disc/environment" initialized.
[INFO] [1653499743.885394]: Node "/disc/pendulum/theta" initialized.
[INFO] [1653499743.920012]: Node "/disc/pendulum/dtheta" initialized.
Start evaluation episode 0 of 3
[INFO] [1653499743.992702]: Adding object "pendulum" of type "Pendulum" to the simulator.
[INFO] [1653499744.006905]: Node "/disc/pendulum/x" initialized.
[INFO] [1653499744.023462]: Node "/disc/pendulum/image" initialized.
[INFO] [1653499744.041535]: Node "/disc/pendulum/u" initialized.
[INFO] [1653499744.057389]: Node "/disc/pendulum/u_applied" initialized.
[INFO] [1653499745.149374]: Nodes initialized.
[INFO] [1653499745.313148]: Pipelines initialized.


  9%|██████████████▌                                                                                                                                                     | 24/270 [00:00<00:02, 122.40it/s]

[INFO] [1653499745.384422]: [pendulum/image] START RENDERING!


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 142.12it/s]


Start video writer
Showing episode 0 with episodic reward: -135.60065114097873


Start evaluation episode 1 of 3


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 149.80it/s]


Start video writer
Showing episode 1 with episodic reward: -96.33362055363183


Start evaluation episode 2 of 3


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:01<00:00, 149.22it/s]


Start video writer
Showing episode 2 with episodic reward: -88.49210323591969


Finished evaluation with mean episodic reward: -106.80879164351008


And.. were you able to swing up the disc pendulum and stabilize it upright successfully?
Note that the mean episodic reward is printed.
A successful policy should result in a mean episodic reward of at least -200.