# Solving the Pick-and-Place Environment in Robosuite

## The Task

Our task is to solve the pick and place environment of the robosuite project with reinforcement learing.
Robosuite provides a simulation environment with a box containing the objects, a second box divided into segments for each object and a robotic arm with a gripper as endeffector. 
The task is considered successfull if the robot manages to place every object into their corresponding segment in the second box. 
To achieve this goal, the robot has to perform four intermediet tasks.

1. Reaching the nearest object.
2. Grasping the object.
3. Lifting the object out of the box.
4. Moving the object to the corresponding segment in the second box.

## Parameters of the Robosuite setup

todo
and reference papers

## Stable Baselines 3

todo

## Reward function

The reward function is essential to understand the behaviour of the robot. For our task the reward function is predefined by the environment. For each subtask an additional reẁard is added successively. 
In this documentation of the reward function describes the rewards for each subtask
todo add image
![title]("img/picture.png")

## The Repository
Clone the git repository. The jupiter notebook file alone will not provide full functionalities.
This jupiter notebook is the main executable and contains all the source code.

The ./models/ folder contains the trained models and tensorboard logs, after executing our training script.
The ./replay/ folder contains the models and recorded simulations for our later example demonstrations.
The ./optuna/ folder contains the optuna logs after executing our optuna training script.
The ./prepared_optuna/ folder contains the optuna logs from our 30 hour execution session. 

## Installation

Since robosuite shows complications with windows, a linux or mac computer is required.
On debian the non free cuda driver has to be installed as a kernel level module in order to use the GPU for calculations.
This change resulted in crashes of wayland DSP so a X11 has to be used as a fallback. 

Our code is writen for python3.11. The following python packages are needed:
numpy (below version 2), robosuite, stable-baselines3[extra], libhdf5, h5py

In [2]:
!python3 -m pip install ipywidgets
!TMPDIR='/var/tmp' python3 -m pip install -r requirements.txt

Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Collecting robosuite==1.4.1
  Using cached robosuite-1.4.1-py3-none-any.whl (193.5 MB)
Collecting stable-baselines3[extra]==2.3.2
  Using cached stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
Collecting tensorboard==2.17.0
  Using cached tensorboard-2.17.0-py3-none-any.whl (5.5 MB)
Collecting h5py==3.11.0
  Using cached h5py-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
Collecting numba>=0.49.1
  Using cached numba-0.60.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.7 MB)
Collecting scipy>=1.2.3
  Using cached scipy-1.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.1 MB)
Collecting mujoco>=2.3.0
  Using cached mujoco-3.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
Collecting Pillow
  Using cached pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
Collecting openc

In [1]:
import numpy as np
import os
import robosuite as suite


from robosuite import load_controller_config
from robosuite.environments.base import register_env
from robosuite.controllers import load_controller_config
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.save_util import save_to_zip_file, load_from_zip_file
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from robosuite.wrappers import GymWrapper

from stable_baselines3 import PPO, DDPG



## Initial setup and testing of parameters

At the beginning of the task we trained a few models with different parameters to learn the relations between the parameters and the changes to the model. To do so we created the following script which defines a config dict of the tested parameters at the beginning of the file.  

In [2]:
parameters = dict(
    # Environment
    robot="Panda",
    gripper="default",
    controller="OSC_POSE",
    seed=12532135,
    control_freq=20,
    horizon=2048,
    camera_size=84,
    episodes=200,
    n_processes=6,
    # Algorithm 
    algortihm="PPO",
    gamma=0.99,
    learning_rate=1e-3,
    n_steps=2048,
)

test_name = str(parameters["robot"]) + "_freq" + str(parameters["robot"]) + "_hor" + str(parameters["horizon"]) + "_learn" + str(parameters["learning_rate"]) + "_episodes" + str(parameters["episodes"]) + "_control" + str(parameters["controller"])

### Training
The following script will train a model with the previously specified parameters.
The model and tensorboard logs will be stored in the "tests" folder named according to the specified parameters. 

In [3]:

controller_config = load_controller_config(default_controller=parameters["controller"])

def make_env(env_id, options, rank):
    def _init():
        env = GymWrapper(suite.make(env_id, **options))
        env.render_mode = 'mujoco'
        env = Monitor(env)
        env.reset(seed=parameters["seed"] + rank)
        return env
    set_random_seed(parameters["seed"])
    return _init


env = SubprocVecEnv([make_env(
    "PickPlace",
    dict(
        robots=[parameters["robot"]],                      
        gripper_types=parameters["gripper"],                
        controller_configs=controller_config,   
        has_renderer=False,                     
        has_offscreen_renderer=True,
        control_freq=parameters["control_freq"],
        horizon=parameters["horizon"],
        use_object_obs=False,                       # don't provide object observations to agent
        use_camera_obs=True,                        # provide image observations to agent
        camera_names="agentview",                   # use "agentview" camera for observations
        camera_heights=parameters["camera_size"],   # image height
        camera_widths=parameters["camera_size"],    # image width
        reward_shaping=True),                       # use a dense reward signal for learning
    i
    ) for i in range(parameters["n_processes"])])


env = VecNormalize(env)
model = PPO("MlpPolicy", env, verbose=1, gamma=parameters["gamma"], learning_rate=parameters["learning_rate"], n_steps=parameters["n_steps"], tensorboard_log=test_name, )

#env = VecNormalize.load('./' + test_name + '/env.pkl', env)
#model = PPO.load("./" + test_name + "/model.zip", env=env)

model.learn(total_timesteps=parameters["horizon"]*parameters["episodes"], progress_bar=True)
model.save("./" + test_name + "/model.zip")
env.save('./' + test_name + '/env.pkl')

env.close()
 

  return torch._C._cuda_getDeviceCount() > 0


Using cpu device
Logging to Panda_freqPanda_hor2048_learn0.001_episodes200_controlOSC_POSE/PPO_1


Output()

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 2.05e+03 |
|    ep_rew_mean     | 1        |
| time/              |          |
|    fps             | 132      |
|    iterations      | 1        |
|    time_elapsed    | 92       |
|    total_timesteps | 12288    |
---------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 1.32        |
| time/                   |             |
|    fps                  | 113         |
|    iterations           | 2           |
|    time_elapsed         | 216         |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.028386751 |
|    clip_fraction        | 0.284       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.9        |
|    explained_variance   | -0.192      |
|    learning_rate        | 0.001       |
|    loss                 | 0.0225      |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0127     |
|    std                  | 0.994       |
|    value_loss           | 0.0278      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 1.5         |
| time/                   |             |
|    fps                  | 110         |
|    iterations           | 3           |
|    time_elapsed         | 333         |
|    total_timesteps      | 36864       |
| train/                  |             |
|    approx_kl            | 0.038334385 |
|    clip_fraction        | 0.389       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.9        |
|    explained_variance   | 0.33        |
|    learning_rate        | 0.001       |
|    loss                 | -0.0156     |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.000468   |
|    std                  | 0.994       |
|    value_loss           | 0.00927     |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.64       |
| time/                   |            |
|    fps                  | 109        |
|    iterations           | 4          |
|    time_elapsed         | 447        |
|    total_timesteps      | 49152      |
| train/                  |            |
|    approx_kl            | 0.05035616 |
|    clip_fraction        | 0.446      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.87      |
|    explained_variance   | 0.318      |
|    learning_rate        | 0.001      |
|    loss                 | -0.012     |
|    n_updates            | 30         |
|    policy_gradient_loss | 0.00642    |
|    std                  | 0.991      |
|    value_loss           | 0.00633    |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.52       |
| time/                   |            |
|    fps                  | 109        |
|    iterations           | 5          |
|    time_elapsed         | 561        |
|    total_timesteps      | 61440      |
| train/                  |            |
|    approx_kl            | 0.05702697 |
|    clip_fraction        | 0.461      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.86      |
|    explained_variance   | 0.558      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0372    |
|    n_updates            | 40         |
|    policy_gradient_loss | 0.00306    |
|    std                  | 0.99       |
|    value_loss           | 0.00952    |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.72       |
| time/                   |            |
|    fps                  | 108        |
|    iterations           | 6          |
|    time_elapsed         | 681        |
|    total_timesteps      | 73728      |
| train/                  |            |
|    approx_kl            | 0.05938536 |
|    clip_fraction        | 0.473      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.86      |
|    explained_variance   | 0.517      |
|    learning_rate        | 0.001      |
|    loss                 | -0.012     |
|    n_updates            | 50         |
|    policy_gradient_loss | 0.00384    |
|    std                  | 0.99       |
|    value_loss           | 0.00556    |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.72       |
| time/                   |            |
|    fps                  | 108        |
|    iterations           | 7          |
|    time_elapsed         | 795        |
|    total_timesteps      | 86016      |
| train/                  |            |
|    approx_kl            | 0.06202428 |
|    clip_fraction        | 0.503      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.86      |
|    explained_variance   | 0.764      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00787    |
|    n_updates            | 60         |
|    policy_gradient_loss | 0.0105     |
|    std                  | 0.987      |
|    value_loss           | 0.0156     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.78       |
| time/                   |            |
|    fps                  | 107        |
|    iterations           | 8          |
|    time_elapsed         | 911        |
|    total_timesteps      | 98304      |
| train/                  |            |
|    approx_kl            | 0.06726036 |
|    clip_fraction        | 0.496      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.83      |
|    explained_variance   | 0.672      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00841    |
|    n_updates            | 70         |
|    policy_gradient_loss | 0.00263    |
|    std                  | 0.986      |
|    value_loss           | 0.0112     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 1.95       |
| time/                   |            |
|    fps                  | 107        |
|    iterations           | 9          |
|    time_elapsed         | 1031       |
|    total_timesteps      | 110592     |
| train/                  |            |
|    approx_kl            | 0.08512173 |
|    clip_fraction        | 0.535      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.81      |
|    explained_variance   | 0.762      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0114    |
|    n_updates            | 80         |
|    policy_gradient_loss | 0.00512    |
|    std                  | 0.983      |
|    value_loss           | 0.0118     |
----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 2.33        |
| time/                   |             |
|    fps                  | 106         |
|    iterations           | 10          |
|    time_elapsed         | 1148        |
|    total_timesteps      | 122880      |
| train/                  |             |
|    approx_kl            | 0.090729594 |
|    clip_fraction        | 0.55        |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.82       |
|    explained_variance   | 0.715       |
|    learning_rate        | 0.001       |
|    loss                 | 0.0137      |
|    n_updates            | 90          |
|    policy_gradient_loss | 0.00921     |
|    std                  | 0.984       |
|    value_loss           | 0.0321      |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 2.58       |
| time/                   |            |
|    fps                  | 106        |
|    iterations           | 11         |
|    time_elapsed         | 1271       |
|    total_timesteps      | 135168     |
| train/                  |            |
|    approx_kl            | 0.08466514 |
|    clip_fraction        | 0.529      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.85      |
|    explained_variance   | 0.306      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0245     |
|    n_updates            | 100        |
|    policy_gradient_loss | 0.0111     |
|    std                  | 0.989      |
|    value_loss           | 0.0399     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 2.91       |
| time/                   |            |
|    fps                  | 105        |
|    iterations           | 12         |
|    time_elapsed         | 1396       |
|    total_timesteps      | 147456     |
| train/                  |            |
|    approx_kl            | 0.08798125 |
|    clip_fraction        | 0.537      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.86      |
|    explained_variance   | 0.723      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00651   |
|    n_updates            | 110        |
|    policy_gradient_loss | 0.00817    |
|    std                  | 0.987      |
|    value_loss           | 0.0276     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 3.15       |
| time/                   |            |
|    fps                  | 105        |
|    iterations           | 13         |
|    time_elapsed         | 1519       |
|    total_timesteps      | 159744     |
| train/                  |            |
|    approx_kl            | 0.08787829 |
|    clip_fraction        | 0.539      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.84      |
|    explained_variance   | 0.32       |
|    learning_rate        | 0.001      |
|    loss                 | 0.002      |
|    n_updates            | 120        |
|    policy_gradient_loss | 0.0115     |
|    std                  | 0.983      |
|    value_loss           | 0.042      |
----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 3.26        |
| time/                   |             |
|    fps                  | 105         |
|    iterations           | 14          |
|    time_elapsed         | 1636        |
|    total_timesteps      | 172032      |
| train/                  |             |
|    approx_kl            | 0.101813555 |
|    clip_fraction        | 0.562       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.82       |
|    explained_variance   | 0.386       |
|    learning_rate        | 0.001       |
|    loss                 | 0.00245     |
|    n_updates            | 130         |
|    policy_gradient_loss | 0.00986     |
|    std                  | 0.988       |
|    value_loss           | 0.0299      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 3.26        |
| time/                   |             |
|    fps                  | 105         |
|    iterations           | 15          |
|    time_elapsed         | 1751        |
|    total_timesteps      | 184320      |
| train/                  |             |
|    approx_kl            | 0.100872494 |
|    clip_fraction        | 0.569       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.87       |
|    explained_variance   | 0.539       |
|    learning_rate        | 0.001       |
|    loss                 | -0.0352     |
|    n_updates            | 140         |
|    policy_gradient_loss | 0.0212      |
|    std                  | 0.992       |
|    value_loss           | 0.0215      |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 3.46       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 16         |
|    time_elapsed         | 1873       |
|    total_timesteps      | 196608     |
| train/                  |            |
|    approx_kl            | 0.09848369 |
|    clip_fraction        | 0.563      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.89      |
|    explained_variance   | 0.85       |
|    learning_rate        | 0.001      |
|    loss                 | -0.00329   |
|    n_updates            | 150        |
|    policy_gradient_loss | 0.0123     |
|    std                  | 0.995      |
|    value_loss           | 0.0151     |
----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 3.6         |
| time/                   |             |
|    fps                  | 104         |
|    iterations           | 17          |
|    time_elapsed         | 1993        |
|    total_timesteps      | 208896      |
| train/                  |             |
|    approx_kl            | 0.108122505 |
|    clip_fraction        | 0.547       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.89       |
|    explained_variance   | 0.393       |
|    learning_rate        | 0.001       |
|    loss                 | 0.0753      |
|    n_updates            | 160         |
|    policy_gradient_loss | 0.00602     |
|    std                  | 0.997       |
|    value_loss           | 0.105       |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 3.85       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 18         |
|    time_elapsed         | 2113       |
|    total_timesteps      | 221184     |
| train/                  |            |
|    approx_kl            | 0.10715013 |
|    clip_fraction        | 0.549      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.94      |
|    explained_variance   | 0.246      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0433     |
|    n_updates            | 170        |
|    policy_gradient_loss | 0.00678    |
|    std                  | 1          |
|    value_loss           | 0.0385     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 3.84       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 19         |
|    time_elapsed         | 2232       |
|    total_timesteps      | 233472     |
| train/                  |            |
|    approx_kl            | 0.12635954 |
|    clip_fraction        | 0.595      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10        |
|    explained_variance   | 0.64       |
|    learning_rate        | 0.001      |
|    loss                 | -0.00105   |
|    n_updates            | 180        |
|    policy_gradient_loss | 0.0185     |
|    std                  | 1.01       |
|    value_loss           | 0.0242     |
----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 2.05e+03 |
|    ep_rew_mean          | 3.98     |
| time/                   |          |
|    fps                  | 104      |
|    iterations           | 20       |
|    time_elapsed         | 2349     |
|    total_timesteps      | 245760   |
| train/                  |          |
|    approx_kl            | 0.12936  |
|    clip_fraction        | 0.601    |
|    clip_range           | 0.2      |
|    entropy_loss         | -10      |
|    explained_variance   | 0.753    |
|    learning_rate        | 0.001    |
|    loss                 | -0.0144  |
|    n_updates            | 190      |
|    policy_gradient_loss | 0.0231   |
|    std                  | 1.02     |
|    value_loss           | 0.0143   |
--------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 4.23       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 21         |
|    time_elapsed         | 2470       |
|    total_timesteps      | 258048     |
| train/                  |            |
|    approx_kl            | 0.12698981 |
|    clip_fraction        | 0.591      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.766      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0182     |
|    n_updates            | 200        |
|    policy_gradient_loss | 0.015      |
|    std                  | 1.02       |
|    value_loss           | 0.0174     |
----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 4.48        |
| time/                   |             |
|    fps                  | 104         |
|    iterations           | 22          |
|    time_elapsed         | 2589        |
|    total_timesteps      | 270336      |
| train/                  |             |
|    approx_kl            | 0.115501426 |
|    clip_fraction        | 0.565       |
|    clip_range           | 0.2         |
|    entropy_loss         | -10.1       |
|    explained_variance   | 0.633       |
|    learning_rate        | 0.001       |
|    loss                 | 0.0578      |
|    n_updates            | 210         |
|    policy_gradient_loss | 0.00927     |
|    std                  | 1.02        |
|    value_loss           | 0.0206      |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 4.68       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 23         |
|    time_elapsed         | 2709       |
|    total_timesteps      | 282624     |
| train/                  |            |
|    approx_kl            | 0.11089348 |
|    clip_fraction        | 0.569      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.626      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00713   |
|    n_updates            | 220        |
|    policy_gradient_loss | 0.00783    |
|    std                  | 1.03       |
|    value_loss           | 0.0239     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 4.79       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 24         |
|    time_elapsed         | 2831       |
|    total_timesteps      | 294912     |
| train/                  |            |
|    approx_kl            | 0.11024625 |
|    clip_fraction        | 0.558      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.146      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0032     |
|    n_updates            | 230        |
|    policy_gradient_loss | 0.00952    |
|    std                  | 1.03       |
|    value_loss           | 0.0209     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 4.91       |
| time/                   |            |
|    fps                  | 104        |
|    iterations           | 25         |
|    time_elapsed         | 2952       |
|    total_timesteps      | 307200     |
| train/                  |            |
|    approx_kl            | 0.12334037 |
|    clip_fraction        | 0.589      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.537      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0927     |
|    n_updates            | 240        |
|    policy_gradient_loss | 0.0133     |
|    std                  | 1.03       |
|    value_loss           | 0.0201     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.13       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 26         |
|    time_elapsed         | 3074       |
|    total_timesteps      | 319488     |
| train/                  |            |
|    approx_kl            | 0.13424887 |
|    clip_fraction        | 0.592      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.734      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0208    |
|    n_updates            | 250        |
|    policy_gradient_loss | 0.0143     |
|    std                  | 1.04       |
|    value_loss           | 0.0162     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.19       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 27         |
|    time_elapsed         | 3196       |
|    total_timesteps      | 331776     |
| train/                  |            |
|    approx_kl            | 0.12779556 |
|    clip_fraction        | 0.584      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.636      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0262     |
|    n_updates            | 260        |
|    policy_gradient_loss | 0.0148     |
|    std                  | 1.04       |
|    value_loss           | 0.0176     |
----------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 2.05e+03  |
|    ep_rew_mean          | 5.49      |
| time/                   |           |
|    fps                  | 103       |
|    iterations           | 28        |
|    time_elapsed         | 3321      |
|    total_timesteps      | 344064    |
| train/                  |           |
|    approx_kl            | 0.1258159 |
|    clip_fraction        | 0.594     |
|    clip_range           | 0.2       |
|    entropy_loss         | -10.2     |
|    explained_variance   | 0.287     |
|    learning_rate        | 0.001     |
|    loss                 | 0.0524    |
|    n_updates            | 270       |
|    policy_gradient_loss | 0.021     |
|    std                  | 1.04      |
|    value_loss           | 0.027     |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.33       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 29         |
|    time_elapsed         | 3439       |
|    total_timesteps      | 356352     |
| train/                  |            |
|    approx_kl            | 0.11889517 |
|    clip_fraction        | 0.579      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.282      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00403   |
|    n_updates            | 280        |
|    policy_gradient_loss | 0.0178     |
|    std                  | 1.04       |
|    value_loss           | 0.0387     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.29       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 30         |
|    time_elapsed         | 3555       |
|    total_timesteps      | 368640     |
| train/                  |            |
|    approx_kl            | 0.10799837 |
|    clip_fraction        | 0.561      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.605      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0202    |
|    n_updates            | 290        |
|    policy_gradient_loss | 0.0136     |
|    std                  | 1.04       |
|    value_loss           | 0.0231     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.34       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 31         |
|    time_elapsed         | 3672       |
|    total_timesteps      | 380928     |
| train/                  |            |
|    approx_kl            | 0.12684654 |
|    clip_fraction        | 0.589      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.731      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0479    |
|    n_updates            | 300        |
|    policy_gradient_loss | 0.017      |
|    std                  | 1.05       |
|    value_loss           | 0.014      |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 2.05e+03   |
|    ep_rew_mean          | 5.36       |
| time/                   |            |
|    fps                  | 103        |
|    iterations           | 32         |
|    time_elapsed         | 3791       |
|    total_timesteps      | 393216     |
| train/                  |            |
|    approx_kl            | 0.12102735 |
|    clip_fraction        | 0.585      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.726      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00966    |
|    n_updates            | 310        |
|    policy_gradient_loss | 0.017      |
|    std                  | 1.05       |
|    value_loss           | 0.0158     |
----------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 2.05e+03  |
|    ep_rew_mean          | 5.55      |
| time/                   |           |
|    fps                  | 103       |
|    iterations           | 33        |
|    time_elapsed         | 3909      |
|    total_timesteps      | 405504    |
| train/                  |           |
|    approx_kl            | 0.1149396 |
|    clip_fraction        | 0.585     |
|    clip_range           | 0.2       |
|    entropy_loss         | -10.3     |
|    explained_variance   | 0.734     |
|    learning_rate        | 0.001     |
|    loss                 | -0.03     |
|    n_updates            | 320       |
|    policy_gradient_loss | 0.0184    |
|    std                  | 1.06      |
|    value_loss           | 0.0122    |
---------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.05e+03    |
|    ep_rew_mean          | 5.54        |
| time/                   |             |
|    fps                  | 103         |
|    iterations           | 34          |
|    time_elapsed         | 4026        |
|    total_timesteps      | 417792      |
| train/                  |             |
|    approx_kl            | 0.097734936 |
|    clip_fraction        | 0.538       |
|    clip_range           | 0.2         |
|    entropy_loss         | -10.3       |
|    explained_variance   | 0.226       |
|    learning_rate        | 0.001       |
|    loss                 | 0.0322      |
|    n_updates            | 330         |
|    policy_gradient_loss | 0.00592     |
|    std                  | 1.05        |
|    value_loss           | 0.0901      |
-----------------------------------------


### Tensorboard
The following command will open a locally hosted http server for the tensorboard.
Navigate to http://localhost:6006 to view the data logged during training.

In [4]:
!python3 -m tensorboard.main --logdir={'./' + test_name + '/'}

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.17.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


### Apply the model
With the following script, the model defined by the specified parameters will be used for the task execution. 
To run a simulation the model with the specified parameters has to be trained first.
Therefore adjust the parameters dict and rerun the code block. Then execute the training script. After that the new model can be simulated with the following script.

In [6]:
print(np.version.version)

if not 'parameters' in vars() and not 'parameters' in globals():
    print("No parameters defined")
elif not os.path.isdir('./' + test_name):
    print("No model found for this configuration. Train a model first!")
else:
    print('Using model ' + test_name)

    controller_config = load_controller_config(default_controller=parameters["controller"])

    def make_env(env_id, options, rank):
        def _init():
            env = GymWrapper(suite.make(env_id, **options))
            env.render_mode = 'mujoco'
            env = Monitor(env)
            env.reset(seed=parameters["seed"] + rank)
            return env
        set_random_seed(parameters["seed"])
        return _init


    env = SubprocVecEnv([make_env(
        "PickPlace",
        dict(
            robots=[parameters["robot"]],                      
            gripper_types=parameters["gripper"],                
            controller_configs=controller_config,   
            has_renderer=True,
            has_offscreen_renderer=True,
            control_freq=parameters["control_freq"],
            horizon=parameters["horizon"],
            use_object_obs=False,                       # don't provide object observations to agent
            use_camera_obs=True,                        # provide image observations to agent
            camera_names="agentview",                   # use "agentview" camera for observations
            camera_heights=parameters["camera_size"],   # image height
            camera_widths=parameters["camera_size"],    # image width
            reward_shaping=True),                       # use a dense reward signal for learning
        0
        )])

    #env.render_mode = 'mujoco'
    env = VecNormalize.load('./' + test_name + '/env.pkl', env)
    print(env)
    model = PPO.load("./" + test_name + "/model.zip", env=env)

    def get_policy_action(obs):
        action, _states = model.predict(obs)
        return action

    # reset the environment to prepare for a rollout
    env.training = False
    env.norm_reward = False
    obs = env.reset()

    for i in range(parameters["horizon"]):
        action = get_policy_action(obs)         # use observation to decide on an action
        obs, reward, done, info = env.step(action) # play action
        env.render()

    env.close()

1.26.4
Using model Panda_freqPanda_hor2048_learn0.001_episodes200_controlOSC_POSE




<stable_baselines3.common.vec_env.vec_normalize.VecNormalize object at 0x7f25bb694890>


Qt: Session management error: Could not open network socket


In [None]:
!python3 -m pip install numpy==1.26.1

### Further developments

With these first tests we could test a variety of robot configurations and samples of parameters.
We could identify the Sawyer robot with its default gripper and the PPO algorithm as our most promising candidates.

Furthermore, we could optimize the first parameters. 
With the predefined control_freq of the environment of 20 the robot arm was not ä

We identified the lack of computing performance as a bottleneck.




todo some simulation examples with tensorboard graphs and everything:

In [None]:
# replay of panda beeing clumsy with its grippers head

In [None]:
# replay of IIWA bugging

In [None]:
# replay of 20hz control freq

In [None]:
# replay of 100hz control freq

In [None]:
# replay of 250hz control freq

## Optuna

It is easy to overlook configurations with better performance when multiple parameters correlate. Changes for each parameter on its own reduces performes but adapting parameters to each other easily outperforms the basis configuration. 
To further increase the performance of the model a wider field of parameters has to be tested. 
The success of trying random combinations of parameters manually is very limited. Since most of the now relevant parameters are flots, there is a very high number of possible configurations. 

Optuna is a tool that can automate this task. It samples a value of a range for each parameter for each run. Then it trains a model with this configuration and tests its performacne. This can be done for hundrets of runs unsupervised. The performance for each configuration is logged and can be analysed by the researcher. 

The 

### Setup

The following script setup and executes 200 runs.
To limit the number of tries we already know perform bad, the parameters are fixed values or sampled between boundaries.
We took the boundaries for the PPO Algoritm from http://cult...

We let optuna run for 30 hours. The logs are also uploaded to this repository. See the next section for the access to the dashboard and our analysis of the results. 

In [None]:
from robosuite.controllers import load_controller_config
from optuna.visualization import plot_optimization_history, plot_slice


#optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
#study_name = "PPO_Sawyer_OSC_POSE"  # Unique identifier of the study.
study_name = "study1"
storage_name = "sqlite:///{}.db".format(study_name)

# Define the environment setup
def make_env(env_id, options, rank, seed=0):
    def _init():
        env = GymWrapper(suite.make(env_id, **options))
        env.render_mode = 'mujoco'
        env = Monitor(env)
        env.reset(seed=seed + rank)
        return env
    set_random_seed(seed)
    return _init

def evaluate_policy(model, env, n_eval_episodes=5):
    all_episode_rewards = []
    for _ in range(n_eval_episodes):
        episode_rewards = []
        done = np.array([False])
        obs = env.reset()
        while not done.all():
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)
        all_episode_rewards.append(np.sum(episode_rewards))
    mean_reward = np.mean(all_episode_rewards)
    return mean_reward

def save_model(model_path, model, vec_env):
    model.save(model_path + ".zip")
    vec_env.save(model_path + ".env")

def objective(trial):
    # Suggest hyperparameters
    #learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    learning_rate = 0.001
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    gamma = trial.suggest_categorical('gamma', [0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999])
    n_steps = trial.suggest_categorical('n_steps', [512, 1024, 2048])
    #horizon = trial.suggest_categorical('horizon', [512, 1024, 2048])
    horizon = 512
    control_freq = trial.suggest_uniform('control_freq', 100, 150)
    #total_timesteps = trial.suggest_categorical('total_timesteps', [1e5, 2e5, 5e5, 1e6, 2e6])
    total_timesteps = 3e5
    ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
    clip_range = trial.suggest_categorical("clip_range", [0.1, 0.2, 0.3, 0.4])
    #n_epochs = trial.suggest_categorical("n_epochs", [1, 5, 10, 20])
    gae_lambda = trial.suggest_categorical("gae_lambda", [0.8, 0.9, 0.92, 0.95, 0.98, 0.99, 1.0])
    max_grad_norm = trial.suggest_categorical("max_grad_norm", [0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5])
    vf_coef = trial.suggest_float("vf_coef", 0, 1)
    net_arch_type = trial.suggest_categorical("net_arch", ["tiny", "small", "medium"])

    print(
        f"Learning rate: {learning_rate}, "
        f"Batch size: {batch_size}, "
        f"Gamma: {gamma}, "
        f"N steps: {n_steps}, "
        f"Horizon: {horizon}, "
        f"Control freq: {control_freq}, "
        f"Total timesteps: {total_timesteps}, "
        f"Entropy coefficient: {ent_coef}, "
        f"Clip range: {clip_range}, "
        f"GAE lambda: {gae_lambda}, "
        f"Max grad norm: {max_grad_norm}, "
        f"Value function coefficient: {vf_coef}, "
        f"Network architecture: {net_arch_type}")

    # Load configuration
    with open("config_hyperparams.yaml") as stream:
        config = yaml.safe_load(stream)

    controller_config = load_controller_config(default_controller=config["robot_controller"])

    env_options = {
        "robots": config["robot_name"],
        "controller_configs": controller_config,
        "has_renderer": False,
        "has_offscreen_renderer": True,
        "single_object_mode": 2,
        "object_type": "milk",
        "use_camera_obs": True,
        "use_object_obs": False,
        "camera_names": "agentview",
        "camera_heights": 128,
        "camera_widths": 128,
        "reward_shaping": True,
        "horizon": horizon,
        "control_freq": control_freq,
    }    
    
    # Setup environment
    if config["multiprocessing"]:
        env = SubprocVecEnv([make_env("PickPlace", env_options, i, config["seed"]) for i in range(config["num_envs"])], start_method='spawn')
        eval_env = SubprocVecEnv([make_env("PickPlace", env_options, i, config["seed"]) for i in range(config["num_eval_envs"])], start_method='spawn')
        eval_env = VecNormalize(eval_env)
        # TODO: account when using multiple envs
        if batch_size > n_steps:
            batch_size = n_steps
    else:
        env = DummyVecEnv(make_env("PickPlace", env_options, 0, config["seed"]))
        eval_env = DummyVecEnv([make_env("PickPlace", env_options, 0, config["seed"])])
        eval_env = VecNormalize(eval_env)

    if config["normalize"]:
        env = VecNormalize(env)

    device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

    # Orthogonal initialization
    ortho_init = False

    activation_fn_name = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    # Independent networks usually work best
    # when not working with images
    net_arch = {
        "tiny": dict(pi=[64], vf=[64]),
        "small": dict(pi=[64, 64], vf=[64, 64]),
        "medium": dict(pi=[256, 256], vf=[256, 256]),
    }[net_arch_type]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU, "elu": nn.ELU, "leaky_relu": nn.LeakyReLU}[activation_fn_name]

    # Initialize model
    if config["algorithm"] == "PPO":
        model = PPO(config["policy"],
                    env,
                    learning_rate=learning_rate,
                    batch_size=batch_size,
                    gamma=gamma,
                    n_steps=n_steps,
                    ent_coef=ent_coef,
                    clip_range=clip_range,
                    gae_lambda=gae_lambda,
                    max_grad_norm=max_grad_norm,
                    vf_coef=vf_coef,
                    policy_kwargs=dict(
                                    net_arch=net_arch,
                                    activation_fn=activation_fn,
                                    ortho_init=ortho_init,
                                    ),
                    verbose=0,
                    tensorboard_log=None,
                    device=device
                    )
    elif config["algorithm"] == "DDPG":
        n_actions = env.action_space.shape[-1]
        action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
        model = DDPG(config["policy"], env, action_noise=action_noise, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, verbose=0, tensorboard_log=None, device=device)
    elif config["algorithm"] == "SAC":
        model = SAC(config["policy"], env, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, verbose=0, tensorboard_log=None, device=device)
    
    '''
    # Evaluation callback
    eval_callback = EvalCallback(eval_env, best_model_save_path="./logs/",
                                 log_path="./logs/", eval_freq=2048,
                                 deterministic=True, render=False)
    '''

    # Train the model
    model.learn(total_timesteps=total_timesteps, progress_bar=True)
    
    # Evaluate the model
    #mean_reward = eval_callback.last_mean_reward
    #mean_reward, _ = model.evaluate_policy(eval_env, n_eval_episodes=5, deterministic=True)
    mean_reward = evaluate_policy(model, eval_env, n_eval_episodes=5)
    print("Mean reward: ", mean_reward)
    
    trial.report(mean_reward, step=total_timesteps)

    #Handle pruning based on the intermediate value
    if trial.should_prune():
        raise optuna.exceptions.TrialPruned()

    return mean_reward

# Optimize hyperparameters
if __name__ == "__main__":
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50, timeout=90000)

    pruned_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
    complete_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]

    print('Study statistics: ')
    print('  Number of finished trials: ', len(study.trials))
    print('  Number of pruned trials: ', len(pruned_trials))
    print('  Number of complete trials: ', len(complete_trials))

    print('Best trial: ')
    trial = study.best_trial

    print('  Value: ', trial.value)
    print('  Params: ')
    for key, value in trial.params.items():
        print('    {}: {}'.format(key, value))

    print('Best hyperparameters: ', study.best_params)
    
    plot_optimization_history(study)

### Optuna Dashboard
Optuna dashboard visualizes the logged results of the optuna execution. 
The optuna dashboard can be accessed by executing the following command and open https://localhost:port in the browser.


In [None]:
# Dashboard command here

### Analysis

In the _ section we can see that the parameters _, _, _ are especially important for the earned rewards. 

### Testing Optimized Parameters

Let's see how our model with optimized parameters performs. The following script starts a replay of one of our most successfull runs. We can see ...

In [None]:
# Run replay

## Conclusion
