# Solving the Pick-and-Place Environment in Robosuite
<img src="https://robosuite.ai/docs/images/env_pick_place.png" align="middle" width="100%"/>

Welcome to the "Project Assignment: Solving the Pick-and-Place Environment in Robosuite" repository! This repository is intended to allow for the replication of our project results and documents its progress including insights as well as tests.

## Table of Contents
 This repository holds the source code framework for training and evaluating the policy in the pick-and-place environments as well as a configuration file to set the different robosuite modules (robots, controllers, etc.) and tune hyperparameters
- [Project Description](#project-description)
	 - [Course Description](#course-description)
	 - [Task Description](#task-description)
- [Installation and Setup](#installation-and-setup)
	- [Installing robosuite and stable baselines 3](#installing-robosuite-and-stable-baselines-3)
	- [Installing our repository](#installing-our-repository)
- [Getting Started](#getting-started)
	- [Initial parameters](#initial-parameters)
	- [Train an Agent](#train-an-agent)
	- [Employ an Agent](#employ-an-agent)
	- [Insights and further testing](#insights-and-further-testing)
- [Hyperparameter Tuning with Optuna](#hyperparameter-tuning-with-optuna)

## Project Description
### Course description
**[Innovative Konzepte zur Programmierung von Industrierobotern](https://ipr.iar.kit.edu/lehrangebote_3804.php)** is an interactive course at the Karlsruhe Institute of Technology, supervised by Prof. Björn Hein, dealing with new methods of programming industrial robots. The topics covered in this lecture include collision-detection, collision-free path planning, path optimization and the emerging field of Reinforcement Learning. As the conclusion of the lecture, a final project related to one of these topics must be implemented by a team of two course participants.
### Task Description
Our team's task is to solve the **[Pick-and-Place Environment](https://robosuite.ai/docs/modules/environments.html#pick-and-place)** from Robosuite using Reinforcement Learning. In this simulated environment, a robot arm needs to place four objects from a bin into their designated container. At every initialization of the environment, the location of the objects are randomized and the task is considered successful is the robot arm manages to place every object into their corresponding container. 

#### Subtasks:
The task (for each object) can be subdivided into the following subtasks:

 1. Reaching: Move to nearest object
 2. Grasping: Pick up the object
 3. Lifting: Carry object to container
 4. Hovering: Drop object into corresponding container
 5. Repeat starting at 1. until all objects are placed in their corresponding containers

#### Reward function:
The reward function is essential to understanding the behaviour of the robot while interacting with the environment. In robosuite each environment has implemented two different kinds of reward functions. A binary reward rewards the robot only in the case if the object is placed in its corresponding container. We employed the dense reward function which uses reward shaping and rewards the robot for each subtask (like reaching & grasping), these rewards are then added successively. The image below taken from the [python code for the pick-and-place task](https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/environments/manipulation/pick_place.py#L260) describes the additional rewards for each subtask:

![](https://github.com/TheOrzo/IKfIR/blob/main/.assets/img/reward_function.png)

## Installation and Setup

### Installing robosuite and stable baselines 3
Employing robosuite on windows is possible (e.g. by using a VM or WSL), but it leads to complications during installing, which is why using a linux or mac computer is highly recommended. Before being able to use our repository, you need to install robosuite following the [installation guide](https://robosuite.ai/docs/installation.html) from the robosuite documentation. We installed it from source:

In [None]:
% git clone https://github.com/ARISE-Initiative/robosuite.git
% cd robosuite
% pip3 install -r requirements.txt

Our repository uses the stable release of the stable baselines 3 for RL algorithm implementations which you can install by following the [installation guide](https://stable-baselines3.readthedocs.io/en/master/guide/install.html):

In [None]:
% pip  install  stable-baselines3[extra]

### Installing our repository
On debian the non free cuda driver has to be installed as a kernel level module in order to use the GPU for calculations. This change resulted in crashes of wayland DSP so a X11 has to be used as a fallback.

Our code is writen for python3.11. The following python packages are needed: numpy (below version 2), robosuite, stable-baselines3[extra], libhdf5, h5py

In [1]:
!python3 -m pip install --upgrade pip
!TMPDIR='/var/tmp'  python3  -m  pip  install  -r  requirements.txt



In [2]:
import numpy as np
import os
from sys import platform
import torch
import robosuite as suite


from robosuite import load_controller_config
from robosuite.environments.base import register_env
from robosuite.controllers import load_controller_config
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.save_util import save_to_zip_file, load_from_zip_file
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from robosuite.wrappers import GymWrapper

from stable_baselines3 import PPO, DDPG, SAC


# Check if cuda(linux) or mps(mac) is available
if torch.cuda.is_available():
	device = torch.device("cuda")
	print("Cuda backend is available.")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Mps backend is available.")
else:
	device = torch.device("cpu")
	print("Cuda backend is not available, using CPU.")

2024-08-18 13:38:35.433048: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-18 13:38:35.472635: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-18 13:38:35.482410: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-18 13:38:35.525297: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Cuda backend is available.


## Getting started
### Initial parameters
To get a feel of how different parameters of the model affect the model performance in a specific environment, we train the model subsequently with different parameters. The following script is a config file defining all parameters that can be adjusted for these subsequent runs.

In [7]:
parameters = dict(
    # Environment
    robot="Sawyer",
    gripper="default",
    controller="OSC_POSE",
    seed=1837812,
    control_freq=20,
    horizon=1000,
    camera_size=84, #84
    episodes=350,
    eval_episodes=1,
    n_processes=6,
    n_eval_processes=1,
    # Algorithm 
    algorithm="PPO",            # PPO, DDPG, SAC
    policy="MlpPolicy",
    gamma=0.99,
    learning_rate=1e-2,
    n_steps=1000,
    batch_size=100,                 # Only DDPG and SAC
)


if platform == "linux" or platform == "linux2":
    print('Recognized Linux. Setting start method to forkserver')
    parameters["start_method"] = 'forkserver'
elif platform == 'darwin':
    print('Recognized macOs. Setting start method to spawn')
    parameters["start_method"] = 'spawn'
elif platform == 'win32':
    print('Windows? Mutig.')
    parameters["start_method"] = 'spawn'
else:
    print('Could not determine os platform. Set start method to forkserver')
    parameters["start_method"] = 'forkserver'

test_name = str(parameters["robot"]) + "_freq" + str(parameters["control_freq"]) + "_hor" + str(parameters["horizon"]) + "_learn" + str(parameters["learning_rate"]) + "_episodes" + str(parameters["episodes"]) + "_control" + str(parameters["controller"])

Recognized Linux. Setting start method to forkserver


Initial parameters seen in this dict are taken from multiple sources (Benchmarks, Implementations & Papers) referred to under [Sources](#Sources). By initial exploring, we discovered that changing the robot model, batch_size, as well as the learning rate have the greatest impact on the model performance.

### Train an Agent
Run the following script to train a model with the previously specified parameters. The model and tensorboard logs will be stored in the "tests" folder named according to the specified parameters.

In [8]:
#Set up TensorBoard logger
tensor_logger = "./" + test_name + "/tensorboard"
print("TensorBoard logging to", tensor_logger)

# Set controller configuration
controller_config = load_controller_config(default_controller=parameters["controller"])

# Define the environment setup
# Make robosuite environment into a gym environment as stable baselines only supports gym environments
def make_env(env_id, options, rank, seed=0):
    def _init():
        env = GymWrapper(suite.make(env_id, **options))
        env.render_mode = 'mujoco'
        env = Monitor(env)
        env.reset(seed=seed + rank)
        return env
    set_random_seed(seed)
    return _init

# Setup environment
# Define environment parameters for specific environment "PickPlace"
env = SubprocVecEnv([make_env(
    "PickPlace",
    dict(
        robots=[parameters["robot"]],                      
        gripper_types=parameters["gripper"],                
        controller_configs=controller_config,   
        has_renderer=False,                     
        has_offscreen_renderer=True,
        control_freq=parameters["control_freq"],
        horizon=parameters["horizon"],
        use_object_obs=False,                       # don't provide object observations to agent
        use_camera_obs=True,                        # provide image observations to agent
        camera_names="agentview",                   # use "agentview" camera for observations
        camera_heights=parameters["camera_size"],   # image height
        camera_widths=parameters["camera_size"],    # image width
        reward_shaping=True),                       # use a dense reward signal for learning
        i,
        parameters["seed"]
        ) for i in range(parameters["n_processes"])], start_method=parameters["start_method"])
        
env = VecNormalize(env)

# Initialize model for training:
if parameters["algorithm"] == "PPO":
    model = PPO("MlpPolicy", env, verbose=1, batch_size=parameters["batch_size"], gamma=parameters["gamma"], learning_rate=parameters["learning_rate"], n_steps=parameters["n_steps"], tensorboard_log=tensor_logger, device=device)
elif parameters["algorithm"] == "DDPG":
    n_actions = env.action_space.shape[-1]
    action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
    model = DDPG(parameters["policy"], env, action_noise=action_noise, verbose=1, batch_size=parameters["batch_size"], tensorboard_log=tensor_logger, device=device)
elif parameters["algorithm"] == "SAC":
    model = SAC(parameters["policy"], env, verbose=1, batch_size=10, train_freq=(1, "episode"), learning_rate=0.001, gradient_steps=1000, learning_starts=3300, tensorboard_log=tensor_logger, device=device)	
else:
    raise ValueError("Invalid algorithm specified in the configuration.")
    
'''
# Load existing model to train
# Comment out the above model initialization and uncomment the following code to load an existing model
env.load("./" + test_name + '/env.pkl', env)
if parameters["algorithm"] == "PPO":
    model = PPO.load("./" + test_name + "/model.zip", env=env, tensorboard_log=tensor_logger, device=device)
elif config["algorithm"] == "DDPG":
    model = DDPG.load("./" + test_name + "/model.zip", env=env,tensorboard_log=tensor_logger, device=device)
elif config["algorithm"] == "SAC":
    model = SAC.load("./" + test_name + "/model.zip", env=env, tensorboard_log=tensor_logger, device=device)
else:
    raise ValueError("Invalid algorithm specified in the configuration.")
'''

# Train the model and save it
model.learn(total_timesteps=parameters["horizon"]*parameters["episodes"], progress_bar=True)
model.save("./" + test_name + "/model.zip")
env.save('./' + test_name + '/env.pkl')

env.close()

TensorBoard logging to ./Sawyer_freq20_hor1000_learn0.001_episodes350_controlOSC_POSE/tensorboard


2024-08-18 14:13:19.604509: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-18 14:13:19.620172: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-18 14:13:19.624909: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-18 14:13:19.635807: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-18 14:13:19.659336: E external/local_xla/xla/

Using cuda device
Logging to ./Sawyer_freq20_hor1000_learn0.001_episodes350_controlOSC_POSE/tensorboard/PPO_1


Output()

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 0.759    |
| time/              |          |
|    fps             | 229      |
|    iterations      | 1        |
|    time_elapsed    | 26       |
|    total_timesteps | 6000     |
---------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 1e+03       |
|    ep_rew_mean          | 0.918       |
| time/                   |             |
|    fps                  | 215         |
|    iterations           | 2           |
|    time_elapsed         | 55          |
|    total_timesteps      | 12000       |
| train/                  |             |
|    approx_kl            | 0.026668975 |
|    clip_fraction        | 0.237       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.92       |
|    explained_variance   | -0.457      |
|    learning_rate        | 0.001       |
|    loss                 | -0.0448     |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0192     |
|    std                  | 1           |
|    value_loss           | 0.0394      |
-----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 0.85       |
| time/                   |            |
|    fps                  | 214        |
|    iterations           | 3          |
|    time_elapsed         | 83         |
|    total_timesteps      | 18000      |
| train/                  |            |
|    approx_kl            | 0.05703262 |
|    clip_fraction        | 0.461      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.92      |
|    explained_variance   | 0.436      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00243   |
|    n_updates            | 20         |
|    policy_gradient_loss | -0.0198    |
|    std                  | 0.997      |
|    value_loss           | 0.0279     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 0.882      |
| time/                   |            |
|    fps                  | 213        |
|    iterations           | 4          |
|    time_elapsed         | 112        |
|    total_timesteps      | 24000      |
| train/                  |            |
|    approx_kl            | 0.06548039 |
|    clip_fraction        | 0.495      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.9       |
|    explained_variance   | 0.728      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0279    |
|    n_updates            | 30         |
|    policy_gradient_loss | -0.00213   |
|    std                  | 0.992      |
|    value_loss           | 0.00882    |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 0.905      |
| time/                   |            |
|    fps                  | 212        |
|    iterations           | 5          |
|    time_elapsed         | 141        |
|    total_timesteps      | 30000      |
| train/                  |            |
|    approx_kl            | 0.08299292 |
|    clip_fraction        | 0.535      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.91      |
|    explained_variance   | 0.673      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0474    |
|    n_updates            | 40         |
|    policy_gradient_loss | -0.00296   |
|    std                  | 0.997      |
|    value_loss           | 0.0104     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.09       |
| time/                   |            |
|    fps                  | 210        |
|    iterations           | 6          |
|    time_elapsed         | 170        |
|    total_timesteps      | 36000      |
| train/                  |            |
|    approx_kl            | 0.09374873 |
|    clip_fraction        | 0.542      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.9       |
|    explained_variance   | 0.723      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0163    |
|    n_updates            | 50         |
|    policy_gradient_loss | 0.00216    |
|    std                  | 0.993      |
|    value_loss           | 0.0116     |
----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 1e+03       |
|    ep_rew_mean          | 1.21        |
| time/                   |             |
|    fps                  | 209         |
|    iterations           | 7           |
|    time_elapsed         | 200         |
|    total_timesteps      | 42000       |
| train/                  |             |
|    approx_kl            | 0.110649824 |
|    clip_fraction        | 0.586       |
|    clip_range           | 0.2         |
|    entropy_loss         | -9.92       |
|    explained_variance   | 0.538       |
|    learning_rate        | 0.001       |
|    loss                 | -0.00671    |
|    n_updates            | 60          |
|    policy_gradient_loss | 0.00238     |
|    std                  | 0.997       |
|    value_loss           | 0.0191      |
-----------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | 1.3       |
| time/                   |           |
|    fps                  | 208       |
|    iterations           | 8         |
|    time_elapsed         | 230       |
|    total_timesteps      | 48000     |
| train/                  |           |
|    approx_kl            | 0.1082494 |
|    clip_fraction        | 0.582     |
|    clip_range           | 0.2       |
|    entropy_loss         | -9.91     |
|    explained_variance   | 0.725     |
|    learning_rate        | 0.001     |
|    loss                 | -0.0303   |
|    n_updates            | 70        |
|    policy_gradient_loss | -0.00337  |
|    std                  | 0.995     |
|    value_loss           | 0.0198    |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.32       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 9          |
|    time_elapsed         | 259        |
|    total_timesteps      | 54000      |
| train/                  |            |
|    approx_kl            | 0.12367248 |
|    clip_fraction        | 0.603      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.94      |
|    explained_variance   | 0.711      |
|    learning_rate        | 0.001      |
|    loss                 | -0.02      |
|    n_updates            | 80         |
|    policy_gradient_loss | 0.0131     |
|    std                  | 0.999      |
|    value_loss           | 0.0133     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.47       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 10         |
|    time_elapsed         | 289        |
|    total_timesteps      | 60000      |
| train/                  |            |
|    approx_kl            | 0.14444776 |
|    clip_fraction        | 0.621      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.93      |
|    explained_variance   | 0.611      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0279    |
|    n_updates            | 90         |
|    policy_gradient_loss | 0.00735    |
|    std                  | 0.998      |
|    value_loss           | 0.0178     |
----------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | 1.45      |
| time/                   |           |
|    fps                  | 207       |
|    iterations           | 11        |
|    time_elapsed         | 318       |
|    total_timesteps      | 66000     |
| train/                  |           |
|    approx_kl            | 0.1435966 |
|    clip_fraction        | 0.615     |
|    clip_range           | 0.2       |
|    entropy_loss         | -9.94     |
|    explained_variance   | 0.777     |
|    learning_rate        | 0.001     |
|    loss                 | 0.0123    |
|    n_updates            | 100       |
|    policy_gradient_loss | 0.00828   |
|    std                  | 0.999     |
|    value_loss           | 0.0171    |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.48       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 12         |
|    time_elapsed         | 347        |
|    total_timesteps      | 72000      |
| train/                  |            |
|    approx_kl            | 0.14146695 |
|    clip_fraction        | 0.617      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.98      |
|    explained_variance   | 0.637      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0347    |
|    n_updates            | 110        |
|    policy_gradient_loss | 0.000255   |
|    std                  | 1.01       |
|    value_loss           | 0.0141     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.54       |
| time/                   |            |
|    fps                  | 206        |
|    iterations           | 13         |
|    time_elapsed         | 377        |
|    total_timesteps      | 78000      |
| train/                  |            |
|    approx_kl            | 0.15675753 |
|    clip_fraction        | 0.638      |
|    clip_range           | 0.2        |
|    entropy_loss         | -9.97      |
|    explained_variance   | 0.626      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0184     |
|    n_updates            | 120        |
|    policy_gradient_loss | 0.011      |
|    std                  | 1          |
|    value_loss           | 0.0147     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.63       |
| time/                   |            |
|    fps                  | 206        |
|    iterations           | 14         |
|    time_elapsed         | 406        |
|    total_timesteps      | 84000      |
| train/                  |            |
|    approx_kl            | 0.13359417 |
|    clip_fraction        | 0.608      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10        |
|    explained_variance   | 0.564      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00761    |
|    n_updates            | 130        |
|    policy_gradient_loss | 0.00503    |
|    std                  | 1.01       |
|    value_loss           | 0.0157     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.69       |
| time/                   |            |
|    fps                  | 206        |
|    iterations           | 15         |
|    time_elapsed         | 436        |
|    total_timesteps      | 90000      |
| train/                  |            |
|    approx_kl            | 0.16338567 |
|    clip_fraction        | 0.629      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.499      |
|    learning_rate        | 0.001      |
|    loss                 | -0.048     |
|    n_updates            | 140        |
|    policy_gradient_loss | 0.0111     |
|    std                  | 1.02       |
|    value_loss           | 0.0189     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.8        |
| time/                   |            |
|    fps                  | 206        |
|    iterations           | 16         |
|    time_elapsed         | 465        |
|    total_timesteps      | 96000      |
| train/                  |            |
|    approx_kl            | 0.16205487 |
|    clip_fraction        | 0.626      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10        |
|    explained_variance   | 0.514      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0417     |
|    n_updates            | 150        |
|    policy_gradient_loss | 0.00831    |
|    std                  | 1.01       |
|    value_loss           | 0.0324     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 1.91       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 17         |
|    time_elapsed         | 495        |
|    total_timesteps      | 102000     |
| train/                  |            |
|    approx_kl            | 0.15438415 |
|    clip_fraction        | 0.616      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.336      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0311    |
|    n_updates            | 160        |
|    policy_gradient_loss | 0.0121     |
|    std                  | 1.02       |
|    value_loss           | 0.0448     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.11       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 18         |
|    time_elapsed         | 525        |
|    total_timesteps      | 108000     |
| train/                  |            |
|    approx_kl            | 0.16492032 |
|    clip_fraction        | 0.646      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10        |
|    explained_variance   | 0.329      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00265    |
|    n_updates            | 170        |
|    policy_gradient_loss | 0.0225     |
|    std                  | 1.02       |
|    value_loss           | 0.0223     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.21       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 19         |
|    time_elapsed         | 555        |
|    total_timesteps      | 114000     |
| train/                  |            |
|    approx_kl            | 0.12774369 |
|    clip_fraction        | 0.595      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.311      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0224     |
|    n_updates            | 180        |
|    policy_gradient_loss | 0.000625   |
|    std                  | 1.02       |
|    value_loss           | 0.0621     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.47       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 20         |
|    time_elapsed         | 585        |
|    total_timesteps      | 120000     |
| train/                  |            |
|    approx_kl            | 0.17995135 |
|    clip_fraction        | 0.637      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | 0.394      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0258     |
|    n_updates            | 190        |
|    policy_gradient_loss | 0.0134     |
|    std                  | 1.02       |
|    value_loss           | 0.0262     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.65       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 21         |
|    time_elapsed         | 615        |
|    total_timesteps      | 126000     |
| train/                  |            |
|    approx_kl            | 0.15699045 |
|    clip_fraction        | 0.633      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.1      |
|    explained_variance   | -0.263     |
|    learning_rate        | 0.001      |
|    loss                 | 0.0455     |
|    n_updates            | 200        |
|    policy_gradient_loss | 0.0339     |
|    std                  | 1.03       |
|    value_loss           | 0.026      |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.76       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 22         |
|    time_elapsed         | 644        |
|    total_timesteps      | 132000     |
| train/                  |            |
|    approx_kl            | 0.16729341 |
|    clip_fraction        | 0.634      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.297      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0335    |
|    n_updates            | 210        |
|    policy_gradient_loss | 0.0201     |
|    std                  | 1.04       |
|    value_loss           | 0.0232     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 2.94       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 23         |
|    time_elapsed         | 675        |
|    total_timesteps      | 138000     |
| train/                  |            |
|    approx_kl            | 0.12771124 |
|    clip_fraction        | 0.589      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.567      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0205    |
|    n_updates            | 220        |
|    policy_gradient_loss | -0.000665  |
|    std                  | 1.04       |
|    value_loss           | 0.0214     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.12       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 24         |
|    time_elapsed         | 705        |
|    total_timesteps      | 144000     |
| train/                  |            |
|    approx_kl            | 0.13611262 |
|    clip_fraction        | 0.583      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.64       |
|    learning_rate        | 0.001      |
|    loss                 | -0.0209    |
|    n_updates            | 230        |
|    policy_gradient_loss | 0.00431    |
|    std                  | 1.04       |
|    value_loss           | 0.0244     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.26       |
| time/                   |            |
|    fps                  | 203        |
|    iterations           | 25         |
|    time_elapsed         | 735        |
|    total_timesteps      | 150000     |
| train/                  |            |
|    approx_kl            | 0.15218262 |
|    clip_fraction        | 0.617      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.188      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0256     |
|    n_updates            | 240        |
|    policy_gradient_loss | 0.0228     |
|    std                  | 1.04       |
|    value_loss           | 0.0248     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.43       |
| time/                   |            |
|    fps                  | 203        |
|    iterations           | 26         |
|    time_elapsed         | 766        |
|    total_timesteps      | 156000     |
| train/                  |            |
|    approx_kl            | 0.14546765 |
|    clip_fraction        | 0.583      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.355      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0155     |
|    n_updates            | 250        |
|    policy_gradient_loss | 0.00317    |
|    std                  | 1.04       |
|    value_loss           | 0.0214     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.45       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 27         |
|    time_elapsed         | 793        |
|    total_timesteps      | 162000     |
| train/                  |            |
|    approx_kl            | 0.18506335 |
|    clip_fraction        | 0.621      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.255      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00532   |
|    n_updates            | 260        |
|    policy_gradient_loss | 0.00972    |
|    std                  | 1.04       |
|    value_loss           | 0.0237     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.57       |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 28         |
|    time_elapsed         | 821        |
|    total_timesteps      | 168000     |
| train/                  |            |
|    approx_kl            | 0.16220658 |
|    clip_fraction        | 0.617      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.538      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0581    |
|    n_updates            | 270        |
|    policy_gradient_loss | 0.00915    |
|    std                  | 1.04       |
|    value_loss           | 0.0198     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.69       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 29         |
|    time_elapsed         | 848        |
|    total_timesteps      | 174000     |
| train/                  |            |
|    approx_kl            | 0.20268957 |
|    clip_fraction        | 0.645      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.542      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0218    |
|    n_updates            | 280        |
|    policy_gradient_loss | 0.00897    |
|    std                  | 1.04       |
|    value_loss           | 0.0173     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.76       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 30         |
|    time_elapsed         | 875        |
|    total_timesteps      | 180000     |
| train/                  |            |
|    approx_kl            | 0.16515535 |
|    clip_fraction        | 0.615      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.365      |
|    learning_rate        | 0.001      |
|    loss                 | 0.028      |
|    n_updates            | 290        |
|    policy_gradient_loss | 0.00785    |
|    std                  | 1.04       |
|    value_loss           | 0.0205     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.84       |
| time/                   |            |
|    fps                  | 205        |
|    iterations           | 31         |
|    time_elapsed         | 903        |
|    total_timesteps      | 186000     |
| train/                  |            |
|    approx_kl            | 0.19777937 |
|    clip_fraction        | 0.651      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.2      |
|    explained_variance   | 0.304      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0264    |
|    n_updates            | 300        |
|    policy_gradient_loss | 0.0113     |
|    std                  | 1.04       |
|    value_loss           | 0.0346     |
----------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | 3.85      |
| time/                   |           |
|    fps                  | 206       |
|    iterations           | 32        |
|    time_elapsed         | 930       |
|    total_timesteps      | 192000    |
| train/                  |           |
|    approx_kl            | 0.1635237 |
|    clip_fraction        | 0.63      |
|    clip_range           | 0.2       |
|    entropy_loss         | -10.3     |
|    explained_variance   | 0.327     |
|    learning_rate        | 0.001     |
|    loss                 | 0.00206   |
|    n_updates            | 310       |
|    policy_gradient_loss | 0.00426   |
|    std                  | 1.05      |
|    value_loss           | 0.0249    |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.84       |
| time/                   |            |
|    fps                  | 206        |
|    iterations           | 33         |
|    time_elapsed         | 957        |
|    total_timesteps      | 198000     |
| train/                  |            |
|    approx_kl            | 0.16776043 |
|    clip_fraction        | 0.621      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.187      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0231     |
|    n_updates            | 320        |
|    policy_gradient_loss | 0.00514    |
|    std                  | 1.05       |
|    value_loss           | 0.021      |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.78       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 34         |
|    time_elapsed         | 984        |
|    total_timesteps      | 204000     |
| train/                  |            |
|    approx_kl            | 0.17489688 |
|    clip_fraction        | 0.63       |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.304      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0299    |
|    n_updates            | 330        |
|    policy_gradient_loss | -0.00605   |
|    std                  | 1.05       |
|    value_loss           | 0.0223     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.73       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 35         |
|    time_elapsed         | 1011       |
|    total_timesteps      | 210000     |
| train/                  |            |
|    approx_kl            | 0.18415803 |
|    clip_fraction        | 0.623      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.323      |
|    learning_rate        | 0.001      |
|    loss                 | -0.00414   |
|    n_updates            | 340        |
|    policy_gradient_loss | -0.00715   |
|    std                  | 1.05       |
|    value_loss           | 0.0193     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.77       |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 36         |
|    time_elapsed         | 1039       |
|    total_timesteps      | 216000     |
| train/                  |            |
|    approx_kl            | 0.24030466 |
|    clip_fraction        | 0.656      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.3      |
|    explained_variance   | 0.465      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00526    |
|    n_updates            | 350        |
|    policy_gradient_loss | 0.0027     |
|    std                  | 1.06       |
|    value_loss           | 0.0166     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.71       |
| time/                   |            |
|    fps                  | 208        |
|    iterations           | 37         |
|    time_elapsed         | 1066       |
|    total_timesteps      | 222000     |
| train/                  |            |
|    approx_kl            | 0.18562958 |
|    clip_fraction        | 0.636      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.151      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0413    |
|    n_updates            | 360        |
|    policy_gradient_loss | 0.00371    |
|    std                  | 1.06       |
|    value_loss           | 0.0221     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.66       |
| time/                   |            |
|    fps                  | 208        |
|    iterations           | 38         |
|    time_elapsed         | 1093       |
|    total_timesteps      | 228000     |
| train/                  |            |
|    approx_kl            | 0.24029727 |
|    clip_fraction        | 0.67       |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.16       |
|    learning_rate        | 0.001      |
|    loss                 | 0.0234     |
|    n_updates            | 370        |
|    policy_gradient_loss | 0.0214     |
|    std                  | 1.07       |
|    value_loss           | 0.0184     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.66       |
| time/                   |            |
|    fps                  | 208        |
|    iterations           | 39         |
|    time_elapsed         | 1120       |
|    total_timesteps      | 234000     |
| train/                  |            |
|    approx_kl            | 0.22140989 |
|    clip_fraction        | 0.637      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.62       |
|    learning_rate        | 0.001      |
|    loss                 | -0.0338    |
|    n_updates            | 380        |
|    policy_gradient_loss | 0.0132     |
|    std                  | 1.07       |
|    value_loss           | 0.0212     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.64       |
| time/                   |            |
|    fps                  | 209        |
|    iterations           | 40         |
|    time_elapsed         | 1148       |
|    total_timesteps      | 240000     |
| train/                  |            |
|    approx_kl            | 0.29255828 |
|    clip_fraction        | 0.635      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.591      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0634    |
|    n_updates            | 390        |
|    policy_gradient_loss | 0.00953    |
|    std                  | 1.07       |
|    value_loss           | 0.0184     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.5        |
| time/                   |            |
|    fps                  | 209        |
|    iterations           | 41         |
|    time_elapsed         | 1175       |
|    total_timesteps      | 246000     |
| train/                  |            |
|    approx_kl            | 0.22916292 |
|    clip_fraction        | 0.644      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.418      |
|    learning_rate        | 0.001      |
|    loss                 | -0.018     |
|    n_updates            | 400        |
|    policy_gradient_loss | 0.0131     |
|    std                  | 1.08       |
|    value_loss           | 0.0242     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.48       |
| time/                   |            |
|    fps                  | 209        |
|    iterations           | 42         |
|    time_elapsed         | 1202       |
|    total_timesteps      | 252000     |
| train/                  |            |
|    approx_kl            | 0.27839416 |
|    clip_fraction        | 0.659      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.634      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0158     |
|    n_updates            | 410        |
|    policy_gradient_loss | 0.00622    |
|    std                  | 1.08       |
|    value_loss           | 0.0204     |
----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | 3.5      |
| time/                   |          |
|    fps                  | 209      |
|    iterations           | 43       |
|    time_elapsed         | 1229     |
|    total_timesteps      | 258000   |
| train/                  |          |
|    approx_kl            | 0.259545 |
|    clip_fraction        | 0.648    |
|    clip_range           | 0.2      |
|    entropy_loss         | -10.5    |
|    explained_variance   | 0.72     |
|    learning_rate        | 0.001    |
|    loss                 | -0.0727  |
|    n_updates            | 420      |
|    policy_gradient_loss | 0.0116   |
|    std                  | 1.08     |
|    value_loss           | 0.0182   |
--------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | 3.54      |
| time/                   |           |
|    fps                  | 210       |
|    iterations           | 44        |
|    time_elapsed         | 1256      |
|    total_timesteps      | 264000    |
| train/                  |           |
|    approx_kl            | 0.1997951 |
|    clip_fraction        | 0.642     |
|    clip_range           | 0.2       |
|    entropy_loss         | -10.5     |
|    explained_variance   | 0.424     |
|    learning_rate        | 0.001     |
|    loss                 | 0.0262    |
|    n_updates            | 430       |
|    policy_gradient_loss | 0.0135    |
|    std                  | 1.08      |
|    value_loss           | 0.0281    |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.53       |
| time/                   |            |
|    fps                  | 210        |
|    iterations           | 45         |
|    time_elapsed         | 1284       |
|    total_timesteps      | 270000     |
| train/                  |            |
|    approx_kl            | 0.35273615 |
|    clip_fraction        | 0.677      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.618      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00809    |
|    n_updates            | 440        |
|    policy_gradient_loss | 0.0207     |
|    std                  | 1.08       |
|    value_loss           | 0.026      |
----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | 3.5      |
| time/                   |          |
|    fps                  | 210      |
|    iterations           | 46       |
|    time_elapsed         | 1312     |
|    total_timesteps      | 276000   |
| train/                  |          |
|    approx_kl            | 0.315245 |
|    clip_fraction        | 0.68     |
|    clip_range           | 0.2      |
|    entropy_loss         | -10.5    |
|    explained_variance   | 0.49     |
|    learning_rate        | 0.001    |
|    loss                 | 0.136    |
|    n_updates            | 450      |
|    policy_gradient_loss | 0.0125   |
|    std                  | 1.08     |
|    value_loss           | 0.019    |
--------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.46       |
| time/                   |            |
|    fps                  | 210        |
|    iterations           | 47         |
|    time_elapsed         | 1339       |
|    total_timesteps      | 282000     |
| train/                  |            |
|    approx_kl            | 0.21695186 |
|    clip_fraction        | 0.636      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.686      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0191    |
|    n_updates            | 460        |
|    policy_gradient_loss | 0.0131     |
|    std                  | 1.07       |
|    value_loss           | 0.0208     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.54       |
| time/                   |            |
|    fps                  | 210        |
|    iterations           | 48         |
|    time_elapsed         | 1367       |
|    total_timesteps      | 288000     |
| train/                  |            |
|    approx_kl            | 0.23236792 |
|    clip_fraction        | 0.649      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.754      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0329    |
|    n_updates            | 470        |
|    policy_gradient_loss | 0.0115     |
|    std                  | 1.08       |
|    value_loss           | 0.0216     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.53       |
| time/                   |            |
|    fps                  | 210        |
|    iterations           | 49         |
|    time_elapsed         | 1394       |
|    total_timesteps      | 294000     |
| train/                  |            |
|    approx_kl            | 0.30077288 |
|    clip_fraction        | 0.661      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.4      |
|    explained_variance   | 0.227      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0435     |
|    n_updates            | 480        |
|    policy_gradient_loss | 0.023      |
|    std                  | 1.08       |
|    value_loss           | 0.0255     |
----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | 3.55     |
| time/                   |          |
|    fps                  | 211      |
|    iterations           | 50       |
|    time_elapsed         | 1421     |
|    total_timesteps      | 300000   |
| train/                  |          |
|    approx_kl            | 0.239525 |
|    clip_fraction        | 0.655    |
|    clip_range           | 0.2      |
|    entropy_loss         | -10.5    |
|    explained_variance   | 0.533    |
|    learning_rate        | 0.001    |
|    loss                 | 0.0392   |
|    n_updates            | 490      |
|    policy_gradient_loss | 0.00983  |
|    std                  | 1.08     |
|    value_loss           | 0.0256   |
--------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.59       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 51         |
|    time_elapsed         | 1448       |
|    total_timesteps      | 306000     |
| train/                  |            |
|    approx_kl            | 0.30628288 |
|    clip_fraction        | 0.683      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.475      |
|    learning_rate        | 0.001      |
|    loss                 | 0.00379    |
|    n_updates            | 500        |
|    policy_gradient_loss | 0.0263     |
|    std                  | 1.09       |
|    value_loss           | 0.017      |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.54       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 52         |
|    time_elapsed         | 1476       |
|    total_timesteps      | 312000     |
| train/                  |            |
|    approx_kl            | 0.24640726 |
|    clip_fraction        | 0.672      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.6      |
|    explained_variance   | 0.692      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0371     |
|    n_updates            | 510        |
|    policy_gradient_loss | 0.016      |
|    std                  | 1.1        |
|    value_loss           | 0.0188     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.58       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 53         |
|    time_elapsed         | 1503       |
|    total_timesteps      | 318000     |
| train/                  |            |
|    approx_kl            | 0.24295744 |
|    clip_fraction        | 0.654      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.6      |
|    explained_variance   | 0.773      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0128    |
|    n_updates            | 520        |
|    policy_gradient_loss | 0.0225     |
|    std                  | 1.09       |
|    value_loss           | 0.0176     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.74       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 54         |
|    time_elapsed         | 1531       |
|    total_timesteps      | 324000     |
| train/                  |            |
|    approx_kl            | 0.17424461 |
|    clip_fraction        | 0.608      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.656      |
|    learning_rate        | 0.001      |
|    loss                 | 0.0538     |
|    n_updates            | 530        |
|    policy_gradient_loss | 0.0158     |
|    std                  | 1.09       |
|    value_loss           | 0.0207     |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.78       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 55         |
|    time_elapsed         | 1558       |
|    total_timesteps      | 330000     |
| train/                  |            |
|    approx_kl            | 0.31925362 |
|    clip_fraction        | 0.654      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.5      |
|    explained_variance   | 0.333      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0279    |
|    n_updates            | 540        |
|    policy_gradient_loss | 0.02       |
|    std                  | 1.1        |
|    value_loss           | 0.104      |
----------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.79       |
| time/                   |            |
|    fps                  | 211        |
|    iterations           | 56         |
|    time_elapsed         | 1586       |
|    total_timesteps      | 336000     |
| train/                  |            |
|    approx_kl            | 0.18121004 |
|    clip_fraction        | 0.601      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.6      |
|    explained_variance   | 0.602      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0237    |
|    n_updates            | 550        |
|    policy_gradient_loss | -0.00358   |
|    std                  | 1.1        |
|    value_loss           | 0.0346     |
----------------------------------------


--------------------------------------
| rollout/                |          |
|    ep_len_mean          | 1e+03    |
|    ep_rew_mean          | 3.75     |
| time/                   |          |
|    fps                  | 211      |
|    iterations           | 57       |
|    time_elapsed         | 1614     |
|    total_timesteps      | 342000   |
| train/                  |          |
|    approx_kl            | 0.671455 |
|    clip_fraction        | 0.68     |
|    clip_range           | 0.2      |
|    entropy_loss         | -10.6    |
|    explained_variance   | 0.0391   |
|    learning_rate        | 0.001    |
|    loss                 | -0.0498  |
|    n_updates            | 560      |
|    policy_gradient_loss | 0.015    |
|    std                  | 1.09     |
|    value_loss           | 0.0318   |
--------------------------------------


---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 1e+03     |
|    ep_rew_mean          | 3.83      |
| time/                   |           |
|    fps                  | 211       |
|    iterations           | 58        |
|    time_elapsed         | 1641      |
|    total_timesteps      | 348000    |
| train/                  |           |
|    approx_kl            | 0.1779277 |
|    clip_fraction        | 0.615     |
|    clip_range           | 0.2       |
|    entropy_loss         | -10.6     |
|    explained_variance   | 0.829     |
|    learning_rate        | 0.001     |
|    loss                 | -0.0136   |
|    n_updates            | 570       |
|    policy_gradient_loss | 0.00712   |
|    std                  | 1.1       |
|    value_loss           | 0.0271    |
---------------------------------------


----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 1e+03      |
|    ep_rew_mean          | 3.77       |
| time/                   |            |
|    fps                  | 212        |
|    iterations           | 59         |
|    time_elapsed         | 1668       |
|    total_timesteps      | 354000     |
| train/                  |            |
|    approx_kl            | 0.30114806 |
|    clip_fraction        | 0.666      |
|    clip_range           | 0.2        |
|    entropy_loss         | -10.6      |
|    explained_variance   | 0.324      |
|    learning_rate        | 0.001      |
|    loss                 | -0.0173    |
|    n_updates            | 580        |
|    policy_gradient_loss | 0.0123     |
|    std                  | 1.11       |
|    value_loss           | 0.0433     |
----------------------------------------


### Tensorboard
The following command will open a locally hosted http server for the tensorboard.
Navigate to http://localhost:6006 to view the data logged during training.

In [9]:
!python3 -m tensorboard.main --logdir={'./' + test_name + '/'}

#% python -m tensorboard.main --logdir=tensor_logger

2024-08-18 14:53:13.772352: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-18 14:53:13.787492: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-18 14:53:13.792034: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-18 14:53:13.803108: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1723985595.611053    7126 cuda_executor.c

### Employ an Agent
With the following script, the trained model defined by the specified parameters will be used for the task execution. If the trained agent exists, you can run it in the specified environment by:


In [10]:
if not os.path.isdir('./' + test_name):
    print("No model found for this configuration. Train a model first!")
else:
    print('Using model ' + test_name)

    # Set controller configuration
    controller_config = load_controller_config(default_controller=parameters["controller"])

    # Define the environment setup
    # Make robosuite environment into a gym environment as stable baselines only supports gym environments
    def make_env(env_id, options, rank, seed=0):
        def _init():
            env = GymWrapper(suite.make(env_id, **options))
            env.render_mode = 'mujoco'
            env = Monitor(env)
            env.reset(seed=seed + rank)
            return env
        set_random_seed(seed)
        return _init

    # Setup environment
    # Define environment parameters for specific environment "PickPlace"
    env = SubprocVecEnv([make_env(
        "PickPlace",
        dict(
            robots=[parameters["robot"]],                      
            gripper_types=parameters["gripper"],                
            controller_configs=controller_config,   
            has_renderer=True,
            has_offscreen_renderer=True,
            control_freq=parameters["control_freq"],
            horizon=parameters["horizon"],
            use_object_obs=False,                       # don't provide object observations to agent
            use_camera_obs=True,                        # provide image observations to agent
            camera_names="agentview",                   # use "agentview" camera for observations
            camera_heights=parameters["camera_size"],   # image height
            camera_widths=parameters["camera_size"],    # image width
            reward_shaping=True),                       # use a dense reward signal for learning
            i,
            parameters["seed"]
            ) for i in range(parameters["n_eval_processes"])], start_method=parameters["start_method"])
    
    env = VecNormalize(env)
    env.load("./" + test_name + '/env.pkl', env)

    if parameters["algorithm"] == "PPO":
        model = PPO.load("./" + test_name + "/model.zip", env=env, device=device)
    elif parameters["algorithm"] == "DDPG":
        model = DDPG.load("./" + test_name + "/model.zip", env=env, device=device)
    elif parameters["algorithm"] == "SAC":
        model = SAC.load("./" + test_name + "/model.zip", env=env, device=device)
    else:
        raise ValueError("Invalid algorithm specified in the configuration.")
    
    def get_policy_action(obs):
        action, _states = model.predict(obs, deterministic=True)
        return action

    # reset the environment to prepare for a rollout
    env.training = False
    env.norm_reward = False
    episode_rewards = []
    eval_episodes = parameters["eval_episodes"]
    for i_episode in range(eval_episodes):
        obs = env.reset()
        total_reward = 0
        for t in range(parameters["horizon"]):
            env.render()
            action = get_policy_action(obs)   # use observation to decide on an action
            obs, reward, done, info = env.step(action) # play action
            total_reward += reward
            if done.all():
                print("Episode finished after {} timesteps".format(t+1))
                break
        episode_rewards.append(total_reward)
    average_reward_per_environment = sum(episode_rewards) / len(episode_rewards)
    average_reward = np.mean(average_reward_per_environment)
    print(f"Iteration {i_episode+1}/{eval_episodes}, Average Reward per Environment: {average_reward_per_environment}, Average Reward: {average_reward}")
    
    # Close environment
    env.close()

Using model Sawyer_freq20_hor1000_learn0.001_episodes350_controlOSC_POSE


2024-08-18 14:53:27.333270: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-18 14:53:27.348087: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-18 14:53:27.352591: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-18 14:53:27.363025: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Qt: Session management error: Could not open network 

Episode finished after 1000 timesteps
Iteration 1/1, Average Reward per Environment: [0.13878148], Average Reward: 0.1387814794874347


### Insights and further testing

With these initial tests, we tested a variety of robot configurations and parameters, evaluating them based on visual critic and the total collected reward per episode.
We identified the Sawyer robot with its default gripper and the PPO algorithm as our most promising candidate. An additional insight is that changing the parameters responsible for steps taken until a policy update, the horizon and the control frequency of the robot influences the performance of the agent significantly. 

Further tests were conducted, but the lack of computing performance and the parameters being highly correlated with each other served as a strong bottleneck in solving this high-level task. It is easy to overlook configurations which would enable better performance when the parameters correlate with each other. In most cases, changing a parameter requires adapting the other parameters, otherwise the agent might even perform worse. The success of trying random combinations of parameters manually is very limited, since there is a very high number of possible parameter configurations.

todo some simulation examples with tensorboard graphs and everything:

In [None]:
# replay of panda beeing clumsy with its grippers head

In [None]:
# replay of IIWA bugging

In [None]:
# replay of 20hz control freq

In [None]:
# replay of 100hz control freq

In [None]:
# replay of 250hz control freq

## Hyperparameter Tuning with Optuna
### Installing Optuna
To bridge the gap of achieving a higher performance of the agent despite correlating parameters, a wider field of parameters needs to be evaluated.

The [Optuna hyperparameter optimization](https://optuna.org) framework make this task feasible by automating the hyperparameter search. By sampling for each run, called trial, a value for each parameter from a specified range and training a model with these parameters, the model performance can be evaluated based on the mean reward. Optuna then provides after a specified number of trials which hyperparameters lead to the best performance, have the highest influence on model performance and how they correlate to each other.

Install it by running the following command:

In [None]:
% pip install optuna

### Running Optuna
Running the following script executes 200 optima trials. Parameter ranges for the PPO algorithm are taken from the [RL3 baselines zoo repository](https://github.com/DLR-RM/rl-baselines3-zoo/blob/726e2f1d3f1a6ea58ad4ae61c02a4ba71d241e4b/rl_zoo3/hyperparams_opt.py#L11C5-L11C22). To reduce the hyperparameter search space, i.e. limit the number of trials, we either kept certain parameters fixed or reduced their range based on gathered insights from previous tests.

In [None]:
import yaml
from torch import nn as nn
import optuna
from optuna.visualization import plot_optimization_history


# Set name of study
study_name = "study_sawyer_pickplace"
storage_name = "sqlite:///{}.db".format(study_name)

# Load configuration
with open("config_hyperparams.yaml") as stream:
    config = yaml.safe_load(stream)

# Method to evaluate the policy
def evaluate_policy(model, env, n_eval_episodes=5):
    all_episode_rewards = []
    for _ in range(n_eval_episodes):
        episode_rewards = []
        done = np.array([False])
        obs = env.reset()
        while not done.all():
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)
        all_episode_rewards.append(np.sum(episode_rewards))
    mean_reward = np.mean(all_episode_rewards)
    return mean_reward

def objective(trial):
    # Suggest hyperparameters
    #learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    learning_rate = 0.001
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    gamma = trial.suggest_categorical('gamma', [0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999])
    n_steps = trial.suggest_categorical('n_steps', [512, 1024, 2048])
    #horizon = trial.suggest_categorical('horizon', [512, 1024, 2048])
    horizon = 512
    control_freq = trial.suggest_uniform('control_freq', 100, 150)
    #total_timesteps = trial.suggest_categorical('total_timesteps', [1e5, 2e5, 5e5, 1e6, 2e6])
    total_timesteps = 3e5
    ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
    clip_range = trial.suggest_categorical("clip_range", [0.1, 0.2, 0.3, 0.4])
    #n_epochs = trial.suggest_categorical("n_epochs", [1, 5, 10, 20])
    gae_lambda = trial.suggest_categorical("gae_lambda", [0.8, 0.9, 0.92, 0.95, 0.98, 0.99, 1.0])
    max_grad_norm = trial.suggest_categorical("max_grad_norm", [0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5])
    vf_coef = trial.suggest_float("vf_coef", 0, 1)
    net_arch_type = trial.suggest_categorical("net_arch", ["tiny", "small", "medium"])

    print(
        f"Learning rate: {learning_rate}, "
        f"Batch size: {batch_size}, "
        f"Gamma: {gamma}, "
        f"N steps: {n_steps}, "
        f"Horizon: {horizon}, "
        f"Control freq: {control_freq}, "
        f"Total timesteps: {total_timesteps}, "
        f"Entropy coefficient: {ent_coef}, "
        f"Clip range: {clip_range}, "
        f"GAE lambda: {gae_lambda}, "
        f"Max grad norm: {max_grad_norm}, "
        f"Value function coefficient: {vf_coef}, "
        f"Network architecture: {net_arch_type}")

    # Set controller configuration
    controller_config = load_controller_config(default_controller=config["controller"])

    # Setup environment
    # Define environment parameters for specific environment "PickPlace"
    env_options = {
        "robots": config["robot_name"],
        "controller_configs": controller_config,
        "gripper_types": config["gripper"],
        "has_renderer": False,
        "has_offscreen_renderer": True,
        "single_object_mode": 2,
        "object_type": "milk",
        "use_camera_obs": True,         # provide image observations to agent
        "use_object_obs": False,        # don't provide object observations to agent
        "camera_names": "agentview",    # use "agentview" camera for observations
        "camera_heights": 128,          # image height
        "camera_widths": 128,           # image width
        "reward_shaping": True,         # use a dense reward signal for learning
        "horizon": horizon,
        "control_freq": control_freq,
    }    
    
    # Setup environment
    env = SubprocVecEnv([make_env("PickPlace", env_options, i, config["seed"]) for i in range(config["num_envs"])], start_method='spawn') #remove start_method='spawn' if you are not training on MPS
    env = VecNormalize(env)

    # TODO: account when using multiple envs
    if batch_size > n_steps:
        batch_size = n_steps

    # Check if cuda(linux) or mps(mac) is available
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Cuda backend is available.")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Mps backend is available.")
    else:
        device = torch.device("cpu")
        print("Cuda backend is not available, using CPU.")

    # Orthogonal initialization
    ortho_init = False

    activation_fn_name = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    # Independent networks usually work best when not working with images
    net_arch = {
        "tiny": dict(pi=[64], vf=[64]),
        "small": dict(pi=[64, 64], vf=[64, 64]),
        "medium": dict(pi=[256, 256], vf=[256, 256]),
    }[net_arch_type]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU, "elu": nn.ELU, "leaky_relu": nn.LeakyReLU}[activation_fn_name]

    # Initialize model
    if config["algorithm"] == "PPO":
        model = PPO(config["policy"],
                    env,
                    learning_rate=learning_rate,
                    batch_size=batch_size,
                    gamma=gamma,
                    n_steps=n_steps,
                    ent_coef=ent_coef,
                    clip_range=clip_range,
                    gae_lambda=gae_lambda,
                    max_grad_norm=max_grad_norm,
                    vf_coef=vf_coef,
                    policy_kwargs=dict(
                                    net_arch=net_arch,
                                    activation_fn=activation_fn,
                                    ortho_init=ortho_init,
                                    ),
                    verbose=0,
                    tensorboard_log=None,
                    device=device
                    )
    elif config["algorithm"] == "DDPG":
        n_actions = env.action_space.shape[-1]
        action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
        model = DDPG(config["policy"], env, action_noise=action_noise, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, verbose=0, tensorboard_log=None, device=device)
    elif config["algorithm"] == "SAC":
        model = SAC(config["policy"], env, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, verbose=0, tensorboard_log=None, device=device)
    
    # Train the model
    model.learn(total_timesteps=total_timesteps, progress_bar=True)
    env.close()

    # Setup evaluation environment
    # Define environment parameters for specific environment "PickPlace"
    eval_env = SubprocVecEnv([make_env("PickPlace", env_options, i, config["seed"]) for i in range(config["num_eval_envs"])], start_method='spawn') #remove start_method='spawn' if you are not training on MPS
    eval_env = VecNormalize(eval_env)

    # Evaluate the model
    mean_reward = evaluate_policy(model, eval_env, config["n_eval_episodes"])
    print("Mean reward: ", mean_reward)
    eval_env.close()

    trial.report(mean_reward, step=total_timesteps)

    #Handle pruning based on the intermediate value
    if trial.should_prune():
        raise optuna.exceptions.TrialPruned()

    return mean_reward

# Optimize hyperparameters
study = optuna.create_study(direction='maximize', study_name=study_name, storage=storage_name, load_if_exists=True)
study.optimize(objective, config["n_trials"])

pruned_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
complete_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]

print('Study statistics: ')
print('  Number of finished trials: ', len(study.trials))
print('  Number of pruned trials: ', len(pruned_trials))
print('  Number of complete trials: ', len(complete_trials))

print('Best trial: ')
trial = study.best_trial

print('  Value: ', trial.value)
print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

print('Best hyperparameters: ', study.best_params)

plot_optimization_history(study)

We let optuna run for 48 hours. The logs are also uploaded to this repository. See the section [Analysis](#Analysis) for the access to the dashboard and our analysis of the results.


### Optuna Dashboard
Optuna dashboard visualizes the logged results of the optuna execution. 
The optuna dashboard can be accessed by executing the following command and opening up https://localhost:port in your browser.

In [None]:
% optuna-dashboard sqlite:///study_sawyer_pickplace.db

### Analysis

In the _ section we can see that the parameters _, _, _ are especially important for the earned rewards. 

### Testing Optimized Parameters

Let's see how our model with optimized parameters performs. The following script starts a replay of one of our most successfull runs. We can see ...

In [None]:
# Run replay

# Conclusion