<a href="https://colab.research.google.com/github/AI4Finance-LLC/ElegantRL/blob/master/Reacher_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reacher-v2 Example in ElegantRL**






# **Part 1: Problem Definition**

The goal of the [Reacher-v2](https://gym.openai.com/envs/Reacher-v2/) task is straightforward: we want to make a 2D robot reach to a randomly located target.

# **Part 2: Install all the packages through ElegantRL**

In [1]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-6febpv_h
  Running command git clone -q https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-6febpv_h
Collecting pybullet
[?25l  Downloading https://files.pythonhosted.org/packages/6b/b6/719c6e1741fe6126c99d9f3a96fbb9f024ec12a60e6718843f33c7cab1b0/pybullet-3.0.8-cp37-cp37m-manylinux1_x86_64.whl (76.6MB)
[K     |████████████████████████████████| 76.6MB 69kB/s 
Building wheels for collected packages: elegantrl
  Building wheel for elegantrl (setup.py) ... [?25l[?25hdone
  Created wheel for elegantrl: filename=elegantrl-0.3.1-cp37-none-any.whl size=75050 sha256=7d646391383a02fe96e55c50d0aa328d226d5e4bcc2cf17589bc9353498c6eee
  Stored in directory: /tmp/pip-ephem-wheel-cache-fd0upk1_/wheels/d0/f4/2e/cec0c14b57c2094a2bcef3063f95d758ad1309a640ff100419
Successfully built elegantrl
Installing collected packages: pybullet, elegantrl
Suc

# **Part 3: Import packages**


*   elegantrl
*   OpenAI gym
*   PyBullet gym



In [4]:
from elegantrl.run import *
from elegantrl.env import prep_env
import elegantrl.agent as agent
import gym
import pybullet_envs  # for python-bullet-gym
dir(pybullet_envs)
gym.logger.set_level(40) # Block warning

# **Part 4: Initialize agent and environment**

*   **break_step**: the maximum training steps if the target reward is not reached.
*  **eval_times1**: the evaluation times if 'eval_reward > old_max_reward'.
*   **eval_times2**: the evaluation times if 'eval_reward > target_reward'.
*   **rollout_num**: the number of rollout workers (larger is not always faster).


> See Arguments() for more details about adjustable hyper-parameters, and user is able to import the customized environment for own task.




In [6]:
args = Arguments(if_on_policy=True)
args.agent_rl = agent.AgentGaePPO  # choose an DRL algorithm
args.env = prep_env(gym.make('ReacherBulletEnv-v0')) # create and preprocess the environment from gym

args.break_step = int(5e4 * 8)  # (5e4) 1e5, UsedTime: (300s) 800s
args.eval_times1 = 1
args.eval_times2 = 2
args.rollout_num = 4

| env_name: ReacherBulletEnv-v0, action space: Continuous
| state_dim: 9, action_dim: 2, action_max: 1.0, target_reward: 18.0


# **Part 5: Train and evaluate the model**

> In order to train and evaluate the model, the user only needs to pass the Arguments into the function, and the model would be automatically generated. In this case, all work the user needs to do is to define a proper environment for the agent to interact with.


*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).



In [7]:
train_and_evaluate__multiprocessing(args) # the training process will terminate once it reaches the target reward.

| GPU id: 0, cwd: ./AgentGaePPO/ReacherBulletEnv-v0_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00     -8.48 |
0   1.05e+03     14.32 |
0   9.45e+03     14.32 |   -2.92      0.00      -0.56      2.65
0   1.78e+04     14.32 |    1.95      0.00      -0.62      1.66
0   2.52e+04     20.50 |
ID      Step   TargetR |    avgR      stdR   UsedTime  ########
0   2.62e+04     18.00 |   20.50      3.78        808  ########
0   2.62e+04     20.50 |   20.50      3.78      -0.67      1.02
| print_norm: state_avg, state_fix_std
| avg = np.array([ 0.22324559, -0.8252825 , -0.13812736,  0.65398854, -0.03838773,
        0.19503878,  0.8229074 , -0.18179317, -0.10579718], dtype=np.float32)
| std = np.array([0.11132354, 0.12903091, 0.13470675, 0.15849867, 0.518055  ,
       0.51017815, 0.2939634 , 0.55595165, 0.54980254], dtype=np.float32)
| SavedDir: ./AgentGaePPO/ReacherBulletEnv-v0_0
| UsedTime: 808


