<a href="https://colab.research.google.com/github/AI4Finance-Foundation/FinRL-Meta/blob/master/Demo_China_A_share_market.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo: ElegantRL_HelloWorld_tutorial (DQN, DDPG, PPO)

![File_structure of ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/figs/File_structure.png)

One sentence summary: an agent (agent.py) with Actor-Critic networks (net.py) is trained (run.py) by interacting with an environment (env.py).

### The training env for RL

In [1]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agentfrom elegantrl_helloworld.env import get_gym_env_args, PendulumEnv

Install `gym` to run some `env` for DRL training.
- [Link to know more about **Discreted action** space env `CartPole`](https://gym.openai.com/envs/CartPole-v1/)
- [Link to know more about **Discreted action** space env `LunarLander`](https://gym.openai.com/envs/LunarLander-v2/)
- [Link to know more about **Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/)
- [Link to know more about **Continuous action** space env `LunarLanderContinuous`](https://gym.openai.com/envs/LunarLanderContinuous-v2/)
- [Link to know more about **Continuous action** space env `BipedalWalker`](https://gym.openai.com/envs/BipedalWalker-v2/)

In [2]:
!pip install gym
import gym



Install `Box2D` to run some `env` for DRL training.

Box2D is a 2D rigid body simulation library for games.

The following code install `Box2D` for task `LunarLannder` and `BipdealWalker`. 

In [3]:
!pip3 install Box2D
!pip3 install box2d-py
!pip3 install gym[Box_2D]

import gym
env = gym.make("LunarLander-v2")

Collecting Box2D
  Downloading Box2D-2.3.10-cp37-cp37m-manylinux1_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: Box2D
Successfully installed Box2D-2.3.10


## Train DQN on discreted action space task.

In [4]:
from elegantrl_helloworld.agent import AgentDQN
agent_class = AgentDQN
env_name = ["CartPole-v0", "LunarLander-v2"][0]

if env_name == "CartPole-v0":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** 0
    args.gamma = 0.97

    '''network update'''
    args.target_step = args.max_step * 2
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 7
    args.repeat_times = 2 ** 0
    args.explore_rate = 0.25

    '''evaluate'''
    args.eval_gap = 2 ** 5
    args.eval_times = 2 ** 3
    args.break_step = int(1e5)
elif env_name == "LunarLander-v2":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** 0
    args.gamma = 0.99

    '''network update'''
    args.target_step = args.max_step
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 7
    args.repeat_times = 2 ** 0
    args.explore_noise = 0.125

    '''evaluate'''
    args.eval_gap = 2 ** 7
    args.eval_times = 2 ** 4
    args.break_step = int(4e5)
else:
    raise ValueError("env_name:", env_name)

env_args = {'env_num': 1,
            'env_name': 'CartPole-v0',
            'max_step': 200,
            'state_dim': 4,
            'action_dim': 2,
            'if_discrete': True}


`env_args = get_gym_env_args(env, if_print=True)` print the information about the `env`

Then, choose the GPU. (set as `-1` or GPU unavaliable, it will use CPU automatically.)

Finally, train and evaluate the DRL agent.

In [6]:
!conda activate 
args.learner_gpus = 0

train_agent(args)
evaluate_agent(args)

| Arguments Remove cwd: ./CartPole-v0_DQN_6 
| Step 4.51e+02  ExpR     1.00  | ObjC     0.26  ObjA     0.29 
| Step 4.14e+04  ExpR     1.00  | ObjC     0.25  ObjA    31.08 
| Step 6.81e+04  ExpR     1.00  | ObjC     0.71  ObjA    33.63 
| Step 8.92e+04  ExpR     1.00  | ObjC     0.39  ObjA    33.55 
| UsedTime: 234 | SavedDir: ./CartPole-v0_DQN_6 

| Arguments Keep cwd: ./CartPole-v0_DQN_6 
| Steps 4.51e+02  | Returns avg    89.125  std    13.364 
| Steps 4.14e+04  | Returns avg   194.625  std    11.124 
| Steps 6.81e+04  | Returns avg   199.125  std     2.315 
| Steps 8.92e+04  | Returns avg   200.000  std     0.000 


## Train DDPG on continuous action space task.

The following code just for tutorial.

DDPG is a simple DRL algorithm. But it is low sample efficiency and unstable.

Remember to run the PPO below later to experience how the **PPO algorithm is better than the DDPG algorithm**.

In [8]:
from elegantrl_helloworld.agent import AgentDDPG
agent_class = AgentDDPG
env_name = ["Pendulum-v1", "LunarLanderContinuous-v2", "BipedalWalker-v3"][0]
gpu_id = 0

if env_name == "Pendulum-v1":
    env = PendulumEnv()
    env_func = PendulumEnv
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
    args.gamma = 0.97

    '''network update'''
    args.target_step = args.max_step * 2
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 7
    args.repeat_times = 2 ** 0
    args.explore_noise = 0.1

    '''evaluate'''
    args.eval_gap = 2 ** 6
    args.eval_times = 2 ** 3
    args.break_step = int(1e5)
elif env_name == "LunarLanderContinuous-v2":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** 0
    args.gamma = 0.99

    '''network update'''
    args.target_step = args.max_step // 2
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 7
    args.repeat_times = 2 ** 0
    args.explore_noise = 0.1

    '''evaluate'''
    args.eval_gap = 2 ** 7
    args.eval_times = 2 ** 4
    args.break_step = int(4e5)
elif env_name == "BipedalWalker-v3":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** -1
    args.gamma = 0.98

    '''network update'''
    args.target_step = args.max_step // 2
    args.net_dim = 2 ** 8
    args.batch_size = 2 ** 8
    args.repeat_times = 2 ** 0
    args.explore_noise = 0.05

    '''evaluate'''
    args.eval_gap = 2 ** 7
    args.eval_times = 2 ** 3
    args.break_step = int(3e5)
else:
    raise ValueError("env_name:", env_name)

args.learner_gpus = gpu_id
train_agent(args)
evaluate_agent(args)

| Arguments Remove cwd: ./Pendulum-v1_DDPG_6 
|Step 4.00e+02  ExpR    -3.60  |ObjC     3.22  ObjA     0.15 
|Step 4.00e+04  ExpR    -1.98  |ObjC     0.86  ObjA   -82.97 
|Step 5.88e+04  ExpR    -1.58  |ObjC     0.82  ObjA   -66.20 
|Step 7.32e+04  ExpR    -0.60  |ObjC     0.62  ObjA   -45.66 
|Step 8.52e+04  ExpR    -0.47  |ObjC     0.36  ObjA   -33.60 
|Step 9.56e+04  ExpR    -0.33  |ObjC     0.35  ObjA   -28.88 
| UsedTime: 357 | SavedDir: ./Pendulum-v1_DDPG_6 

| Arguments Keep cwd: ./Pendulum-v1_DDPG_6 
|Steps          400  |Returns avg -1391.019  std   272.423 
|Steps        40000  |Returns avg  -822.530  std    77.746 
|Steps        58800  |Returns avg  -583.974  std    54.622 
|Steps        73200  |Returns avg  -199.278  std    83.178 
|Steps        85200  |Returns avg  -163.388  std    82.727 
|Steps        95600  |Returns avg  -211.675  std    72.861 


## ## Train PPO on continuous action space task.

In [13]:
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = ["Pendulum-v1", "LunarLanderContinuous-v2", "BipedalWalker-v3"][0]
gpu_id = 0

if env_name == "Pendulum-v1":
    env = PendulumEnv()
    env_func = PendulumEnv
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
    args.gamma = 0.97

    '''network update'''
    args.target_step = args.max_step * 8
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 8
    args.repeat_times = 2 ** 4

    '''evaluate'''
    args.eval_gap = 2 ** 6
    args.eval_times = 2 ** 3
    args.break_step = int(8e5)
elif env_name == "LunarLanderContinuous-v2":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** -2
    args.gamma = 0.99

    '''network update'''
    args.target_step = args.max_step * 2
    args.net_dim = 2 ** 7
    args.batch_size = 2 ** 8
    args.repeat_times = 2 ** 5

    '''evaluate'''
    args.eval_gap = 2 ** 6
    args.eval_times = 2 ** 5
    args.break_step = int(6e5)
elif env_name == "BipedalWalker-v3":
    import gym
    env = gym.make(env_name)
    env_func = gym.make
    env_args = get_gym_env_args(env, if_print=True)

    args = Arguments(agent_class, env_func, env_args)

    '''reward shaping'''
    args.reward_scale = 2 ** -1
    args.gamma = 0.98

    '''network update'''
    args.target_step = args.max_step
    args.net_dim = 2 ** 8
    args.batch_size = 2 ** 9
    args.repeat_times = 2 ** 4

    '''evaluate'''
    args.eval_gap = 2 ** 6
    args.eval_times = 2 ** 4
    args.break_step = int(6e5)
else:
    raise ValueError("env_name:", env_name)
    
args.learner_gpus = gpu_id
train_agent(args)
evaluate_agent(args)

| Arguments Remove cwd: ./Pendulum-v1_DDPG_6 
|Step 4.00e+02  ExpR    -3.60  |ObjC     3.22  ObjA     0.15 
|Step 4.00e+04  ExpR    -1.98  |ObjC     0.86  ObjA   -82.97 
|Step 5.88e+04  ExpR    -1.58  |ObjC     0.82  ObjA   -66.20 
|Step 7.32e+04  ExpR    -0.60  |ObjC     0.62  ObjA   -45.66 
|Step 8.52e+04  ExpR    -0.47  |ObjC     0.36  ObjA   -33.60 
|Step 9.56e+04  ExpR    -0.33  |ObjC     0.35  ObjA   -28.88 
| UsedTime: 357 | SavedDir: ./Pendulum-v1_DDPG_6 

| Arguments Keep cwd: ./Pendulum-v1_DDPG_6 
|Steps          400  |Returns avg -1391.019  std   272.423 
|Steps        40000  |Returns avg  -822.530  std    77.746 
|Steps        58800  |Returns avg  -583.974  std    54.622 
|Steps        73200  |Returns avg  -199.278  std    83.178 
|Steps        85200  |Returns avg  -163.388  std    82.727 
|Steps        95600  |Returns avg  -211.675  std    72.861 


### Authors
github [ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL)