<a href="https://colab.research.google.com/github/AI4Finance-Foundation/ElegantRL/blob/master/tutorial_helloworld_DQN_DDPG_PPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo: ElegantRL_HelloWorld_tutorial (DQN --> DDPG --> PPO)

We suggest to following this order to quickly learn about RL:
- DQN (Deep Q Network), a basic RL algorithms in discrete action space.
- DDPG (Deep Deterministic Policy Gradient), a basic RL algorithm in continuous action space.
- PPO (Proximal Policy Gradient), a widely used RL algorithms in continuous action space.

If you have any suggestion about ElegantRL Helloworld, you can discuss them in [ElegantRL issues/135: Suggestions for elegant_helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/issues/135), and we will keep an eye on this issue.
ElegantRL's code, especially the Helloworld, really needs a lot of feedback to be better.

# **Part 1: Install ElegantRL**

In [2]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-s3u140ew
  Running command git clone -q https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-s3u140ew
Collecting pybullet
  Downloading pybullet-3.2.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (91.7 MB)
[K     |████████████████████████████████| 91.7 MB 109 kB/s 
Collecting box2d-py
  Downloading box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 72.4 MB/s 
Building wheels for collected packages: elegantrl
  Building wheel for elegantrl (setup.py) ... [?25l[?25hdone
  Created wheel for elegantrl: filename=elegantrl-0.3.3-py3-none-any.whl size=239162 sha256=b45d23493b8f7bbe8db6c324e918a05b5930865c6b03dafda838ef3041dc2059
  Stored in directory: /tmp/pip-ephem-wheel-cache-rw5d1bev/wheels/52/9a/b3/08c8a0b5be22a65da0132538c05e7e961b1253c90d6845e0c6
Successfully built 


## **Part 2: Import ElegantRL helloworld**

We hope that the `ElegantRL Helloworld` would help people who want to learn about reinforcement learning to quickly run a few introductory examples.
- **Less lines of code**. (code lines <1000)
- **Less packages requirements**. (only `torch` and `gym` )
- **keep a consistent style with the full version of ElegantRL**.

![File_structure of ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/figs/File_structure.png)

One sentence summary: an agent `agent.py` with Actor-Critic networks `net.py` is trained `run.py` by interacting with an environment `env.py`.


In this tutorial, we only need to download the directory from [elegantrl_helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld) using the following code.

The files in `elegantrl_helloworld` including:
`config.py`, `agent.py`, `net.py`, `env.py`, `run.py`

In [3]:
!rm -r -f /content/elegantrl_helloworld  # remove if the directory exists
!wget https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/elegantrl_helloworld -P /content/

--2022-04-20 08:04:30--  https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/elegantrl_helloworld
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld [following]
--2022-04-20 08:04:30--  https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/elegantrl_helloworld’

elegantrl_helloworl     [ <=>                ] 114.22K  --.-KB/s    in 0.04s   

2022-04-20 08:04:30 (2.54 MB/s) - ‘/content/elegantrl_helloworld’ saved [116960]



In [4]:
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.config import Arguments

## **Part 3: Train DQN on discreted action space task.**

Train DQN on [**Discreted action** space task `CartPole`](https://gym.openai.com/envs/CartPole-v1/)

You can see [/elegantrl_helloworld/config.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/elegantrl_helloworld/config.py) to get more information about hyperparameter.

```
class Arguments:
    def __init__(self, agent_class, env_func=None, env_args=None):
        self.env_num = self.env_args['env_num']  # env_num = 1. In vector env, env_num > 1.
        self.max_step = self.env_args['max_step']  # the max step of an episode
        self.env_name = self.env_args['env_name']  # the env name. Be used to set 'cwd'.
        self.state_dim = self.env_args['state_dim']  # vector dimension (feature number) of state
        self.action_dim = self.env_args['action_dim']  # vector dimension (feature number) of action
        self.if_discrete = self.env_args['if_discrete']  # discrete or continuous action space
        ...
```

In [10]:
from elegantrl_helloworld.agent import AgentDQN
agent_class = AgentDQN
env_name = "CartPole-v0"

import gym
gym.logger.set_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** 0  # an approximate target reward usually be closed to 256
args.gamma = 0.97  # discount factor of future rewards

'''network update'''
args.target_step = args.max_step * 2  # collect target_step, then update network
args.net_dim = 2 ** 7  # the middle layer dimension of Fully Connected Network
args.num_layer = 3  # the layer number of MultiLayer Perceptron, `assert num_layer >= 2`
args.batch_size = 2 ** 7  # num of transitions sampled from replay buffer.
args.repeat_times = 2 ** 0  # repeatedly update network using ReplayBuffer to keep critic's loss small
args.explore_rate = 0.25  # epsilon-greedy for exploration.

'''evaluate'''
args.eval_gap = 2 ** 5  # number of times that get episode return
args.eval_times = 2 ** 3  # number of times that get episode return
args.break_step = int(8e4)  # break training if 'total_step > break_step'

env_args = {'env_num': 1,
            'env_name': 'CartPole-v0',
            'max_step': 200,
            'state_dim': 4,
            'action_dim': 2,
            'if_discrete': True}


Choose gpu id `0` using `args.learner_gpu = 0`. Set as `-1` or GPU is unavaliable, the training program will choose CPU automatically.

- The cumulative returns of CartPole-v0  is ∈ (0, (1, 195), 200) 
- The cumulative returns of task_name is ∈ (min score, (score of random action, target score), max score).

In [12]:
args.learner_gpus = -1

train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of CartPole-v0  is ∈ (0, (1, 195), 200)')

| Arguments Keep cwd: ./CartPole-v0_DQN_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 4.08e+02  ExpR     1.00  | ObjC     0.08  ObjA     0.89
| Steps 1.79e+04  ExpR     1.00  | ObjC     0.25  ObjA    16.98
| Steps 3.13e+04  ExpR     1.00  | ObjC     0.32  ObjA    26.03
| Steps 4.11e+04  ExpR     1.00  | ObjC     0.93  ObjA    29.85
| Steps 4.98e+04  ExpR     1.00  | ObjC     1.19  ObjA    31.70
| Steps 5.72e+04  ExpR     1.00  | ObjC     0.06  ObjA    30.98
| Steps 6.43e+04  ExpR     1.00  | ObjC     0.21  ObjA    31.57
| Steps 7.06e+04  ExpR     1.00  | ObjC     0.81  ObjA    31.66
| Steps 7.59e+04  ExpR     1.00  | ObjC     

Train DQN on [**Discreted action** space env `LunarLander`](https://gym.openai.com/envs/LunarLander-v2/)

**You can pass and run codes below.**. Because DQN takes over 1000 seconds for training. It is too slow.

And there are many other DQN variance algorithms which get higher cumulative returns and takes less time for training. See [examples/demo_DQN_Dueling_Double_DQN.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/examples/demo_DQN_Dueling_Double_DQN.py)

In [None]:
from elegantrl_helloworld.agent import AgentDQN
agent_class = AgentDQN
env_name = "LunarLander-v2"

import gym
gym.logger.set_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** 0
args.gamma = 0.99

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 7
args.num_layer = 3

args.batch_size = 2 ** 6

args.repeat_times = 2 ** 0
args.explore_noise = 0.125

'''evaluate'''
args.eval_gap = 2 ** 7
args.eval_times = 2 ** 4
args.break_step = int(4e5)  # LunarLander needs a larger `break_step`

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of LunarLander-v2 is ∈ (-1800, (-600, 200), 340)')

env_args = {'env_num': 1,
            'env_name': 'LunarLander-v2',
            'max_step': 1000,
            'state_dim': 8,
            'action_dim': 4,
            'if_discrete': True}
| Arguments Remove cwd: ./LunarLander-v2_DQN_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 1.02e+03  ExpR    -2.63  | ObjC     1.95  ObjA    -2.92
| Steps 1.93e+04  ExpR    -0.52  | ObjC     1.15  ObjA    -4.78
| Steps 3.42e+04  ExpR    -0.18  | ObjC     2.60  ObjA   -20.33
| Steps 4.70e+04  ExpR    -0.12  | ObjC     2.14  ObjA   -87.57
| Steps 5.93e+04  ExpR    -0.04  | ObjC     7.10  ObjA   -14.11
| Steps 7.12e+04  ExpR    -0.05  | ObjC    

## **Part 4: Train DDPG on continuous action space task.**

Train DDPG on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/)

We show a cunstom env in [elegantrl_helloworld/env.py `class PendulumEnv`](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/elegantrl_helloworld/env.py#L19-L23)

OpenAI Pendulum env set its action space as (-2, +2). It is bad. We suggest that adjust action space to (-1, +1) when designing your own env.


In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentDDPG
agent_class = AgentDDPG

from elegantrl_helloworld.env import PendulumEnv
env = PendulumEnv()
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 2
args.net_dim = 2 ** 7
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 0
args.explore_noise = 0.1

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(1e5)

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of Pendulum-v1 is ∈ (-1600, (-1400, -200), 0)')

{'action_dim': 4,
 'env_name': 'BipedalWalker-v3',
 'env_num': 1,
 'if_discrete': False,
 'max_step': 1600,
 'state_dim': 24,
 'target_return': 300}

# **Part 5: Train PPO on continuous action space task.**

Train PPO on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/)


In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO

from elegantrl_helloworld.env import PendulumEnv
env = PendulumEnv()
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 8
args.net_dim = 2 ** 7
args.num_layer = 2
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 5

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(8e5)

args.learner_gpus = gpu_id
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of Pendulum-v1 is ∈ (-1600, (-1400, -200), 0)')

Train PPO on [**Continuous action** space env `LunarLanderContinuous`](https://gym.openai.com/envs/LunarLanderContinuous-v2/)

In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "LunarLanderContinuous-v2"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.gamma = 0.99
args.reward_scale = 2 ** -1

'''network update'''
args.target_step = args.max_step * 8
args.num_layer = 3
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 4
args.lambda_entropy = 0.04

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 5
args.break_step = int(4e5)

args.learner_gpus = gpu_id
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of LunarLanderContinuous-v2 is ∈ (-1800, (-300, 200), 310+)')

Train PPO on [**Continuous action** space env `BipedalWalker`](https://gym.openai.com/envs/BipedalWalker-v2/)

In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "BipedalWalker-v3"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1
args.gamma = 0.98

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 8
args.num_layer = 3
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 4

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 4
args.break_step = int(1e6)

args.learner_gpus = gpu_id
args.random_seed += gpu_id
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of BipedalWalker-v3 is ∈ (-150, (-100, 280), 320+)')


| Arguments Remove cwd: ./BipedalWalker-v3_PPO_0
################################################################################
ID     Step    maxR |    avgR   stdR   avgS  stdS |    expR   objC   etc.
0  6.98e+03  -91.89 |
0  6.98e+03  -91.89 |  -91.89    0.0    109     2 |   -0.39 676.16   0.06  -0.50
0  9.49e+04  -21.05 |
0  9.49e+04  -21.05 |  -21.05    0.4   1600     0 |   -0.05   6.96   0.02  -0.50
0  1.59e+05  -21.05 |  -38.62    1.8   1600     0 |   -0.03   0.34  -0.01  -0.51
0  2.24e+05  -21.05 |  -34.80    3.4   1600     0 |   -0.02   0.31   0.05  -0.52
0  2.94e+05  133.03 |
0  2.94e+05  133.03 |  133.03    4.3   1600     0 |    0.01   0.59  -0.05  -0.53
0  3.65e+05  133.03 |  -95.17    0.2    121     7 |    0.04   0.75   0.05  -0.55
0  4.55e+05  133.03 | -125.18   13.9    268    68 |    0.07   5.88   0.03  -0.56
0  5.37e+05  133.03 |  -63.86   34.8    416   175 |    0.08   7.43  -0.01  -0.57
0  6.20e+05  152.64 |
0  6.20e+05  152.64 |  152.64  137.1   1152   451 |    0.14 