# Demo: ElegantRL_HelloWorld_tutorial (DQN --> DDPG --> PPO)

We suggest to following this order to quickly learn about RL:
- DQN (Deep Q Network), a basic RL algorithms in discrete action space.
- DDPG (Deep Deterministic Policy Gradient), a basic RL algorithm in continuous action space.
- PPO (Proximal Policy Gradient), a widely used RL algorithms in continuous action space.

If you have any suggestion about ElegantRL Helloworld, you can discuss them in [ElegantRL issues/135: Suggestions for elegant_helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/issues/135), and we will keep an eye on this issue.
ElegantRL's code, especially the Helloworld, really needs a lot of feedback to be better.

# **Part 1: Install ElegantRL**

In [5]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-120vlmei
  Running command git clone -q https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-120vlmei



## **Part 2: Import ElegantRL helloworld**

We hope that the `ElegantRL Helloworld` would help people who want to learn about reinforcement learning to quickly run a few introductory examples.
- **Less lines of code**. (code lines <1000)
- **Less packages requirements**. (only `torch` and `gym` )
- **keep a consistent style with the full version of ElegantRL**.

![File_structure of ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/figs/File_structure.png)

One sentence summary: an agent `agent.py` with Actor-Critic networks `net.py` is trained `run.py` by interacting with an environment `env.py`.


In this tutorial, we only need to download the directory from [elegantrl_helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld) using the following code.

The files in `elegantrl_helloworld` including:
`config.py`, `agent.py`, `net.py`, `env.py`, `run.py`

In [6]:
!rm -r -f /content/elegantrl_helloworld  # remove if the directory exists
!wget https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/elegantrl_helloworld -P /content/

--2022-04-21 03:32:48--  https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/elegantrl_helloworld
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld [following]
--2022-04-21 03:32:49--  https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl_helloworld
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/elegantrl_helloworld’

elegantrl_helloworl     [ <=>                ] 125.96K  --.-KB/s    in 0.03s   

2022-04-21 03:32:49 (4.68 MB/s) - ‘/content/elegantrl_helloworld’ saved [128986]



In [7]:
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.config import Arguments

## **Part 3: Train DQN on discreted action space task.**

Train DQN on [**Discreted action** space task `CartPole`](https://gym.openai.com/envs/CartPole-v1/)

You can see [/elegantrl_helloworld/config.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/elegantrl_helloworld/config.py) to get more information about hyperparameter.

```
class Arguments:
    def __init__(self, agent_class, env_func=None, env_args=None):
        self.env_num = self.env_args['env_num']  # env_num = 1. In vector env, env_num > 1.
        self.max_step = self.env_args['max_step']  # the max step of an episode
        self.env_name = self.env_args['env_name']  # the env name. Be used to set 'cwd'.
        self.state_dim = self.env_args['state_dim']  # vector dimension (feature number) of state
        self.action_dim = self.env_args['action_dim']  # vector dimension (feature number) of action
        self.if_discrete = self.env_args['if_discrete']  # discrete or continuous action space
        ...
```

In [None]:
from elegantrl_helloworld.agent import AgentDQN
agent_class = AgentDQN
env_name = "CartPole-v0"

import gym
gym.logger.set_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** 0  # an approximate target reward usually be closed to 256
args.gamma = 0.97  # discount factor of future rewards

'''network update'''
args.target_step = args.max_step * 2  # collect target_step, then update network
args.net_dim = 2 ** 7  # the middle layer dimension of Fully Connected Network
args.num_layer = 3  # the layer number of MultiLayer Perceptron, `assert num_layer >= 2`
args.batch_size = 2 ** 7  # num of transitions sampled from replay buffer.
args.repeat_times = 2 ** 0  # repeatedly update network using ReplayBuffer to keep critic's loss small
args.explore_rate = 0.25  # epsilon-greedy for exploration.

'''evaluate'''
args.eval_gap = 2 ** 5  # number of times that get episode return
args.eval_times = 2 ** 3  # number of times that get episode return
args.break_step = int(8e4)  # break training if 'total_step > break_step'

env_args = {'env_num': 1,
            'env_name': 'CartPole-v0',
            'max_step': 200,
            'state_dim': 4,
            'action_dim': 2,
            'if_discrete': True}


Choose gpu id `0` using `args.learner_gpu = 0`. Set as `-1` or GPU is unavaliable, the training program will choose CPU automatically.

- The cumulative returns of CartPole-v0  is ∈ (0, (1, 195), 200) 
- The cumulative returns of task_name is ∈ (min score, (score of random action, target score), max score).

In [None]:
args.learner_gpus = -1

train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of CartPole-v0  is ∈ (0, (1, 195), 200)')

| Arguments Keep cwd: ./CartPole-v0_DQN_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 4.08e+02  ExpR     1.00  | ObjC     0.08  ObjA     0.89
| Steps 1.79e+04  ExpR     1.00  | ObjC     0.25  ObjA    16.98
| Steps 3.13e+04  ExpR     1.00  | ObjC     0.32  ObjA    26.03
| Steps 4.11e+04  ExpR     1.00  | ObjC     0.93  ObjA    29.85
| Steps 4.98e+04  ExpR     1.00  | ObjC     1.19  ObjA    31.70
| Steps 5.72e+04  ExpR     1.00  | ObjC     0.06  ObjA    30.98
| Steps 6.43e+04  ExpR     1.00  | ObjC     0.21  ObjA    31.57
| Steps 7.06e+04  ExpR     1.00  | ObjC     0.81  ObjA    31.66
| Steps 7.59e+04  ExpR     1.00  | ObjC     

Train DQN on [**Discreted action** space env `LunarLander`](https://gym.openai.com/envs/LunarLander-v2/)

**You can pass and run codes below.**. Because DQN takes over 6000 seconds for training. It is too slow. (DuelingDoubleDQN taks less than 1000 second for training on LunarLander-v2 task.)

And there are many other DQN variance algorithms which get higher cumulative returns and takes less time for training. See [examples/demo_DQN_Dueling_Double_DQN.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/examples/demo_DQN_Dueling_Double_DQN.py)

In [8]:
from elegantrl_helloworld.agent import AgentDQN
agent_class = AgentDQN
env_name = "LunarLander-v2"

import gym
gym.logger.set_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** 0
args.gamma = 0.99

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 7
args.num_layer = 3

args.batch_size = 2 ** 6

args.repeat_times = 2 ** 0
args.explore_noise = 0.125

'''evaluate'''
args.eval_gap = 2 ** 7
args.eval_times = 2 ** 4
args.break_step = int(4e5)  # LunarLander needs a larger `break_step`

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of LunarLander-v2 is ∈ (-1800, (-600, 200), 340)')

env_args = {'env_num': 1,
            'env_name': 'LunarLander-v2',
            'max_step': 1000,
            'state_dim': 8,
            'action_dim': 4,
            'if_discrete': True}
| Arguments Remove cwd: ./LunarLander-v2_DQN_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 1.12e+03  ExpR    -2.70  | ObjC     3.05  ObjA    -4.95
| Steps 1.67e+04  ExpR    -0.34  | ObjC    11.37  ObjA   -49.77
| Steps 3.32e+04  ExpR    -0.39  | ObjC    17.30  ObjA   -46.68
| Steps 4.42e+04  ExpR    -0.11  | ObjC     1.10  ObjA   -27.90
| Steps 5.71e+04  ExpR    -0.06  | ObjC     1.55  ObjA   -33.38
| Steps 6.81e+04  ExpR     0.04  | ObjC    

## **Part 4: Train DDPG on continuous action space task.**

Train DDPG on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/)

We show a cunstom env in [elegantrl_helloworld/env.py `class PendulumEnv`](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/elegantrl_helloworld/env.py#L19-L23)

OpenAI Pendulum env set its action space as (-2, +2). It is bad. We suggest that adjust action space to (-1, +1) when designing your own env.


In [9]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentDDPG
agent_class = AgentDDPG

from elegantrl_helloworld.env import PendulumEnv
env = PendulumEnv('Pendulum-v0')  # PendulumEnv('Pendulum-v1')
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 2
args.net_dim = 2 ** 7
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 0
args.explore_noise = 0.1

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(1e5)

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of Pendulum-v1 is ∈ (-1600, (-1400, -200), 0)')

env_args = {'env_num': 1,
            'env_name': 'Pendulum-v0',
            'max_step': 200,
            'state_dim': 3,
            'action_dim': 1,
            'if_discrete': False}
| Arguments Remove cwd: ./Pendulum-v0_DDPG_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 4.00e+02  ExpR    -3.71  | ObjC     1.13  ObjA    -0.83
| Steps 1.12e+04  ExpR    -0.32  | ObjC     0.32  ObjA   -39.47
| Steps 1.96e+04  ExpR    -0.65  | ObjC     0.57  ObjA   -40.92
| Steps 2.68e+04  ExpR    -0.16  | ObjC     0.32  ObjA   -27.83
| Steps 3.32e+04  ExpR    -0.31  | ObjC     0.14  ObjA   -25.19
| Steps 3.88e+04  ExpR    -0.32  | ObjC     1.16

# **Part 5: Train PPO on continuous action space task.**

Train PPO on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/). 


In [11]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO

from elegantrl_helloworld.env import PendulumEnv
env = PendulumEnv()
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 8
args.net_dim = 2 ** 7
args.num_layer = 2
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 5

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(8e5)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of Pendulum-v1 is ∈ (-1600, (-1400, -200), 0)')

env_args = {'env_num': 1,
            'env_name': 'Pendulum-v0',
            'max_step': 200,
            'state_dim': 3,
            'action_dim': 1,
            'if_discrete': False}
| Arguments Remove cwd: ./Pendulum-v0_PPO_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 1.60e+03  ExpR    -3.34  | ObjC    93.37  ObjA     0.02
| Steps 9.76e+04  ExpR    -2.69  | ObjC    26.47  ObjA     0.13
| Steps 1.95e+05  ExpR    -2.37  | ObjC    14.92  ObjA     0.12
| Steps 2.94e+05  ExpR    -1.95  | ObjC    10.23  ObjA     0.03
| Steps 3.94e+05  ExpR    -1.75  | ObjC     7.16  ObjA    -0.01
| Steps 4.93e+05  ExpR    -0.87  | ObjC     6.48 

Train PPO on [**Continuous action** space env `LunarLanderContinuous`](https://gym.openai.com/envs/LunarLanderContinuous-v2/)

In [12]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "LunarLanderContinuous-v2"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.gamma = 0.99
args.reward_scale = 2 ** -1

'''network update'''
args.target_step = args.max_step * 8
args.num_layer = 3
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 4
args.lambda_entropy = 0.04

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 5
args.break_step = int(4e5)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of LunarLanderContinuous-v2 is ∈ (-1800, (-300, 200), 310+)')

env_args = {'env_num': 1,
            'env_name': 'LunarLanderContinuous-v2',
            'max_step': 1000,
            'state_dim': 8,
            'action_dim': 2,
            'if_discrete': False}
| Arguments Remove cwd: ./LunarLanderContinuous-v2_PPO_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 8.10e+03  ExpR    -1.06  | ObjC    26.53  ObjA     0.02
| Steps 5.63e+04  ExpR    -0.21  | ObjC     9.94  ObjA     0.01
| Steps 1.05e+05  ExpR    -0.05  | ObjC    10.62  ObjA    -0.02
| Steps 1.30e+05  ExpR    -0.01  | ObjC     8.52  ObjA     0.02
| Steps 1.54e+05  ExpR     0.02  | ObjC     6.58  ObjA    -0.10
| Steps 1.79e+05  ExpR

Train PPO on [**Continuous action** space env `BipedalWalker`](https://gym.openai.com/envs/BipedalWalker-v2/)

In [13]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "BipedalWalker-v3"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1
args.gamma = 0.98

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 8
args.num_layer = 3
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 4

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 4
args.break_step = int(1e6)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of BipedalWalker-v3 is ∈ (-150, (-100, 280), 320+)')


env_args = {'env_num': 1,
            'env_name': 'BipedalWalker-v3',
            'max_step': 1600,
            'state_dim': 24,
            'action_dim': 4,
            'if_discrete': False}
| Arguments Remove cwd: ./BipedalWalker-v3_PPO_-1

| `Steps` denotes the number of samples, or the total training step, or the running times of `env.step()`.
| `ExpR` denotes average rewards during exploration. The agent gets this rewards with noisy action.
| `ObjC` denotes the objective of Critic network. Or call it loss function of critic network.
| `ObjA` denotes the objective of Actor network. It is the average Q value of the critic network.

| Steps 1.60e+03  ExpR    -0.02  | ObjC     0.10  ObjA     0.05
| Steps 3.78e+04  ExpR    -0.05  | ObjC     1.02  ObjA     0.01
| Steps 7.37e+04  ExpR    -0.05  | ObjC     0.80  ObjA     0.04
| Steps 1.10e+05  ExpR    -0.02  | ObjC     0.03  ObjA    -0.01
| Steps 1.45e+05  ExpR    -0.00  | ObjC     0.05  ObjA     0.01
| Steps 1.82e+05  ExpR     0.01  | Ob