## The totorial version and formal version
You can find a simple version in `elegantrl/tutorial/run.py`
You can also find demo 1~3 in `elegantrl/run.py` (advanced version)

elegantrl/tutorial <1000 lines 
```
agent.py # 530 lines
net.py   # 160 lines
run.py   # 320 lines
env.py   # 160 lines (not necessary)
```
The structtion of formal version is similar to tutorial version.

In [1]:
from elegantrl.tutorial.run import Arguments, train_and_evaluate
from elegantrl.tutorial.env import PreprocessEnv
import gym
gym.logger.set_level(40)  # Block warning

## Demo 1: Discrete action space

In [2]:
'''choose an DRL algorithm'''
from elegantrl.tutorial.agent import AgentDoubleDQN  # AgentDQN

args = Arguments(agent=None, env=None, gpu_id=None)
args.agent = AgentDoubleDQN()

In [3]:
'''choose environment'''
args.env = PreprocessEnv(env=gym.make('CartPole-v0'))
args.net_dim = 2 ** 7  # change a default hyper-parameters
args.batch_size = 2 ** 7
"TotalStep: 2e3, TargetReward: , UsedTime: 10s"

# args.env = PreprocessEnv(env=gym.make('LunarLander-v2'))
# args.net_dim = 2 ** 8
# args.batch_size = 2 ** 8
# "TotalStep: 6e4, TargetReward: 200, UsedTime: 600s"


| env_name: CartPole-v0, action space if_discrete: True
| state_dim: 4, action_dim: 2, action_max: 1
| max_step: 200 target_reward: 195.0


'TotalStep: 2e3, TargetReward: , UsedTime: 10s'

In [4]:
'''train and evaluate'''
train_and_evaluate(args)

| GPU id: 0, cwd: ./CartPole-v0_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00    200.00 |
ID      Step   TargetR |    avgR      stdR   UsedTime  ########
0   1.02e+03    195.00 |  200.00      0.00         12  ########


## Demo 2: Continuous action space

In [5]:
'''DEMO 2.1: choose an off-policy DRL algorithm'''
from elegantrl.agent import AgentSAC  # AgentTD3, AgentDDPG
args = Arguments(if_on_policy=False)
args.agent = AgentSAC()

In [6]:
'''DEMO 2.2: choose an on-policy DRL algorithm'''
from elegantrl.tutorial.agent import AgentPPO 
args = Arguments(if_on_policy=True)  # hyper-parameters of on-policy is different from off-policy
args.agent = AgentPPO()

In [7]:
'''choose environment'''
env = gym.make('Pendulum-v0')
env.target_reward = -200  # set target_reward manually for env 'Pendulum-v0'
args.env = PreprocessEnv(env=env)
args.reward_scale = 2 ** -3  # RewardRange: -1800 < -200 < -50 < 0
args.net_dim = 2 ** 7
args.batch_size = 2 ** 7
"TotalStep: 3e5, TargetReward: -200, UsedTime: 300s"

# args.env = PreprocessEnv(env=gym.make('LunarLanderContinuous-v2'))
# args.reward_scale = 2 ** 0  # RewardRange: -800 < -200 < 200 < 302
# "TotalStep: 9e4, TargetReward: 200, UsedTime: 2500s"

# args.env = PreprocessEnv(env=gym.make('BipedalWalker-v3'))
# args.reward_scale = 2 ** 0  # RewardRange: -200 < -150 < 300 < 334
# args.break_step = int(2e5)
# args.if_allow_break = False
# "TotalStep: 2e5, TargetReward: 300, UsedTime: 5000s"


| env_name: Pendulum-v0, action space if_discrete: False
| state_dim: 3, action_dim: 1, action_max: 2.0
| max_step: 200 target_reward: -200


'TotalStep: 2e5, TargetReward: 300, UsedTime: 5000s'

## Demo 3: Custom Env from AI4Finance

In [12]:
args = Arguments(if_on_policy=True)
'''choose an DRL algorithm'''
from elegantrl.tutorial.agent import AgentPPO
args.agent = AgentPPO()

from elegantrl.tutorial.env import FinanceMultiStockEnv  # a standard env for ElegantRL, not need PreprocessEnv()
args.env = FinanceMultiStockEnv(if_train=True)
args.env_eval = FinanceMultiStockEnv(if_train=False)  # eva_len = 1699 - train_len
args.reward_scale = 2 ** 0  # RewardRange: 0 < 1.0 < 1.25 <
args.break_step = int(5e6)
args.max_step = args.env.max_step
args.max_memo = (args.max_step - 1) * 8
args.batch_size = 2 ** 11
"TotalStep:  2e5, TargetReward: 1.25, UsedTime:  200s"

'TotalStep: 10e5, TargetReward: 1.62, UsedTime: 1000s'

In [13]:
'''train and evaluate'''
train_and_evaluate(args)
# args.rollout_num = 8
# train_and_evaluate__multiprocessing(args)  # try multiprocessing in formal version

| GPU id: 0, cwd: ./FinanceStock-v2_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00      1.06 |
0   5.12e+03      1.10 |
0   1.02e+04      1.22 |
0   2.56e+04      1.29 |
ID      Step   TargetR |    avgR      stdR   UsedTime  ########
0   3.07e+04      1.25 |    1.29      0.03         29  ########


## Demo 4: train in PyBullet (MuJoCo) (wait for adding)

In [3]:
from elegantrl.run import Arguments, train_and_evaluate__multiprocessing
from elegantrl.env import PreprocessEnv

import gym  # don't worry about 'WARN: Box bound precision lowered by casting to float32'
import pybullet_envs  # Free PyBullet as an env alternatives of paid MuJoCo

In [3]:
'''DEMO 4.1: choose an off-policy DRL algorithm'''
args = Arguments(if_on_policy=False)

from elegantrl.agent import AgentModSAC  # AgentSAC, AgentTD3, AgentDDPG
args.agent = AgentModSAC()  # AgentSAC(), AgentTD3(), AgentDDPG()
args.agent.if_use_dn = True
args.net_dim = 2 ** 7  # default is 2 ** 8 is too large for if_use_dn = True

env_name = 'AntBulletEnv-v0'
assert env_name in {"AntBulletEnv-v0",
                    "Walker2DBulletEnv-v0", 
                    "HalfCheetahBulletEnv-v0",
                    "HumanoidBulletEnv-v0", 
                    "HumanoidFlagrunBulletEnv-v0", 
                    "HumanoidFlagrunHarderBulletEnv-v0",}
args.env = PreprocessEnv(env=gym.make(env_name))
args.env.max_step = 2 ** 10
args.env.target_reward = 1500


| env_name:  AntBulletEnv-v0, action space if_discrete: False
| state_dim:   28, action_dim: 8, action_max: 1.0
| max_step:  1000, target_reward: 2500.0


In [4]:
args.break_step = int(1e6 * 8)  # (5e5) 1e6, UsedTime: (15,000s) 30,000s
args.reward_scale = 2 ** -2  # RewardRange: -50 < 0 < 2500 < 3340
args.max_memo = 2 ** 20
args.batch_size = 2 ** 9
args.show_gap = 2 ** 8  # for Recorder
args.eva_size1 = 2 ** 1  # for Recorder
args.eva_size2 = 2 ** 3  # for Recorder
"TotalStep: 3e5, TargetReward: 1500, UsedTime:  8ks"
"TotalStep: 6e5, TargetReward: 2500, UsedTime: 20ks"

args.rollout_num = 4
train_and_evaluate__multiprocessing(args)

| multiprocessing, act_workers: 4
| multiprocessing, None:
| GPU id: 0, cwd: ./AgentModSAC/AntBulletEnv-v0_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00    453.28 |
0   1.02e+04    453.28 |  409.85    103.88       0.01      0.04
0   2.36e+04    453.28 |  382.60    110.55       0.04      0.29
0   3.28e+04    471.39 |
0   3.58e+04    491.33 |
0   3.69e+04    491.33 |  491.33     76.77       0.12      0.34
0   5.22e+04    491.33 |  389.91    131.47       0.08      0.38
0   6.45e+04    491.33 |  381.27     93.70       0.08      0.34
0   6.86e+04    497.62 |
0   7.58e+04    497.62 |  467.76    108.42       0.06      0.25
0   9.01e+04    497.62 |  491.25     57.79       0.05      0.24
0   9.42e+04    502.69 |
0   9.83e+04    502.69 |   25.88     17.32       0.04      0.19
0   1.09e+05    502.69 |  457.20     82.28       0.03      0.15
0   1.20e+05    502.69 |  465.17    131.43       0.03      0.11
0   1.22e+05    517.77 |
0   1.23e+05    603.

In [7]:
'''DEMO 4.2: choose an off-policy DRL algorithm'''
from elegantrl.run import Arguments, train_and_evaluate__multiprocessing
from elegantrl.env import PreprocessEnv

In [8]:
args = Arguments(if_on_policy=False)

from elegantrl.agent import AgentPPO
args.agent = AgentPPO()
args.agent.if_use_gae = True


import gym  # don't worry about 'WARN: Box bound precision lowered by casting to float32'
import pybullet_envs  # Free PyBullet as an env alternatives of paid MuJoCo
env_name = 'AntBulletEnv-v0'
assert env_name in {"AntBulletEnv-v0",
                    "Walker2DBulletEnv-v0", 
                    "HalfCheetahBulletEnv-v0",
                    "HumanoidBulletEnv-v0", 
                    "HumanoidFlagrunBulletEnv-v0", 
                    "HumanoidFlagrunHarderBulletEnv-v0",}
args.env = PreprocessEnv(env=gym.make(env_name))
args.env.max_step = 2 ** 10
args.env.target_reward = 1500


| env_name:  AntBulletEnv-v0, action space if_discrete: False
| state_dim:   28, action_dim: 8, action_max: 1.0
| max_step:  1000, target_reward: 2500.0


In [9]:
args.break_step = int(2e6 * 8)  # (5e5) 1e6, UsedTime: (15,000s) 30,000s
args.reward_scale = 2 ** -2  # (-50) 0 ~ 2500 (3340)
args.max_memo = 2 ** 11
args.repeat_times = 2 ** 3
args.batch_size = 2 ** 10
args.net_dim = 2 ** 9
args.show_gap = 2 ** 8  # for Recorder
args.eva_size1 = 2 ** 1  # for Recorder
args.eva_size2 = 2 ** 3  # for Recorder
"TotalStep:  2e6, TargetReward: 1500, UsedTime:  3ks"
"TotalStep: 13e6, TargetReward: 2400, UsedTime: 21ks"

args.rollout_num = 4
train_and_evaluate__multiprocessing(args)

| multiprocessing, act_workers: 4
| multiprocessing, None:
| GPU id: 0, cwd: ./AgentPPO/AntBulletEnv-v0_0
| Remove history
ID      Step      MaxR |    avgR      stdR       objA      objC
0   0.00e+00      9.95 |
0   2.90e+03     10.07 |
0   1.31e+04     11.03 |
0   1.71e+04     38.62 |
0   3.12e+04    449.95 |
0   4.28e+04    556.44 |
0   4.66e+04    680.85 |
0   5.66e+04    767.35 |
0   1.21e+05    767.35 |  727.64     34.92      -0.50      1.31
0   1.43e+05    822.35 |
0   1.59e+05    854.50 |
0   1.67e+05    925.34 |
0   2.15e+05    925.34 |  738.72     70.94      -0.51      1.35
0   3.59e+05    925.34 |  729.24     42.17      -0.52      1.16
0   5.03e+05    925.34 |  804.23     50.80      -0.53      1.44
0   6.47e+05    925.34 |  810.85      2.67      -0.54      1.53
0   7.59e+05    955.70 |
0   7.63e+05    955.70 |  955.70    139.78      -0.55      1.45
0   7.63e+05   1007.63 |
0   7.75e+05   1076.46 |
0   7.83e+05   1105.29 |
0   8.23e+05   1120.53 |
0   8.31e+05   1120.53 | 1077

## API of ElegantRL

### level 0 API: training pipeline (User-oriented)

- `class Arguments` 
- `class AgentXXX`
- `class PreprocessEnv`
- `def train_and_evaluate(args)`
- `def train_and_evaluate__multiprocessing(args)`

---
`class Arguments` 
Save the hyper-parameters of DRL algorithms and intiialize the training setting.
The user should set the DRL algortithm and training enviroment for training. The other hyperparameters will set as default.

---
`class AgentXXX`
The built-in DRL algorithms of ElegantRL. They are named as AgentXXX
- Discrete action space: DQN, DuelingDQN, DoubleDQN, ...
- Continuous action space (off-policy): DDPG, TD3, SAC, ...
- Continuous action space (on-policy): PPO, GaePPO ...

---
`class PreprocessEnv`
Preprocess the OpenAI gym standard environment for DRL.
- DRL algorithm needs to know the env information for creating network automatically. We find and assign these variable `env.state_dim, env.action_dim, ...`
- Some OpenAI gym standard environments are not standard enough, we adjust some continuous action range into (-1, +1). We also convert the data type of state from `float64` to `float32`.
- Some OpenAI gym standard environments have bad state design. We do normalization on state. See more detail in `def get_avg_std__for_state_norm` in `env.py`

---
`def train_and_evaluate(args)`
choose single processing for DRL training.


`def train_and_evaluate__multiprocessing(args)`
choose multiple processing for DRL training.

---
Such as: (see more in `def demo***`)

In [None]:
# set hyperparameters
from elegantrl.run import Arguments
args = Arguments(agent=None, env=None, gpu_id=None)

# set DRL algorithms
from elegantrl.agent import AgentPPO
args.agent = AgentPPO()
args.agent.if_gae = True  # change some hyperparameters of DRL algorithms

# set training environment
from elegantrl.env import PreprocessEnv
env = gym.make('Pendulum-v0')
env.target_reward = -200  # set target_reward manually for env 'Pendulum-v0'
args.env = PreprocessEnv(env=env)

# start training
train_and_evaluate(args)  # train_and_evaluate__multiprocessing(args)

### level 0 API: training pipeline 

---
`class AgentBase` is the base class of all the DRL algorithms (both DQN variants and Actor-critic Methods).
- `class AgentDQN(AgentBase)`
- `class AgentDDPG(AgentBase)'
- `class AgentTD3(AgentBase)`
- `class AgentSAC(AgentBase)`
- `class AgentPPO(AgentBase)`

This base class should had the following attribution:
- `init(net_dim, state_dim, action_dim)` initialize the neural network, optimizer, certierion and others for training. We explict call `init(...)` for multiprocessing instead of directly call `__init__()`.
    - `int net_dim`: the dimension of networks (the width of neural networks)
    - `int state_dim`: the dimension of state (the number of state vector)
    - `int action_dim`: the dimension of action (the number of discrete action)



In [None]:
# AgentXXX
agent = AgentPPO()  # take AgentPPO as an example
agent = agent.init(net_dim, state_dim, action_dim)

### level 1 API: RAM management

---
`class ReplayBuffer` Experience Replay Buffer. We save the env transition `(state, action, reward, ... )` as a trajectory on a contiguous memory for high performance training.
`class ReplayBufferMP` for multiprocessing.
- `__init__(self, max_len, state_dim, action_dim, if_on_policy, if_gpu)`
    - `max_len` the maximum capacity. First In First Out. We don't use random out because w e save the environment transition as an ordered trajectory.
    - `if_on_policy` switch to on-policy-DRl-mode or off-policy.
    - `if_gpu` switch to `torch.tensor`-GPU-mode or `numpy.ndarray`-CPU.
- `append_buffer(state, other)` same as list.append() in Python. `state, other` are `nd.array` or `torch.tensor`. 
- `extend_buffer(state, other)` same as list.extend() in Python

In [None]:
# ReplayBuffer

state = np.array(state_dim)
action = np.array(action_dim)
other = 

buffer = ReplayBuffer(max_len, state_dim, action_dim, if_on_policy, if_gpu)
buffer.append(state, other)

## Name list of ElegantRL

enviroment
- `str env_name`: the environment name. Such as `'CartPole-v0', 'LunarLander-v2'`
- `int state_dim`: the dimension of state (the number of state vector)
- `int action_dim`: the dimension of action (the number of discrete action)
- `int max_step`: the max step of an episode. The actor will break this episode of environment exploration when `done=Ture` or `steps > max_step`. 
- `bool if_discrete`: if swith to di

training 
- `int net_dim`: the dimension of networks (the width of neural networks)

other:
- `bool if_xxx`: a Boolean value. Call it a `flag` in English?
- `bool if_on_policy` it shows that it is an on-policy algorithm.

# API List

---
## `class Arguement()` 
set hyperparameters

### `init_before_training(self, if_main)` 
prepare training environment.

- `bool if_main`: build current work directory

---
### `train_and_evaluate(args)` 
single processing, `args=Arguement()`

### `train_and_evaluate__multiprocessing(args)` 
multiprocessing, `args=Arguement()`

---
## `class AgentBase()` 

### `__init__()` 
default initialize


### `init(self, net_dim, state_dim, action_dim)` 
explict call init() to build networks

- `int net_dim`: the dimension of networks 

- `int state_dim`: the dimension of state 

- `int action_dim`: the dimension of action (or the number of discrete action)


### `select_action(self, state)` 
select action for exploration

- `array state`: the shape of state is `(state_dim, )`


### `store_transition(self, env, buffer, target_step, reward_scale, gamma)`
store transition (state, action, reward, ...) to ReplayBuffer

- `env`: DRL training environment, it has `.reset()` and `.step()`

- `buffer`: experience replay buffer. ReplayBuffer has `.append_buffer()` 

- `int target_step`: number of target steps plan to collect in env

- `float reward_scale`: scale the reward size

- `float gamma`: discount factor

plan to move `reward_scale` and `gamma` to `__init__()`


### `update_net(self, buffer, target_step, batch_size, repeat_times)`
update networks using sampling data from ReplayBuffer

- `buffer`: experience replay buffer. ReplayBuffer has `.sample_batch()`

- `int target_step`: number of target steps that add to ReplayBuffer

- `int batch_size`: number of samples for stochastic gradient decent


### `save_load_model(self, cwd, if_save)`
save neural network to cwd

- `str cwd`: current working directory

- `bool if_save` save or load model


### `soft_update(self, target_net, current_net)` 
soft target update using self.tau (set as `2**-8` in default)

- `target_net`: update via soft target update of `current_net`

- `current_net`: update via gradient decent

---
## `class ReplayBuffer`
experience replay buffer

### `_init__(self, max_len, state_dim, action_dim, if_on_policy, if_gpu)`
creat a continuous memory space to store data

- `max_len`: maximum capacity, First In First Out.

- `int state_dim`: the dimension of state 

- `int action_dim`: the dimension of action (`action_dim=1` for discrete action)

- `bool if_on_policy`: on-policy or off-policy

- `if_gpu`: creat memory space on CPU RAM or GPU RAM

### `append_buffer(self, state, other)`
append to ReplayBuffer, same as `list.append()`

- `array state`: the shape is `(state_dim, )`

- `array other`: the shape is `(other_dim, )`, including action, reward, ...


### `extend_buffer(self, state, other)`
extend to ReplayBuffer, same as `list.extend()`

- `array state`: the shape is `(-1, state_dim)`

- `array other`: the shape is `(-1, other_dim)`, including action, reward, ...


### `sample_batch(self, batch_size)`
sample a batch of data from ReplayBuffer randomly for stochastic gradient decent

- `int batch_size`: number of data in a batch


### `sample_for_ppo(self)`
sample all the data from ReplayBuffer.

- return `float reward`

- return `float mask`

- return `array action`

- return `array noise`

- return `array state`

### `update__now_len__before_sample(self)`
update `now_len` (pointer) before sample data form ReplayBuffer

### `empty_memories__before_explore(self)`
empty the memories of ReplayBuffer before exploring for on-policy

### `print_state_norm(self)`
print the `avg` and `std` for state normalization. compute using the state in ReplayBuffer after finishing the training pipeline

## `class Preprocess(gym.Wrapper)`

### `__init__(self, env)` 
get environemnt information

- `reset(self)` return `state`

- `step(self, action)` return `(next_state, reward, done, dict)`


- `str env_name`: for example LunarLander-v2

- `int net_dim`: the dimension of networks 

- `int state_dim`: the dimension of state 

- `int action_dim`: the dimension of action (or the number of discrete action)

- `int action_max`: the max range of continuous action. action_max=1 when it is discrete action

- `int max_step`: the max step of an episode 

- `bool if_discrete`: discrete or continuous action space

- `float target_reward`: the gold score of this environment

- `array neg_state_avg`: for state normalization

- `array div_state_std`: for state normalization