<a href="https://colab.research.google.com/github/kuds/ElegantRL/blob/master/tutorial_helloworld_DQN_DDPG_PPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo: ElegantRL_HelloWorld_tutorial (DQN --> DDPG --> PPO)

We suggest to following this order to quickly learn about RL:
- DQN (Deep Q Network), a basic RL algorithms in discrete action space.
- DDPG (Deep Deterministic Policy Gradient), a basic RL algorithm in continuous action space.
- PPO (Proximal Policy Gradient), a widely used RL algorithms in continuous action space.

If you have any suggestion about ElegantRL Helloworld, you can discuss them in [ElegantRL issues/135: Suggestions for elegant_helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/issues/135), and we will keep an eye on this issue.
ElegantRL's code, especially the Helloworld, really needs a lot of feedback to be better.

# **Part 1: Install ElegantRL**

In [1]:
# install elegantrl library
!pip install git+https://github.com/AI4Finance-LLC/ElegantRL.git

Collecting git+https://github.com/AI4Finance-LLC/ElegantRL.git
  Cloning https://github.com/AI4Finance-LLC/ElegantRL.git to /tmp/pip-req-build-pe3o7570
  Running command git clone --filter=blob:none --quiet https://github.com/AI4Finance-LLC/ElegantRL.git /tmp/pip-req-build-pe3o7570
  Resolved https://github.com/AI4Finance-LLC/ElegantRL.git to commit 37aac1f592e1add9f9fd37ae8db1094656009b76
  Preparing metadata (setup.py) ... [?25l[?25hdone



## **Part 2: Import ElegantRL helloworld**

We hope that the `ElegantRL Helloworld` would help people who want to learn about reinforcement learning to quickly run a few introductory examples.
- **Less lines of code**. (code lines <1000)
- **Less packages requirements**. (only `torch` and `gym` )
- **keep a consistent style with the full version of ElegantRL**.

![File_structure of ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL/raw/master/figs/File_structure.png)

One sentence summary: an agent `agent.py` with Actor-Critic networks `net.py` is trained `run.py` by interacting with an environment `env.py`.


In this tutorial, we only need to download the directory from [helloworld](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/helloworld) using the following code.

The files in `elegantrl_helloworld` including:
`config.py`, `agent.py`, `net.py`, `env.py`, `run.py`

In [2]:
!rm -r -f /content/ElegantRL/
!rm -r -f /content/elegantrl_helloworld
!git clone https://github.com/kuds/ElegantRL.git
!mv /content/ElegantRL/helloworld /content/elegantrl_helloworld

Cloning into 'ElegantRL'...
remote: Enumerating objects: 17322, done.[K
remote: Counting objects: 100% (2421/2421), done.[K
remote: Compressing objects: 100% (913/913), done.[K
remote: Total 17322 (delta 1575), reused 2038 (delta 1430), pack-reused 14901 (from 4)[K
Receiving objects: 100% (17322/17322), 115.62 MiB | 24.01 MiB/s, done.
Resolving deltas: 100% (11179/11179), done.


In [3]:
from elegantrl_helloworld.erl_run import train_agent, valid_agent
from elegantrl_helloworld.erl_config import Config, get_gym_env_args

## **Part 3: Train DQN on discreted action space task.**

Train DQN on [**Discreted action** space task `CartPole`](https://gym.openai.com/envs/CartPole-v1/)

You can see [/helloworld/erl_config.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/helloworld/erl_config.py) to get more information about hyperparameter.

```
class Arguments:
    def __init__(self, agent_class, env_func=None, env_args=None):
        self.env_num = self.env_args['env_num']  # env_num = 1. In vector env, env_num > 1.
        self.max_step = self.env_args['max_step']  # the max step of an episode
        self.env_name = self.env_args['env_name']  # the env name. Be used to set 'cwd'.
        self.state_dim = self.env_args['state_dim']  # vector dimension (feature number) of state
        self.action_dim = self.env_args['action_dim']  # vector dimension (feature number) of action
        self.if_discrete = self.env_args['if_discrete']  # discrete or continuous action space
        ...
```

In [4]:
from elegantrl_helloworld.erl_config import Config
from elegantrl_helloworld.erl_agent import AgentDQN
agent_class = AgentDQN
env_name = "CartPole-v1"

import gymnasium as gym
# gym.logger.min_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Config(agent_class, env_func, env_args)

# Set attributes from env_args on the args object
args.state_dim = env_args['state_dim']
args.action_dim = env_args['action_dim']
args.if_discrete = env_args['if_discrete']
# args.env_num = env_args['env_num'] # Assuming env_num is in env_args based on LunarLander example output


'''reward shaping'''
args.reward_scale = 2 ** 0  # an approximate target reward usually be closed to 256
args.gamma = 0.97  # discount factor of future rewards

'''network update'''
args.net_dim = 2 ** 7  # the middle layer dimension of Fully Connected Network
args.num_layer = 3  # the layer number of MultiLayer Perceptron, `assert num_layer >= 2`
args.batch_size = 2 ** 7  # num of transitions sampled from replay buffer.
args.repeat_times = 2 ** 0  # repeatedly update network using ReplayBuffer to keep critic's loss small
args.explore_rate = 0.25  # epsilon-greedy for exploration.

'''evaluate'''
args.eval_gap = 2 ** 5  # number of times that get episode return
args.eval_times = 2 ** 3  # number of times that get episode return
args.break_step = int(8e4)  # break training if 'total_step > break_step'

env_args = {'env_name': 'CartPole-v1',
            'state_dim': 4,
            'action_dim': np.int64(2),
            'if_discrete': True}


Choose gpu id `0` using `args.learner_gpu = 0`. Set as `-1` or GPU is unavaliable, the training program will choose CPU automatically.

- The cumulative returns of CartPole-v0  is ∈ (0, (1, 195), 200)
- The cumulative returns of task_name is ∈ (min score, (score of random action, target score), max score).

In [5]:
# args.learner_gpus = -1

train_agent(args)
valid_agent(env, args) # Changed evaluate_agent to valid_agent
print(f'| The cumulative returns of {env_name}  is ∈ (0, (1, 195), 200)')

| Arguments Remove cwd: ./CartPole-v1_DQN_0
| Evaluator:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| `time`: Time spent from the start of training to this moment.
| `avgR`: Average value of cumulative rewards, which is the sum of rewards in an episode.
| `stdR`: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| `avgS`: Average of steps in an episode.
| `objC`: Objective of Critic network. Or call it loss function of critic network.
| `objA`: Objective of Actor network. It is the average Q value of the critic network.
|     step      time  |     avgR    stdR    avgS  |     objC      objA
| 1.02e+04        12  |     9.38    0.48       9  |     0.15      2.99
| 2.05e+04        23  |    11.25    1.20      11  |     1.12     17.66
| 3.07e+04        39  |     9.75    0.43      10  |     1.53     26.96
| 4.10e+04        57  |     9.62    0.48      10  |     4.91     74.76
| 5.12e+04        80  |    10.12    0.60     

TypeError: valid_agent() missing 4 required positional arguments: 'env_args', 'net_dims', 'agent_class', and 'actor_path'

Train DQN on [**Discreted action** space env `LunarLander`](https://gym.openai.com/envs/LunarLander-v2/)

**You can pass and run codes below.**. Because DQN takes over 6000 seconds for training. It is too slow. (DuelingDoubleDQN taks less than 1000 second for training on LunarLander-v2 task.)

And there are many other DQN variance algorithms which get higher cumulative returns and takes less time for training. See [examples/demo_DQN_Dueling_Double_DQN.py](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/examples/demo_DQN_Dueling_Double_DQN.py)

In [None]:
from elegantrl_helloworld.erl_agent import AgentDQN
agent_class = AgentDQN
env_name = "LunarLander-v3"

import gym
gym.logger.set_level(40)  # Block warning
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** 0
args.gamma = 0.99

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 7
args.num_layer = 3

args.batch_size = 2 ** 6

args.repeat_times = 2 ** 0
args.explore_noise = 0.125

'''evaluate'''
args.eval_gap = 2 ** 7
args.eval_times = 2 ** 4
args.break_step = int(4e5)  # LunarLander needs a larger `break_step`

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
valid_agent(args)
print('| The cumulative returns of LunarLander-v2 is ∈ (-1800, (-600, 200), 340)')

## **Part 4: Train DDPG on continuous action space task.**

Train DDPG on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/)

We show a cunstom env in helloworld/erl_env.py `class PendulumEnv`](https://github.com/AI4Finance-Foundation/ElegantRL/blob/master/helloworld/erl_env.py#L19-L23)

OpenAI Pendulum env set its action space as (-2, +2). It is bad. We suggest that adjust action space to (-1, +1) when designing your own env.


In [None]:
from elegantrl_helloworld.erl_config import Config, get_gym_env_args
from elegantrl_helloworld.erl_run import train_agent, valid_agent
from elegantrl_helloworld.erl_env import PendulumEnv
from elegantrl_helloworld.erl_agent import AgentDDPG
agent_class = AgentDDPG

env_name = "Pendulum-v1"
env = PendulumEnv(env_name)  # PendulumEnv('Pendulum-v1')
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 2
args.net_dim = 2 ** 7
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 0
args.explore_noise = 0.1

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(1e5)

args.learner_gpus = -1  # denotes use CPU
train_agent(args)
evaluate_agent(args)
print(f'| The cumulative returns of {env_name} is ∈ (-1600, (-1400, -200), 0)')

# **Part 5: Train PPO on continuous action space task.**

Train PPO on [**Continuous action** space env `Pendulum`](https://gym.openai.com/envs/Pendulum-v0/).


In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO

from elegantrl_helloworld.env import PendulumEnv
env = PendulumEnv()
env_func = PendulumEnv
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97

'''network update'''
args.target_step = args.max_step * 8
args.net_dim = 2 ** 7
args.num_layer = 2
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 5

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 3
args.break_step = int(8e5)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of Pendulum-v1 is ∈ (-1600, (-1400, -200), 0)')

Train PPO on [**Continuous action** space env `LunarLanderContinuous`](https://gym.openai.com/envs/LunarLanderContinuous-v2/)

In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "LunarLanderContinuous-v2"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.gamma = 0.99
args.reward_scale = 2 ** -1

'''network update'''
args.target_step = args.max_step * 8
args.num_layer = 3
args.batch_size = 2 ** 7
args.repeat_times = 2 ** 4
args.lambda_entropy = 0.04

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 5
args.break_step = int(4e5)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of LunarLanderContinuous-v2 is ∈ (-1800, (-300, 200), 310+)')

Train PPO on [**Continuous action** space env `BipedalWalker`](https://gym.openai.com/envs/BipedalWalker-v2/)

In [None]:
from elegantrl_helloworld.config import Arguments
from elegantrl_helloworld.run import train_agent, evaluate_agent
from elegantrl_helloworld.env import get_gym_env_args
from elegantrl_helloworld.agent import AgentPPO
agent_class = AgentPPO
env_name = "BipedalWalker-v3"

import gym
env = gym.make(env_name)
env_func = gym.make
env_args = get_gym_env_args(env, if_print=True)

args = Arguments(agent_class, env_func, env_args)

'''reward shaping'''
args.reward_scale = 2 ** -1
args.gamma = 0.98

'''network update'''
args.target_step = args.max_step
args.net_dim = 2 ** 8
args.num_layer = 3
args.batch_size = 2 ** 8
args.repeat_times = 2 ** 4

'''evaluate'''
args.eval_gap = 2 ** 6
args.eval_times = 2 ** 4
args.break_step = int(1e6)

args.learner_gpus = -1
train_agent(args)
evaluate_agent(args)
print('| The cumulative returns of BipedalWalker-v3 is ∈ (-150, (-100, 280), 320+)')
