# **Pendulum-v1 Example in ElegantRL-HelloWorld**






# **Part 1: Install ElegantRL**

# **Part 2: Specify Environment and Agent**

*   **agent**: chooses a agent (DRL algorithm) from a set of agents in the [directory](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl/agents).
*   **env**: creates an environment for your agent.


In [1]:
import gymnasium as gym
import os

In [2]:
from elegantrl.train.config import Config
from elegantrl.agents.AgentPPO import AgentDiscretePPO

env = gym.make

env_args = {
    "id": "CartPole-v1",
    "env_name": "CartPole-v1",
    "num_envs": 1,
    "max_step": 1000,
    "state_dim": 4,
    "action_dim": 2,
    "if_discrete": True,
    "reward_scale": 2**-1,
}
args = Config(AgentDiscretePPO, env_class=env, env_args=env_args)

# **Part 3: Specify hyper-parameters**
A list of hyper-parameters is available [here](https://elegantrl.readthedocs.io/en/latest/api/config.html).

In [3]:
args.max_step = 1000
args.reward_scale = 2**-1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.97
args.target_step = args.max_step
args.eval_times = 2**3
args.num_workers=16 #rollout, improve gpu utilization
args.num_threads=2
args.gpu_id=1
args.break_step=1e6 #1m步
args.net_dims = [64, 64]  # the middle layer dimension of MLP (MultiLayer Perceptron)
args.learning_rate = 3e-4  # the learning rate for network updating

# args.learner_gpu_ids=[0,1]

# **Part 4: Train and Evaluate the Agent**






In [4]:
from elegantrl.train.run import train_agent

train_agent(args)

| train_agent_multiprocessing() with GPU_ID 1
| Arguments Remove cwd: runs/CartPole-v1_DiscretePPO_0
| Evaluator:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| `time`: Time spent from the start of training to this moment.
| `avgR`: Average value of cumulative rewards, which is the sum of rewards in an episode.
| `stdR`: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| `avgS`: Average of steps in an episode.
| `objC`: Objective of Critic network. Or call it loss function of critic network.
| `objA`: Objective of Actor network. It is the average Q value of the critic network.
################################################################################
ID     Step    Time |    avgR   stdR   avgS  stdS |    expR   objC   objA   etc.
1  6.55e+04     162 |  396.00  133.5    396   134 |   -0.68   8.50  -0.18   0.59 0.5915502104908228
1  1.31e+05     208 |  404.12  107.1    404   107 |   -0.53   3.57  -0.11   0.48 

Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).