# **Pendulum-v1 Example in ElegantRL-HelloWorld**






# **Part 1: Install ElegantRL**

# **Part 2: Specify Environment and Agent**

*   **agent**: chooses a agent (DRL algorithm) from a set of agents in the [directory](https://github.com/AI4Finance-Foundation/ElegantRL/tree/master/elegantrl/agents).
*   **env**: creates an environment for your agent.


In [1]:
import gymnasium as gym

In [2]:
from elegantrl.train.config import Config
from elegantrl.agents.AgentSAC import AgentSAC

env = gym.make

env_args = {
    "id": "Pendulum-v1",
    "env_name": "Pendulum-v1",
    "num_envs": 1,
    "max_step": 1000,
    "state_dim": 3,
    "action_dim": 1,
    "if_discrete": False,
    "reward_scale": 2**-1,
    "gpu_id": 0, # if you have GPU
}
args = Config(AgentSAC, env_class=env, env_args=env_args)

# **Part 3: Specify hyper-parameters**
A list of hyper-parameters is available [here](https://elegantrl.readthedocs.io/en/latest/api/config.html).

In [3]:
args.max_step = 1000
args.reward_scale = 2**-1  # RewardRange: -1800 < -200 < -50 < 0
args.gamma = 0.99
args.target_step = args.max_step
args.eval_times = 2**3
args.num_workers=16 #rollout, improve gpu utilization
args.num_threads=2
# args.learner_gpu_ids=[0,1]
args.net_dims = [64, 64]  # the middle layer dimension of MLP (MultiLayer Perceptron)
args.learning_rate = 3e-4  # the learning rate for network updating
args.use_tensorboard=True
args.scheduler_name = 'WarmupCosineLR'
args.scheduler_args = {
    'warmup_steps': args.break_step*0.1,
    'max_steps': args.break_step,
    'warmup_start_factor': 0.01,
    'lr_min': 0.001*args.learning_rate,
}

# **Part 4: Train and Evaluate the Agent**






In [4]:
from elegantrl.train.run import train_agent

train_agent(args)

| train_agent_multiprocessing() with GPU_ID 0
| Arguments Remove cwd: runs/Pendulum-v1_SAC_5
| Evaluator:
| `step`: Number of samples, or total training steps, or running times of `env.step()`.
| `time`: Time spent from the start of training to this moment.
| `avgR`: Average value of cumulative rewards, which is the sum of rewards in an episode.
| `stdR`: Standard dev of cumulative rewards, which is the sum of rewards in an episode.
| `avgS`: Average of steps in an episode.
| `objC`: Objective of Critic network. Or call it loss function of critic network.
| `objA`: Objective of Actor network. It is the average Q value of the critic network.
################################################################################
ID     Step    Time |    avgR   stdR   avgS  stdS |    expR   objC   objA   etc.
0  1.64e+04      86 |-1230.53  275.9    200     0 |   -3.28  13.84   0.01   0.37   0.00 
0  5.73e+04     100 |-1252.14  383.3    200     0 |   -3.32  13.19  -0.00   0.35   0.00 
0  8.19e+04

KeyboardInterrupt: 

0  5.98e+05     807 |-1160.05   78.7    200     0 |   -2.77   5.48-191.44   1.12   0.00 


: 

Understanding the above results::
*   **Step**: the total training steps.
*  **MaxR**: the maximum reward.
*   **avgR**: the average of the rewards.
*   **stdR**: the standard deviation of the rewards.
*   **objA**: the objective function value of Actor Network (Policy Network).
*   **objC**: the objective function value (Q-value)  of Critic Network (Value Network).