# Stable Baselines3 튜토리얼 - Callbacks

이 튜토리얼은 Stable Baselins3 라이브러리의 공식 튜토리얼과 예제 코드를 참고/번역하여 작성되었습니다.

출처 : [https://github.com/Stable-Baselines-Team/rl-colab-notebooks](https://github.com/Stable-Baselines-Team/rl-colab-notebooks)

Stable-Baselines3 Github: https://github.com/DLR-RM/stable-baselines3

공식 문서: https://stable-baselines.readthedocs.io/en/master/

공식 문서(Callbacks): https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html


## Callback 이란?

일반적으로 callback 함수는 특정 이벤트나 조건이 발생했을 때 호출되는 함수입니다.

Stable Baselines3에서 callback은 훈련 과정에서 호출하여 훈련 중에 RL 모델의 내부 상태에 접근할 수 있습니다.

이를 통해 모니터링, 자동 저장, 모델 조작, 진행률 표시 등을 할 수 있습니다.

wrapper와 마찬가지로 custom callback을 만들어서 사용할 수 있습니다.

이 튜토리얼에서는 대표적인 callback 사용법을 살펴보겠습니다.

In [1]:
import stable_baselines3

## 1. CheckpointCallback

`CheckpointCallback`은 훈련 중에 모델을 저장하는 callback입니다.

- 매 `save_freq` step 마다 모델을 저장합니다.
- `save_replay_buffer`를 통해 replay buffer를 저장할 수 있습니다.
- `save_vecnormalize`를 통해 normalized observation, reward를 저장할 수 있습니다.

In [2]:
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CheckpointCallback

log_dir = "./tmp/logs/checkpoint"

# Save a checkpoint every 1000 steps
checkpoint_callback = CheckpointCallback(
    save_freq=1000,
    save_path=log_dir,
    name_prefix="rl_model",
    save_replay_buffer=True,
    save_vecnormalize=True,
)

model = SAC("MlpPolicy", "Pendulum-v1")
model.learn(2000, callback=checkpoint_callback)

<stable_baselines3.sac.sac.SAC at 0x1340bb5d0>

## 2. EvalCallback

`EvalCallback`은 훈련 중에 모델을 평가하는 callback입니다.

- 매 `eval_freq` step 마다 모델을 평가합니다.
- `best_model_save_path`를 통해 최고 성능의 모델을 저장할 수 있습니다.
- `log_path`를 통해 결과를 저장할 수 있습니다.

In [3]:
import gymnasium as gym

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.logger import configure


train_env = gym.make("Pendulum-v1")
# Separate evaluation env
eval_env = gym.make("Pendulum-v1")

model = SAC("MlpPolicy", train_env, verbose=1)

log_dir = "./tmp/logs/eval"

# $ tensorboard --logdir ./examples/tmp/logs/eval 로 tensorboard 실행
new_logger = configure(log_dir, ["stdout", "csv", "tensorboard"])
model.set_logger(new_logger)

# Use deterministic actions for evaluation
eval_callback = EvalCallback(eval_env, best_model_save_path=log_dir,
                             log_path=log_dir, eval_freq=500, n_eval_episodes=5,
                             deterministic=True, render=False)

model.learn(total_timesteps=10000, callback=eval_callback)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to ./tmp/logs/eval
Eval num_timesteps=500, episode_reward=-1573.86 +/- 143.16
Episode length: 200.00 +/- 0.00
----------------------------------
| eval/              |           |
|    mean_ep_length  | 200       |
|    mean_reward     | -1.57e+03 |
| time/              |           |
|    total_timesteps | 500       |
| train/             |           |
|    actor_loss      | 15.5      |
|    critic_loss     | 0.454     |
|    ent_coef        | 0.888     |
|    ent_coef_loss   | -0.19     |
|    learning_rate   | 0.0003    |
|    n_updates       | 399       |
----------------------------------
New best mean reward!
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.53e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 176       |
|    time_elapsed    | 4       

<stable_baselines3.sac.sac.SAC at 0x158612910>

## 3. CallbackList

`CallbackList`는 여러 callback을 순차적으로 실행할 수 있습니다.

In [4]:
import gymnasium as gym

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback, EvalCallback

log_dir = "./tmp/logs/callback_list"

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path=log_dir)
# Separate evaluation env
eval_env = gym.make("Pendulum-v1")
eval_callback = EvalCallback(eval_env, best_model_save_path=log_dir,
                             log_path=log_dir, eval_freq=500)
# Create the callback list
callback = CallbackList([checkpoint_callback, eval_callback])

model = SAC("MlpPolicy", "Pendulum-v1")
# Equivalent to:
# model.learn(5000, callback=[checkpoint_callback, eval_callback])
model.learn(5000, callback=callback)

Eval num_timesteps=500, episode_reward=-1562.98 +/- 187.27
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1000, episode_reward=-1413.46 +/- 51.22
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1500, episode_reward=-1299.74 +/- 63.15
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=2000, episode_reward=-1066.46 +/- 77.20
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=2500, episode_reward=-1237.29 +/- 39.78
Episode length: 200.00 +/- 0.00
Eval num_timesteps=3000, episode_reward=-292.39 +/- 136.51
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=3500, episode_reward=-148.00 +/- 90.11
Episode length: 200.00 +/- 0.00
New best mean reward!
Eval num_timesteps=4000, episode_reward=-166.98 +/- 56.91
Episode length: 200.00 +/- 0.00
Eval num_timesteps=4500, episode_reward=-269.00 +/- 93.46
Episode length: 200.00 +/- 0.00
Eval num_timesteps=5000, episode_reward=-163.59 +/- 

<stable_baselines3.sac.sac.SAC at 0x146a0e110>

## 4. StopTrainingOnNoModelImprovement

`StopTrainingOnNoModelImprovement`는 모델이 향상되지 않을 때 훈련을 중지하는 callback입니다.

In [5]:
import gymnasium as gym

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnNoModelImprovement

# Separate evaluation env
eval_env = gym.make("Pendulum-v1")
# Stop training if there is no improvement after more than 3 evaluations
stop_train_callback = StopTrainingOnNoModelImprovement(max_no_improvement_evals=3, min_evals=5, verbose=1)
eval_callback = EvalCallback(eval_env, eval_freq=1000, callback_after_eval=stop_train_callback, verbose=1)

model = SAC("MlpPolicy", "Pendulum-v1", learning_rate=1e-3, verbose=1)
# Almost infinite number of timesteps, but the training will stop early
# as soon as the the number of consecutive evaluations without model
# improvement is greater than 3
model.learn(int(1e10), callback=eval_callback)

Using cpu device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.57e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 217       |
|    time_elapsed    | 3         |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 27.2      |
|    critic_loss     | 0.0582    |
|    ent_coef        | 0.503     |
|    ent_coef_loss   | -1.08     |
|    learning_rate   | 0.001     |
|    n_updates       | 699       |
----------------------------------
Eval num_timesteps=1000, episode_reward=-1726.12 +/- 90.27
Episode length: 200.00 +/- 0.00
----------------------------------
| eval/              |           |
|    mean_ep_length  | 200       |
|    mean_reward     | -1.73e+03 |
| time/              |   

<stable_baselines3.sac.sac.SAC at 0x143fbf910>