# Training an SB3 Agent

© Crown-owned copyright 2025, Defence Science and Technology Laboratory UK

This notebook will demonstrate how to use primaite to create and train a PPO agent, using a pre-defined configuration file using the [UC2 scenario](./Data-Manipulation-E2E-Demonstration.ipynb).

#### First, we import the inital packages and read in our configuration file.

In [1]:
!primaite setup

2025-03-24 09:53:20,481: Performing the PrimAITE first-time setup...
2025-03-24 09:53:20,482: Building the PrimAITE app directories...
2025-03-24 09:53:20,482: Building primaite_config.yaml...
2025-03-24 09:53:20,482: Rebuilding the demo notebooks...
2025-03-24 09:53:20,504: Rebuilding the example notebooks...
2025-03-24 09:53:20,506: PrimAITE setup complete!


In [2]:
from primaite.session.environment import PrimaiteGymEnv
from primaite.config.load import data_manipulation_config_path
import yaml

In [3]:
with open(data_manipulation_config_path(), 'r') as f:
    cfg = yaml.safe_load(f)
for agent in cfg['agents']:
    if agent['ref'] == 'defender':
        agent['agent_settings']['flatten_obs']=True

Using the given configuration, we generate the environment our agent will train in.

In [4]:
gym = PrimaiteGymEnv(env_config=cfg)

2025-03-24 09:53:24,082: PrimaiteGymEnv RNG seed = None


Lets define training parameters for the agent.

In [5]:
from stable_baselines3 import PPO

EPISODE_LEN = 128
NUM_EPISODES = 5
NO_STEPS = EPISODE_LEN * NUM_EPISODES
BATCH_SIZE = 32
LEARNING_RATE = 3e-4

E0000 00:00:1742810004.534146    7457 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742810004.538558    7457 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742810004.550850    7457 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742810004.550862    7457 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742810004.550864    7457 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742810004.550866    7457 computation_placer.cc:177] computation placer already registered. Please check linka

In [6]:
model = PPO('MlpPolicy', gym, learning_rate=LEARNING_RATE,  n_steps=NO_STEPS, batch_size=BATCH_SIZE, verbose=0, tensorboard_log="./PPO_UC2/")

With the agent configured, let's train for our defined number of episodes.

In [7]:
model.learn(total_timesteps=NO_STEPS)

2025-03-24 09:53:27,615: Resetting environment, episode 0, avg. reward: 0.0


2025-03-24 09:53:27,617: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_0.json


2025-03-24 09:53:28,877: Resetting environment, episode 1, avg. reward: -28.5


2025-03-24 09:53:28,879: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_1.json


2025-03-24 09:53:30,061: Resetting environment, episode 2, avg. reward: -24.499999999999947


2025-03-24 09:53:30,063: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_2.json


2025-03-24 09:53:31,222: Resetting environment, episode 3, avg. reward: -24.49999999999998


2025-03-24 09:53:31,224: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_3.json


2025-03-24 09:53:32,364: Resetting environment, episode 4, avg. reward: -21.499999999999957


2025-03-24 09:53:32,366: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_4.json


2025-03-24 09:53:33,541: Resetting environment, episode 5, avg. reward: -55.85000000000007


2025-03-24 09:53:33,543: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_5.json


<stable_baselines3.ppo.ppo.PPO at 0x7f350cab6ad0>

Next, let's save the agent to a zip file that can be used in future evaluation.

In [8]:
model.save("PrimAITE-PPO-example-agent")

Now, we load the saved agent and run it in evaluation mode.

In [9]:
eval_model = PPO("MlpPolicy", gym)
eval_model = PPO.load("PrimAITE-PPO-example-agent", gym)

Finally, evaluate the agent.

In [10]:
from stable_baselines3.common.evaluation import evaluate_policy

evaluate_policy(eval_model, gym, n_eval_episodes=10)

2025-03-24 09:53:34,629: Resetting environment, episode 6, avg. reward: 0.0


2025-03-24 09:53:34,630: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_6.json


2025-03-24 09:53:36,260: Resetting environment, episode 7, avg. reward: -57.39999999999994


2025-03-24 09:53:36,261: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_7.json


2025-03-24 09:53:37,774: Resetting environment, episode 8, avg. reward: -22.999999999999986


2025-03-24 09:53:37,775: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_8.json


2025-03-24 09:53:39,251: Resetting environment, episode 9, avg. reward: 13.550000000000097


2025-03-24 09:53:39,252: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_9.json


2025-03-24 09:53:40,745: Resetting environment, episode 10, avg. reward: -58.54999999999994


2025-03-24 09:53:40,747: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_10.json


2025-03-24 09:53:42,580: Resetting environment, episode 11, avg. reward: -57.04999999999994


2025-03-24 09:53:42,581: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_11.json


2025-03-24 09:53:44,088: Resetting environment, episode 12, avg. reward: 22.250000000000124


2025-03-24 09:53:44,089: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_12.json


2025-03-24 09:53:45,593: Resetting environment, episode 13, avg. reward: -14.599999999999982


2025-03-24 09:53:45,594: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_13.json


2025-03-24 09:53:47,103: Resetting environment, episode 14, avg. reward: -22.399999999999988


2025-03-24 09:53:47,104: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_14.json


2025-03-24 09:53:48,617: Resetting environment, episode 15, avg. reward: -56.29999999999995


2025-03-24 09:53:48,618: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_15.json


2025-03-24 09:53:50,491: Resetting environment, episode 16, avg. reward: -19.699999999999974


2025-03-24 09:53:50,493: Saving agent action log to /home/runner/primaite/4.0.0/sessions/2025-03-24/09-53-21/agent_actions/episode_16.json


(-27.320001342892645, 28.273850241009196)