training with fixed population#7
Conversation
charliemolony
commented
Nov 13, 2025
- Training with a fixed population
- seperating co player logging and logging
- implemented co player conditioning
Greptile OverviewGreptile SummaryImplemented population-based training where ego agents (trainable) interact with co-players (fixed policy agents) in shared driving scenarios. The PR separates agent roles, adds co-player policy loading/inference, implements separate logging systems, and extends the C/Python bindings to support dynamic agent allocation across worlds. Key changes:
Critical issues found:
Confidence Score: 2/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Config as drive.ini
participant Vector as vector.py
participant PufferRL as pufferl.py
participant DriveEnv as drive.py
participant Binding as binding.c/h
participant DriveC as drive.h (C)
Config->>Vector: load config with population_play=True
Vector->>Vector: load co-player policy from checkpoint
Vector->>DriveEnv: create env with co_player_policy
DriveEnv->>Binding: call shared() to allocate worlds
Binding->>Binding: check population_play flag
alt population_play enabled
Binding->>Binding: my_shared_population_play()
Binding->>Binding: shuffle agent roles (ego/co-player)
Binding->>Binding: allocate worlds ensuring >= 1 ego per world
Binding-->>DriveEnv: return (offsets, map_ids, ego_ids, co_player_ids)
else self-play mode
Binding->>Binding: my_shared_self_play()
Binding-->>DriveEnv: return (offsets, map_ids)
end
DriveEnv->>Binding: env_init() for each world
Binding->>DriveC: allocate ego_agent_ids and co_player_ids arrays
Binding->>DriveC: call init()
DriveC->>DriveC: assign_ego_and_coplayer_roles()
DriveC->>DriveC: allocate co_player_logs
loop training step
PufferRL->>DriveEnv: step(ego_actions)
DriveEnv->>DriveEnv: get_co_player_actions() via policy inference
DriveEnv->>DriveEnv: merge ego + co-player actions
DriveEnv->>DriveC: vec_step() with all actions
DriveC->>DriveC: move agents and compute metrics
DriveC->>DriveC: update separate logs for ego vs co-players
DriveC-->>DriveEnv: observations, rewards, terminals
DriveEnv->>DriveEnv: slice out only ego agent data
DriveEnv-->>PufferRL: ego observations/rewards
end
DriveC->>DriveC: c_close() cleanup
Note over DriveC: MEMORY LEAK: ego_agent_ids, co_player_ids not freed
|
| env_k["co_player_condition_type"] = condition_type | ||
| torch.set_num_threads( | ||
| 1 | ||
| ) # NOTE this is the only way I could get co-player policies to work inside environment evaluation |
There was a problem hiding this comment.
what do you mean by this btw? Does this affect anything else
There was a problem hiding this comment.
As far as I know it doesn't but I just wanted to flag it incase issues came down the line downstream
| elif isinstance(w_slice, range): | ||
| w_indices = list(w_slice) | ||
| else: | ||
| # covers lists, tuples, numpy arrays, etc. |
There was a problem hiding this comment.
do you need support for list, tuples etc?
There was a problem hiding this comment.
Yeah when batch_size = 1 the data types are different
|
|
||
| if self.population_play: | ||
|
|
||
| co_player_actions = self.get_co_player_actions() |
There was a problem hiding this comment.
im wondering if theres an easier way to do this from the C side. Because everytime you get actions, you do _add_co_player_conditioning. Which requires you to reshape the big obs array. Im not sure if this is possible though so feel free to ignore
There was a problem hiding this comment.
I think I would do more harm than good cause you would have to seperate the co player and ego observations on the C side and it would get complicated. However it might make sense to vectorise this, especially when num_co_players >> num_ego_agents
|
@greptile review |
There was a problem hiding this comment.
Greptile Summary
- Implemented fixed population training mode where ego agents train against a fixed co-player policy
- Separated ego and co-player logging systems with distinct metrics tracking
- Refactored adaptive agent logging from struct arrays to individual float arrays for better memory management
Confidence Score: 3/5
- This PR introduces complex memory management and multi-agent coordination that requires thorough testing
- Previous comments identified memory leaks that were partially fixed, but the C code has complex pointer arithmetic and memory management across ego/co-player systems that needs verification
- Pay close attention to
pufferlib/ocean/drive/binding.cfor memory allocation/deallocation andpufferlib/ocean/drive/drive.hfor proper ego/co-player role assignment logic
Important Files Changed
| Filename | Overview |
|---|---|
| pufferlib/ocean/drive/binding.c | Refactored shared function to support population play, added ego/co-player ID allocation with memory leak fixes from previous review |
| pufferlib/ocean/drive/drive.h | Added Co_Player_Log struct, refactored Adaptive_Agent_Log to use float arrays, implemented ego/co-player role assignment and separate logging |
| pufferlib/ocean/drive/drive.py | Added population play mode with co-player policy inference, conditioning support, and separated ego/co-player action handling |
| pufferlib/pufferl.py | Added population play support to training loop with ego/co-player separation, LSTM state handling, and batch size calculations |
Sequence Diagram
sequenceDiagram
participant User
participant PuffeRL
participant Drive_Env
participant Binding
participant Co_Player_Policy
participant Ego_Policy
User->>PuffeRL: "train()"
PuffeRL->>Drive_Env: "__init__(population_play=True)"
Drive_Env->>Binding: "shared(population_play=True)"
Binding->>Binding: "my_shared_population_play()"
Binding-->>Drive_Env: "agent_offsets, map_ids, ego_ids, co_player_ids"
Drive_Env->>Drive_Env: "_set_co_player_state()"
Drive_Env-->>PuffeRL: "vecenv ready"
loop Training Loop
PuffeRL->>Drive_Env: "step(ego_actions)"
Drive_Env->>Co_Player_Policy: "get_co_player_actions(observations[co_player_ids])"
Co_Player_Policy-->>Drive_Env: "co_player_actions"
Drive_Env->>Binding: "vec_step(actions[ego_ids + co_player_ids])"
Binding->>Binding: "c_step() with ego/co-player logging"
Binding-->>Drive_Env: "observations, rewards, terminals"
Drive_Env-->>PuffeRL: "ego observations, ego rewards, ego terminals"
PuffeRL->>Ego_Policy: "forward(ego_observations)"
Ego_Policy-->>PuffeRL: "logits, value"
PuffeRL->>PuffeRL: "compute_loss() and update()"
end
7 files reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| vtrace_c_clip = 1 | ||
| vtrace_rho_clip = 1 | ||
| checkpoint_interval = 1000 | ||
| checkpoint_interval = 50 |
There was a problem hiding this comment.
Yes, the batch sizes are bigger so there is far fewer epochs/checkpoints, so this makes sure we actually save some policies
| int is_ego = env->entities[agent_idx].is_ego; | ||
| int is_co_player = env->entities[agent_idx].is_co_player; | ||
|
|
||
| // Handle collisions - SAME REWARD for both ego and co-players |
There was a problem hiding this comment.
do we want co-players to get rewards? or how does this work
There was a problem hiding this comment.
They recieve rewards but only for logging purposes
|
|
||
| if self.population_play: | ||
|
|
||
| co_player_actions = self.get_co_player_actions() |
| backend = "Serial" | ||
|
|
||
| args["vec"] = dict(backend=backend, num_envs=1) | ||
| args["env"]["num_agents"] = args["wosac"]["num_total_wosac_agents"] if wosac_enabled else 1 |
There was a problem hiding this comment.
we should add back the wosac stuff i guess.