Skip to content

training with fixed population#7

Merged
charliemolony merged 9 commits intomainfrom
fixed_population_play
Nov 17, 2025
Merged

training with fixed population#7
charliemolony merged 9 commits intomainfrom
fixed_population_play

Conversation

@charliemolony
Copy link
Copy Markdown
Collaborator

  • Training with a fixed population
  • seperating co player logging and logging
  • implemented co player conditioning

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

Implemented population-based training where ego agents (trainable) interact with co-players (fixed policy agents) in shared driving scenarios. The PR separates agent roles, adds co-player policy loading/inference, implements separate logging systems, and extends the C/Python bindings to support dynamic agent allocation across worlds.

Key changes:

  • Added population_play mode with configurable ego/co-player split (512 ego agents by default)
  • Implemented co-player policy loading from checkpoint with conditioning support (reward, entropy, discount)
  • Created separate logging structs (Co_Player_Log, refactored Adaptive_Agent_Log from nested to flat arrays)
  • Extended C bindings with my_shared_population_play() for world allocation with agent role shuffling
  • Modified Python env to slice observations/rewards to only return ego agent data to trainer
  • Increased batch_size from 2 to 16 and num_maps from 1 to 1000

Critical issues found:

  • Memory leak: ego_agent_ids and co_player_ids arrays allocated but never freed in c_close()
  • Hardcoded path: /scratch/kj2676/gpudrive/data/processed/training breaks portability
  • Buffer overflow risk: num_co_players read before validation against actual array size

Confidence Score: 2/5

  • Not safe to merge - contains critical memory leaks and portability issues
  • Score reflects critical bugs that will cause memory leaks in production (ego_agent_ids, co_player_ids never freed), a hardcoded absolute path that breaks on other systems, and potential buffer overflow from unchecked array sizes
  • pufferlib/ocean/drive/drive.h (memory leak in c_close), pufferlib/ocean/drive/binding.c (buffer overflow risk), pufferlib/ocean/drive/drive.py (hardcoded path)

Important Files Changed

File Analysis

Filename Score Overview
pufferlib/ocean/drive/drive.h 2/5 added population play with co-player logging and adaptive agent tracking, but has critical memory leak for ego_agent_ids and co_player_ids arrays
pufferlib/ocean/drive/binding.c 2/5 refactored to support population play mode with separate self-play and population-play paths, buffer overflow risk in co_player_ids allocation
pufferlib/ocean/drive/drive.py 1/5 added population play with co-player policy loading and conditioning, has hardcoded absolute path breaking portability

Sequence Diagram

sequenceDiagram
    participant Config as drive.ini
    participant Vector as vector.py
    participant PufferRL as pufferl.py
    participant DriveEnv as drive.py
    participant Binding as binding.c/h
    participant DriveC as drive.h (C)

    Config->>Vector: load config with population_play=True
    Vector->>Vector: load co-player policy from checkpoint
    Vector->>DriveEnv: create env with co_player_policy
    
    DriveEnv->>Binding: call shared() to allocate worlds
    Binding->>Binding: check population_play flag
    alt population_play enabled
        Binding->>Binding: my_shared_population_play()
        Binding->>Binding: shuffle agent roles (ego/co-player)
        Binding->>Binding: allocate worlds ensuring >= 1 ego per world
        Binding-->>DriveEnv: return (offsets, map_ids, ego_ids, co_player_ids)
    else self-play mode
        Binding->>Binding: my_shared_self_play()
        Binding-->>DriveEnv: return (offsets, map_ids)
    end
    
    DriveEnv->>Binding: env_init() for each world
    Binding->>DriveC: allocate ego_agent_ids and co_player_ids arrays
    Binding->>DriveC: call init()
    DriveC->>DriveC: assign_ego_and_coplayer_roles()
    DriveC->>DriveC: allocate co_player_logs
    
    loop training step
        PufferRL->>DriveEnv: step(ego_actions)
        DriveEnv->>DriveEnv: get_co_player_actions() via policy inference
        DriveEnv->>DriveEnv: merge ego + co-player actions
        DriveEnv->>DriveC: vec_step() with all actions
        DriveC->>DriveC: move agents and compute metrics
        DriveC->>DriveC: update separate logs for ego vs co-players
        DriveC-->>DriveEnv: observations, rewards, terminals
        DriveEnv->>DriveEnv: slice out only ego agent data
        DriveEnv-->>PufferRL: ego observations/rewards
    end
    
    DriveC->>DriveC: c_close() cleanup
    Note over DriveC: MEMORY LEAK: ego_agent_ids, co_player_ids not freed
Loading

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread pufferlib/ocean/drive/drive.h
Comment thread pufferlib/ocean/drive/drive.py Outdated
Comment thread pufferlib/ocean/drive/binding.c
Comment thread pufferlib/vector.py
env_k["co_player_condition_type"] = condition_type
torch.set_num_threads(
1
) # NOTE this is the only way I could get co-player policies to work inside environment evaluation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this btw? Does this affect anything else

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know it doesn't but I just wanted to flag it incase issues came down the line downstream

Comment thread pufferlib/vector.py
elif isinstance(w_slice, range):
w_indices = list(w_slice)
else:
# covers lists, tuples, numpy arrays, etc.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need support for list, tuples etc?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah when batch_size = 1 the data types are different


if self.population_play:

co_player_actions = self.get_co_player_actions()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im wondering if theres an easier way to do this from the C side. Because everytime you get actions, you do _add_co_player_conditioning. Which requires you to reshape the big obs array. Im not sure if this is possible though so feel free to ignore

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would do more harm than good cause you would have to seperate the co player and ego observations on the C side and it would get complicated. However it might make sense to vectorise this, especially when num_co_players >> num_ego_agents

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. thanks

@charliemolony charliemolony changed the title training with fixed populaiton training with fixed population Nov 17, 2025
@m2kulkarni
Copy link
Copy Markdown

@greptile review

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

  • Implemented fixed population training mode where ego agents train against a fixed co-player policy
  • Separated ego and co-player logging systems with distinct metrics tracking
  • Refactored adaptive agent logging from struct arrays to individual float arrays for better memory management

Confidence Score: 3/5

  • This PR introduces complex memory management and multi-agent coordination that requires thorough testing
  • Previous comments identified memory leaks that were partially fixed, but the C code has complex pointer arithmetic and memory management across ego/co-player systems that needs verification
  • Pay close attention to pufferlib/ocean/drive/binding.c for memory allocation/deallocation and pufferlib/ocean/drive/drive.h for proper ego/co-player role assignment logic

Important Files Changed

Filename Overview
pufferlib/ocean/drive/binding.c Refactored shared function to support population play, added ego/co-player ID allocation with memory leak fixes from previous review
pufferlib/ocean/drive/drive.h Added Co_Player_Log struct, refactored Adaptive_Agent_Log to use float arrays, implemented ego/co-player role assignment and separate logging
pufferlib/ocean/drive/drive.py Added population play mode with co-player policy inference, conditioning support, and separated ego/co-player action handling
pufferlib/pufferl.py Added population play support to training loop with ego/co-player separation, LSTM state handling, and batch size calculations

Sequence Diagram

sequenceDiagram
    participant User
    participant PuffeRL
    participant Drive_Env
    participant Binding
    participant Co_Player_Policy
    participant Ego_Policy

    User->>PuffeRL: "train()"
    PuffeRL->>Drive_Env: "__init__(population_play=True)"
    Drive_Env->>Binding: "shared(population_play=True)"
    Binding->>Binding: "my_shared_population_play()"
    Binding-->>Drive_Env: "agent_offsets, map_ids, ego_ids, co_player_ids"
    Drive_Env->>Drive_Env: "_set_co_player_state()"
    Drive_Env-->>PuffeRL: "vecenv ready"
    
    loop Training Loop
        PuffeRL->>Drive_Env: "step(ego_actions)"
        Drive_Env->>Co_Player_Policy: "get_co_player_actions(observations[co_player_ids])"
        Co_Player_Policy-->>Drive_Env: "co_player_actions"
        Drive_Env->>Binding: "vec_step(actions[ego_ids + co_player_ids])"
        Binding->>Binding: "c_step() with ego/co-player logging"
        Binding-->>Drive_Env: "observations, rewards, terminals"
        Drive_Env-->>PuffeRL: "ego observations, ego rewards, ego terminals"
        PuffeRL->>Ego_Policy: "forward(ego_observations)"
        Ego_Policy-->>PuffeRL: "logits, value"
        PuffeRL->>PuffeRL: "compute_loss() and update()"
    end
Loading

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

vtrace_c_clip = 1
vtrace_rho_clip = 1
checkpoint_interval = 1000
checkpoint_interval = 50
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the batch sizes are bigger so there is far fewer epochs/checkpoints, so this makes sure we actually save some policies

int is_ego = env->entities[agent_idx].is_ego;
int is_co_player = env->entities[agent_idx].is_co_player;

// Handle collisions - SAME REWARD for both ego and co-players
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want co-players to get rewards? or how does this work

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They recieve rewards but only for logging purposes


if self.population_play:

co_player_actions = self.get_co_player_actions()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. thanks

Comment thread pufferlib/pufferl.py
backend = "Serial"

args["vec"] = dict(backend=backend, num_envs=1)
args["env"]["num_agents"] = args["wosac"]["num_total_wosac_agents"] if wosac_enabled else 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add back the wosac stuff i guess.

@charliemolony charliemolony merged commit 766e505 into main Nov 17, 2025
13 checks passed
@charliemolony charliemolony deleted the fixed_population_play branch November 17, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants