training with fixed population by charliemolony · Pull Request #7 · Emerge-Lab/Adaptive_Driving_Agent

charliemolony · 2025-11-13T21:21:39Z

Training with a fixed population
seperating co player logging and logging
implemented co player conditioning

greptile-apps · 2025-11-13T21:25:13Z

Greptile Overview

Greptile Summary

Implemented population-based training where ego agents (trainable) interact with co-players (fixed policy agents) in shared driving scenarios. The PR separates agent roles, adds co-player policy loading/inference, implements separate logging systems, and extends the C/Python bindings to support dynamic agent allocation across worlds.

Key changes:

Added population_play mode with configurable ego/co-player split (512 ego agents by default)
Implemented co-player policy loading from checkpoint with conditioning support (reward, entropy, discount)
Created separate logging structs (Co_Player_Log, refactored Adaptive_Agent_Log from nested to flat arrays)
Extended C bindings with my_shared_population_play() for world allocation with agent role shuffling
Modified Python env to slice observations/rewards to only return ego agent data to trainer
Increased batch_size from 2 to 16 and num_maps from 1 to 1000

Critical issues found:

Memory leak: ego_agent_ids and co_player_ids arrays allocated but never freed in c_close()
Hardcoded path: /scratch/kj2676/gpudrive/data/processed/training breaks portability
Buffer overflow risk: num_co_players read before validation against actual array size

Confidence Score: 2/5

Not safe to merge - contains critical memory leaks and portability issues
Score reflects critical bugs that will cause memory leaks in production (ego_agent_ids, co_player_ids never freed), a hardcoded absolute path that breaks on other systems, and potential buffer overflow from unchecked array sizes
pufferlib/ocean/drive/drive.h (memory leak in c_close), pufferlib/ocean/drive/binding.c (buffer overflow risk), pufferlib/ocean/drive/drive.py (hardcoded path)

Important Files Changed

File Analysis

Filename	Score	Overview
pufferlib/ocean/drive/drive.h	2/5	added population play with co-player logging and adaptive agent tracking, but has critical memory leak for ego_agent_ids and co_player_ids arrays
pufferlib/ocean/drive/binding.c	2/5	refactored to support population play mode with separate self-play and population-play paths, buffer overflow risk in co_player_ids allocation
pufferlib/ocean/drive/drive.py	1/5	added population play with co-player policy loading and conditioning, has hardcoded absolute path breaking portability

Sequence Diagram

sequenceDiagram
    participant Config as drive.ini
    participant Vector as vector.py
    participant PufferRL as pufferl.py
    participant DriveEnv as drive.py
    participant Binding as binding.c/h
    participant DriveC as drive.h (C)

    Config->>Vector: load config with population_play=True
    Vector->>Vector: load co-player policy from checkpoint
    Vector->>DriveEnv: create env with co_player_policy
    
    DriveEnv->>Binding: call shared() to allocate worlds
    Binding->>Binding: check population_play flag
    alt population_play enabled
        Binding->>Binding: my_shared_population_play()
        Binding->>Binding: shuffle agent roles (ego/co-player)
        Binding->>Binding: allocate worlds ensuring >= 1 ego per world
        Binding-->>DriveEnv: return (offsets, map_ids, ego_ids, co_player_ids)
    else self-play mode
        Binding->>Binding: my_shared_self_play()
        Binding-->>DriveEnv: return (offsets, map_ids)
    end
    
    DriveEnv->>Binding: env_init() for each world
    Binding->>DriveC: allocate ego_agent_ids and co_player_ids arrays
    Binding->>DriveC: call init()
    DriveC->>DriveC: assign_ego_and_coplayer_roles()
    DriveC->>DriveC: allocate co_player_logs
    
    loop training step
        PufferRL->>DriveEnv: step(ego_actions)
        DriveEnv->>DriveEnv: get_co_player_actions() via policy inference
        DriveEnv->>DriveEnv: merge ego + co-player actions
        DriveEnv->>DriveC: vec_step() with all actions
        DriveC->>DriveC: move agents and compute metrics
        DriveC->>DriveC: update separate logs for ego vs co-players
        DriveC-->>DriveEnv: observations, rewards, terminals
        DriveEnv->>DriveEnv: slice out only ego agent data
        DriveEnv-->>PufferRL: ego observations/rewards
    end
    
    DriveC->>DriveC: c_close() cleanup
    Note over DriveC: MEMORY LEAK: ego_agent_ids, co_player_ids not freed

greptile-apps

_{9 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

m2kulkarni · 2025-11-13T21:41:41Z

+        env_k["co_player_condition_type"] = condition_type
+        torch.set_num_threads(
+            1
+        )  # NOTE this is the only way I could get co-player policies to work inside environment evaluation


what do you mean by this btw? Does this affect anything else

As far as I know it doesn't but I just wanted to flag it incase issues came down the line downstream

m2kulkarni · 2025-11-13T21:42:59Z

+        elif isinstance(w_slice, range):
+            w_indices = list(w_slice)
+        else:
+            # covers lists, tuples, numpy arrays, etc.


do you need support for list, tuples etc?

Yeah when batch_size = 1 the data types are different

m2kulkarni · 2025-11-13T21:49:45Z

+
+        if self.population_play:
+
+            co_player_actions = self.get_co_player_actions()


im wondering if theres an easier way to do this from the C side. Because everytime you get actions, you do _add_co_player_conditioning. Which requires you to reshape the big obs array. Im not sure if this is possible though so feel free to ignore

I think I would do more harm than good cause you would have to seperate the co player and ego observations on the C side and it would get complicated. However it might make sense to vectorise this, especially when num_co_players >> num_ego_agents

makes sense. thanks

m2kulkarni · 2025-11-17T17:18:29Z

@greptile review

greptile-apps

Greptile Summary

Implemented fixed population training mode where ego agents train against a fixed co-player policy
Separated ego and co-player logging systems with distinct metrics tracking
Refactored adaptive agent logging from struct arrays to individual float arrays for better memory management

Confidence Score: 3/5

This PR introduces complex memory management and multi-agent coordination that requires thorough testing
Previous comments identified memory leaks that were partially fixed, but the C code has complex pointer arithmetic and memory management across ego/co-player systems that needs verification
Pay close attention to pufferlib/ocean/drive/binding.c for memory allocation/deallocation and pufferlib/ocean/drive/drive.h for proper ego/co-player role assignment logic

Important Files Changed

Filename	Overview
pufferlib/ocean/drive/binding.c	Refactored shared function to support population play, added ego/co-player ID allocation with memory leak fixes from previous review
pufferlib/ocean/drive/drive.h	Added Co_Player_Log struct, refactored Adaptive_Agent_Log to use float arrays, implemented ego/co-player role assignment and separate logging
pufferlib/ocean/drive/drive.py	Added population play mode with co-player policy inference, conditioning support, and separated ego/co-player action handling
pufferlib/pufferl.py	Added population play support to training loop with ego/co-player separation, LSTM state handling, and batch size calculations

Sequence Diagram

sequenceDiagram
    participant User
    participant PuffeRL
    participant Drive_Env
    participant Binding
    participant Co_Player_Policy
    participant Ego_Policy

    User->>PuffeRL: "train()"
    PuffeRL->>Drive_Env: "__init__(population_play=True)"
    Drive_Env->>Binding: "shared(population_play=True)"
    Binding->>Binding: "my_shared_population_play()"
    Binding-->>Drive_Env: "agent_offsets, map_ids, ego_ids, co_player_ids"
    Drive_Env->>Drive_Env: "_set_co_player_state()"
    Drive_Env-->>PuffeRL: "vecenv ready"
    
    loop Training Loop
        PuffeRL->>Drive_Env: "step(ego_actions)"
        Drive_Env->>Co_Player_Policy: "get_co_player_actions(observations[co_player_ids])"
        Co_Player_Policy-->>Drive_Env: "co_player_actions"
        Drive_Env->>Binding: "vec_step(actions[ego_ids + co_player_ids])"
        Binding->>Binding: "c_step() with ego/co-player logging"
        Binding-->>Drive_Env: "observations, rewards, terminals"
        Drive_Env-->>PuffeRL: "ego observations, ego rewards, ego terminals"
        PuffeRL->>Ego_Policy: "forward(ego_observations)"
        Ego_Policy-->>PuffeRL: "logits, value"
        PuffeRL->>PuffeRL: "compute_loss() and update()"
    end

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

m2kulkarni · 2025-11-17T17:20:24Z

 vtrace_c_clip = 1
 vtrace_rho_clip = 1
-checkpoint_interval = 1000
+checkpoint_interval = 50


is this intentional?

Yes, the batch sizes are bigger so there is far fewer epochs/checkpoints, so this makes sure we actually save some policies

m2kulkarni · 2025-11-17T17:29:30Z

+        int is_ego = env->entities[agent_idx].is_ego;
+        int is_co_player = env->entities[agent_idx].is_co_player;

+        // Handle collisions - SAME REWARD for both ego and co-players


do we want co-players to get rewards? or how does this work

They recieve rewards but only for logging purposes

m2kulkarni · 2025-11-17T17:30:24Z

+
+        if self.population_play:
+
+            co_player_actions = self.get_co_player_actions()


makes sense. thanks

m2kulkarni · 2025-11-17T17:31:55Z

+        backend = "Serial"

    args["vec"] = dict(backend=backend, num_envs=1)
-    args["env"]["num_agents"] = args["wosac"]["num_total_wosac_agents"] if wosac_enabled else 1


we should add back the wosac stuff i guess.

training with fixed populaiton

1401d25

charliemolony requested a review from m2kulkarni November 13, 2025 21:21

greptile-apps Bot reviewed Nov 13, 2025

View reviewed changes

Comment thread pufferlib/ocean/drive/drive.h

Comment thread pufferlib/ocean/drive/drive.py Outdated

Comment thread pufferlib/ocean/drive/binding.c

fixing greptile comments

db7b906

m2kulkarni reviewed Nov 13, 2025

View reviewed changes

charliemolony59@gmail.com and others added 2 commits November 14, 2025 06:15

vectorizing co player observations

b7ffc52

fixing hanging and optimizing ada

21c2204

charliemolony changed the title ~~training with fixed populaiton~~ training with fixed population Nov 17, 2025

charliemolony59@gmail.com added 2 commits November 17, 2025 11:13

attempting to fix memory issues

7a6cd4e

Memory Issue Fixed

ad78abc

greptile-apps Bot reviewed Nov 17, 2025

View reviewed changes

fixing some small tests

106d8cb

m2kulkarni reviewed Nov 17, 2025

View reviewed changes

charliemolony59@gmail.com added 2 commits November 17, 2025 12:42

Fixing failed tests

b90ff4c

fixing build errors

41d8959

charliemolony merged commit 766e505 into main Nov 17, 2025
13 checks passed

charliemolony deleted the fixed_population_play branch November 17, 2025 17:57


		if self.population_play:

		co_player_actions = self.get_co_player_actions()

Conversation

charliemolony commented Nov 13, 2025

Uh oh!

greptile-apps Bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

Key changes:

Critical issues found:

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m2kulkarni commented Nov 17, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants