Add changes from vcha/stable#436
Conversation
… clarity - Added 'amp' option to default.ini for automatic mixed precision support. - Introduced 'resume_state_path' in default.ini for state restoration. - Updated compilation settings in default.ini for better compatibility. - Refined Waypoint structure in datatypes.h for clarity. - Modified Drive class in drive.h to improve collision handling and agent initialization. - Enhanced observation handling in drive.py, including padded observations and traffic control features. - Implemented utility functions in pufferl.py for better device management and state handling. - Improved training state loading and saving mechanisms in PuffeRL class. - Adjusted training logic to support advanced features like mixed precision and dynamic batching.
…d training evaluation
…resource management
There was a problem hiding this comment.
Pull request overview
This PR appears to merge in “stable” changes that extend PufferDrive’s training loop with improved checkpoint/resume support, additional evaluation utilities (multi-scenario evaluation + CSV export), and several Drive environment/config updates.
Changes:
- Extend
PuffeRLwith precision/AMP handling, state dict key cleaning, richer checkpoint state, and resume-from-state support. - Add standalone multi-scenario evaluation helpers (config merging, overrides, CSV export, coverage verification, logging).
- Update Drive env observation construction/padding and configs (including new INI defaults and new weight config YAMLs).
Reviewed changes
Copilot reviewed 9 out of 45 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
weigths/tomate/config.yaml |
Adds a new experiment/config preset for training/eval. |
weigths/salade/config.yaml |
Adds another experiment/config preset for training/eval. |
pufferlib/pufferl.py |
Major training/eval refactor: AMP/precision validation, compile tweaks, checkpoint state v2 + RNG capture/restore, resume, and new multi-scenario eval utilities. |
pufferlib/ocean/torch.py |
Refactors encoder+pooling and aligns one-hot dtypes with continuous features. |
pufferlib/ocean/drive/drive.py |
Adjusts control_mode error message text. |
pufferlib/ocean/drive/drive.h |
Changes observation padding strategy and removes a zero-drivable-cells guard; minor control logic tweak. |
pufferlib/ocean/drive/datatypes.h |
Edits a struct field comment. |
pufferlib/config/ocean/drive.ini |
Updates map_dir and adds an [eval] section with multi-scenario eval config. |
pufferlib/config/default.ini |
Adds amp and resume_state_path defaults; changes torch.compile defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [eval] | ||
| ; Set to True to enable periodic multi-scenario evaluation during training | ||
| multi_scenario_eval = False | ||
| ; Frequency of evaluation during training (in epochs) | ||
| eval_interval = 25 | ||
| num_agents = 512 | ||
| ; Batch size for eval_multi_scenarios (number of scenarios per batch) | ||
| ; Path to dataset used for evaluation | ||
| map_dir = "pufferlib/resources/drive/binaries/eval" | ||
| ; Simulation mode for evaluation: "gigaflow" or "replay" | ||
| multi_scenario_simulation_mode = "replay" | ||
| ; Total number of scenarios to evaluate | ||
| multi_scenario_num_scenarios = 250 | ||
| backend = PufferEnv |
| else: | ||
| raise ValueError( | ||
| f"control_mode must be one of 'control_vehicles', 'control_agents', 'control_wosac', or 'control_sdc_only'. Got: {self.control_mode_str}" | ||
| f"control_mode must be one of 'control_vehicles', 'control_wosac', or 'control_agents'. Got: {self.control_mode_str}" |
| float sin_heading; // Cached sinf(heading) - set in build_path | ||
| float kappa; // Curvature at this point | ||
| int lane_idx; // Index of the lane this waypoint belongs to (for GT path) or closest to (for expert path) | ||
| int lane_idx; // Index of the lane this waypoint |
| if model_path: | ||
| experiment_dir = os.path.dirname(os.path.dirname(model_path)) | ||
| config_yaml_path = os.path.join(experiment_dir, "config.yaml") | ||
| EXCLUDE_KEYS = eval_overrides["env"].keys() |
| # Multi-worker backend returns infos as list of lists (one per worker) | ||
| if infos and infos[0]: | ||
| for sub_env in infos: | ||
| for env_idx, summary in enumerate(sub_env): | ||
| env_map_name = summary["map_name"].split("/")[-1].split(".")[0] | ||
| summary["episode_id"] = env_idx | ||
| summary["map_name"] = env_map_name | ||
| scenarios_processed += 1 | ||
| pbar.update(1) | ||
|
|
||
| for k, v in summary.items(): | ||
| if k not in global_infos: | ||
| global_infos[k] = [] | ||
| global_infos[k].append(v) | ||
|
|
| try: | ||
| df_episodes = pd.DataFrame(global_infos) | ||
| first_cols = ["episode_id", "map_name"] | ||
| other_cols = [col for col in df_episodes.columns if col not in first_cols] | ||
| new_col_order = first_cols + other_cols | ||
| df_episodes = df_episodes[new_col_order] | ||
|
|
| return; | ||
| } | ||
| int num_agents_to_create = env->num_controllable_agents; | ||
|
|
| static inline void fill_padded_observation_rows(float *obs, int rows, int features) { | ||
| for (int r = 0; r < rows; r++) { | ||
| for (int c = 0; c < features; c++) { | ||
| obs[r * features + c] = PADDED_OBSERVATION_VALUE; |
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add one-line comments to fill_padded_observation_rows / fill_padded_traffic_control_rows, and pull the road-edge heading fold into a reusable wrap_heading(angle) helper (folds a heading into [-pi/2, pi/2] so opposite directions map to one orientation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The message omitted control_sdc_only (a valid mode → control_mode=3); list all four accepted values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The folded range is [-pi/2, pi/2], not the [-pi, pi] that "wrap" implies, so the helper name was misleading. Inline it back at the road-edge block and replace it with a comment that states why the fold exists. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| logits, value = self.policy.forward_eval(o_device.to(self.observations.dtype), state) | ||
| logits = logits_to_float(logits) | ||
| value = value.float() |
There was a problem hiding this comment.
@vcharraut why are we doing a cast here?
There was a problem hiding this comment.
To support bfloat16 training
| clipfrac = ((ratio - 1.0).abs() > config["clip_coef"]).float().mean() | ||
|
|
||
| mb_adv = (mb_adv - mb_adv.mean()) / (mb_adv.std(unbiased=unbiased_std) + 1e-8) | ||
| mb_adv = (mb_adv - mb_adv.mean()) / (mb_adv.std(unbiased=False) + 1e-8) |
There was a problem hiding this comment.
The value was False for PPO w/ adv filtering and True w/ adv sampling; with refactoring I've put False by default, there is a not a big thought behind it.
No description provided.