Amanley/reward overrides#869
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces numeric metric-error sentinel with a singleton sentinel type and MetricValue alias; introduces a nested RewardOverrides model on agent configs and threads it into CloudAIGymEnv; simplifies BaseGym.step signature and env.step calls; updates many report-strategy return types and tests to use the new sentinel/type. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_cloudaigym.py (1)
161-189: 🛠️ Refactor suggestion | 🟠 MajorAdd an env-level test for
metric_failure.This PR’s new behavior is overriding failed-metric observation slots, but the suite only checks config parsing and
constraint_failure. Please add a focusedCloudAIGymEnvtest that hitsget_observation()/step()with a missing reporter orMETRIC_ERRORmetric and asserts theRewardOverrides(metric_failure=...)value is emitted instead of-1.0.Based on learnings, prefer expressing behavioral documentation through tests rather than docstrings.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cloudaigym.py` around lines 161 - 189, Add a new unit test mirroring test_constraint_failure that verifies metric_failure behavior: create a CloudAIGymEnv with a RewardOverrides(metric_failure=VALUE) and a TestRun/NeMoRunTestDefinition, then simulate a missing reporter or a METRIC_ERROR by configuring the runner/system or observation inputs so get_observation()/step() produces a failed metric; call env.step(...) and assert the observation slot and returned reward equal VALUE (and done/info behave like the constraint test). Reference CloudAIGymEnv, get_observation(), step(), RewardOverrides, METRIC_ERROR, TestRun and reuse the test structure of test_constraint_failure to keep setup consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 148-152: The single-sbatch DSE path constructs CloudAIGymEnv as
CloudAIGymEnv(next_tr, self) so its get_observation()/compute_reward() use
hard-coded defaults and never see agent_config.rewards; update the
single_sbatch_runner code that creates CloudAIGymEnv to thread the same rewards
config (agent_config.rewards) into the CloudAIGymEnv constructor and ensure any
subsequent calls to get_observation() / compute_reward() operate on that
instance so metric_failure and constraint_failure are computed from the provided
rewards instead of staying at -1.0; locate the constructor call in
single_sbatch_runner.py (currently CloudAIGymEnv(next_tr, self)) and pass the
rewards argument consistent with the other call site where CloudAIGymEnv(...,
rewards=agent_config.rewards) is used.
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 224-234: The float equality check v ==
self.test_run.get_metric_error_value() is fragile; change to use a unique error
sentinel (or an explicit error flag) instead of a float: update
get_metric_error_value to return a unique object (e.g., METRIC_ERROR = object())
and ensure get_metric_value returns that sentinel when a metric is an error,
then replace the equality check with an identity check (v is
self.test_run.get_metric_error_value()) in the code that builds observation (the
loop using self.test_run.get_metric_value and
self.test_run.get_metric_error_value), or alternatively have get_metric_value
return (value, is_error) and check the is_error flag before appending the
observation.
---
Outside diff comments:
In `@tests/test_cloudaigym.py`:
- Around line 161-189: Add a new unit test mirroring test_constraint_failure
that verifies metric_failure behavior: create a CloudAIGymEnv with a
RewardOverrides(metric_failure=VALUE) and a TestRun/NeMoRunTestDefinition, then
simulate a missing reporter or a METRIC_ERROR by configuring the runner/system
or observation inputs so get_observation()/step() produces a failed metric; call
env.step(...) and assert the observation slot and returned reward equal VALUE
(and done/info behave like the constraint test). Reference CloudAIGymEnv,
get_observation(), step(), RewardOverrides, METRIC_ERROR, TestRun and reuse the
test structure of test_constraint_failure to keep setup consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: cc4dbd81-6c28-4ccf-8d84-386e0aa1c053
📒 Files selected for processing (9)
doc/USER_GUIDE.rstsrc/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/base_gym.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/core.pytests/test_cloudaigym.pytests/test_handlers.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/_core/test_scenario.py`:
- Line 23: Update the typing imports to use built-in generics instead of
deprecated typing aliases: remove List, Set, and Type from the typing import
line (keep TYPE_CHECKING, Any, Optional, TypeAlias, Union) and then update all
usages in this module accordingly (e.g. change List[str] to list[str], change
Set[Type[ReportGenerationStrategy]] to set[type[ReportGenerationStrategy]], and
change Type[ReportGenerationStrategy] to type[ReportGenerationStrategy]) so the
code uses modern built-in type hints.
In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 213-217: If a test supplies agent_config but
registry.agents_map.get(next_tr.test.agent) returns None, fail fast instead of
silently ignoring overrides: check for next_tr.test.agent_config (or truthy
overrides) and if agent_class is None raise a clear error (e.g.,
ValueError/RuntimeError) referencing next_tr.test.agent and that agent_config
cannot be applied; otherwise continue to instantiate agent_config via
agent_class.get_config_class() and read agent_config.rewards as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 14a0b18a-6bd2-4780-92c7-177940e17dc8
📒 Files selected for processing (17)
doc/USER_GUIDE.rstsrc/cloudai/_core/report_generation_strategy.pysrc/cloudai/_core/test_scenario.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/core.pysrc/cloudai/systems/slurm/single_sbatch_runner.pysrc/cloudai/workloads/ai_dynamo/report_generation_strategy.pysrc/cloudai/workloads/aiconfig/report_generation_strategy.pysrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/megatron_bridge/report_generation_strategy.pysrc/cloudai/workloads/megatron_run/report_generation_strategy.pysrc/cloudai/workloads/nccl_test/performance_report_generation_strategy.pysrc/cloudai/workloads/nemo_run/report_generation_strategy.pysrc/cloudai/workloads/nixl_bench/report_generation_strategy.pysrc/cloudai/workloads/nixl_ep/report_generation_strategy.pysrc/cloudai/workloads/ucc_test/report_generation_strategy.pytests/workloads/megatron_run/test_report_gen_strategy.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/_core/report_generation_strategy.py`:
- Around line 33-34: The base implementation of get_metric currently returns 0.0
which falsely signals a successful metric; change get_metric(self, metric: str)
-> MetricValue to return the metric-error sentinel from the MetricValue contract
(i.e., return METRIC_ERROR or MetricValue.METRIC_ERROR depending on how the
sentinel is exposed) so inheritors propagate metric failures correctly instead
of emitting a bogus zero.
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 214-222: The None checks around rewards are redundant: replace the
block that initializes obs_replace (currently setting -1.0 then checking
self.rewards and self.rewards.metric_failure) with a direct assignment
obs_replace = self.rewards.metric_failure (since RewardOverrides and
metric_failure are always set via CloudAIGymEnv.__init__), and leave the
subsequent test_run.get_metric_value(...)/METRIC_ERROR handling unchanged.
In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 219-221: The reconstruction path in single_sbatch_runner.py is
bypassing CloudAIGymEnv.step(), so when unroll_dse() skips combinations and
handle_dse() writes a trajectory it calls gym.get_observation(...) and
gym.compute_reward(...) which ignores rewards.constraint_failure; change
handle_dse()/this reconstruction block to detect a failed constraint (the same
check used by CloudAIGymEnv.step()) and, when that failure is true or metrics
are missing, set the reward to rewards.constraint_failure instead of calling
compute_reward, or alternatively invoke CloudAIGymEnv.step() for that transition
so the existing constraint application in CloudAIGymEnv.step() is honored
(reference CloudAIGymEnv.step, get_observation, compute_reward, unroll_dse,
handle_dse, and rewards.constraint_failure).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 967377dc-aa26-4f34-9f60-a35cd632462d
📒 Files selected for processing (14)
doc/USER_GUIDE.rstsrc/cloudai/_core/report_generation_strategy.pysrc/cloudai/_core/test_scenario.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pysrc/cloudai/systems/slurm/single_sbatch_runner.pysrc/cloudai/workloads/aiconfig/report_generation_strategy.pysrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/nemo_run/report_generation_strategy.pysrc/cloudai/workloads/nixl_bench/report_generation_strategy.pysrc/cloudai/workloads/nixl_ep/report_generation_strategy.pysrc/cloudai/workloads/ucc_test/report_generation_strategy.pytests/test_cloudaigym.pytests/workloads/megatron_run/test_report_gen_strategy.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cloudai/_core/report_generation_strategy.py (1)
33-33:⚠️ Potential issue | 🟡 MinorMark intentionally unused parameter to satisfy lint.
metricis unused on Line 33 (Ruff ARG002). Rename it to_metricto make intent explicit and keep lint clean.🔧 Suggested change
- def get_metric(self, metric: str) -> MetricValue: + def get_metric(self, _metric: str) -> MetricValue: return METRIC_ERROR🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/_core/report_generation_strategy.py` at line 33, Rename the unused parameter in the method get_metric (MetricValue get_metric) from metric to _metric to satisfy the Ruff ARG002 lint; update the function signature in report_generation_strategy.py to use _metric and adjust any internal references or overrides (if any) to match the new parameter name so callers and subclass implementations remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/cloudai/_core/report_generation_strategy.py`:
- Line 33: Rename the unused parameter in the method get_metric (MetricValue
get_metric) from metric to _metric to satisfy the Ruff ARG002 lint; update the
function signature in report_generation_strategy.py to use _metric and adjust
any internal references or overrides (if any) to match the new parameter name so
callers and subclass implementations remain consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1ee7f536-967d-4bf3-b0b1-c0a75c0610f2
📒 Files selected for processing (2)
src/cloudai/_core/report_generation_strategy.pysrc/cloudai/configurator/cloudai_gym.py
Summary
Refines the agent config overrides for rewards. Originally only constraint check failures' reward was possible to override, now metric errors can be modified. Also, leaves room for future overrides.
New TOML flags:
Test Plan
Added new tests.
Keeps similar behavior as #865, but improves the interface.