Conversation
…stry log…" This reverts commit f067dac.
There was a problem hiding this comment.
Pull request overview
This PR primarily reverts PR #106 by removing the centralized auth module and reintroducing distributed credential-loading / registry-login logic in orchestrators and execution components. Alongside that revert, it also includes a wide set of behavioral changes across error types, Docker command construction, deployment status handling, and reporting utilities.
Changes:
- Remove
madengine.core.auth(and its unit tests) and re-add credential loading / registry login logic directly in orchestrators,DockerBuilder, andContainerRunner. - Rename/alias several error types (
NetworkError→ConnectionError,DeploymentTimeoutError→TimeoutError, andRuntimeErroralias) and update unit/integration tests accordingly. - Adjust multiple operational behaviors (Docker arg quoting, SLURM completion fallback behavior, perf.csv update logic, session tracking).
Reviewed changes
Copilot reviewed 39 out of 39 changed files in this pull request and generated 34 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_error_handling.py | Updates tests for renamed error classes and backward-compat alias. |
| tests/unit/test_deployment.py | Removes cache-clearing hooks for is_rocprofv3_available. |
| tests/unit/test_auth.py | Deletes tests for removed madengine.core.auth. |
| tests/integration/test_errors.py | Updates integration tests to renamed/aliased error classes. |
| tests/integration/test_docker_integration.py | Updates expected quoting in generated Docker build-args. |
| src/madengine/utils/session_tracker.py | Changes row counting logic; adds marker loading + getter. |
| src/madengine/utils/rocm_tool_manager.py | Adjusts fallback error logging/raising in command execution. |
| src/madengine/utils/ops.py | Adds dict substring utilities. |
| src/madengine/utils/log_formatting.py | Fixes truncation message text; adds log-section helper. |
| src/madengine/utils/gpu_validator.py | Expands typing imports. |
| src/madengine/utils/gpu_config.py | Replaces logger calls with print messaging. |
| src/madengine/utils/discover_models.py | Changes MODEL_DIR copy command construction; switches to assert checks for CustomModel. |
| src/madengine/utils/config_parser.py | Modifies directory-walk condition for locating configs. |
| src/madengine/scripts/common/tools/rocm_smi_utils.py | Broadens exception handling in ROCm SMI init path. |
| src/madengine/scripts/common/tools/amd_smi_utils.py | Broadens exception handling in AMD SMI power query path. |
| src/madengine/reporting/update_perf_csv.py | Alters tag flattening type check; removes dropping of non-scalar fields before DataFrame creation. |
| src/madengine/reporting/csv_to_email.py | Changes function return annotation (Optional→str) and typing imports. |
| src/madengine/orchestration/run_orchestrator.py | Reintroduces inline credential loading; adds dockerfile-context filtering helper; adds handle_error import. |
| src/madengine/orchestration/build_orchestrator.py | Reintroduces inline credential loading; changes config error handling (SystemExit); adds handle_error import. |
| src/madengine/execution/docker_builder.py | Removes shared auth dependency; changes build-arg quoting; adds embedded registry login logic. |
| src/madengine/execution/container_runner.py | Removes shared auth dependency; adds embedded registry login logic; changes env var quoting/logging. |
| src/madengine/deployment/slurm.py | Changes “unknown status” handling to assume success on sacct issues. |
| src/madengine/deployment/kubernetes.py | Updates imports/constants related to deployment common/secrets. |
| src/madengine/deployment/factory.py | Silences Kubernetes ImportError warning behavior. |
| src/madengine/deployment/config_loader.py | Adds os import. |
| src/madengine/deployment/common.py | Removes lru_cache usage on is_rocprofv3_available. |
| src/madengine/deployment/base.py | Removes UNKNOWN deployment status and adjusts perf.csv writing logic. |
| src/madengine/core/timeout.py | Adds typing import. |
| src/madengine/core/errors.py | Renames/aliases error classes; adds traceback/Rich Table import. |
| src/madengine/core/docker.py | Changes default console arg + removes quoting around shell-invoked docker commands. |
| src/madengine/core/dataprovider.py | Adds debug prints in data source checks. |
| src/madengine/core/context.py | Adds GPU validator imports; adds stderr debug prints; changes get_numa_balancing type signature. |
| src/madengine/core/auth.py | Removes centralized auth utilities module. |
| src/madengine/cli/validators.py | Removes initialization of dockerfile_matched in batch-manifest processing. |
| src/madengine/cli/constants.py | Changes ExitCode from IntEnum to plain class. |
| src/madengine/cli/commands/run.py | Removes timeout==0 → None mapping while help text still claims “0 = no timeout”. |
| src/madengine/cli/app.py | Hardcodes --version output rather than reading package metadata. |
| pyproject.toml | Adds dependencies (setuptools-rust, pytest). |
| CLAUDE.md | Removes repository development guidance doc. |
Comments suppressed due to low confidence (2)
src/madengine/core/context.py:463
get_numa_balancingis annotated/ documented as returningbool, but it actually returns the raw stdout fromcat(a string like'0'/'1') when the path exists. Either convert the command output to a boolean, or keep the return type asUnion[str, bool]/strto match actual behavior.
def get_numa_balancing(self) -> bool:
"""Get NUMA balancing.
Returns:
bool: The output of the shell command.
Raises:
RuntimeError: If the NUMA balancing is not enabled or disabled.
Note:
NUMA balancing is enabled if the output is '1', and disabled if the output is '0'.
What is NUMA balancing?
Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing,
where the memory access time depends on the memory location relative to the processor.
"""
# Check if NUMA balancing is enabled or disabled.
path = "/proc/sys/kernel/numa_balancing"
if os.path.exists(path):
return self.console.sh("cat /proc/sys/kernel/numa_balancing || true")
else:
return False
src/madengine/execution/docker_builder.py:75
get_build_arguses a mutable default (run_build_arg: typing.Dict = {}), which can leak state across calls/tests. UseNoneas the default and create a new dict inside the function instead.
def get_build_arg(self, run_build_arg: typing.Dict = {}) -> str:
"""Get the build arguments.
Args:
run_build_arg: The run build arguments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -395,8 +395,6 @@ def process_batch_manifest_entries( | |||
|
|
|||
| # If the model was not built (build_new=false), create an entry for it | |||
| if not build_new: | |||
There was a problem hiding this comment.
dockerfile_matched was previously initialized before the try, but it’s still referenced later in the except fallback path (synthetic_image_name = f"ci-{model_name}_{dockerfile_matched}"). If an exception occurs before dockerfile_matched is assigned, this will raise UnboundLocalError and skip creating the minimal entry. Reintroduce a safe initialization (e.g., dockerfile_matched = 'unknown') before the try.
| if not build_new: | |
| if not build_new: | |
| # Default value used by the fallback path if discovery fails early. | |
| dockerfile_matched = "unknown" |
| @@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg | |||
| ) | |||
|
|
|||
|
|
|||
| # Backward compatibility alias | |||
| RuntimeError = ExecutionError | |||
|
|
|||
There was a problem hiding this comment.
Defining ConnectionError, TimeoutError, and especially RuntimeError = ExecutionError in this module collides with Python’s built-in exception names. This can lead to very confusing except RuntimeError/TimeoutError/ConnectionError behavior once callers import from madengine.core.errors. Consider using distinct names (e.g., MadEngineConnectionError) and providing backward-compatibility aliases with non-builtin names (or alias the old names like NetworkError).
| """ | ||
| # flatten tags to a string, if tags is a list. | ||
| if isinstance(perf_entry["tags"], list): | ||
| if type(perf_entry["tags"]) == list: |
There was a problem hiding this comment.
flatten_tags uses type(x) == list instead of isinstance(x, list), which will miss list subclasses (and is inconsistent with the rest of the codebase’s type checks). Prefer isinstance(perf_entry.get('tags'), list).
| if type(perf_entry["tags"]) == list: | |
| if isinstance(perf_entry.get("tags"), list): |
| if version: | ||
| try: | ||
| _version = pkg_version("madengine") | ||
| except PackageNotFoundError: | ||
| _version = "unknown" | ||
| # You might want to get the actual version from your package | ||
| console.print( | ||
| f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]" | ||
| "🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]" | ||
| ) | ||
| raise typer.Exit() |
There was a problem hiding this comment.
--version output is now hardcoded to 2.0.0, which will become incorrect as soon as the package version changes and breaks automation that relies on accurate version reporting. Consider restoring runtime version lookup via importlib.metadata.version('madengine') (with a fallback when not installed).
| def _load_credentials(self) -> Optional[Dict]: | ||
| """Load credentials from credential.json and environment.""" | ||
| credentials = None | ||
|
|
||
| credential_file = "credential.json" | ||
| if os.path.exists(credential_file): | ||
| try: | ||
| with open(credential_file) as f: | ||
| credentials = json.load(f) | ||
| except Exception as e: | ||
| print(f"Warning: Could not load credentials: {e}") | ||
|
|
There was a problem hiding this comment.
The new _load_credentials catches all exceptions and only prints a warning. This diverges from the structured error handling used elsewhere (handle_error/ConfigurationError) and can lead to silent misconfiguration (e.g., malformed JSON) later in the run. Consider using handle_error with a ConfigurationError (or at least include actionable suggestions) instead of a bare print.
| AssertionError: If the number of columns in the performance csv DataFrame is not equal | ||
| """ | ||
| single_result_json = read_json(single_result) | ||
| # Remove non-scalar fields that are not perf.csv columns (e.g. configs list). | ||
| # See handle_exception_result for rationale. | ||
| single_result_json.pop("configs", None) | ||
| perf_entry_dict_to_csv(single_result_json) | ||
| single_result_df = pd.DataFrame(single_result_json, index=[0]) | ||
| if perf_csv_df.empty: |
There was a problem hiding this comment.
handle_single_result / handle_exception_result now build a one-row DataFrame directly from the JSON dict. If the result contains non-scalar fields like configs (often a list of dicts from ConfigParser.parse_and_load()), pd.DataFrame(dict_with_list, index=[0]) can raise ValueError (length mismatch) or produce an invalid row. Consider dropping/serializing non-scalar fields (e.g., configs) before constructing the DataFrame, as the previous code did.
| except Exception as e: | ||
| self.console.print( | ||
| f"[yellow]Warning: Exception checking job {job_id} status: {e}. " | ||
| f"Status cannot be verified.[/yellow]" | ||
| ) | ||
| self.console.print(f"[dim yellow]Warning: Exception checking job {job_id}: {e}[/dim yellow]") | ||
| return DeploymentResult( | ||
| status=DeploymentStatus.UNKNOWN, | ||
| status=DeploymentStatus.SUCCESS, | ||
| deployment_id=job_id, | ||
| message=f"Job {job_id} status unknown: {e}", | ||
| message=f"Job {job_id} completed (status unavailable)", | ||
| ) |
There was a problem hiding this comment.
Catching all exceptions here and returning DeploymentStatus.SUCCESS has the same risk as the non-zero sacct fallback: failures in status checks will be reported as success. Return an explicit unknown/failed status (or rethrow) so callers can react appropriately.
| # Debug: log to stderr so SLURM node .err captures where we are if killed | ||
| print(f"[DEBUG] get_gpu_vendor: trying amd-smi at {amd_smi_path}", file=sys.stderr, flush=True) | ||
| # Verify amd-smi actually works (180s timeout for slow GPU initialization) | ||
| result = self.console.sh(f"{amd_smi_path} list > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180) | ||
| print(f"[DEBUG] get_gpu_vendor: amd-smi returned", file=sys.stderr, flush=True) | ||
| if result and result.strip() == "AMD": |
There was a problem hiding this comment.
get_gpu_vendor now prints debug messages to stderr unconditionally. This will add noisy output in normal operation and can interfere with log parsing. Consider guarding these prints behind a debug/verbose flag or using the project’s logging facilities with an appropriate log level.
| # add envVars | ||
| if envVars is not None: | ||
| for evar in envVars.keys(): | ||
| command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " " | ||
| command += "-e " + evar + "=" + envVars[evar] + " " | ||
|
|
There was a problem hiding this comment.
Environment variable values are appended to the docker run command without any quoting/escaping (... -e KEY=VALUE ...). Values containing spaces/quotes will break the command, and unescaped interpolation combined with shell=True is injection-prone. Quote values (e.g., shlex.quote(str(...))) or build the docker invocation as an argument list.
| # Fallback - assume completed | ||
| self.console.print(f"[dim yellow]Warning: Could not get status for job {job_id}, assuming success[/dim yellow]") | ||
| return DeploymentResult( | ||
| status=DeploymentStatus.UNKNOWN, | ||
| status=DeploymentStatus.SUCCESS, | ||
| deployment_id=job_id, | ||
| message=f"Job {job_id} status unknown: sacct exited with code {result.returncode}", | ||
| message=f"Job {job_id} completed (assumed)", | ||
| ) |
There was a problem hiding this comment.
When sacct fails (non-zero return code), _check_job_completion now assumes job success and returns DeploymentStatus.SUCCESS. This can silently mark failed or cancelled jobs as successful and lead to incorrect reporting/automation decisions. Prefer returning a distinct “unknown” state (or treat as failure) and surface the inability to verify status in the result message.
Reverts #106
For ROCM-23419