Revert "refactor(auth): centralize credential loading and Docker registry login" by gargrahul · Pull Request #107 · ROCm/madengine

gargrahul · 2026-04-21T05:16:32Z

Reverts #106

For ROCM-23419

…stry log…" This reverts commit f067dac.

Copilot

Pull request overview

This PR primarily reverts PR #106 by removing the centralized auth module and reintroducing distributed credential-loading / registry-login logic in orchestrators and execution components. Alongside that revert, it also includes a wide set of behavioral changes across error types, Docker command construction, deployment status handling, and reporting utilities.

Changes:

Remove madengine.core.auth (and its unit tests) and re-add credential loading / registry login logic directly in orchestrators, DockerBuilder, and ContainerRunner.
Rename/alias several error types (NetworkError→ConnectionError, DeploymentTimeoutError→TimeoutError, and RuntimeError alias) and update unit/integration tests accordingly.
Adjust multiple operational behaviors (Docker arg quoting, SLURM completion fallback behavior, perf.csv update logic, session tracking).

Reviewed changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated 34 comments.

Show a summary per file

File	Description
tests/unit/test_error_handling.py	Updates tests for renamed error classes and backward-compat alias.
tests/unit/test_deployment.py	Removes cache-clearing hooks for `is_rocprofv3_available`.
tests/unit/test_auth.py	Deletes tests for removed `madengine.core.auth`.
tests/integration/test_errors.py	Updates integration tests to renamed/aliased error classes.
tests/integration/test_docker_integration.py	Updates expected quoting in generated Docker build-args.
src/madengine/utils/session_tracker.py	Changes row counting logic; adds marker loading + getter.
src/madengine/utils/rocm_tool_manager.py	Adjusts fallback error logging/raising in command execution.
src/madengine/utils/ops.py	Adds dict substring utilities.
src/madengine/utils/log_formatting.py	Fixes truncation message text; adds log-section helper.
src/madengine/utils/gpu_validator.py	Expands typing imports.
src/madengine/utils/gpu_config.py	Replaces logger calls with `print` messaging.
src/madengine/utils/discover_models.py	Changes MODEL_DIR copy command construction; switches to `assert` checks for CustomModel.
src/madengine/utils/config_parser.py	Modifies directory-walk condition for locating configs.
src/madengine/scripts/common/tools/rocm_smi_utils.py	Broadens exception handling in ROCm SMI init path.
src/madengine/scripts/common/tools/amd_smi_utils.py	Broadens exception handling in AMD SMI power query path.
src/madengine/reporting/update_perf_csv.py	Alters tag flattening type check; removes dropping of non-scalar fields before DataFrame creation.
src/madengine/reporting/csv_to_email.py	Changes function return annotation (Optional→str) and typing imports.
src/madengine/orchestration/run_orchestrator.py	Reintroduces inline credential loading; adds dockerfile-context filtering helper; adds `handle_error` import.
src/madengine/orchestration/build_orchestrator.py	Reintroduces inline credential loading; changes config error handling (SystemExit); adds `handle_error` import.
src/madengine/execution/docker_builder.py	Removes shared auth dependency; changes build-arg quoting; adds embedded registry login logic.
src/madengine/execution/container_runner.py	Removes shared auth dependency; adds embedded registry login logic; changes env var quoting/logging.
src/madengine/deployment/slurm.py	Changes “unknown status” handling to assume success on `sacct` issues.
src/madengine/deployment/kubernetes.py	Updates imports/constants related to deployment common/secrets.
src/madengine/deployment/factory.py	Silences Kubernetes ImportError warning behavior.
src/madengine/deployment/config_loader.py	Adds `os` import.
src/madengine/deployment/common.py	Removes `lru_cache` usage on `is_rocprofv3_available`.
src/madengine/deployment/base.py	Removes `UNKNOWN` deployment status and adjusts perf.csv writing logic.
src/madengine/core/timeout.py	Adds typing import.
src/madengine/core/errors.py	Renames/aliases error classes; adds `traceback`/Rich `Table` import.
src/madengine/core/docker.py	Changes default console arg + removes quoting around shell-invoked docker commands.
src/madengine/core/dataprovider.py	Adds debug prints in data source checks.
src/madengine/core/context.py	Adds GPU validator imports; adds stderr debug prints; changes `get_numa_balancing` type signature.
src/madengine/core/auth.py	Removes centralized auth utilities module.
src/madengine/cli/validators.py	Removes initialization of `dockerfile_matched` in batch-manifest processing.
src/madengine/cli/constants.py	Changes `ExitCode` from `IntEnum` to plain class.
src/madengine/cli/commands/run.py	Removes `timeout==0` → `None` mapping while help text still claims “0 = no timeout”.
src/madengine/cli/app.py	Hardcodes `--version` output rather than reading package metadata.
pyproject.toml	Adds dependencies (`setuptools-rust`, `pytest`).
CLAUDE.md	Removes repository development guidance doc.

Comments suppressed due to low confidence (2)

src/madengine/core/context.py:463

get_numa_balancing is annotated/ documented as returning bool, but it actually returns the raw stdout from cat (a string like '0'/'1') when the path exists. Either convert the command output to a boolean, or keep the return type as Union[str, bool] / str to match actual behavior.

    def get_numa_balancing(self) -> bool:
        """Get NUMA balancing.

        Returns:
            bool: The output of the shell command.

        Raises:
            RuntimeError: If the NUMA balancing is not enabled or disabled.

        Note:
            NUMA balancing is enabled if the output is '1', and disabled if the output is '0'.

            What is NUMA balancing?
            Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing,
            where the memory access time depends on the memory location relative to the processor.
        """
        # Check if NUMA balancing is enabled or disabled.
        path = "/proc/sys/kernel/numa_balancing"
        if os.path.exists(path):
            return self.console.sh("cat /proc/sys/kernel/numa_balancing || true")
        else:
            return False

src/madengine/execution/docker_builder.py:75

get_build_arg uses a mutable default (run_build_arg: typing.Dict = {}), which can leak state across calls/tests. Use None as the default and create a new dict inside the function instead.

    def get_build_arg(self, run_build_arg: typing.Dict = {}) -> str:
        """Get the build arguments.

        Args:
            run_build_arg: The run build arguments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T05:28:15Z

@@ -395,8 +395,6 @@ def process_batch_manifest_entries(

        # If the model was not built (build_new=false), create an entry for it
        if not build_new:


dockerfile_matched was previously initialized before the try, but it’s still referenced later in the except fallback path (synthetic_image_name = f"ci-{model_name}_{dockerfile_matched}"). If an exception occurs before dockerfile_matched is assigned, this will raise UnboundLocalError and skip creating the minimal entry. Reintroduce a safe initialization (e.g., dockerfile_matched = 'unknown') before the try.

Suggested change

if not build_new:

if not build_new:

# Default value used by the fallback path if discovery fails early.

dockerfile_matched = "unknown"

Copilot · 2026-04-21T05:28:16Z

@@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg
        )


+# Backward compatibility alias
+RuntimeError = ExecutionError
+


Defining ConnectionError, TimeoutError, and especially RuntimeError = ExecutionError in this module collides with Python’s built-in exception names. This can lead to very confusing except RuntimeError/TimeoutError/ConnectionError behavior once callers import from madengine.core.errors. Consider using distinct names (e.g., MadEngineConnectionError) and providing backward-compatibility aliases with non-builtin names (or alias the old names like NetworkError).

Copilot · 2026-04-21T05:28:16Z

    """
    # flatten tags to a string, if tags is a list.
-    if isinstance(perf_entry["tags"], list):
+    if type(perf_entry["tags"]) == list:


flatten_tags uses type(x) == list instead of isinstance(x, list), which will miss list subclasses (and is inconsistent with the rest of the codebase’s type checks). Prefer isinstance(perf_entry.get('tags'), list).

Suggested change

if type(perf_entry["tags"]) == list:

if isinstance(perf_entry.get("tags"), list):

Copilot · 2026-04-21T05:28:16Z

    if version:
-        try:
-            _version = pkg_version("madengine")
-        except PackageNotFoundError:
-            _version = "unknown"
+        # You might want to get the actual version from your package
        console.print(
-            f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
+            "🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
        )
        raise typer.Exit()


--version output is now hardcoded to 2.0.0, which will become incorrect as soon as the package version changes and breaks automation that relies on accurate version reporting. Consider restoring runtime version lookup via importlib.metadata.version('madengine') (with a fallback when not installed).

Copilot · 2026-04-21T05:28:17Z

+    def _load_credentials(self) -> Optional[Dict]:
+        """Load credentials from credential.json and environment."""
+        credentials = None
+
+        credential_file = "credential.json"
+        if os.path.exists(credential_file):
+            try:
+                with open(credential_file) as f:
+                    credentials = json.load(f)
+            except Exception as e:
+                print(f"Warning: Could not load credentials: {e}")
+


The new _load_credentials catches all exceptions and only prints a warning. This diverges from the structured error handling used elsewhere (handle_error/ConfigurationError) and can lead to silent misconfiguration (e.g., malformed JSON) later in the run. Consider using handle_error with a ConfigurationError (or at least include actionable suggestions) instead of a bare print.

Copilot · 2026-04-21T05:28:23Z

        AssertionError: If the number of columns in the performance csv DataFrame is not equal
    """
    single_result_json = read_json(single_result)
-    # Remove non-scalar fields that are not perf.csv columns (e.g. configs list).
-    # See handle_exception_result for rationale.
-    single_result_json.pop("configs", None)
    perf_entry_dict_to_csv(single_result_json)
    single_result_df = pd.DataFrame(single_result_json, index=[0])
    if perf_csv_df.empty:


handle_single_result / handle_exception_result now build a one-row DataFrame directly from the JSON dict. If the result contains non-scalar fields like configs (often a list of dicts from ConfigParser.parse_and_load()), pd.DataFrame(dict_with_list, index=[0]) can raise ValueError (length mismatch) or produce an invalid row. Consider dropping/serializing non-scalar fields (e.g., configs) before constructing the DataFrame, as the previous code did.

Copilot · 2026-04-21T05:28:23Z

        except Exception as e:
-            self.console.print(
-                f"[yellow]Warning: Exception checking job {job_id} status: {e}. "
-                f"Status cannot be verified.[/yellow]"
-            )
+            self.console.print(f"[dim yellow]Warning: Exception checking job {job_id}: {e}[/dim yellow]")
            return DeploymentResult(
-                status=DeploymentStatus.UNKNOWN,
+                status=DeploymentStatus.SUCCESS,
                deployment_id=job_id,
-                message=f"Job {job_id} status unknown: {e}",
+                message=f"Job {job_id} completed (status unavailable)",
            )


Catching all exceptions here and returning DeploymentStatus.SUCCESS has the same risk as the non-zero sacct fallback: failures in status checks will be reported as success. Return an explicit unknown/failed status (or rethrow) so callers can react appropriately.

Copilot · 2026-04-21T05:28:23Z

+                    # Debug: log to stderr so SLURM node .err captures where we are if killed
+                    print(f"[DEBUG] get_gpu_vendor: trying amd-smi at {amd_smi_path}", file=sys.stderr, flush=True)
                    # Verify amd-smi actually works (180s timeout for slow GPU initialization)
                    result = self.console.sh(f"{amd_smi_path} list > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
+                    print(f"[DEBUG] get_gpu_vendor: amd-smi returned", file=sys.stderr, flush=True)
                    if result and result.strip() == "AMD":


get_gpu_vendor now prints debug messages to stderr unconditionally. This will add noisy output in normal operation and can interfere with log parsing. Consider guarding these prints behind a debug/verbose flag or using the project’s logging facilities with an appropriate log level.

Copilot · 2026-04-21T05:28:24Z

        # add envVars
        if envVars is not None:
            for evar in envVars.keys():
-                command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "
+                command += "-e " + evar + "=" + envVars[evar] + " "



Environment variable values are appended to the docker run command without any quoting/escaping (... -e KEY=VALUE ...). Values containing spaces/quotes will break the command, and unescaped interpolation combined with shell=True is injection-prone. Quote values (e.g., shlex.quote(str(...))) or build the docker invocation as an argument list.

Copilot · 2026-04-21T05:28:24Z

+            # Fallback - assume completed
+            self.console.print(f"[dim yellow]Warning: Could not get status for job {job_id}, assuming success[/dim yellow]")
            return DeploymentResult(
-                status=DeploymentStatus.UNKNOWN,
+                status=DeploymentStatus.SUCCESS,
                deployment_id=job_id,
-                message=f"Job {job_id} status unknown: sacct exited with code {result.returncode}",
+                message=f"Job {job_id} completed (assumed)",
            )


When sacct fails (non-zero return code), _check_job_completion now assumes job success and returns DeploymentStatus.SUCCESS. This can silently mark failed or cancelled jobs as successful and lead to incorrect reporting/automation decisions. Prefer returning a distinct “unknown” state (or treat as failure) and surface the inability to verify status in the result message.

Revert "refactor(auth): centralize credential loading and Docker regi…

f3eda0a

…stry log…" This reverts commit f067dac.

Copilot AI review requested due to automatic review settings April 21, 2026 05:16

gargrahul requested review from Cemberk, Rohan138, coketaste and leconcio as code owners April 21, 2026 05:16

gargrahul merged commit 5e8781f into main Apr 21, 2026
2 checks passed

Copilot started reviewing on behalf of gargrahul April 21, 2026 05:17 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

srinivamd mentioned this pull request Apr 23, 2026

fix(timeout): guard signal.alarm() against None; centralize auth; fix security issues #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "refactor(auth): centralize credential loading and Docker registry login"#107

Revert "refactor(auth): centralize credential loading and Docker registry login"#107
gargrahul merged 1 commit intomainfrom
revert-106-coketaste/v2-review

gargrahul commented Apr 21, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -395,8 +395,6 @@ def process_batch_manifest_entries(

		# If the model was not built (build_new=false), create an entry for it
		if not build_new:

	if type(perf_entry["tags"]) == list:
	if isinstance(perf_entry.get("tags"), list):

Conversation

gargrahul commented Apr 21, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants