Skip to content

Revert "refactor(auth): centralize credential loading and Docker registry login"#107

Merged
gargrahul merged 1 commit intomainfrom
revert-106-coketaste/v2-review
Apr 21, 2026
Merged

Revert "refactor(auth): centralize credential loading and Docker registry login"#107
gargrahul merged 1 commit intomainfrom
revert-106-coketaste/v2-review

Conversation

@gargrahul
Copy link
Copy Markdown
Collaborator

Reverts #106

For ROCM-23419

Copilot AI review requested due to automatic review settings April 21, 2026 05:16
@gargrahul gargrahul merged commit 5e8781f into main Apr 21, 2026
2 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR primarily reverts PR #106 by removing the centralized auth module and reintroducing distributed credential-loading / registry-login logic in orchestrators and execution components. Alongside that revert, it also includes a wide set of behavioral changes across error types, Docker command construction, deployment status handling, and reporting utilities.

Changes:

  • Remove madengine.core.auth (and its unit tests) and re-add credential loading / registry login logic directly in orchestrators, DockerBuilder, and ContainerRunner.
  • Rename/alias several error types (NetworkErrorConnectionError, DeploymentTimeoutErrorTimeoutError, and RuntimeError alias) and update unit/integration tests accordingly.
  • Adjust multiple operational behaviors (Docker arg quoting, SLURM completion fallback behavior, perf.csv update logic, session tracking).

Reviewed changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated 34 comments.

Show a summary per file
File Description
tests/unit/test_error_handling.py Updates tests for renamed error classes and backward-compat alias.
tests/unit/test_deployment.py Removes cache-clearing hooks for is_rocprofv3_available.
tests/unit/test_auth.py Deletes tests for removed madengine.core.auth.
tests/integration/test_errors.py Updates integration tests to renamed/aliased error classes.
tests/integration/test_docker_integration.py Updates expected quoting in generated Docker build-args.
src/madengine/utils/session_tracker.py Changes row counting logic; adds marker loading + getter.
src/madengine/utils/rocm_tool_manager.py Adjusts fallback error logging/raising in command execution.
src/madengine/utils/ops.py Adds dict substring utilities.
src/madengine/utils/log_formatting.py Fixes truncation message text; adds log-section helper.
src/madengine/utils/gpu_validator.py Expands typing imports.
src/madengine/utils/gpu_config.py Replaces logger calls with print messaging.
src/madengine/utils/discover_models.py Changes MODEL_DIR copy command construction; switches to assert checks for CustomModel.
src/madengine/utils/config_parser.py Modifies directory-walk condition for locating configs.
src/madengine/scripts/common/tools/rocm_smi_utils.py Broadens exception handling in ROCm SMI init path.
src/madengine/scripts/common/tools/amd_smi_utils.py Broadens exception handling in AMD SMI power query path.
src/madengine/reporting/update_perf_csv.py Alters tag flattening type check; removes dropping of non-scalar fields before DataFrame creation.
src/madengine/reporting/csv_to_email.py Changes function return annotation (Optional→str) and typing imports.
src/madengine/orchestration/run_orchestrator.py Reintroduces inline credential loading; adds dockerfile-context filtering helper; adds handle_error import.
src/madengine/orchestration/build_orchestrator.py Reintroduces inline credential loading; changes config error handling (SystemExit); adds handle_error import.
src/madengine/execution/docker_builder.py Removes shared auth dependency; changes build-arg quoting; adds embedded registry login logic.
src/madengine/execution/container_runner.py Removes shared auth dependency; adds embedded registry login logic; changes env var quoting/logging.
src/madengine/deployment/slurm.py Changes “unknown status” handling to assume success on sacct issues.
src/madengine/deployment/kubernetes.py Updates imports/constants related to deployment common/secrets.
src/madengine/deployment/factory.py Silences Kubernetes ImportError warning behavior.
src/madengine/deployment/config_loader.py Adds os import.
src/madengine/deployment/common.py Removes lru_cache usage on is_rocprofv3_available.
src/madengine/deployment/base.py Removes UNKNOWN deployment status and adjusts perf.csv writing logic.
src/madengine/core/timeout.py Adds typing import.
src/madengine/core/errors.py Renames/aliases error classes; adds traceback/Rich Table import.
src/madengine/core/docker.py Changes default console arg + removes quoting around shell-invoked docker commands.
src/madengine/core/dataprovider.py Adds debug prints in data source checks.
src/madengine/core/context.py Adds GPU validator imports; adds stderr debug prints; changes get_numa_balancing type signature.
src/madengine/core/auth.py Removes centralized auth utilities module.
src/madengine/cli/validators.py Removes initialization of dockerfile_matched in batch-manifest processing.
src/madengine/cli/constants.py Changes ExitCode from IntEnum to plain class.
src/madengine/cli/commands/run.py Removes timeout==0None mapping while help text still claims “0 = no timeout”.
src/madengine/cli/app.py Hardcodes --version output rather than reading package metadata.
pyproject.toml Adds dependencies (setuptools-rust, pytest).
CLAUDE.md Removes repository development guidance doc.
Comments suppressed due to low confidence (2)

src/madengine/core/context.py:463

  • get_numa_balancing is annotated/ documented as returning bool, but it actually returns the raw stdout from cat (a string like '0'/'1') when the path exists. Either convert the command output to a boolean, or keep the return type as Union[str, bool] / str to match actual behavior.
    def get_numa_balancing(self) -> bool:
        """Get NUMA balancing.

        Returns:
            bool: The output of the shell command.

        Raises:
            RuntimeError: If the NUMA balancing is not enabled or disabled.

        Note:
            NUMA balancing is enabled if the output is '1', and disabled if the output is '0'.

            What is NUMA balancing?
            Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing,
            where the memory access time depends on the memory location relative to the processor.
        """
        # Check if NUMA balancing is enabled or disabled.
        path = "/proc/sys/kernel/numa_balancing"
        if os.path.exists(path):
            return self.console.sh("cat /proc/sys/kernel/numa_balancing || true")
        else:
            return False

src/madengine/execution/docker_builder.py:75

  • get_build_arg uses a mutable default (run_build_arg: typing.Dict = {}), which can leak state across calls/tests. Use None as the default and create a new dict inside the function instead.
    def get_build_arg(self, run_build_arg: typing.Dict = {}) -> str:
        """Get the build arguments.

        Args:
            run_build_arg: The run build arguments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -395,8 +395,6 @@ def process_batch_manifest_entries(

# If the model was not built (build_new=false), create an entry for it
if not build_new:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dockerfile_matched was previously initialized before the try, but it’s still referenced later in the except fallback path (synthetic_image_name = f"ci-{model_name}_{dockerfile_matched}"). If an exception occurs before dockerfile_matched is assigned, this will raise UnboundLocalError and skip creating the minimal entry. Reintroduce a safe initialization (e.g., dockerfile_matched = 'unknown') before the try.

Suggested change
if not build_new:
if not build_new:
# Default value used by the fallback path if discovery fails early.
dockerfile_matched = "unknown"

Copilot uses AI. Check for mistakes.
Comment on lines 86 to +127
@@ -120,6 +122,10 @@ def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwarg
)


# Backward compatibility alias
RuntimeError = ExecutionError

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining ConnectionError, TimeoutError, and especially RuntimeError = ExecutionError in this module collides with Python’s built-in exception names. This can lead to very confusing except RuntimeError/TimeoutError/ConnectionError behavior once callers import from madengine.core.errors. Consider using distinct names (e.g., MadEngineConnectionError) and providing backward-compatibility aliases with non-builtin names (or alias the old names like NetworkError).

Copilot uses AI. Check for mistakes.
"""
# flatten tags to a string, if tags is a list.
if isinstance(perf_entry["tags"], list):
if type(perf_entry["tags"]) == list:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten_tags uses type(x) == list instead of isinstance(x, list), which will miss list subclasses (and is inconsistent with the rest of the codebase’s type checks). Prefer isinstance(perf_entry.get('tags'), list).

Suggested change
if type(perf_entry["tags"]) == list:
if isinstance(perf_entry.get("tags"), list):

Copilot uses AI. Check for mistakes.
Comment thread src/madengine/cli/app.py
Comment on lines 57 to 62
if version:
try:
_version = pkg_version("madengine")
except PackageNotFoundError:
_version = "unknown"
# You might want to get the actual version from your package
console.print(
f"🚀 [bold cyan]madengine[/bold cyan] version [green]{_version}[/green]"
"🚀 [bold cyan]madengine[/bold cyan] version [green]2.0.0[/green]"
)
raise typer.Exit()
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--version output is now hardcoded to 2.0.0, which will become incorrect as soon as the package version changes and breaks automation that relies on accurate version reporting. Consider restoring runtime version lookup via importlib.metadata.version('madengine') (with a fallback when not installed).

Copilot uses AI. Check for mistakes.
Comment on lines +995 to +1006
def _load_credentials(self) -> Optional[Dict]:
"""Load credentials from credential.json and environment."""
credentials = None

credential_file = "credential.json"
if os.path.exists(credential_file):
try:
with open(credential_file) as f:
credentials = json.load(f)
except Exception as e:
print(f"Warning: Could not load credentials: {e}")

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new _load_credentials catches all exceptions and only prints a warning. This diverges from the structured error handling used elsewhere (handle_error/ConfigurationError) and can lead to silent misconfiguration (e.g., malformed JSON) later in the run. Consider using handle_error with a ConfigurationError (or at least include actionable suggestions) instead of a bare print.

Copilot uses AI. Check for mistakes.
Comment on lines 192 to 197
AssertionError: If the number of columns in the performance csv DataFrame is not equal
"""
single_result_json = read_json(single_result)
# Remove non-scalar fields that are not perf.csv columns (e.g. configs list).
# See handle_exception_result for rationale.
single_result_json.pop("configs", None)
perf_entry_dict_to_csv(single_result_json)
single_result_df = pd.DataFrame(single_result_json, index=[0])
if perf_csv_df.empty:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_single_result / handle_exception_result now build a one-row DataFrame directly from the JSON dict. If the result contains non-scalar fields like configs (often a list of dicts from ConfigParser.parse_and_load()), pd.DataFrame(dict_with_list, index=[0]) can raise ValueError (length mismatch) or produce an invalid row. Consider dropping/serializing non-scalar fields (e.g., configs) before constructing the DataFrame, as the previous code did.

Copilot uses AI. Check for mistakes.
Comment on lines 1030 to 1036
except Exception as e:
self.console.print(
f"[yellow]Warning: Exception checking job {job_id} status: {e}. "
f"Status cannot be verified.[/yellow]"
)
self.console.print(f"[dim yellow]Warning: Exception checking job {job_id}: {e}[/dim yellow]")
return DeploymentResult(
status=DeploymentStatus.UNKNOWN,
status=DeploymentStatus.SUCCESS,
deployment_id=job_id,
message=f"Job {job_id} status unknown: {e}",
message=f"Job {job_id} completed (status unavailable)",
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all exceptions here and returning DeploymentStatus.SUCCESS has the same risk as the non-zero sacct fallback: failures in status checks will be reported as success. Return an explicit unknown/failed status (or rethrow) so callers can react appropriately.

Copilot uses AI. Check for mistakes.
Comment on lines +398 to 403
# Debug: log to stderr so SLURM node .err captures where we are if killed
print(f"[DEBUG] get_gpu_vendor: trying amd-smi at {amd_smi_path}", file=sys.stderr, flush=True)
# Verify amd-smi actually works (180s timeout for slow GPU initialization)
result = self.console.sh(f"{amd_smi_path} list > /dev/null 2>&1 && echo 'AMD' || echo ''", timeout=180)
print(f"[DEBUG] get_gpu_vendor: amd-smi returned", file=sys.stderr, flush=True)
if result and result.strip() == "AMD":
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_gpu_vendor now prints debug messages to stderr unconditionally. This will add noisy output in normal operation and can interfere with log parsing. Consider guarding these prints behind a debug/verbose flag or using the project’s logging facilities with an appropriate log level.

Copilot uses AI. Check for mistakes.
Comment on lines 93 to 97
# add envVars
if envVars is not None:
for evar in envVars.keys():
command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "
command += "-e " + evar + "=" + envVars[evar] + " "

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment variable values are appended to the docker run command without any quoting/escaping (... -e KEY=VALUE ...). Values containing spaces/quotes will break the command, and unescaped interpolation combined with shell=True is injection-prone. Quote values (e.g., shlex.quote(str(...))) or build the docker invocation as an argument list.

Copilot uses AI. Check for mistakes.
Comment on lines +1022 to 1028
# Fallback - assume completed
self.console.print(f"[dim yellow]Warning: Could not get status for job {job_id}, assuming success[/dim yellow]")
return DeploymentResult(
status=DeploymentStatus.UNKNOWN,
status=DeploymentStatus.SUCCESS,
deployment_id=job_id,
message=f"Job {job_id} status unknown: sacct exited with code {result.returncode}",
message=f"Job {job_id} completed (assumed)",
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sacct fails (non-zero return code), _check_job_completion now assumes job success and returns DeploymentStatus.SUCCESS. This can silently mark failed or cancelled jobs as successful and lead to incorrect reporting/automation decisions. Prefer returning a distinct “unknown” state (or treat as failure) and surface the inability to verify status in the result message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants