Skip to content

Add azure.ai.training extension#8130

Merged
saanikaguptamicrosoft merged 74 commits into
Azure:mainfrom
saanikaguptamicrosoft:foundry-training-dev
May 21, 2026
Merged

Add azure.ai.training extension#8130
saanikaguptamicrosoft merged 74 commits into
Azure:mainfrom
saanikaguptamicrosoft:foundry-training-dev

Conversation

achauhan-scc and others added 28 commits March 18, 2026 12:31
* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands
…to show

- Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK)
- Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com)
- Rename 'job get' to 'job show' to match models/finetune extensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API
…rt (Azure#7203)

- Add --skip-token flag for pagination with next-page UX message
- Add --tag and --properties flags for server-side filtering
- Add --include-archived flag for listViewType control
- Add SystemData (createdBy, createdAt) to job list output
- Update doDataPlane() to support variadic query params

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… code, and input resolution (Azure#7205)

- Rename job create command to job submit for consistency with finetune extension
- Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver
- Add JobResolver orchestrator that resolves all references in JobDefinition
- Wire resolver into submit flow before buildJobResource()
- Stub implementations guide users to provide full ARM IDs / remote URIs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Custom training (Azure#7180)

* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API

* adding -e -s override

* fixing asset resolution

* custom training: enhance job show, fix asset resolution, add full resource config support

- Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info
- Added client APIs for run history, metrics, and artifacts endpoints
- Fixed dataset version field: json:dataType -> json:type
- Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount
- Added full resource config support: instanceType, shmSize, dockerArgs, properties
- Added ResourceDefinition YAML struct with AISuperComputer properties pass-through
- Backward compatible: flat instance_count still works

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* custom training: add spinner progress to job show command

Show animated spinner with progress text while fetching job details.
Updates text as each parallel fetch (run history, metrics, artifacts)
completes, showing remaining items until all data is loaded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 11, 2026 11:35
@JeffreyCA JeffreyCA requested a review from wbreza May 19, 2026 21:02
Copy link
Copy Markdown
Contributor

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review against the 16 new commits (3106077f1ab8f3)

Thanks for the rapid turnaround, @saanikaguptamicrosoft — a lot of prior feedback is cleanly resolved. Summary below covers (a) verification of prior findings and (b) new findings introduced by the delta.

✅ Prior findings — resolved

  • azcopy silent parse errors — fixed in runner.go; first-occurrence warnings via warnedUnparseable / warnedProgressFormat 👍
  • Zero-test critical paths (hash, upload_service)hash_test.go and upload_service_test.go added with strong coverage (determinism, order-independence, zombie recovery, hash collision, missing SAS, sentinel tagging) 👍
  • Empty PR description — now populated with related PRs and screenshots 👍
  • job get vs job show namingjob_get.go renamed to job_show.go 👍
  • proxyEndpointPattern allowing %%%% escape applied for OpenSSH ProxyCommand expansion 👍
  • Design/ placement — renamed to lowercase design/, README references updated 👍

⚠️ Prior findings — partial

  • ServiceEndpoint consolidation — new internal/utils/services.go ServiceEndpoint() helper exists and stream_service.go uses it, but internal/cmd/job_show_services.go:47 still calls its local extractServiceEndpointStr. One call site missed.
  • Untested paths — coverage added for hash/upload_service, but pkg/client/download.go still has no tests.

❌ Prior findings — not addressed

1. --debug flag still prints request URLs and bodies unredacted
pkg/client/client.go:91–103. SAS URIs in query strings and credential fields in GetDatasetCredentials / GetModelCredentials request bodies remain printable.
Fix: strip query string from the URL print; redact sasUri, Authorization, sig, signature in the body print; or gate behind a developer-only env var and document the leak risk.

2. SSH proxy stdin goroutine still untracked
internal/cmd/job_ssh_proxy.go:167–182. No wg.Add, no os.Stdin close on cancel; relies on process exit. Acceptable for the short-lived ProxyCommand model but wg.Wait() at the end of runProxy cannot guarantee clean shutdown.
Fix: at minimum, close os.Stdin from the cancellation watcher so the upstream Read() returns.

🟥 New findings from the delta

Critical — DeleteJob long-running-operation polling is single-shot
pkg/client/jobs.go (pollDeleteOperationOnce / DeleteJob). Issues a single GET against the LRO Location URL; on 202 Accepted returns DeleteJobInProgress rather than continuing to poll. This violates Azure LRO semantics, gives users the "delete accepted but unverified" UX, and creates a race window for concurrent delete attempts.
Fix: loop with bounded retries + backoff respecting Retry-After, or adopt runtime.NewPoller from azcore.

High — Artifact contentinfo fan-out drops per-root errors
internal/cmd/job_download.go ~lines 688–720. Per-root goroutines record rootResults[idx].err, but after wg.Wait() the results are consumed without surfacing the per-root error; a permission-denied or 404 on a single root prefix is silently dropped → partial downloads reported as success.
Fix: after wg.Wait(), aggregate errors with errors.Join and either fail the command or print a loud per-root warning.

High — Debug URL logging extended to APIM data-plane path
pkg/client/client.go getAbsoluteDataPlane() adds a new fmt.Fprintf(os.Stderr, "[DEBUG] GET %s\n", parsed.String()) site. Same class of leak as the unresolved prior finding above but extended.
Fix: strip query string before printing.

Medium — redactSAS may pass URLs through unredacted
internal/azcopy/runner.go ~lines 388–397. If ? is absent the input URL is returned unchanged. Defense-in-depth: also check for known SAS markers (sv=, sig=, se=) and redact any token-like substring.

Medium — mapServiceType() silently passes unknown types
internal/cmd/job_submit.go ~lines 1188–1203. Returns input unchanged on unknown service type, relying on ValidateJobOffline() always running. If validation is bypassed or drifts, server returns a generic 400 instead of a clear CLI error.
Fix: return an error on unknown service type — keep offline validation as the primary check, this as a backstop.

Medium — SSH ProxyCommand shell quoting
internal/cmd/job_connect_ssh.go ~lines 434–440. %-escape handles OpenSSH expansion, but the rendered ProxyCommand isn't single-quoted, so shell metacharacters in job names / endpoints ($, backticks, !) could be interpreted by the parent shell.
Fix: wrap the constructed proxy command in single quotes when emitting it.

Medium — Distribution Ray fields lack client-side validation
pkg/models/common.go Distribution + buildDistribution() ~lines 1208–1227. *int Port/DashboardPort and *bool IncludeDashboard pointer copies are correct, but out-of-range/negative ports aren't validated → cryptic server error.

Medium — Root-folder fan-out semaphore leak on context cancel
internal/cmd/job_download.go. sem <- struct{}{} isn't context-aware; cancellation while blocked on acquire leaks the goroutine.
Fix: select { case sem <- struct{}{}: ... case <-ctx.Done(): return ctx.Err() }.

Medium — azcopy end-of-job diagnostics truncation
internal/azcopy/runner.go printEndOfJobDiagnostics() caps to first 5 failed transfers + 16 KB stderr; may hide root cause on batch failures. Consider writing full diagnostics to a side-file when truncation kicks in.

✅ Cleanly verified

  • "Temp: Print API request response" (6d7f2d7) was fully reverted by 61c438f — no leftover prints.
  • go fix modernization commit (16cff27) is mechanical only — no behavioral changes.
  • Metrics path fix (305bdcc) consistent with APIM schema.
  • New tests (hash_test.go, upload_service_test.go) exercise the failure paths called out previously — exactly the right targets.

Posted by Copilot CLI on behalf of the reviewer.

Copy link
Copy Markdown
Contributor

@JeffreyCA JeffreyCA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above comments can be addressed in a follow-up PR before the first release I'm good with that too.

Would like to at least get release-ext-azure-ai-training.yml merged soon so @vhvb1989 can help with creating the pipelines.

@saanikaguptamicrosoft
Copy link
Copy Markdown
Collaborator Author

saanikaguptamicrosoft commented May 20, 2026

@wbreza

⚠️ Prior findings — partial

ServiceEndpoint consolidation — new internal/utils/services.go ServiceEndpoint() helper exists and stream_service.go uses it, but internal/cmd/job_show_services.go:47 still calls its local extractServiceEndpointStr. One call site missed.

This comment is no longer applicable. We don't have any function extractServiceEndpointStr post we moved to APIM APIs.

Untested paths — coverage added for hash/upload_service, but pkg/client/download.go still has no tests.

Acknowledged — tracking as follow-up. These are thin HTTP wrappers around the shared doDataPlane transport and are exercised end-to-end by model download, dataset download, and job download. Will add focused httptest coverage in a follow-up so this PR stays scoped.

❌ Prior findings — not addressed

--debug flag still prints request URLs and bodies unredacted

Done

SSH proxy stdin goroutine still untracked

Intentional and consistent with existing AML CLI extension (_ssh_connector.py uses the same pattern: a thread-pool os.read(0, ...) that's not awaited and is reaped by process exit). Closing os.Stdin from a watcher would not actually unblock the in-flight read(2) syscall on either Linux pipes or Windows console handles, so the suggested "minimum fix" doesn't deliver the determinism it implies — SSH terminating the ProxyCommand process is the cleanup mechanism by design.

🟥 New findings from the delta

Critical — DeleteJob long-running-operation polling is single-shot

Addressed. Also added a --no-wait flag for single peek.

High — Artifact contentinfo fan-out drops per-root errors

Done

High — Debug URL logging extended to APIM data-plane path

Done

Medium — redactSAS may pass URLs through unredacted

Done

Medium — mapServiceType() silently passes unknown types

Tested empirically — backend silently drops unknown service types (no 400), and the case is already caught upstream by ValidateJobOffline() (job_validator.go:154-180) which rejects them with a clear field-scoped error before submit. Keeping the mapper as a pure translator and the validator as the single source of truth for the allow-list (duplicating it here would invert the drift risk, not reduce it).

Medium — SSH ProxyCommand shell quoting

The only non-constant inputs to the ProxyCommand string are (a) selfPath from os.Executable() and (b) the proxy endpoint URL. The endpoint is already character-class-validated by job_connect_ssh.go to exclude all shell metacharacters ($, backticks, ;, |, &, <, >, (, ), ', ", spaces, etc.), and the job name doesn't appear in the ProxyCommand string at all (destination is the fixed alias azureml-job). The remaining input selfPath would require an attacker-controlled install path of the extension binary, which already implies arbitrary code execution.

Also, single-quoting the whole ProxyCommand value would break it — when ssh spawns it via /bin/sh -c, the shell would try to execute the single-quoted string as a single binary name. The endpoint validation is the right layer for this defense and is already in place.

Medium — Distribution Ray fields lack client-side validation

Addressed

Medium — Root-folder fan-out semaphore leak on context cancel

Addressed

Medium — azcopy end-of-job diagnostics truncation

Done — bumped the stderr buffer cap to 64 KB and added maybeWriteFullDiagnostics that dumps the full EndOfJob payload (as JSON) plus captured stderr to a temp file (%TEMP%/azd-azcopy-diag-*.log) whenever either the failed-transfers list or the stderr buffer was truncated for terminal display; the path is printed to stderr so users can inspect everything for batch failures.

@JeffreyCA JeffreyCA added the ext-training azure.ai.training extension label May 20, 2026
@vhvb1989
Copy link
Copy Markdown
Member

/check-enforcer evaluate

@saanikaguptamicrosoft saanikaguptamicrosoft merged commit f5b95f3 into Azure:main May 21, 2026
10 checks passed
Copilot AI pushed a commit that referenced this pull request May 21, 2026
* Custom training (#7125)

* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* feat: add job name auto-generation, fix endpoint URL, rename job get to show

- Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK)
- Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com)
- Rename 'job get' to 'job show' to match models/finetune extensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Custom training (#7180)

* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API

* feat: enhance job list with pagination, filters, and systemData support (#7203)

- Add --skip-token flag for pagination with next-page UX message
- Add --tag and --properties flags for server-side filtering
- Add --include-archived flag for listViewType control
- Add SystemData (createdBy, createdAt) to job list output
- Update doDataPlane() to support variadic query params

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: add CODEOWNERS for azure.ai.customtraining extension (#7204)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: rename job create to submit and add resolver layer for compute, code, and input resolution (#7205)

- Rename job create command to job submit for consistency with finetune extension
- Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver
- Add JobResolver orchestrator that resolves all references in JobDefinition
- Wire resolver into submit flow before buildJobResource()
- Stub implementations guide users to provide full ARM IDs / remote URIs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add artifact resolution for code and inputs in job create (#7153)

* Implement cancel job command for custom training (#7272)

* Impelement delete job command for custom training (#7273)

* Custom training clean (#7454)

* Custom training (#7180)

* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API

* adding -e -s override

* fixing asset resolution

* custom training: enhance job show, fix asset resolution, add full resource config support

- Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info
- Added client APIs for run history, metrics, and artifacts endpoints
- Fixed dataset version field: json:dataType -> json:type
- Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount
- Added full resource config support: instanceType, shmSize, dockerArgs, properties
- Added ResourceDefinition YAML struct with AISuperComputer properties pass-through
- Backward compatible: flat instance_count still works

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* custom training: add spinner progress to job show command

Show animated spinner with progress text while fetching job details.
Updates text as each parallel fetch (run history, metrics, artifacts)
completes, showing remaining items until all data is loaded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Implement download command for custom training (#7453)

* Add validate command for custom training (#7407)

* Rename EnvironmentID to EnvironmentImageReference (#7891)

* Revert "Implement download command for custom training (#7453)" (#7892)

This reverts commit 5216202.

* Add userAssignedIdentityId support for command jobs (#7927)

* Implement stream command for training job (#7939)

* Add support for experiment name (#7961)

* Implement connect-ssh to job node (#7964)

* Add support for gpuCount for partial SKU scenario (#8067)

* Share offline validation between job validate and job submit commands (#8068)

* fix(azure.ai.customtraining): honor --subscription and --project-endpoint flags over stored env values (#8093)

* Implement download command for training job (#8029)

* Rename azure.ai.customtraining to azure.ai.training (#8106)

* Implement show services command (#8121)

* Add validations for UAMI requirement (#8122)

* Add template flag in init command (#8123)

* Bump armcognitiveservices SDK from v1.8.0 to v2.0.0 to fix NetworkInjections unmarshal error

* Bump go directive to 1.26.1 to align with repo standard and other extensions

* Use Subscription.UserTenantId for credential tenant to support guest/multi-tenant users

* Pin azd module to semver v1.24.3 instead of pseudo-version for stable dependency

* Add UTs

* Consolidate doUpload and doUploadWithTag into single method with optional tags parameter

* Use PromptSubscriptionResource for interactive Foundry project selection in init

* Surface azcopy scanner errors so truncated stdout doesn't mask upload failures

* Add schema header and requiredAzdVersion >=1.25.1 to extension.yaml

* Cap error and service-instance response body reads to prevent unbounded memory use

* Add retry policy (429/502/503/504 + net errors) to Foundry data plane client

* Use azd Confirm prompt and honor --no-prompt in job delete

* Escape user-supplied IDs in client URL paths

* Fix JSON tag mismatch for DataType

* Route client debug prints to stderr to keep stdout JSON parseable

* Adopt azdext.NewExtensionRootCommand and remove reserved-flag conflicts

* chore(ext/azure.ai.training): add CHANGELOG, README, cspell, golangci config and CI lint + release pipeline

* chore(ext/azure.ai.training): extend cspell dictionary to fix CI lint

* chore(ext/azure.ai.training): apply go fix modernization (interface{}→any, CutPrefix, drop loopvar capture)

* chore(ext/azure.ai.training): fix golangci-lint issues

* fix(ext/azure.ai.training): surface azcopy failure diagnostics

* refactor(ext/azure.ai.training): consolidate ServiceEndpoint helper in internal/utils

* test(ext/azure.ai.training): add hash + upload_service unit tests

* refactor(ext/azure.ai.training): rename job_get.go to job_show.go

* chore(ext/azure.ai.training): rename Design/ to design/ and link from README

* chore(ext/azure.ai.training): address PR feedback (ssh ProxyCommand % escape, design/ rename)

* Fix cspell error

* fix: apply go fix modernizations

* Move to APIM APIs and update API paths as per latest Typespec

* Temp: Print API request response for testing

* Fix API paths for metrics

* fix(ai.training): use delete operation result url to surface accurate job delete outcome

* fix(ai.training): fan out artifact contentinfo per unique root folder for job download

* Revert "Temp: Print API request response for testing"

This reverts commit 6d7f2d7.

* Support attaching remaining service types (jupyter_lab, tensor_board, vs_code, custom) to job

* Support distribution type (pytorch, tensorflow, mpi, ray) in job YAML

* Add polling for job deletion | Add a --no-wait flag

* Add polling for job cancel | Add a --no-wait flag | Refactor to share same poller as delete

* fix: surface all per-root errors in artifact contentinfo fan-out

* fix: redact query string from --debug URL logs to avoid leaking SAS tokens

* fix: harden redactSAS with SAS-marker fallback for URLs without '?'

* fix: validate ray distribution port and dashboard_port ranges in job validator

* fix: make root-folder fan-out semaphore acquire context-aware in job download

* fix: write full azcopy diagnostics to side-file when terminal output is truncated

* fix: resolve cspell (skoid/sktid) and gosec G104 (LRO body close) lint failures

* fix: redact request body for /credentials endpoints under --debug

---------

Co-authored-by: Amit Chauhan <70937115+achauhan-scc@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: therealjohn <1501196+therealjohn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ext-training azure.ai.training extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extension for job training

8 participants