Add azure.ai.training extension#8130
Conversation
* adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands
…to show
- Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK)
- Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com)
- Rename 'job get' to 'job show' to match models/finetune extensions
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API
…rt (Azure#7203) - Add --skip-token flag for pagination with next-page UX message - Add --tag and --properties flags for server-side filtering - Add --include-archived flag for listViewType control - Add SystemData (createdBy, createdAt) to job list output - Update doDataPlane() to support variadic query params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… code, and input resolution (Azure#7205) - Rename job create command to job submit for consistency with finetune extension - Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver - Add JobResolver orchestrator that resolves all references in JobDefinition - Wire resolver into submit flow before buildJobResource() - Stub implementations guide users to provide full ARM IDs / remote URIs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Custom training (Azure#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * adding -e -s override * fixing asset resolution * custom training: enhance job show, fix asset resolution, add full resource config support - Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info - Added client APIs for run history, metrics, and artifacts endpoints - Fixed dataset version field: json:dataType -> json:type - Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount - Added full resource config support: instanceType, shmSize, dockerArgs, properties - Added ResourceDefinition YAML struct with AISuperComputer properties pass-through - Backward compatible: flat instance_count still works Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * custom training: add spinner progress to job show command Show animated spinner with progress text while fetching job details. Updates text as each parallel fetch (run history, metrics, artifacts) completes, showing remaining items until all data is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…zure#7892) This reverts commit 5216202.
…oint flags over stored env values (Azure#8093)
wbreza
left a comment
There was a problem hiding this comment.
Re-review against the 16 new commits (3106077 → f1ab8f3)
Thanks for the rapid turnaround, @saanikaguptamicrosoft — a lot of prior feedback is cleanly resolved. Summary below covers (a) verification of prior findings and (b) new findings introduced by the delta.
✅ Prior findings — resolved
- azcopy silent parse errors — fixed in
runner.go; first-occurrence warnings viawarnedUnparseable/warnedProgressFormat👍 - Zero-test critical paths (hash, upload_service) —
hash_test.goandupload_service_test.goadded with strong coverage (determinism, order-independence, zombie recovery, hash collision, missing SAS, sentinel tagging) 👍 - Empty PR description — now populated with related PRs and screenshots 👍
job getvsjob shownaming —job_get.gorenamed tojob_show.go👍proxyEndpointPatternallowing%—%→%%escape applied for OpenSSH ProxyCommand expansion 👍Design/placement — renamed to lowercasedesign/, README references updated 👍
⚠️ Prior findings — partial
- ServiceEndpoint consolidation — new
internal/utils/services.goServiceEndpoint()helper exists andstream_service.gouses it, butinternal/cmd/job_show_services.go:47still calls its localextractServiceEndpointStr. One call site missed. - Untested paths — coverage added for hash/upload_service, but
pkg/client/download.gostill has no tests.
❌ Prior findings — not addressed
1. --debug flag still prints request URLs and bodies unredacted
pkg/client/client.go:91–103. SAS URIs in query strings and credential fields in GetDatasetCredentials / GetModelCredentials request bodies remain printable.
Fix: strip query string from the URL print; redact sasUri, Authorization, sig, signature in the body print; or gate behind a developer-only env var and document the leak risk.
2. SSH proxy stdin goroutine still untracked
internal/cmd/job_ssh_proxy.go:167–182. No wg.Add, no os.Stdin close on cancel; relies on process exit. Acceptable for the short-lived ProxyCommand model but wg.Wait() at the end of runProxy cannot guarantee clean shutdown.
Fix: at minimum, close os.Stdin from the cancellation watcher so the upstream Read() returns.
🟥 New findings from the delta
Critical — DeleteJob long-running-operation polling is single-shot
pkg/client/jobs.go (pollDeleteOperationOnce / DeleteJob). Issues a single GET against the LRO Location URL; on 202 Accepted returns DeleteJobInProgress rather than continuing to poll. This violates Azure LRO semantics, gives users the "delete accepted but unverified" UX, and creates a race window for concurrent delete attempts.
Fix: loop with bounded retries + backoff respecting Retry-After, or adopt runtime.NewPoller from azcore.
High — Artifact contentinfo fan-out drops per-root errors
internal/cmd/job_download.go ~lines 688–720. Per-root goroutines record rootResults[idx].err, but after wg.Wait() the results are consumed without surfacing the per-root error; a permission-denied or 404 on a single root prefix is silently dropped → partial downloads reported as success.
Fix: after wg.Wait(), aggregate errors with errors.Join and either fail the command or print a loud per-root warning.
High — Debug URL logging extended to APIM data-plane path
pkg/client/client.go getAbsoluteDataPlane() adds a new fmt.Fprintf(os.Stderr, "[DEBUG] GET %s\n", parsed.String()) site. Same class of leak as the unresolved prior finding above but extended.
Fix: strip query string before printing.
Medium — redactSAS may pass URLs through unredacted
internal/azcopy/runner.go ~lines 388–397. If ? is absent the input URL is returned unchanged. Defense-in-depth: also check for known SAS markers (sv=, sig=, se=) and redact any token-like substring.
Medium — mapServiceType() silently passes unknown types
internal/cmd/job_submit.go ~lines 1188–1203. Returns input unchanged on unknown service type, relying on ValidateJobOffline() always running. If validation is bypassed or drifts, server returns a generic 400 instead of a clear CLI error.
Fix: return an error on unknown service type — keep offline validation as the primary check, this as a backstop.
Medium — SSH ProxyCommand shell quoting
internal/cmd/job_connect_ssh.go ~lines 434–440. %-escape handles OpenSSH expansion, but the rendered ProxyCommand isn't single-quoted, so shell metacharacters in job names / endpoints ($, backticks, !) could be interpreted by the parent shell.
Fix: wrap the constructed proxy command in single quotes when emitting it.
Medium — Distribution Ray fields lack client-side validation
pkg/models/common.go Distribution + buildDistribution() ~lines 1208–1227. *int Port/DashboardPort and *bool IncludeDashboard pointer copies are correct, but out-of-range/negative ports aren't validated → cryptic server error.
Medium — Root-folder fan-out semaphore leak on context cancel
internal/cmd/job_download.go. sem <- struct{}{} isn't context-aware; cancellation while blocked on acquire leaks the goroutine.
Fix: select { case sem <- struct{}{}: ... case <-ctx.Done(): return ctx.Err() }.
Medium — azcopy end-of-job diagnostics truncation
internal/azcopy/runner.go printEndOfJobDiagnostics() caps to first 5 failed transfers + 16 KB stderr; may hide root cause on batch failures. Consider writing full diagnostics to a side-file when truncation kicks in.
✅ Cleanly verified
- "Temp: Print API request response" (6d7f2d7) was fully reverted by 61c438f — no leftover prints.
go fixmodernization commit (16cff27) is mechanical only — no behavioral changes.- Metrics path fix (305bdcc) consistent with APIM schema.
- New tests (
hash_test.go,upload_service_test.go) exercise the failure paths called out previously — exactly the right targets.
Posted by Copilot CLI on behalf of the reviewer.
JeffreyCA
left a comment
There was a problem hiding this comment.
If the above comments can be addressed in a follow-up PR before the first release I'm good with that too.
Would like to at least get release-ext-azure-ai-training.yml merged soon so @vhvb1989 can help with creating the pipelines.
|
|
/check-enforcer evaluate |
* Custom training (#7125) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * feat: add job name auto-generation, fix endpoint URL, rename job get to show - Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK) - Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com) - Rename 'job get' to 'job show' to match models/finetune extensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * feat: enhance job list with pagination, filters, and systemData support (#7203) - Add --skip-token flag for pagination with next-page UX message - Add --tag and --properties flags for server-side filtering - Add --include-archived flag for listViewType control - Add SystemData (createdBy, createdAt) to job list output - Update doDataPlane() to support variadic query params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: add CODEOWNERS for azure.ai.customtraining extension (#7204) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: rename job create to submit and add resolver layer for compute, code, and input resolution (#7205) - Rename job create command to job submit for consistency with finetune extension - Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver - Add JobResolver orchestrator that resolves all references in JobDefinition - Wire resolver into submit flow before buildJobResource() - Stub implementations guide users to provide full ARM IDs / remote URIs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add artifact resolution for code and inputs in job create (#7153) * Implement cancel job command for custom training (#7272) * Impelement delete job command for custom training (#7273) * Custom training clean (#7454) * Custom training (#7180) * adding design detaiils for command job CLI * adding more details * adding dedup details * adding api details * adding execution plan * adding draft version of custom training commands * integrating with API * adding -e -s override * fixing asset resolution * custom training: enhance job show, fix asset resolution, add full resource config support - Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info - Added client APIs for run history, metrics, and artifacts endpoints - Fixed dataset version field: json:dataType -> json:type - Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount - Added full resource config support: instanceType, shmSize, dockerArgs, properties - Added ResourceDefinition YAML struct with AISuperComputer properties pass-through - Backward compatible: flat instance_count still works Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * custom training: add spinner progress to job show command Show animated spinner with progress text while fetching job details. Updates text as each parallel fetch (run history, metrics, artifacts) completes, showing remaining items until all data is loaded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Implement download command for custom training (#7453) * Add validate command for custom training (#7407) * Rename EnvironmentID to EnvironmentImageReference (#7891) * Revert "Implement download command for custom training (#7453)" (#7892) This reverts commit 5216202. * Add userAssignedIdentityId support for command jobs (#7927) * Implement stream command for training job (#7939) * Add support for experiment name (#7961) * Implement connect-ssh to job node (#7964) * Add support for gpuCount for partial SKU scenario (#8067) * Share offline validation between job validate and job submit commands (#8068) * fix(azure.ai.customtraining): honor --subscription and --project-endpoint flags over stored env values (#8093) * Implement download command for training job (#8029) * Rename azure.ai.customtraining to azure.ai.training (#8106) * Implement show services command (#8121) * Add validations for UAMI requirement (#8122) * Add template flag in init command (#8123) * Bump armcognitiveservices SDK from v1.8.0 to v2.0.0 to fix NetworkInjections unmarshal error * Bump go directive to 1.26.1 to align with repo standard and other extensions * Use Subscription.UserTenantId for credential tenant to support guest/multi-tenant users * Pin azd module to semver v1.24.3 instead of pseudo-version for stable dependency * Add UTs * Consolidate doUpload and doUploadWithTag into single method with optional tags parameter * Use PromptSubscriptionResource for interactive Foundry project selection in init * Surface azcopy scanner errors so truncated stdout doesn't mask upload failures * Add schema header and requiredAzdVersion >=1.25.1 to extension.yaml * Cap error and service-instance response body reads to prevent unbounded memory use * Add retry policy (429/502/503/504 + net errors) to Foundry data plane client * Use azd Confirm prompt and honor --no-prompt in job delete * Escape user-supplied IDs in client URL paths * Fix JSON tag mismatch for DataType * Route client debug prints to stderr to keep stdout JSON parseable * Adopt azdext.NewExtensionRootCommand and remove reserved-flag conflicts * chore(ext/azure.ai.training): add CHANGELOG, README, cspell, golangci config and CI lint + release pipeline * chore(ext/azure.ai.training): extend cspell dictionary to fix CI lint * chore(ext/azure.ai.training): apply go fix modernization (interface{}→any, CutPrefix, drop loopvar capture) * chore(ext/azure.ai.training): fix golangci-lint issues * fix(ext/azure.ai.training): surface azcopy failure diagnostics * refactor(ext/azure.ai.training): consolidate ServiceEndpoint helper in internal/utils * test(ext/azure.ai.training): add hash + upload_service unit tests * refactor(ext/azure.ai.training): rename job_get.go to job_show.go * chore(ext/azure.ai.training): rename Design/ to design/ and link from README * chore(ext/azure.ai.training): address PR feedback (ssh ProxyCommand % escape, design/ rename) * Fix cspell error * fix: apply go fix modernizations * Move to APIM APIs and update API paths as per latest Typespec * Temp: Print API request response for testing * Fix API paths for metrics * fix(ai.training): use delete operation result url to surface accurate job delete outcome * fix(ai.training): fan out artifact contentinfo per unique root folder for job download * Revert "Temp: Print API request response for testing" This reverts commit 6d7f2d7. * Support attaching remaining service types (jupyter_lab, tensor_board, vs_code, custom) to job * Support distribution type (pytorch, tensorflow, mpi, ray) in job YAML * Add polling for job deletion | Add a --no-wait flag * Add polling for job cancel | Add a --no-wait flag | Refactor to share same poller as delete * fix: surface all per-root errors in artifact contentinfo fan-out * fix: redact query string from --debug URL logs to avoid leaking SAS tokens * fix: harden redactSAS with SAS-marker fallback for URLs without '?' * fix: validate ray distribution port and dashboard_port ranges in job validator * fix: make root-folder fan-out semaphore acquire context-aware in job download * fix: write full azcopy diagnostics to side-file when terminal output is truncated * fix: resolve cspell (skoid/sktid) and gosec G104 (LRO body close) lint failures * fix: redact request body for /credentials endpoints under --debug --------- Co-authored-by: Amit Chauhan <70937115+achauhan-scc@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: therealjohn <1501196+therealjohn@users.noreply.github.com>
Notes
Testing
Other operations like validate, download, stream, delete, cancel, show-services, connect-ssh are also supported.