Skip to content

[DRAFT] adding generated and custom code for custom training#45951

Open
jayesh-tanna wants to merge 69 commits intomainfrom
jatanna/trainingv1
Open

[DRAFT] adding generated and custom code for custom training#45951
jayesh-tanna wants to merge 69 commits intomainfrom
jatanna/trainingv1

Conversation

@jayesh-tanna
Copy link
Copy Markdown
Member

@jayesh-tanna jayesh-tanna commented Mar 27, 2026

Description

Typespec PullRequest: Azure/azure-rest-api-specs#41619

Add Training Jobs support to azure-ai-projects SDK

Overview

This PR introduces CommandJob support under client.beta.training.jobs (sync) and
async_client.beta.training.jobs (async), enabling users to create, get, list, update,
cancel, and delete training jobs from the Azure AI Projects SDK without wrapping boilerplate.

A lot of our customers are currently using azure-ai-ml feels familiar to them — same patterns, same mental model. That way, when they are ready to move to Azure AI Foundry, the migration is a small step rather than a full rewrite.


Design Choices

1. Flat CommandJob surface — no envelope required
Callers pass CommandJob directly to create_or_update and receive CommandJob back from
get/list. The SDK wraps/unwraps the Job(properties=...) wire envelope transparently.

2. Custom CommandJob subclass (model patch)
CommandJob extends the auto-generated _RestCommandJob and exposes read-only name and id
properties promoted from the outer Job envelope returned by the service.

3. _from_rest_object factory method
A classmethod on CommandJob constructs the flat model from any service response object,
with explicit ValueError/TypeError on unexpected shapes rather than silent None fields.

4. CommandJobLimits.timeout accepts int, float, or timedelta
The patched CommandJobLimits.__init__ converts plain numeric seconds to timedelta before
forwarding to the generated model, eliminating a common serialization foot-gun.

5. Auto-injection of Foundry-Features preview header
Every operation (list, get, create_or_update, begin_delete, begin_cancel) automatically injects
Foundry-Features: Jobs=V1Preview so callers never need to pass it manually as a custom header.

6. Automatic local-path resolution for code and inputs
If code or an input path is a local file or folder, the SDK transparently uploads it as a
dataset asset and swaps in the returned datastore URI before the request is sent.

7. Input validation before every create/update
create_or_update validates name, command, environment_image_reference, and compute
are non-empty upfront, surfacing clear ValueErrors instead of opaque HTTP 400 responses.

8. Full async mirror (_patch_jobs_async.py)
All sync customizations are mirrored in the async operations class using async/await and
distributed_trace_async, including async dataset upload resolution for code and inputs.


Customizations Summary

Customization What it does
Flat CommandJob model with name and id properties The service returns jobs wrapped in an outer Job envelope. We subclass the generated model to surface name and id directly on the object so callers never need to unwrap job.properties.name.
CommandJob._from_rest_object factory Converts a raw service Job response into a flat CommandJob in one place, with typed error messages if the response shape is unexpected (missing properties, wrong job type).
Job envelope wrapping in create_or_update The service wire format requires Job(properties=CommandJob(...)). The patch wraps the caller's flat CommandJob into the envelope automatically before the HTTP call, keeping the public API clean.
CommandJobLimits timeout coercion Overrides __init__ to accept plain int/float seconds in addition to timedelta, converting them automatically. Removes a class of runtime serialization errors when callers pass numeric timeouts.
Foundry-Features preview header injection Injects Foundry-Features: Jobs=V1Preview into every request from _inject_preview_header, so the preview feature flag is always active without callers needing to know about it.
Local-path auto-upload for code and inputs Before sending a job, any local file or folder in code or input path is uploaded to a new dataset via DatasetsOperations and the field is replaced with the returned datastore URI transparently.
Dataset name:version short-form resolution An input URI in name:version or azureai:name:version form is resolved to a full datastore URI by fetching the existing dataset, removing the need for callers to look up URIs manually.
Pre-flight _validate guard Checks name, command, environment_image_reference, and compute are non-empty before any network call, giving callers an immediate ValueError with a clear message instead of a cryptic HTTP 400.
Async mirror of all sync customizations Every sync customization (envelope wrap/unwrap, validation, path resolution, header injection) is duplicated with async/await in _patch_jobs_async.py so the async client has identical behaviour.

Pending / Future Work

  • command() factory function — Following the same pattern as azure-ai-ml's top-level
    command() function (see azure.ai.ml.entities._builders.command_func), a standalone
    command(*, command, environment, compute, inputs, outputs, ...) helper will be added so users
    can write job = command(...); client.beta.training.jobs.create_or_update(name, job) without
    constructing CommandJob directly.
  • Unit & live test coverage — Tests for the patch layer (validation, local-path resolution,
    header injection, _from_rest_object error paths, async equivalents) will be added in this PR
    in the next commit.

Sample code

job = CommandJob(
        command="python train.py --epochs 10 --lr 0.001 --output $AZUREML_MODEL_DIR/outputs",
        environment_image_reference="mcr.microsoft.com/azureml/minimal-ubuntu22.04-py39-cuda11.8-gpu-inference",
        compute=compute_id,
        display_name="Sample Command Job - Full",
        description="A sample job created via the Azure AI Projects SDK.",
        tags={"framework": "pytorch", "priority": "low", "team": "ai-platform"},
        properties={"experiment_id": "exp-42", "model_version": "1.0"},
        code="./src",
        environment_variables={
            "NCCL_DEBUG": "INFO",
            "PYTHONPATH": "/opt/conda/lib/python3.9/site-packages",
        },
        inputs={
            "training_data": Input(
                type=AssetTypes.URI_FILE,
                path="./data/train.csv",
                mode=InputOutputModes.READ_ONLY_MOUNT,
                description="CIFAR-10 training split",
            ),
        },
        outputs={
            "model_output": Output(
                type=AssetTypes.URI_FOLDER,
                path="azureai://datastores/workspaceblobstore/paths/outputs/cifar10-model/",
                mode=InputOutputModes.UPLOAD,
                asset_name="cifar10-trained-model",
                description="CIFAR-10 training split"
            ),
        },
        resources=JobResourceConfiguration(
            instance_count=2,
            instance_type="Standard_NC6s_v3",
            shm_size="8g",
            docker_args="--ipc=host",
            properties={"AISuperComputer": {"slaTier": "Premium", "priority": "high"}},
        ),
        distribution=PyTorchDistribution(process_count_per_instance=1),
        limits=CommandJobLimits(timeout=7200),
        queue_settings=QueueSettings(job_tier="Spot"),
        is_archived=False,
    )
job = project_client.beta.training.jobs.create_or_update(name='job_name', body=job)
print(job)

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

dargilco and others added 30 commits March 6, 2026 10:29
…#45611)

* marking finetuning pause and resume operations as live extended tests

* updating recording

---------

Co-authored-by: Jayesh Tanna <jatanna@microsoft.com>
* rename env vars

* rename env var

* resolved comments

* remove chat completion

* resolved comment
* Add CSV and synthetic data generation evaluation samples

Add two new evaluation samples under sdk/ai/azure-ai-projects/samples/evaluations/:

- sample_evaluations_builtin_with_csv.py: Demonstrates evaluating pre-computed
  responses from a CSV file using the csv data source type. Uploads a CSV file
  via the datasets API, runs coherence/violence/f1 evaluators, and polls results.

- sample_synthetic_data_evaluation.py: Demonstrates synthetic data evaluation
  (preview) that generates test queries from a prompt, sends them to a model
  target, and evaluates responses with coherence/violence evaluators.

Also adds:
- data_folder/sample_data_evaluation.csv: Sample CSV data file with 3 rows
- README.md: Updated sample index with both new samples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update synthetic eval sample: agent target + cleaner dataset ID retrieval

- Switch from model target to agent target (azure_ai_agent)
- Create agent version via agents.create_version() before evaluation
- Simplify output_dataset_id retrieval using getattr instead of nested hasattr/isinstance checks
- Add AZURE_AI_AGENT_NAME env var requirement
- Remove input_messages (not needed for agent target)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add model target synthetic eval sample, cross-reference both

- Add sample_synthetic_data_model_evaluation.py for model target with
  input_messages system prompt
- Update sample_synthetic_data_evaluation.py docstring with cross-reference
- Update README.md with both synthetic samples (agent and model)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename synthetic agent sample, clarify README, add prompt/files comments

- Rename sample_synthetic_data_evaluation.py to sample_synthetic_data_agent_evaluation.py
- Clarify README: JSONL dataset vs CSV dataset descriptions
- Remove (preview) from synthetic sample descriptions in README
- Add comments about prompt and reference_files options in both synthetic samples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Skip new eval samples in recording tests

Add sample_evaluations_builtin_with_csv.py, sample_synthetic_data_agent_evaluation.py,
and sample_synthetic_data_model_evaluation.py to samples_to_skip list since they
require file upload prerequisites or are long-running preview features.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename env vars per PR review: FOUNDRY_PROJECT_ENDPOINT, FOUNDRY_MODEL_NAME

Address review comments from howieleung:
- AZURE_AI_PROJECT_ENDPOINT -> FOUNDRY_PROJECT_ENDPOINT
- AZURE_AI_MODEL_DEPLOYMENT_NAME -> FOUNDRY_MODEL_NAME
Updated in all 3 new samples and README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename AZURE_AI_AGENT_NAME to FOUNDRY_AGENT_NAME per review

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update changelog with new sample entries

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* New samples

* resolved comments
* LLM validation use 5.2 and chat completion

* change log

* Resolved comments
* Adding-Upload-Evaluator

* Adding-Upload-Evaluator

* Adding-Upload-Evaluator

* Adding-Upload-Evaluator-aio

* rename

* added - eval and eval run

* fix

* adding tests

* updated as per review
@dargilco
Copy link
Copy Markdown
Member

Please hold off on merging the PR. I want to get Johan's feedback on introducing nested sub-clients like .beta.training.jobs. I'll start a thread with him and you.

dargilco and others added 18 commits March 30, 2026 20:51
* Sample-Fix

* fix samples
* Instruction now not provided in test function.

* fix test
…r upload (#46063)

* fix(azure-ai-projects): skip all dot-prefixed directories in evaluator upload

Change skip_dirs filter to exclude any directory starting with '.' instead
of only '.git' and '.venv'. This covers .venv, .git, .mypy_cache, .tox,
.pytest_cache, and any other hidden/tool directories.

Applied to both sync and async upload functions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(azure-ai-projects): skip dot-prefixed files in evaluator upload

Extend the existing skip logic to also exclude dot-prefixed files
(e.g. .env, .DS_Store, .gitignore) from evaluator uploads, matching
the treatment already applied to dot-prefixed directories.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(azure-ai-projects): add file_pattern and folder_exclusions_pattern to evaluators.upload

Add optional regex-based filtering parameters to _upload_folder_to_blob
and upload methods, consistent with datasets.upload_folder pattern:
- file_pattern: filter which files to upload by name
- folder_exclusions_pattern: exclude directories by name pattern

Applied to both sync and async implementations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: remove hardcoded skip lists, let customer control via patterns

Remove hardcoded skip_dirs and skip_extensions. Filtering is now
fully controlled by the optional file_pattern and
folder_exclusions_pattern parameters. Docstrings include recommended
excludes for typical Python evaluator projects.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: restore sample_eval_upload_friendly_evaluator.py accidentally emptied

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test: update upload tests for customer-controlled pattern filtering

Replace test_upload_skips_pycache_and_pyc_files with two new tests:
- test_upload_skips_pycache_and_pyc_files_with_patterns: verifies
  filtering works when patterns are provided
- test_upload_uploads_all_files_without_patterns: verifies all files
  are uploaded when no patterns are given

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: fix Sphinx docstring continuation line alignment

Align continuation lines under the directive name (e.g. 'p' in :param)
instead of using deeper indentation, per Sphinx requirements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-ai-projects

@howieleung howieleung force-pushed the feature/azure-ai-projects/2.0.2 branch from eac227e to 223cb73 Compare April 16, 2026 06:39
Base automatically changed from feature/azure-ai-projects/2.0.2 to main April 17, 2026 15:34
Co-authored-by: Copilot <copilot@github.com>
Copilot AI review requested due to automatic review settings April 30, 2026 07:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Jayesh Tanna and others added 4 commits April 30, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants