[2.8] Fix tracking recipe integration test#4583
Conversation
5770d05 to
b2c2754
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes the experiment tracking recipe integration smoke tests by aligning them with the documented Recipe API model configuration and ensuring required example code is packaged into the exported job. It also introduces a dedicated tracking integration backend and hooks it into the existing integration test runner and PyTorch CI path.
Changes:
- Update experiment tracking recipe tests to pass a dict-based model config (
class_path/args) instead of a dynamically importednn.Moduleinstance, avoiding export-time source inspection failures. - Bundle the example
model.pyinto the server app somodel.SimpleNetworkcan be resolved at job runtime. - Add a new
trackingintegration backend (pytest-file driven) and route it throughci/run_integration.sh’s PyTorch integration setup; refresh backend list docs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/integration_test/test_configs.yml | Adds tracking to pytest_files so it can run dedicated pytest modules. |
| tests/integration_test/run_integration_tests.sh | Registers tracking as a selectable backend in the integration runner. |
| tests/integration_test/README.md | Updates documented backend list to include tracking. |
| tests/integration_test/experiment_tracking_recipes_test.py | Switches to dict model config + bundles model.py into the job’s server app. |
| ci/run_integration.sh | Routes tracking to the PyTorch integration runner path in CI. |
Comments suppressed due to low confidence (1)
tests/integration_test/README.md:29
- README states that only
standaloneruns explicit pytest files frompytest_files, but this PR also introduces atrackingbackend that runs pytest files via the same mechanism. Please update this section to mentiontrackingas well to avoid misleading instructions.
The backend options are:
`numpy`, `tensorflow`, `pytorch`, `auth`, `preflight`, `cifar`, `stats`, `xgboost`,
`client_api`, `client_api_qa`, `model_controller_api`, `tracking`, and `standalone`.
`preflight` has its own entry file. Most backend options run through
`tests/integration_test/system_test.py`, and `standalone` runs explicit pytest files listed in
`pytest_files` in `tests/integration_test/test_configs.yml`.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR fixes a
Confidence Score: 5/5Safe to merge — the fix correctly distributes model.py to both server and all client sites, resolving the runtime import failure. The change is narrow and targeted: it swaps an unreliable importlib trick for a documented dict config, and ensures model.py is copied into every site's custom directory (server via add_file_to_server, clients via add_file_to_clients using ALL_SITES). Both the server-side model instantiation path and the client-side from model import SimpleNetwork import are now satisfied. No logic changes outside the test file. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "Merge branch '2.8' into codex/fix-tracki..." | Re-trigger Greptile |
Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
0014fdb to
4169833
Compare
|
Scope update after review: this PR has been narrowed to Current fix:
Validated locally:
|
## Summary
- Replace the dynamic `_exp_tracking_model` import in the experiment
tracking recipe smoke tests with the documented dict model config.
- Bundle the example `model.py` into the server app so
`model.SimpleNetwork` resolves when the job runs.
- Add a dedicated `tracking` integration backend and route it through
the PyTorch CI setup.
- Refresh the integration test README/backend list and stale manual
pytest command.
## Why
The tests loaded the example model under a synthetic module name and
then passed that live `nn.Module` instance into `FedAvgRecipe`. During
job export, Python source inspection could not resolve the synthetic
module and raised:
```text
TypeError: <class '_exp_tracking_model.SimpleNetwork'> is a built-in class
```
Using `model={"class_path": "model.SimpleNetwork", "args": {}}` matches
the recipe API and avoids relying on importlib state.
Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
(cherry picked from commit 5bade21)
## Summary Port the selected 2.8 fixes back to `main` in 2.8 merge order: - #4528 Add warnings for missing study data mappings - #4538 Update deploy prepare launcher docs - #4550 Align `Run.get_result()` with the `clean_up` parameter spelling - #4561 Clarify `remove_client` token cleanup semantics - #4563 Respect `CUDA_VISIBLE_DEVICES` in the GPU resource manager - #4574 Fix Docker SJ workspace tmpfs permissions - #4576 Narrow client failure reporting for generic launcher execution errors - #4583 Fix tracking recipe integration test --------- Signed-off-by: YuanTingHsieh <yuantingh@nvidia.com>
Summary
_exp_tracking_modelimport in the experiment tracking recipe smoke tests with the documented dict model config.model.pyinto the server app somodel.SimpleNetworkresolves when the job runs.trackingintegration backend and route it through the PyTorch CI setup.Why
The tests loaded the example model under a synthetic module name and then passed that live
nn.Moduleinstance intoFedAvgRecipe. During job export, Python source inspection could not resolve the synthetic module and raised:Using
model={"class_path": "model.SimpleNetwork", "args": {}}matches the recipe API and avoids relying on importlib state.