integration: harden XGBoost stability and add sovereign quantum-proof observability profile by rwilliamspbg-ops · Pull Request #4374 · NVIDIA/NVFlare

rwilliamspbg-ops · 2026-03-27T21:21:01Z

Description
This PR introduces a new operational profile for quantum-safe federated learning observability and includes system hardening fixes for XGBoost integrations and recipe APIs.

🔐 Sovereign Quantum-Proof Observability
New Component: Added QuantumProofMetricsCollector to emit PQC posture and proof-lifecycle metrics (verification success, latencies, etc.).

Readiness Gate: Introduced nvflare_quantum_readiness_gate.py to automate Prometheus and Grafana health validation.

Dashboards: Added a pre-provisioned Grafana dashboard for "Sovereign Quantum Ops."

🛠 XGBoost & Integration Hardening
Data Resilience: Updated prepare_data.sh to support synthetic data fallback when the HIGGS dataset is unavailable.

API Compatibility: Fixed XGBVerticalRecipe and FedAvgRecipe to allow flexible initialization (omitting per_site_config), restoring backward compatibility.

Test Hygiene: Added stale-process cleanup to integration tests to prevent environment contamination.

📝 Documentation
Updated release notes and monitoring guides for the new sovereign profile.

Type of Changes
[x] Non-breaking change (fix or new feature)

[x] In-line docstrings updated

[x] Documentation updated

[ ] New tests added

Summary ------- Improves integration test stability and xgboost backend validation reliability through process hygiene, fail-fast error detection, and deterministic data preparation with synthetic fallback support. Changes ------- - SimEnv: Fixed client parameter conflict (clients vs. n_clients) in deploy() - FedAvg recipe: Restored backward compatibility for script-driven flows - XGBoost vertical recipe: Made per_site_config optional, defaulting to {} - Integration harness: * Added stale-process cleanup before test runs * Added fail-fast return code checks for setup/teardown commands * Exported cleanup helper from integration utils package - XGBoost data prep (prepare_data.sh): * Hardened shell options (set -euo pipefail) * Added synthetic data generation as CI/dev fallback * Adaptive data sizing based on available rows - Integration test configs (xgb_histogram.yml, xgb_tree.yml): * Use explicit 'env' invocation for subprocess compatibility * Enable synthetic-data fallback for deterministic CI runs Validation ---------- - Backend xgboost integration: 2/2 tests pass - Performance recipe suites: 11/11 tests pass - License check: passed Related to NVFlare integration reliability tracking in 2.7.2.

…adiness gate Summary ------- Introduces production-grade quantum-aware observability for federated learning, inspired by Sovereign-Mohawk-Proto. Enables operational visibility of proof lifecycle events, PQC control posture, and deterministic readiness validation for high-security FL deployments. Changes ------- Core Components: - nvflare/metrics/quantum_proof_metrics_collector.py: Lightweight FLComponent for emitting proof-lifecycle and PQC posture metrics to StatsD-compatible monitoring stacks. - examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py: Automated PASS/FAIL validation script for Prometheus/Grafana health and quantum-path metric availability. - examples/advanced/monitoring/sovereign/README.md: Integration guide with configuration examples and deployment instructions. Monitoring Stack Integration: - Grafana dashboard provisioning: NVFlare Sovereign Quantum Ops dashboard with 6 panels (quantum path status, PQC controls, proof success rate, verify latency, aggregation throughput, timeline). - Grafana provisioning config (dashboards.yaml) for auto-discovery. - examples/advanced/monitoring/README.md: Added reference to sovereign profile. Tests: - tests/unit_test/metrics/quantum_proof_metrics_collector_test.py: 3 tests covering posture publication, proof flow metrics, aggregation flow. - tests/unit_test/metrics/quantum_readiness_gate_test.py: 4 tests covering target health checks, metric presence validation, quantum-path queries, and report generation. Validation ---------- - Unit tests: 7/7 pass - Collector import: successful - Grafana dashboard JSON: valid, 6 panels configured - License check: passed Metric Contract --------------- The collector emits: - quantum_path_ready, quantum_pqc_controls_migration_enabled, quantum_pqc_controls_legacy_lock_enabled (gauges for posture) - quantum_proof_verify_count, _success_count, _failure_count, _time_taken (counters/gauges for request lifecycle) - quantum_proof_aggregation_count, _success_count, _time_taken (aggregation metrics) Part of NVFlare 2.7.2 Sovereign Quantum profile for operational compliance and auditable proof-path validation.

Copilot

Pull request overview

Adds a new “sovereign quantum” monitoring profile (metrics + readiness gate + Grafana dashboard) and hardens XGBoost/integration workflows to be more stable and backward-compatible across common CI/dev environments.

Changes:

Introduces QuantumProofMetricsCollector, a readiness-gate script, and a pre-provisioned Grafana dashboard for the new observability profile.
Improves XGBoost example/integration stability via synthetic dataset fallback and deterministic dataset sizing.
Hardens integration test harness and adjusts recipe APIs (SimEnv, FedAvg, XGBVerticalRecipe) for better compatibility.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/unit_test/metrics/quantum_readiness_gate_test.py	Adds unit coverage for readiness gate behavior and report generation.
tests/unit_test/metrics/quantum_proof_metrics_collector_test.py	Adds unit coverage for quantum metrics collector event handling and metric emission.
tests/integration_test/system_test.py	Adds stale-process cleanup + checks setup/teardown command exit codes.
tests/integration_test/src/utils.py	Adds stale-process cleanup helper for integration runs.
tests/integration_test/src/init.py	Exports the new cleanup helper from the integration test utils package.
tests/integration_test/data/test_configs/standalone_job/xgb_tree.yml	Enables synthetic-data fallback flag when preparing XGBoost data in integration runs.
tests/integration_test/data/test_configs/standalone_job/xgb_histogram.yml	Enables synthetic-data fallback flag when preparing XGBoost data in integration runs.
nvflare/recipe/sim_env.py	Adjusts SimEnv deployment to pass explicit client lists to simulator runs.
nvflare/recipe/fedavg.py	Relaxes model-source validation to preserve script-driven/backward-compatible flows.
nvflare/metrics/quantum_proof_metrics_collector.py	New collector emitting “quantum path” and proof-lifecycle metrics.
nvflare/metrics/init.py	Exposes `QuantumProofMetricsCollector` at package level.
nvflare/app_opt/xgboost/recipes/vertical.py	Allows `per_site_config` to be omitted (defaults to `{}`).
examples/advanced/xgboost/fedxgb/prepare_data.sh	Adds strict bash mode, dynamic dataset sizing, and synthetic dataset fallback.
examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py	New Prometheus/Grafana readiness gate script producing PASS/FAIL report.
examples/advanced/monitoring/sovereign/README.md	Docs for integrating the collector, stack startup, and readiness gate usage.
examples/advanced/monitoring/setup/grafana/provisioning/dashboards/nvflare_sovereign_quantum_ops.json	New provisioned Grafana dashboard for the profile.
examples/advanced/monitoring/setup/grafana/provisioning/dashboards/dashboards.yaml	Adds Grafana provisioning provider for sovereign dashboards.
examples/advanced/monitoring/README.md	Links to the new sovereign profile docs.
docs/release_notes/flare_272.rst	Release notes entries for the new profile and hardening/compatibility changes.

Comments suppressed due to low confidence (1)

nvflare/app_opt/xgboost/recipes/vertical.py:212

per_site_config now defaults to {}. In configure(), the job only adds client executors/components inside the for site_name, site_config in self.per_site_config.items() loop, so an empty dict yields a job with no client components (and no clear error). If omitting per_site_config is meant to be supported for backward compatibility, consider either (a) raising a clear ValueError when per_site_config is empty, or (b) documenting/handling an explicit "I will add clients later" mode to avoid generating an invalid job silently.

        self.metrics_writer_id = v.metrics_writer_id
        self.in_process = v.in_process
        self.model_file_name = v.model_file_name
        self.per_site_config = per_site_config or {}

        # Configure the job
        self.job = self.configure()
        Recipe.__init__(self, self.job)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps · 2026-03-27T21:26:39Z

Greptile Summary

This PR bundles three distinct areas of work: a new "Sovereign Quantum-Proof Observability" profile (QuantumProofMetricsCollector, readiness gate script, autotune helper, and a pre-provisioned Grafana dashboard), hardening fixes for XGBoost and recipe APIs (SimEnv, FedAvgRecipe, XGBVerticalRecipe), and integration test hygiene (stale-process cleanup, synthetic-data fallback for HIGGS).

The functional correctness of the recipe and integration-test changes is solid. However, the new observability module contains a blocking defect and the Docker Compose update introduces a breaking change for existing users:

nvflare/metrics/quantum_proof_metrics_collector.py — A duplicated if event == EventType.ABORT_TASK: at lines 72–73 (both at the same indentation, with no body between them) is a Python IndentationError. The module cannot be imported, which also breaks nvflare/metrics/__init__.py and causes all four unit tests in quantum_proof_metrics_collector_test.py to fail immediately with an ImportError.
examples/advanced/monitoring/setup/docker-compose.yml — The nginx password-file volume mount uses the :? expansion operator with no default, so any user who pulls this change without exporting the new environment variable will get a hard docker compose up failure. The Grafana password has a sensible :-admin fallback; the nginx credential path should be treated consistently.
nvflare/recipe/sim_env.py — The n_clients/clients conflict fix is correct; always generating explicit site names avoids the ambiguous state that caused earlier integration failures.
prepare_data.sh — The synthetic-data fallback and dynamic size_total/size_valid derivation are well-guarded and a clear CI improvement.

Confidence Score: 3/5

Not safe to merge: the new metrics module contains a Python IndentationError that prevents it from loading, and the Docker Compose change is a breaking update for existing monitoring users.

Two concrete defects block merge: (1) a P0 IndentationError in quantum_proof_metrics_collector.py that prevents the module and its __init__.py re-export from loading at all, causing all related unit tests to fail; (2) a P1 breaking change in docker-compose.yml that requires a new environment variable with no default, causing docker compose up to fail for all existing monitoring users. The recipe and integration-test changes are correct and well-tested, but the observability feature that forms the headline of this PR cannot ship in its current state.

nvflare/metrics/quantum_proof_metrics_collector.py (duplicate if IndentationError) and examples/advanced/monitoring/setup/docker-compose.yml (missing default for required env variable).

Important Files Changed

Filename	Overview
nvflare/metrics/quantum_proof_metrics_collector.py	New `QuantumProofMetricsCollector` FLComponent — contains a duplicate `if event == EventType.ABORT_TASK:` at lines 72–73 that is a Python `IndentationError`, preventing the module (and `nvflare.metrics`) from loading at all.
nvflare/metrics/init.py	Exports `QuantumProofMetricsCollector`; inherits the P0 `IndentationError` from the collector module — this `__init__.py` import will also fail at runtime until the duplicate `if` is fixed.
examples/advanced/monitoring/setup/docker-compose.yml	Adds nginx authentication proxy for Prometheus and binds all ports to localhost; requires a new env variable with no default, breaking existing deployments that do not set it.
nvflare/recipe/sim_env.py	Always generates explicit client names instead of passing `n_clients`; cleanly avoids the `clients`/`n_clients` conflict for jobs that already specify targets via `job.to()`.
nvflare/recipe/fedavg.py	Cosmetic-only changes: reorders the three `None` conditions in the model-source guard and removes a blank line; both `ValueError` guards remain intact and functional.
nvflare/app_opt/xgboost/recipes/vertical.py	Changes `per_site_config is None` guard to `not per_site_config`; `ValueError` is still raised for `None` input (confirmed by new test), with the minor side effect that an empty dict `{}` now also raises instead of producing a silently misconfigured recipe.
examples/advanced/xgboost/fedxgb/prepare_data.sh	Adds `set -euo pipefail`, synthetic-data fallback via `NVFLARE_XGB_ALLOW_SYNTHETIC_DATA`, and dynamic `size_total`/`size_valid` derivation from actual dataset row count; well-guarded against edge cases.
tests/integration_test/src/utils.py	Adds `cleanup_stale_integration_processes()` which sends `SIGTERM` then `SIGKILL` to NVFlare process patterns; path-fragment patterns could match unrelated processes in shared CI environments.
tests/integration_test/system_test.py	Calls `cleanup_stale_integration_processes()` before each test run and adds non-zero exit code checks for setup/teardown commands; both are solid test hygiene improvements.
examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py	New standalone readiness-gate script with Prometheus/Grafana health checks, metric presence validation, and consecutive-breach congestion detection; logic is correct and well-tested.
tests/unit_test/metrics/quantum_proof_metrics_collector_test.py	Comprehensive unit tests for `QuantumProofMetricsCollector` — all four tests will fail with an `ImportError` until the duplicate `if` syntax error in the source module is fixed.
examples/advanced/monitoring/sovereign/nvflare_quantum_autotune.py	New standalone auto-tune script that derives congestion thresholds from Prometheus range queries; logic is sound with good handling of low-traffic samples and clear profile recommendations.

Sequence Diagram

sequenceDiagram
    participant FL as FL Runtime
    participant QPMC as QuantumProofMetricsCollector
    participant CB as collect_metrics / DataBus
    participant PM as Prometheus (via StatsD)

    FL->>QPMC: START_RUN
    QPMC->>CB: quantum_path{ready=1, attestation_mode, kex_mode}
    QPMC->>CB: quantum_pqc_controls{migration_enabled, legacy_lock_enabled}

    FL->>QPMC: BEFORE_TASK_EXECUTION
    QPMC->>QPMC: _proof_verify_start_time = now()
    QPMC->>CB: quantum_proof_verify (counter)

    alt Task completes
        FL->>QPMC: AFTER_TASK_EXECUTION
        QPMC->>CB: quantum_proof_verify_success (counter)
        QPMC->>CB: quantum_proof_verify (elapsed gauge)
        QPMC->>QPMC: clear _proof_verify_start_time
    else Task aborted
        FL->>QPMC: ABORT_TASK
        QPMC->>CB: quantum_proof_verify_failure (counter)
        QPMC->>QPMC: clear _proof_verify_start_time
    end

    FL->>QPMC: BEFORE_AGGREGATION
    QPMC->>QPMC: _aggregation_start_time = now()
    QPMC->>CB: quantum_proof_aggregation (counter)

    FL->>QPMC: END_AGGREGATION
    QPMC->>CB: quantum_proof_aggregation_success (counter)
    QPMC->>CB: quantum_proof_aggregation (elapsed gauge)

    CB-->>PM: StatsD/Prometheus scrape

Comments Outside Diff (2)

nvflare/metrics/quantum_proof_metrics_collector.py, line 72-76 (link)

Duplicate if causes IndentationError — module cannot be imported

Lines 72–73 contain two consecutive if event == EventType.ABORT_TASK: guards at the same indentation level. Python requires an indented block after an if statement; an immediately following statement at the same indentation is an IndentationError. This means the entire nvflare.metrics.quantum_proof_metrics_collector module (and by extension nvflare.metrics via its __init__.py) will fail to import at runtime. All four unit tests in quantum_proof_metrics_collector_test.py will also fail with an import error.
examples/advanced/monitoring/setup/docker-compose.yml, line 126-127 (link)

Missing default for required env variable breaks existing monitoring setups

The :? operator on this volume mount causes Docker Compose to abort startup with an error if the environment variable is unset or empty. Unlike GF_SECURITY_ADMIN_PASSWORD on line 140 which ships with an admin default, the nginx password-file path has no fallback. Existing users of the monitoring stack who pull this change will see a hard failure on docker compose up until they create the password file and export the variable.

Consider using the :- operator with a path to the bundled example file as a fallback, and update the README to document this as a new required setup step before merging.

_{Reviews (15): Last reviewed commit: "Merge branch 'NVIDIA:main' into main" | Re-trigger Greptile}

rwilliamspbg-ops · 2026-03-28T00:32:28Z

ready to test out

chesterxgchen · 2026-03-28T23:58:04Z

@greptileai

chesterxgchen · 2026-03-29T00:46:03Z

@rwilliamspbg-ops looks like you code style check failed, you can check this offline by

./runtests.sh -s
./runtests.sh -f --- force code style format

chesterxgchen · 2026-03-29T00:46:30Z

please address the review comments

- Apply Black/isort formatting to quantum autotune and collector tests\n- Replace PEP 604 union annotations with Optional[...] in readiness gate for Python 3.9 compatibility\n- Validate with black, isort, flake8, and metrics unit tests Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>

Notes: - Reworked hello-lr README to follow hello-pt section structure (installation, data, code structure, client code, server side, job recipe, run job, result). - Standardized code-structure formatting around client.py / job.py and noted model.py is not present in this example. - Fixed isort ordering in tests/integration_test/system_test.py to satisfy style checks. - Verified style checks with ./runtest.sh -s (black, isort, flake8 all passed).

rwilliamspbg-ops · 2026-03-29T15:25:06Z

Should be ready now.

Copilot

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rwilliamspbg-ops

Ready for review

chesterxgchen · 2026-03-31T02:34:44Z

please address the comments from Greptile, once address you can resolve the conversation

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

rwilliamspbg-ops added 2 commits March 27, 2026 21:07

Copilot AI review requested due to automatic review settings March 27, 2026 21:21

Copilot started reviewing on behalf of rwilliamspbg-ops March 27, 2026 21:21 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

greptile-apps Bot reviewed Mar 27, 2026

View reviewed changes

Comment thread nvflare/app_opt/xgboost/recipes/vertical.py Outdated

Comment thread nvflare/recipe/fedavg.py Outdated

Comment thread nvflare/metrics/quantum_proof_metrics_collector.py

Comment thread tests/integration_test/src/utils.py Outdated

rwilliamspbg-ops added 11 commits March 27, 2026 23:40

recipe: restore constructor guards and harden cleanup paths

25b98ed

monitoring: fix sovereign PQC controls promql

a34fe57

monitoring: fix proof/aggregation latency panel promql

020e98b

monitoring: make proof verify latency stat deterministic

010515b

monitoring: harden local stack ports and prometheus access

8d4854c

monitoring: add congestion risk checks and indicators

6935033

monitoring: add threshold auto-tuner and failure ratio fallback

20a96cf

monitoring: add profile presets and consecutive congestion checks

11e5f54

monitoring: emit profile recommendation and apply commands

46acc52

monitoring: add shell export mode for auto-tuner

a4f60e9

monitoring: apply P2 lint and credential hardening cleanups

3038552

chesterxgchen requested a review from ZiyueXu77 March 28, 2026 23:55

chesterxgchen reviewed Mar 29, 2026

View reviewed changes

Comment thread nvflare/recipe/fedavg.py

rwilliamspbg-ops added 2 commits March 29, 2026 08:35

rwilliamspbg-ops requested a review from Copilot March 30, 2026 04:06

Copilot started reviewing on behalf of rwilliamspbg-ops March 30, 2026 04:06 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

rwilliamspbg-ops and others added 2 commits March 29, 2026 21:16

Update examples/hello-world/hello-lr/README.md

aca3ec0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/integration_test/src/utils.py

8a8b00a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rwilliamspbg-ops commented Mar 30, 2026

View reviewed changes

rwilliamspbg-ops and others added 3 commits March 30, 2026 12:26

Merge branch 'NVIDIA:main' into main

c827c45

Merge branch 'NVIDIA:main' into main

195eafc

Merge branch 'main' into main

4900aa1

rwilliamspbg-ops and others added 2 commits March 30, 2026 19:47

Update nvflare/metrics/quantum_proof_metrics_collector.py

1e98a8d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Merge branch 'NVIDIA:main' into main

fefb00c

rwilliamspbg-ops closed this Apr 1, 2026

Conversation

rwilliamspbg-ops commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rwilliamspbg-ops commented Mar 28, 2026

Uh oh!

chesterxgchen commented Mar 28, 2026

Uh oh!

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

chesterxgchen commented Mar 29, 2026

Uh oh!

rwilliamspbg-ops commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rwilliamspbg-ops left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Mar 27, 2026 •

edited

Loading