Skip to content

integration: harden XGBoost stability and add sovereign quantum-proof observability profile#4374

Closed
rwilliamspbg-ops wants to merge 22 commits intoNVIDIA:mainfrom
rwilliamspbg-ops:main
Closed

integration: harden XGBoost stability and add sovereign quantum-proof observability profile#4374
rwilliamspbg-ops wants to merge 22 commits intoNVIDIA:mainfrom
rwilliamspbg-ops:main

Conversation

@rwilliamspbg-ops
Copy link
Copy Markdown

Description
This PR introduces a new operational profile for quantum-safe federated learning observability and includes system hardening fixes for XGBoost integrations and recipe APIs.

🔐 Sovereign Quantum-Proof Observability
New Component: Added QuantumProofMetricsCollector to emit PQC posture and proof-lifecycle metrics (verification success, latencies, etc.).

Readiness Gate: Introduced nvflare_quantum_readiness_gate.py to automate Prometheus and Grafana health validation.

Dashboards: Added a pre-provisioned Grafana dashboard for "Sovereign Quantum Ops."

🛠 XGBoost & Integration Hardening
Data Resilience: Updated prepare_data.sh to support synthetic data fallback when the HIGGS dataset is unavailable.

API Compatibility: Fixed XGBVerticalRecipe and FedAvgRecipe to allow flexible initialization (omitting per_site_config), restoring backward compatibility.

Test Hygiene: Added stale-process cleanup to integration tests to prevent environment contamination.

📝 Documentation
Updated release notes and monitoring guides for the new sovereign profile.

Type of Changes
[x] Non-breaking change (fix or new feature)

[x] In-line docstrings updated

[x] Documentation updated

[ ] New tests added

Summary
-------
Improves integration test stability and xgboost backend validation reliability
through process hygiene, fail-fast error detection, and deterministic data
preparation with synthetic fallback support.

Changes
-------
- SimEnv: Fixed client parameter conflict (clients vs. n_clients) in deploy()
- FedAvg recipe: Restored backward compatibility for script-driven flows
- XGBoost vertical recipe: Made per_site_config optional, defaulting to {}
- Integration harness:
  * Added stale-process cleanup before test runs
  * Added fail-fast return code checks for setup/teardown commands
  * Exported cleanup helper from integration utils package
- XGBoost data prep (prepare_data.sh):
  * Hardened shell options (set -euo pipefail)
  * Added synthetic data generation as CI/dev fallback
  * Adaptive data sizing based on available rows
- Integration test configs (xgb_histogram.yml, xgb_tree.yml):
  * Use explicit 'env' invocation for subprocess compatibility
  * Enable synthetic-data fallback for deterministic CI runs

Validation
----------
- Backend xgboost integration: 2/2 tests pass
- Performance recipe suites: 11/11 tests pass
- License check: passed

Related to NVFlare integration reliability tracking in 2.7.2.
…adiness gate

Summary
-------
Introduces production-grade quantum-aware observability for federated learning,
inspired by Sovereign-Mohawk-Proto. Enables operational visibility of proof
lifecycle events, PQC control posture, and deterministic readiness validation
for high-security FL deployments.

Changes
-------
Core Components:
- nvflare/metrics/quantum_proof_metrics_collector.py: Lightweight FLComponent
  for emitting proof-lifecycle and PQC posture metrics to StatsD-compatible
  monitoring stacks.
- examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py:
  Automated PASS/FAIL validation script for Prometheus/Grafana health and
  quantum-path metric availability.
- examples/advanced/monitoring/sovereign/README.md: Integration guide with
  configuration examples and deployment instructions.

Monitoring Stack Integration:
- Grafana dashboard provisioning: NVFlare Sovereign Quantum Ops dashboard
  with 6 panels (quantum path status, PQC controls, proof success rate,
  verify latency, aggregation throughput, timeline).
- Grafana provisioning config (dashboards.yaml) for auto-discovery.
- examples/advanced/monitoring/README.md: Added reference to sovereign profile.

Tests:
- tests/unit_test/metrics/quantum_proof_metrics_collector_test.py: 3 tests
  covering posture publication, proof flow metrics, aggregation flow.
- tests/unit_test/metrics/quantum_readiness_gate_test.py: 4 tests covering
  target health checks, metric presence validation, quantum-path queries,
  and report generation.

Validation
----------
- Unit tests: 7/7 pass
- Collector import: successful
- Grafana dashboard JSON: valid, 6 panels configured
- License check: passed

Metric Contract
---------------
The collector emits:
- quantum_path_ready, quantum_pqc_controls_migration_enabled,
  quantum_pqc_controls_legacy_lock_enabled (gauges for posture)
- quantum_proof_verify_count, _success_count, _failure_count,
  _time_taken (counters/gauges for request lifecycle)
- quantum_proof_aggregation_count, _success_count, _time_taken
  (aggregation metrics)

Part of NVFlare 2.7.2 Sovereign Quantum profile for operational
compliance and auditable proof-path validation.
Copilot AI review requested due to automatic review settings March 27, 2026 21:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “sovereign quantum” monitoring profile (metrics + readiness gate + Grafana dashboard) and hardens XGBoost/integration workflows to be more stable and backward-compatible across common CI/dev environments.

Changes:

  • Introduces QuantumProofMetricsCollector, a readiness-gate script, and a pre-provisioned Grafana dashboard for the new observability profile.
  • Improves XGBoost example/integration stability via synthetic dataset fallback and deterministic dataset sizing.
  • Hardens integration test harness and adjusts recipe APIs (SimEnv, FedAvg, XGBVerticalRecipe) for better compatibility.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit_test/metrics/quantum_readiness_gate_test.py Adds unit coverage for readiness gate behavior and report generation.
tests/unit_test/metrics/quantum_proof_metrics_collector_test.py Adds unit coverage for quantum metrics collector event handling and metric emission.
tests/integration_test/system_test.py Adds stale-process cleanup + checks setup/teardown command exit codes.
tests/integration_test/src/utils.py Adds stale-process cleanup helper for integration runs.
tests/integration_test/src/init.py Exports the new cleanup helper from the integration test utils package.
tests/integration_test/data/test_configs/standalone_job/xgb_tree.yml Enables synthetic-data fallback flag when preparing XGBoost data in integration runs.
tests/integration_test/data/test_configs/standalone_job/xgb_histogram.yml Enables synthetic-data fallback flag when preparing XGBoost data in integration runs.
nvflare/recipe/sim_env.py Adjusts SimEnv deployment to pass explicit client lists to simulator runs.
nvflare/recipe/fedavg.py Relaxes model-source validation to preserve script-driven/backward-compatible flows.
nvflare/metrics/quantum_proof_metrics_collector.py New collector emitting “quantum path” and proof-lifecycle metrics.
nvflare/metrics/init.py Exposes QuantumProofMetricsCollector at package level.
nvflare/app_opt/xgboost/recipes/vertical.py Allows per_site_config to be omitted (defaults to {}).
examples/advanced/xgboost/fedxgb/prepare_data.sh Adds strict bash mode, dynamic dataset sizing, and synthetic dataset fallback.
examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py New Prometheus/Grafana readiness gate script producing PASS/FAIL report.
examples/advanced/monitoring/sovereign/README.md Docs for integrating the collector, stack startup, and readiness gate usage.
examples/advanced/monitoring/setup/grafana/provisioning/dashboards/nvflare_sovereign_quantum_ops.json New provisioned Grafana dashboard for the profile.
examples/advanced/monitoring/setup/grafana/provisioning/dashboards/dashboards.yaml Adds Grafana provisioning provider for sovereign dashboards.
examples/advanced/monitoring/README.md Links to the new sovereign profile docs.
docs/release_notes/flare_272.rst Release notes entries for the new profile and hardening/compatibility changes.
Comments suppressed due to low confidence (1)

nvflare/app_opt/xgboost/recipes/vertical.py:212

  • per_site_config now defaults to {}. In configure(), the job only adds client executors/components inside the for site_name, site_config in self.per_site_config.items() loop, so an empty dict yields a job with no client components (and no clear error). If omitting per_site_config is meant to be supported for backward compatibility, consider either (a) raising a clear ValueError when per_site_config is empty, or (b) documenting/handling an explicit "I will add clients later" mode to avoid generating an invalid job silently.
        self.metrics_writer_id = v.metrics_writer_id
        self.in_process = v.in_process
        self.model_file_name = v.model_file_name
        self.per_site_config = per_site_config or {}

        # Configure the job
        self.job = self.configure()
        Recipe.__init__(self, self.job)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration_test/src/__init__.py Outdated
Comment thread nvflare/recipe/sim_env.py
Comment thread examples/advanced/xgboost/fedxgb/prepare_data.sh
Comment thread nvflare/metrics/quantum_proof_metrics_collector.py
Comment thread examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py Outdated
Comment thread tests/integration_test/src/utils.py Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 27, 2026

Greptile Summary

This PR bundles three distinct areas of work: a new "Sovereign Quantum-Proof Observability" profile (QuantumProofMetricsCollector, readiness gate script, autotune helper, and a pre-provisioned Grafana dashboard), hardening fixes for XGBoost and recipe APIs (SimEnv, FedAvgRecipe, XGBVerticalRecipe), and integration test hygiene (stale-process cleanup, synthetic-data fallback for HIGGS).

The functional correctness of the recipe and integration-test changes is solid. However, the new observability module contains a blocking defect and the Docker Compose update introduces a breaking change for existing users:

  • nvflare/metrics/quantum_proof_metrics_collector.py — A duplicated if event == EventType.ABORT_TASK: at lines 72–73 (both at the same indentation, with no body between them) is a Python IndentationError. The module cannot be imported, which also breaks nvflare/metrics/__init__.py and causes all four unit tests in quantum_proof_metrics_collector_test.py to fail immediately with an ImportError.
  • examples/advanced/monitoring/setup/docker-compose.yml — The nginx password-file volume mount uses the :? expansion operator with no default, so any user who pulls this change without exporting the new environment variable will get a hard docker compose up failure. The Grafana password has a sensible :-admin fallback; the nginx credential path should be treated consistently.
  • nvflare/recipe/sim_env.py — The n_clients/clients conflict fix is correct; always generating explicit site names avoids the ambiguous state that caused earlier integration failures.
  • prepare_data.sh — The synthetic-data fallback and dynamic size_total/size_valid derivation are well-guarded and a clear CI improvement.

Confidence Score: 3/5

Not safe to merge: the new metrics module contains a Python IndentationError that prevents it from loading, and the Docker Compose change is a breaking update for existing monitoring users.

Two concrete defects block merge: (1) a P0 IndentationError in quantum_proof_metrics_collector.py that prevents the module and its __init__.py re-export from loading at all, causing all related unit tests to fail; (2) a P1 breaking change in docker-compose.yml that requires a new environment variable with no default, causing docker compose up to fail for all existing monitoring users. The recipe and integration-test changes are correct and well-tested, but the observability feature that forms the headline of this PR cannot ship in its current state.

nvflare/metrics/quantum_proof_metrics_collector.py (duplicate if IndentationError) and examples/advanced/monitoring/setup/docker-compose.yml (missing default for required env variable).

Important Files Changed

Filename Overview
nvflare/metrics/quantum_proof_metrics_collector.py New QuantumProofMetricsCollector FLComponent — contains a duplicate if event == EventType.ABORT_TASK: at lines 72–73 that is a Python IndentationError, preventing the module (and nvflare.metrics) from loading at all.
nvflare/metrics/init.py Exports QuantumProofMetricsCollector; inherits the P0 IndentationError from the collector module — this __init__.py import will also fail at runtime until the duplicate if is fixed.
examples/advanced/monitoring/setup/docker-compose.yml Adds nginx authentication proxy for Prometheus and binds all ports to localhost; requires a new env variable with no default, breaking existing deployments that do not set it.
nvflare/recipe/sim_env.py Always generates explicit client names instead of passing n_clients; cleanly avoids the clients/n_clients conflict for jobs that already specify targets via job.to().
nvflare/recipe/fedavg.py Cosmetic-only changes: reorders the three None conditions in the model-source guard and removes a blank line; both ValueError guards remain intact and functional.
nvflare/app_opt/xgboost/recipes/vertical.py Changes per_site_config is None guard to not per_site_config; ValueError is still raised for None input (confirmed by new test), with the minor side effect that an empty dict {} now also raises instead of producing a silently misconfigured recipe.
examples/advanced/xgboost/fedxgb/prepare_data.sh Adds set -euo pipefail, synthetic-data fallback via NVFLARE_XGB_ALLOW_SYNTHETIC_DATA, and dynamic size_total/size_valid derivation from actual dataset row count; well-guarded against edge cases.
tests/integration_test/src/utils.py Adds cleanup_stale_integration_processes() which sends SIGTERM then SIGKILL to NVFlare process patterns; path-fragment patterns could match unrelated processes in shared CI environments.
tests/integration_test/system_test.py Calls cleanup_stale_integration_processes() before each test run and adds non-zero exit code checks for setup/teardown commands; both are solid test hygiene improvements.
examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py New standalone readiness-gate script with Prometheus/Grafana health checks, metric presence validation, and consecutive-breach congestion detection; logic is correct and well-tested.
tests/unit_test/metrics/quantum_proof_metrics_collector_test.py Comprehensive unit tests for QuantumProofMetricsCollector — all four tests will fail with an ImportError until the duplicate if syntax error in the source module is fixed.
examples/advanced/monitoring/sovereign/nvflare_quantum_autotune.py New standalone auto-tune script that derives congestion thresholds from Prometheus range queries; logic is sound with good handling of low-traffic samples and clear profile recommendations.

Sequence Diagram

sequenceDiagram
    participant FL as FL Runtime
    participant QPMC as QuantumProofMetricsCollector
    participant CB as collect_metrics / DataBus
    participant PM as Prometheus (via StatsD)

    FL->>QPMC: START_RUN
    QPMC->>CB: quantum_path{ready=1, attestation_mode, kex_mode}
    QPMC->>CB: quantum_pqc_controls{migration_enabled, legacy_lock_enabled}

    FL->>QPMC: BEFORE_TASK_EXECUTION
    QPMC->>QPMC: _proof_verify_start_time = now()
    QPMC->>CB: quantum_proof_verify (counter)

    alt Task completes
        FL->>QPMC: AFTER_TASK_EXECUTION
        QPMC->>CB: quantum_proof_verify_success (counter)
        QPMC->>CB: quantum_proof_verify (elapsed gauge)
        QPMC->>QPMC: clear _proof_verify_start_time
    else Task aborted
        FL->>QPMC: ABORT_TASK
        QPMC->>CB: quantum_proof_verify_failure (counter)
        QPMC->>QPMC: clear _proof_verify_start_time
    end

    FL->>QPMC: BEFORE_AGGREGATION
    QPMC->>QPMC: _aggregation_start_time = now()
    QPMC->>CB: quantum_proof_aggregation (counter)

    FL->>QPMC: END_AGGREGATION
    QPMC->>CB: quantum_proof_aggregation_success (counter)
    QPMC->>CB: quantum_proof_aggregation (elapsed gauge)

    CB-->>PM: StatsD/Prometheus scrape
Loading

Comments Outside Diff (2)

  1. nvflare/metrics/quantum_proof_metrics_collector.py, line 72-76 (link)

    Duplicate if causes IndentationError — module cannot be imported

    Lines 72–73 contain two consecutive if event == EventType.ABORT_TASK: guards at the same indentation level. Python requires an indented block after an if statement; an immediately following statement at the same indentation is an IndentationError. This means the entire nvflare.metrics.quantum_proof_metrics_collector module (and by extension nvflare.metrics via its __init__.py) will fail to import at runtime. All four unit tests in quantum_proof_metrics_collector_test.py will also fail with an import error.

  2. examples/advanced/monitoring/setup/docker-compose.yml, line 126-127 (link)

    Missing default for required env variable breaks existing monitoring setups

    The :? operator on this volume mount causes Docker Compose to abort startup with an error if the environment variable is unset or empty. Unlike GF_SECURITY_ADMIN_PASSWORD on line 140 which ships with an admin default, the nginx password-file path has no fallback. Existing users of the monitoring stack who pull this change will see a hard failure on docker compose up until they create the password file and export the variable.

    Consider using the :- operator with a path to the bundled example file as a fallback, and update the README to document this as a new required setup step before merging.

Reviews (15): Last reviewed commit: "Merge branch 'NVIDIA:main' into main" | Re-trigger Greptile

Comment thread nvflare/app_opt/xgboost/recipes/vertical.py Outdated
Comment thread nvflare/recipe/fedavg.py Outdated
Comment thread nvflare/metrics/quantum_proof_metrics_collector.py
Comment thread tests/integration_test/src/utils.py Outdated
@rwilliamspbg-ops
Copy link
Copy Markdown
Author

ready to test out

@chesterxgchen chesterxgchen requested a review from ZiyueXu77 March 28, 2026 23:55
@chesterxgchen
Copy link
Copy Markdown
Collaborator

@greptileai

Comment thread nvflare/recipe/fedavg.py
@chesterxgchen
Copy link
Copy Markdown
Collaborator

@rwilliamspbg-ops looks like you code style check failed, you can check this offline by

./runtests.sh -s
./runtests.sh -f --- force code style format

@chesterxgchen
Copy link
Copy Markdown
Collaborator

please address the review comments

- Apply Black/isort formatting to quantum autotune and collector tests\n- Replace PEP 604 union annotations with Optional[...] in readiness gate for Python 3.9 compatibility\n- Validate with black, isort, flake8, and metrics unit tests

Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Notes:

- Reworked hello-lr README to follow hello-pt section structure (installation, data, code structure, client code, server side, job recipe, run job, result).

- Standardized code-structure formatting around client.py / job.py and noted model.py is not present in this example.

- Fixed isort ordering in tests/integration_test/system_test.py to satisfy style checks.

- Verified style checks with ./runtest.sh -s (black, isort, flake8 all passed).
@rwilliamspbg-ops
Copy link
Copy Markdown
Author

Should be ready now.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nvflare/recipe/sim_env.py
Comment thread nvflare/app_opt/xgboost/recipes/vertical.py
Comment thread tests/integration_test/test_xgb_vertical_recipe.py
Comment thread docs/release_notes/flare_272.rst
Comment thread examples/hello-world/hello-lr/README.md Outdated
Comment thread tests/integration_test/src/utils.py
Comment thread tests/integration_test/src/utils.py Outdated
rwilliamspbg-ops and others added 2 commits March 29, 2026 21:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Author

@rwilliamspbg-ops rwilliamspbg-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review

@chesterxgchen
Copy link
Copy Markdown
Collaborator

please address the comments from Greptile, once address you can resolve the conversation

rwilliamspbg-ops and others added 2 commits March 30, 2026 19:47
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants