integration: harden XGBoost stability and add sovereign quantum-proof observability profile#4374
integration: harden XGBoost stability and add sovereign quantum-proof observability profile#4374rwilliamspbg-ops wants to merge 22 commits intoNVIDIA:mainfrom
Conversation
Summary
-------
Improves integration test stability and xgboost backend validation reliability
through process hygiene, fail-fast error detection, and deterministic data
preparation with synthetic fallback support.
Changes
-------
- SimEnv: Fixed client parameter conflict (clients vs. n_clients) in deploy()
- FedAvg recipe: Restored backward compatibility for script-driven flows
- XGBoost vertical recipe: Made per_site_config optional, defaulting to {}
- Integration harness:
* Added stale-process cleanup before test runs
* Added fail-fast return code checks for setup/teardown commands
* Exported cleanup helper from integration utils package
- XGBoost data prep (prepare_data.sh):
* Hardened shell options (set -euo pipefail)
* Added synthetic data generation as CI/dev fallback
* Adaptive data sizing based on available rows
- Integration test configs (xgb_histogram.yml, xgb_tree.yml):
* Use explicit 'env' invocation for subprocess compatibility
* Enable synthetic-data fallback for deterministic CI runs
Validation
----------
- Backend xgboost integration: 2/2 tests pass
- Performance recipe suites: 11/11 tests pass
- License check: passed
Related to NVFlare integration reliability tracking in 2.7.2.
…adiness gate Summary ------- Introduces production-grade quantum-aware observability for federated learning, inspired by Sovereign-Mohawk-Proto. Enables operational visibility of proof lifecycle events, PQC control posture, and deterministic readiness validation for high-security FL deployments. Changes ------- Core Components: - nvflare/metrics/quantum_proof_metrics_collector.py: Lightweight FLComponent for emitting proof-lifecycle and PQC posture metrics to StatsD-compatible monitoring stacks. - examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py: Automated PASS/FAIL validation script for Prometheus/Grafana health and quantum-path metric availability. - examples/advanced/monitoring/sovereign/README.md: Integration guide with configuration examples and deployment instructions. Monitoring Stack Integration: - Grafana dashboard provisioning: NVFlare Sovereign Quantum Ops dashboard with 6 panels (quantum path status, PQC controls, proof success rate, verify latency, aggregation throughput, timeline). - Grafana provisioning config (dashboards.yaml) for auto-discovery. - examples/advanced/monitoring/README.md: Added reference to sovereign profile. Tests: - tests/unit_test/metrics/quantum_proof_metrics_collector_test.py: 3 tests covering posture publication, proof flow metrics, aggregation flow. - tests/unit_test/metrics/quantum_readiness_gate_test.py: 4 tests covering target health checks, metric presence validation, quantum-path queries, and report generation. Validation ---------- - Unit tests: 7/7 pass - Collector import: successful - Grafana dashboard JSON: valid, 6 panels configured - License check: passed Metric Contract --------------- The collector emits: - quantum_path_ready, quantum_pqc_controls_migration_enabled, quantum_pqc_controls_legacy_lock_enabled (gauges for posture) - quantum_proof_verify_count, _success_count, _failure_count, _time_taken (counters/gauges for request lifecycle) - quantum_proof_aggregation_count, _success_count, _time_taken (aggregation metrics) Part of NVFlare 2.7.2 Sovereign Quantum profile for operational compliance and auditable proof-path validation.
There was a problem hiding this comment.
Pull request overview
Adds a new “sovereign quantum” monitoring profile (metrics + readiness gate + Grafana dashboard) and hardens XGBoost/integration workflows to be more stable and backward-compatible across common CI/dev environments.
Changes:
- Introduces
QuantumProofMetricsCollector, a readiness-gate script, and a pre-provisioned Grafana dashboard for the new observability profile. - Improves XGBoost example/integration stability via synthetic dataset fallback and deterministic dataset sizing.
- Hardens integration test harness and adjusts recipe APIs (SimEnv, FedAvg, XGBVerticalRecipe) for better compatibility.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_test/metrics/quantum_readiness_gate_test.py | Adds unit coverage for readiness gate behavior and report generation. |
| tests/unit_test/metrics/quantum_proof_metrics_collector_test.py | Adds unit coverage for quantum metrics collector event handling and metric emission. |
| tests/integration_test/system_test.py | Adds stale-process cleanup + checks setup/teardown command exit codes. |
| tests/integration_test/src/utils.py | Adds stale-process cleanup helper for integration runs. |
| tests/integration_test/src/init.py | Exports the new cleanup helper from the integration test utils package. |
| tests/integration_test/data/test_configs/standalone_job/xgb_tree.yml | Enables synthetic-data fallback flag when preparing XGBoost data in integration runs. |
| tests/integration_test/data/test_configs/standalone_job/xgb_histogram.yml | Enables synthetic-data fallback flag when preparing XGBoost data in integration runs. |
| nvflare/recipe/sim_env.py | Adjusts SimEnv deployment to pass explicit client lists to simulator runs. |
| nvflare/recipe/fedavg.py | Relaxes model-source validation to preserve script-driven/backward-compatible flows. |
| nvflare/metrics/quantum_proof_metrics_collector.py | New collector emitting “quantum path” and proof-lifecycle metrics. |
| nvflare/metrics/init.py | Exposes QuantumProofMetricsCollector at package level. |
| nvflare/app_opt/xgboost/recipes/vertical.py | Allows per_site_config to be omitted (defaults to {}). |
| examples/advanced/xgboost/fedxgb/prepare_data.sh | Adds strict bash mode, dynamic dataset sizing, and synthetic dataset fallback. |
| examples/advanced/monitoring/sovereign/nvflare_quantum_readiness_gate.py | New Prometheus/Grafana readiness gate script producing PASS/FAIL report. |
| examples/advanced/monitoring/sovereign/README.md | Docs for integrating the collector, stack startup, and readiness gate usage. |
| examples/advanced/monitoring/setup/grafana/provisioning/dashboards/nvflare_sovereign_quantum_ops.json | New provisioned Grafana dashboard for the profile. |
| examples/advanced/monitoring/setup/grafana/provisioning/dashboards/dashboards.yaml | Adds Grafana provisioning provider for sovereign dashboards. |
| examples/advanced/monitoring/README.md | Links to the new sovereign profile docs. |
| docs/release_notes/flare_272.rst | Release notes entries for the new profile and hardening/compatibility changes. |
Comments suppressed due to low confidence (1)
nvflare/app_opt/xgboost/recipes/vertical.py:212
per_site_confignow defaults to{}. Inconfigure(), the job only adds client executors/components inside thefor site_name, site_config in self.per_site_config.items()loop, so an empty dict yields a job with no client components (and no clear error). If omittingper_site_configis meant to be supported for backward compatibility, consider either (a) raising a clearValueErrorwhenper_site_configis empty, or (b) documenting/handling an explicit "I will add clients later" mode to avoid generating an invalid job silently.
self.metrics_writer_id = v.metrics_writer_id
self.in_process = v.in_process
self.model_file_name = v.model_file_name
self.per_site_config = per_site_config or {}
# Configure the job
self.job = self.configure()
Recipe.__init__(self, self.job)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR bundles three distinct areas of work: a new "Sovereign Quantum-Proof Observability" profile ( The functional correctness of the recipe and integration-test changes is solid. However, the new observability module contains a blocking defect and the Docker Compose update introduces a breaking change for existing users:
Confidence Score: 3/5Not safe to merge: the new metrics module contains a Python IndentationError that prevents it from loading, and the Docker Compose change is a breaking update for existing monitoring users. Two concrete defects block merge: (1) a P0 IndentationError in
Important Files Changed
Sequence DiagramsequenceDiagram
participant FL as FL Runtime
participant QPMC as QuantumProofMetricsCollector
participant CB as collect_metrics / DataBus
participant PM as Prometheus (via StatsD)
FL->>QPMC: START_RUN
QPMC->>CB: quantum_path{ready=1, attestation_mode, kex_mode}
QPMC->>CB: quantum_pqc_controls{migration_enabled, legacy_lock_enabled}
FL->>QPMC: BEFORE_TASK_EXECUTION
QPMC->>QPMC: _proof_verify_start_time = now()
QPMC->>CB: quantum_proof_verify (counter)
alt Task completes
FL->>QPMC: AFTER_TASK_EXECUTION
QPMC->>CB: quantum_proof_verify_success (counter)
QPMC->>CB: quantum_proof_verify (elapsed gauge)
QPMC->>QPMC: clear _proof_verify_start_time
else Task aborted
FL->>QPMC: ABORT_TASK
QPMC->>CB: quantum_proof_verify_failure (counter)
QPMC->>QPMC: clear _proof_verify_start_time
end
FL->>QPMC: BEFORE_AGGREGATION
QPMC->>QPMC: _aggregation_start_time = now()
QPMC->>CB: quantum_proof_aggregation (counter)
FL->>QPMC: END_AGGREGATION
QPMC->>CB: quantum_proof_aggregation_success (counter)
QPMC->>CB: quantum_proof_aggregation (elapsed gauge)
CB-->>PM: StatsD/Prometheus scrape
|
|
ready to test out |
|
@rwilliamspbg-ops looks like you code style check failed, you can check this offline by ./runtests.sh -s |
|
please address the review comments |
- Apply Black/isort formatting to quantum autotune and collector tests\n- Replace PEP 604 union annotations with Optional[...] in readiness gate for Python 3.9 compatibility\n- Validate with black, isort, flake8, and metrics unit tests Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Notes: - Reworked hello-lr README to follow hello-pt section structure (installation, data, code structure, client code, server side, job recipe, run job, result). - Standardized code-structure formatting around client.py / job.py and noted model.py is not present in this example. - Fixed isort ordering in tests/integration_test/system_test.py to satisfy style checks. - Verified style checks with ./runtest.sh -s (black, isort, flake8 all passed).
|
Should be ready now. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
please address the comments from Greptile, once address you can resolve the conversation |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Description
This PR introduces a new operational profile for quantum-safe federated learning observability and includes system hardening fixes for XGBoost integrations and recipe APIs.
🔐 Sovereign Quantum-Proof Observability
New Component: Added QuantumProofMetricsCollector to emit PQC posture and proof-lifecycle metrics (verification success, latencies, etc.).
Readiness Gate: Introduced nvflare_quantum_readiness_gate.py to automate Prometheus and Grafana health validation.
Dashboards: Added a pre-provisioned Grafana dashboard for "Sovereign Quantum Ops."
🛠 XGBoost & Integration Hardening
Data Resilience: Updated prepare_data.sh to support synthetic data fallback when the HIGGS dataset is unavailable.
API Compatibility: Fixed XGBVerticalRecipe and FedAvgRecipe to allow flexible initialization (omitting per_site_config), restoring backward compatibility.
Test Hygiene: Added stale-process cleanup to integration tests to prevent environment contamination.
📝 Documentation
Updated release notes and monitoring guides for the new sovereign profile.
Type of Changes
[x] Non-breaking change (fix or new feature)
[x] In-line docstrings updated
[x] Documentation updated
[ ] New tests added