Add hello-numpy-robust example with optional median aggregation and poisoned-client simulation#4373
Conversation
Greptile SummaryThis PR adds a Key observations:
Confidence Score: 4/5Merging is blocked by the previously-flagged but still-unresolved P1 defect where --poison_client_name "" breaks argument parsing in the documented disable poisoning example. Three P1 issues from prior review rounds remain unaddressed in the current code (silent empty return in aggregate_model, missing state reset in aggregate_model, and broken empty-string argument passing in job.py). These are real defects in the example paths documented by the README. Because they are unresolved, a score of 4/5 is appropriate rather than the default 5/5 for P2-only findings. examples/hello-world/hello-numpy-robust/job.py (empty poison_client_name argument handling) and examples/hello-world/hello-numpy-robust/custom_aggregators.py (aggregate_model silent return and missing state reset). Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User (job.py)
participant R as NumpyFedAvgRecipe
participant S as Server (ScatterAndGather)
participant A as Aggregator (Default or MedianAggregator)
participant C as Clients (client.py x N)
U->>R: NumpyFedAvgRecipe(aggregator=...)
U->>R: recipe.execute(SimEnv)
loop Each round
S->>C: Broadcast global model
C->>C: train(params) → maybe_poison_update()
C->>S: Send FLModel (FULL or DIFF)
S->>A: accept_model(client_model) x N
S->>A: aggregate_model()
A-->>S: FLModel (median or weighted-avg)
S->>S: reset() → reset_stats()
end
S-->>U: run.get_result() / run.get_status()
Reviews (2): Last reviewed commit: "Merge branch 'main' into feat/hello-nump..." | Re-trigger Greptile |
| def aggregate_model(self) -> FLModel: | ||
| if not self._client_models: | ||
| self.error("No client models to aggregate.") | ||
| return FLModel(params={}) |
There was a problem hiding this comment.
Silent empty return masks aggregation data preparation failures
When _client_models is empty, the method logs an error and returns FLModel(params={}). This silent empty return can propagate downstream — the server persistor or controller receives an empty model dict and may overwrite a previously-good checkpoint with empty params, or silently skip aggregation without surfacing the root cause.
Per the project's error-handling convention, an empty aggregation input is an exceptional state that should fail loudly with a descriptive ValueError:
| def aggregate_model(self) -> FLModel: | |
| if not self._client_models: | |
| self.error("No client models to aggregate.") | |
| return FLModel(params={}) | |
| def aggregate_model(self) -> FLModel: | |
| if not self._client_models: | |
| raise ValueError( | |
| "MedianAggregator.aggregate_model called with no accepted client models. " | |
| "Ensure at least one client submitted a model before aggregation." | |
| ) |
This surfaces the problem immediately at the aggregation step rather than letting an empty model silently propagate.
Rule Used: When data_loader might be empty during loss comput... (source)
Learnt From
NVIDIA/NVFlare#4001
| def aggregate_model(self) -> FLModel: | ||
| if not self._client_models: | ||
| self.error("No client models to aggregate.") | ||
| return FLModel(params={}) | ||
|
|
||
| aggregated_params = {} | ||
| for key in self._client_models[0].keys(): | ||
| stacked = np.stack([m[key] for m in self._client_models], axis=0) | ||
| aggregated_params[key] = np.median(stacked, axis=0) | ||
|
|
||
| return FLModel(params=aggregated_params, params_type=self._params_type) |
There was a problem hiding this comment.
aggregate_model() does not reset internal state
The ModelAggregator base class docstring for aggregate_model states it "needs to implement aggregation logic and reset any internal stats". The current implementation aggregates but leaves _client_models populated, so a second call to aggregate_model() (without an intervening reset_stats()) would re-aggregate the same client updates.
While the framework's ScatterAndGather controller calls reset() → reset_stats() between rounds, custom orchestration or direct unit-test usage could hit a double-aggregate bug. Consider clearing state at the end of aggregate_model():
def aggregate_model(self) -> FLModel:
...
result = FLModel(params=aggregated_params, params_type=self._params_type)
self.reset_stats() # clear accumulated state per base-class contract
return result| train_args = ( | ||
| f"--update_type {args.update_type} " | ||
| f"--poison_client_name {args.poison_client_name} " | ||
| f"--poison_scale {args.poison_scale}" | ||
| ) |
There was a problem hiding this comment.
Empty
poison_client_name breaks client argument parsing
When --poison_client_name "" is passed to job.py (the documented way to disable poisoning), args.poison_client_name is an empty string. The f-string produces a train_args string where --poison_client_name is followed by nothing before --poison_scale. When the script runner tokenizes that string, --poison_scale becomes the next positional token and argparse in client.py will raise an error because it expects a value for --poison_client_name, not another flag.
The documented "disable poisoning" invocation from the README therefore fails at runtime.
To properly pass an empty string as an argument token, wrap the value with shlex.quote:
import shlex
train_args = (
f"--update_type {args.update_type} "
f"--poison_client_name {shlex.quote(args.poison_client_name)} "
f"--poison_scale {args.poison_scale}"
)shlex.quote("") produces '', which the shell/tokenizer passes correctly as an empty-string argument to the client.
There was a problem hiding this comment.
Pull request overview
Adds a new hello-world NumPy example demonstrating robustness to poisoned/outlier clients by optionally switching from default FedAvg aggregation to an element-wise median aggregator, with accompanying docs and a small unit test.
Changes:
- Introduce
hello-numpy-robustexample (job launcher, client script, custom median aggregator) with optional poisoned-client simulation. - Add example README and register the example in
examples/hello-world/README.md. - Add targeted unit tests for the custom median aggregator.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/hello-world/hello-numpy-robust/custom_aggregators.py | Adds MedianAggregator implementation for robust aggregation. |
| examples/hello-world/hello-numpy-robust/job.py | Adds simulator job entrypoint with --aggregator and poisoning controls. |
| examples/hello-world/hello-numpy-robust/client.py | Adds client training loop with optional poisoned-update scaling and metrics logging. |
| examples/hello-world/hello-numpy-robust/requirements.txt | Declares dependencies for running the new example. |
| examples/hello-world/hello-numpy-robust/README.md | Documents running baseline vs median aggregation and poisoning toggles. |
| examples/hello-world/hello-numpy-robust/tests/test_custom_aggregators.py | Adds unit tests for median aggregation behavior and error handling. |
| examples/hello-world/README.md | Registers and documents the new example in the hello-world index. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from custom_aggregators import MedianAggregator | ||
|
|
||
| from nvflare.app_common.abstract.fl_model import FLModel, ParamsType | ||
| from nvflare.app_common.np.constants import NPConstants |
There was a problem hiding this comment.
The test imports MedianAggregator via from custom_aggregators import MedianAggregator, but when running pytest from the repo root (default), the example directory isn’t on sys.path, so this will raise ModuleNotFoundError. Add a small sys.path adjustment in the test (e.g., insert the example root directory), or package the example so the module can be imported reliably under the repo’s test runner.
| Disable poisoning: | ||
|
|
||
| ```bash | ||
| python job.py --aggregator median --poison_client_name "" |
There was a problem hiding this comment.
The “Disable poisoning” command uses --poison_client_name "", but job.py forwards args to the client by building a whitespace-split string; an empty value will be dropped, causing --poison_client_name to have no argument and the client script to fail argparse parsing. Update the instructions (and corresponding code) to use a safe disable mechanism (e.g., --poison_client_name none or --no-poison).
| python job.py --aggregator median --poison_client_name "" | |
| python job.py --aggregator median --poison_client_name none |
| train_args = ( | ||
| f"--update_type {args.update_type} " | ||
| f"--poison_client_name {args.poison_client_name} " | ||
| f"--poison_scale {args.poison_scale}" | ||
| ) |
There was a problem hiding this comment.
--poison_client_name is documented as “use empty string to disable”, but train_args is built via string concatenation. When poison_client_name is empty, TaskScriptRunner later splits args on whitespace, producing --poison_client_name --poison_scale ... (missing value) and the client argparse will fail. Use a boolean flag (e.g., --poison/--no-poison) or a sentinel value (e.g., none) and only include --poison_client_name <value> in train_args when a non-empty value is intended; update client defaults accordingly.
| train_args = ( | |
| f"--update_type {args.update_type} " | |
| f"--poison_client_name {args.poison_client_name} " | |
| f"--poison_scale {args.poison_scale}" | |
| ) | |
| train_args_parts = [ | |
| f"--update_type {args.update_type}", | |
| f"--poison_scale {args.poison_scale}", | |
| ] | |
| # Only include poison_client_name if non-empty; empty string disables poisoning. | |
| if args.poison_client_name != "": | |
| train_args_parts.insert(1, f"--poison_client_name {args.poison_client_name}") | |
| train_args = " ".join(train_args_parts) |
| @@ -0,0 +1,67 @@ | |||
| # Hello NumPy Robust Aggregation | |||
There was a problem hiding this comment.
Please follow the readme.md from hello-pt for readme structure (https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-pt/README.md)
installation
data
code structure
client code
server side
Job Recipe
Run Job
Result
etc.
For the code structure,
please follow
client.py
model.py ( if exists)
job.py
format when possible
|
@rwilliamspbg-ops Numpy Recipe was a misstep we made, We did not have time to fix in 2.7.2 as it was discovered late close to the release date. As the FedAvg recipe itself should cover Numpy, there should no need for a separate numpyfedavg recipe. We were plan to remove it and fold into FedAvg or simply remove NumPy Example all togather. If you think your example is important to expand the FedAvg algorithms, please consider add it to the PyTorch FedAvg example such as hello-pt or hello-lightning for pytorch lighnining, or in the examples/advanced examples |
|
sounds good |
Summary
Scope
Validation