fix(service): race condition in force_reload()#98
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to ensure runtime state/task behavior updates correctly when elections-related configuration changes, and improves CI artifact collection for nodectl e2e runs.
Changes:
- Add an info log when a forced runtime cache reload is triggered.
- Change elections include/exclude REST handlers to signal
config_changedinstead of directly restarting the elections task. - Update the nodectl e2e GitHub Actions workflow to always upload logs and set artifact retention/behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/node-control/service/src/runtime_config.rs |
Adds logging around force_reload() to improve observability of cache reload triggers. |
src/node-control/service/src/http/http_server_task.rs |
Routes elections include/exclude mutations through the central config_changed notification mechanism. |
.github/workflows/nodectl-e2e.yml |
Adjusts e2e artifact upload to run always and adds retention/handling options. |
Comments suppressed due to low confidence (1)
.github/workflows/nodectl-e2e.yml:78
- The e2e runner (
run_singlehost_nodectl.py) writes the bootstrap/script log tosrc/node/tests/test_run_net_py/singlehost-bootstrap.logby default (not undertmp/). This workflow removed the bootstrap log from the uploaded artifact, so failures/successes may be harder to diagnose. Consider addingsrc/node/tests/test_run_net_py/singlehost-bootstrap.logto thepathlist (or otherwise ensure the script log is included).
if-no-files-found: warn
path: |
src/node/tests/test_run_net_py/tmp/nodectl-config.json
src/node/tests/test_run_net_py/tmp/nodectl-service.log
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Use after REST mutations that change structural config (entities, endpoints). | ||
| pub async fn force_reload(&self) -> anyhow::Result<()> { | ||
| let config = (*self.get()).clone(); | ||
| tracing::info!("force reloading is triggered"); |
There was a problem hiding this comment.
The log message reads a bit ungrammatically ("force reloading is triggered") and is likely to become a frequently-emitted info log. Consider rewording to something clearer/consistent with other logs (e.g., "force reload triggered") and/or lowering to debug if this is primarily for troubleshooting.
| tracing::info!("force reloading is triggered"); | |
| tracing::info!("force reload triggered"); |
| } | ||
| }) | ||
| .map_err(|e| AppError::internal(e.to_string()))?; | ||
|
|
There was a problem hiding this comment.
Switching from directly restarting elections_task to config_changed.notify_one() means applying include/exclude now depends on the service loop successfully completing runtime_cfg.force_reload() before tasks are restarted. Since force_reload() reopens the vault and reconnects RPC, a transient vault/RPC failure can prevent elections include/exclude from taking effect even though the config update succeeded. Consider keeping a direct elections-task restart for this endpoint (or using a lighter signal that restarts tasks without a full cache rebuild) so enable/disable can still apply when reload fails for unrelated reasons.
| if let Err(e) = state.elections_task.restart().await { | |
| tracing::warn!( | |
| "failed to directly restart elections task after exclude update: {e}" | |
| ); | |
| } |
|
|
||
| let task = state.elections_task.clone(); | ||
| tokio::spawn(async move { | ||
| let _ = task.restart().await; | ||
| }); | ||
| state.config_changed.notify_one(); | ||
|
|
There was a problem hiding this comment.
Same concern as in the exclude handler: using config_changed.notify_one() makes the include operation’s effect contingent on a successful runtime_cfg.force_reload() cycle (vault/RPC reconnect). If reload fails, tasks won’t restart and the new bindings enablement may not take effect until a later reload succeeds. Consider restarting the elections task directly for this change (or separating “restart tasks” from “rebuild caches”).
Summary
force_reload()where config snapshot was captured beforereload(), but by the timereload()finished rebuilding runtime state (vault, RPC, wallets, pools), the config could have been mutated by another concurrent REST API request — causing the runtime state to be rebuilt from a stale configreloadinto two clear methods:reload_state()(reads current config from store, used byforce_reload) andreload(config)(accepts an external config, used for file-change reloads)