Feature request: Failure-tolerant unites for fanout branches
Summary
Add a setting to let unites proceed even if one or more fanout branches end in ERRORED, with clear controls for when to release and which outputs to aggregate.
Context
Per the docs, unites synchronizes parallel paths by waiting on all states with a given identifier, with fingerprinting ensuring a single unification state, and the state lifecycle includes EXECUTED and ERRORED. Today the behavior is strict and assumes all branches should complete before the unify node can run. In practice, large fanouts often have a few failures due to transient errors or bad inputs. We want the unify step to still run using the successful branches. ([Exosphere Docs][1]) ([Exosphere Docs][1])
Problem
When a fanout produces many branches and a subset fails, the unify node either never runs or must be retried after manual cleanup. This blocks downstream work and increases cost and latency for otherwise healthy runs.
Proposed API
Extend the unites block with optional fields. Defaults preserve current behavior.
{
"node_name": "ResultMergerNode",
"identifier": "result_merger",
"inputs": {
"x_processed": "${{ processor_1.outputs.processed_data }}"
},
"unites": {
"identifier": "data_splitter",
"release_when": "all_done", // "all_done" or "threshold_met"
"min_success": null, // integer >= 1, used only if release_when = "threshold_met"
"success_ratio": null, // 0.0 to 1.0, optional alternative to min_success
"aggregate": "success_only", // "success_only" or "all"
"error_strategy": "ignore" // "ignore" or "attach"
},
"next_nodes": []
}
-
release_when
all_done keeps current barrier behavior and releases once all branches reach a terminal state.
threshold_met releases early when min_success or success_ratio is satisfied.
-
aggregate
success_only passes only outputs from successful branches to inputs of the unify node.
all also includes placeholders for failed branches.
-
error_strategy
ignore drops failed branch payloads from aggregation.
attach provides a structured list of {branch_id, error} in self.meta["unites_errors"] for observability.
Examples
- Wait for all, ignore failures at aggregation time
"unites": {
"identifier": "data_splitter",
"release_when": "all_done",
"aggregate": "success_only",
"error_strategy": "attach"
}
- Release early once a quorum of successes is reached
"unites": {
"identifier": "data_splitter",
"release_when": "threshold_met",
"min_success": 3,
"aggregate": "success_only"
}
Acceptance criteria
-
Given a fanout of 5 branches where 1 fails
- With
release_when = "all_done", the unify node runs after all 5 are terminal and receives an array of 4 successful outputs. The error list is exposed in node metadata when error_strategy = "attach".
-
Given a fanout of 10 branches where 4 are slow and 1 fails
- With
release_when = "threshold_met" and success_ratio = 0.5, the unify node runs once 5 successes are present, without waiting for slow or failed branches.
-
Graph validation rejects configs where both min_success and success_ratio are set at the same time, or where the threshold exceeds the theoretical fanout size when known.
Design notes
Observability
-
Emit metrics
unites_total, unites_successful, unites_failed, unites_release_reason with labels for graph, identifier, and attempt.
-
Dashboard
- Badge on the unify node indicating
partial when failures were ignored or early release occurred, plus a drawer listing failed branches.
Open questions
- Should early release cancel still-running branches to save compute, or allow them to finish in the background for fuller results
- For
aggregate = "all", do we want positional alignment with nulls for failed branches, or a list of objects with explicit branch_id fields
- Any need for a per-node override on retry policy before counting a branch as failed for quorum calculations
Feature request: Failure-tolerant
unitesfor fanout branchesSummary
Add a setting to let
unitesproceed even if one or more fanout branches end inERRORED, with clear controls for when to release and which outputs to aggregate.Context
Per the docs,
unitessynchronizes parallel paths by waiting on all states with a given identifier, with fingerprinting ensuring a single unification state, and the state lifecycle includesEXECUTEDandERRORED. Today the behavior is strict and assumes all branches should complete before the unify node can run. In practice, large fanouts often have a few failures due to transient errors or bad inputs. We want the unify step to still run using the successful branches. ([Exosphere Docs][1]) ([Exosphere Docs][1])Problem
When a fanout produces many branches and a subset fails, the unify node either never runs or must be retried after manual cleanup. This blocks downstream work and increases cost and latency for otherwise healthy runs.
Proposed API
Extend the
unitesblock with optional fields. Defaults preserve current behavior.{ "node_name": "ResultMergerNode", "identifier": "result_merger", "inputs": { "x_processed": "${{ processor_1.outputs.processed_data }}" }, "unites": { "identifier": "data_splitter", "release_when": "all_done", // "all_done" or "threshold_met" "min_success": null, // integer >= 1, used only if release_when = "threshold_met" "success_ratio": null, // 0.0 to 1.0, optional alternative to min_success "aggregate": "success_only", // "success_only" or "all" "error_strategy": "ignore" // "ignore" or "attach" }, "next_nodes": [] }release_whenall_donekeeps current barrier behavior and releases once all branches reach a terminal state.threshold_metreleases early whenmin_successorsuccess_ratiois satisfied.aggregatesuccess_onlypasses only outputs from successful branches to inputs of the unify node.allalso includes placeholders for failed branches.error_strategyignoredrops failed branch payloads from aggregation.attachprovides a structured list of{branch_id, error}inself.meta["unites_errors"]for observability.Examples
Acceptance criteria
Given a fanout of 5 branches where 1 fails
release_when = "all_done", the unify node runs after all 5 are terminal and receives an array of 4 successful outputs. The error list is exposed in node metadata whenerror_strategy = "attach".Given a fanout of 10 branches where 4 are slow and 1 fails
release_when = "threshold_met"andsuccess_ratio = 0.5, the unify node runs once 5 successes are present, without waiting for slow or failed branches.Graph validation rejects configs where both
min_successandsuccess_ratioare set at the same time, or where the threshold exceeds the theoretical fanout size when known.Design notes
Backward compatible defaults
Scheduler logic
unitesdefers until all states with the identifier are complete, and fingerprinting ensures only one unify state. Add counters for successful and terminal counts within the group to evaluate thresholds. Keep fingerprinting unchanged. ([Exosphere Docs][1])Aggregation
${{ processor_1.outputs.processed_data }}, the engine already collects outputs across the group. Withaggregate = "success_only", include only successful branch payloads. Withattach, store error summaries in unify state metadata for downstream inspection and dashboards.Observability
Emit metrics
unites_total,unites_successful,unites_failed,unites_release_reasonwith labels for graph, identifier, and attempt.Dashboard
partialwhen failures were ignored or early release occurred, plus a drawer listing failed branches.Open questions
aggregate = "all", do we want positional alignment with nulls for failed branches, or a list of objects with explicitbranch_idfields