Skip to content

fix: preserve broadcast withdrawals during orphan recovery#5752

Merged
Scottcjn merged 4 commits into
Scottcjn:mainfrom
ItsOtherMauridian:fix-payout-broadcast-recovery
May 20, 2026
Merged

fix: preserve broadcast withdrawals during orphan recovery#5752
Scottcjn merged 4 commits into
Scottcjn:mainfrom
ItsOtherMauridian:fix-payout-broadcast-recovery

Conversation

@ItsOtherMauridian
Copy link
Copy Markdown
Contributor

Summary

Fixes #5751.

recover_orphans() was refunding every withdrawal with status='processing'. That is correct for pre-broadcast crash recovery, but unsafe for rows that already have a tx_hash and were intentionally left in processing/manual-reconciliation state after a post-broadcast completion-update failure.

This PR restricts automatic orphan refunds to processing withdrawals without a broadcast transaction hash:

WHERE status = 'processing'
AND (tx_hash IS NULL OR tx_hash = '')

Rows with tx_hash remain untouched for operator reconciliation instead of being internally refunded and marked failed.

Tests

Added node/tests/test_payout_worker_recovery.py covering:

  • processing withdrawal without tx hash is refunded and marked failed;
  • processing withdrawal with tx hash is preserved and not refunded.

Validation:

PYTHONPATH=node python3 -B -m pytest -q node/tests/test_payout_worker_recovery.py --tb=short -p no:cacheprovider
# 2 passed
python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.py
git diff --check

@github-actions github-actions Bot added BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) node Node server related size/M PR: 51-200 lines tests Test suite changes labels May 19, 2026
Copy link
Copy Markdown

@TJCurnutte TJCurnutte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I focused on the orphan-recovery path because this patch narrows recover_orphans() to processing withdrawals that do not already have a broadcast transaction hash.

Validation run at head e3dab0e964b84e851e886f95174342faf18cdc0c:

  • git diff --check origin/main...HEAD -- node/payout_worker.py node/tests/test_payout_worker_recovery.py passed.
  • python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.py passed.
  • python3 -B -m pytest node/tests/test_payout_worker_recovery.py -q passed with 2 passed in 0.08s.
  • Direct SQLite origin-main-vs-PR probe for a processing withdrawal with tx_hash = '0xalready_broadcast': origin_main refunded balance 89.0 -> 100.0 and marked the row failed / Orphaned processing state recovered; this PR left the row at (89.0, 'processing', '0xalready_broadcast', 'manual reconciliation required').

That preserves the manual-reconciliation boundary for already-broadcast withdrawals while keeping the existing refund path for processing rows with no transaction hash.

Copy link
Copy Markdown
Contributor

@jaxint jaxint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work on this PR. 🚀

@ItsOtherMauridian
Copy link
Copy Markdown
Contributor Author

Addressed in the latest commit.

I added reconcile_broadcast_withdrawals() plus a lookup_withdrawal_status(tx_hash) hook. The worker now:

  • refunds only pre-broadcast processing rows without a tx_hash,
  • preserves unknown broadcast rows for reconciliation,
  • marks confirmed broadcast rows completed without refunding balance.

Validation: 4 passed for node/tests/test_payout_worker_recovery.py, plus py_compile and git diff --check.

@github-actions github-actions Bot added size/L PR: 201-500 lines and removed size/M PR: 51-200 lines labels May 19, 2026
Copy link
Copy Markdown

@kevinyan911 kevinyan911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #5752

Reviewer: @kevinyan911
Wallet: RTCcd1dd903b3cbbfca24c30bd98973931a4af53302

Reviewed 2 changed files: node/payout_worker.py, node/tests/test_payout_worker_recovery.py.

What this PR does

Fixes recover_orphans() to only refund withdrawals that are processing AND have no tx_hash. Previously, any processing withdrawal was refunded, including those that already had a broadcast transaction hash — unsafe because it would refund a withdrawal that was intentionally left in manual-reconciliation state after a post-broadcast update failure.

Code quality

  • SQL constraint WHERE status='processing' AND (tx_hash IS NULL OR tx_hash='') is correct.
  • Explicit comment explains the reasoning — future maintainers will understand why tx_hash presence gates the refund.
  • Regression tests cover both cases: refund-without-hash (should refund) and refund-with-hash (should preserve).
  • py_compile clean, git diff --check clean.

APPROVED — clean, scoped fix with good test coverage.


Code review bounty claim submitted to rustchain-bounties

@Scottcjn
Copy link
Copy Markdown
Owner

@ItsOtherMauridian — high-quality batch overall. Triaging the rest of your 7 PRs in detail.

Already merged this round

Holding for Scott (consensus/security-critical)

These each look correct and well-tested, but they touch consensus-adjacent or money-path code where deploy coordination matters:

PR Concern
#5740 preserve enrolled weights in anti-double-mining settlement Settlement reward weight reconstruction — nodes diverging mid-rollout could compute different rewards
#5742 epoch enroll snapshot for claims participation Claims path uses epoch_enroll instead of rolling miner_attest_recent — correct, but affects all in-flight delayed claims
#5744 bind claims to registered miner key Security fix (prevents key-substitution against existing miners). Compatible-by-design but should verify against all live miners' stored keys
#5746 reserve claims before settlement broadcast Concurrency fix in settlement worker — prevents duplicate broadcasts
#5750 ignore epoch finality from state sync State-sync security (peer sig != finality vote). Affects peer-trust model
#5752 preserve broadcast withdrawals during orphan recovery Money-safety (prevents double-refund on tx_hash != null rows)

Each is a real fix with comprehensive tests. The right deploy is probably: merge #5752 + #5750 + #5744 first (money-safety + security path, less consensus risk), then #5740 + #5742 + #5746 as a coordinated settlement-path update with all 4 attestation nodes restarted in the same epoch window.

@Scottcjn — your call. I'd be comfortable merging #5752, #5750, #5744 immediately if you give the nod; the others I'd hold for the next settlement-path deploy.

Wallet/miner ID acknowledged as ItsOtherMauridian (username) per your PR bodies. Payouts will go to that phantom balance once #5753 wallet-parser-fix lands; happy to move to a real wallet if you provide one.

Copy link
Copy Markdown

@victortran0904 victortran0904 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for a money-path edge case that this patch still leaves open.

The new orphan filter treats tx_hash IS NULL OR tx_hash = '' as proof that a processing withdrawal was never broadcast, but that is not a durable invariant in this worker. process_withdrawal() marks the row processing and commits the balance debit before calling execute_withdrawal(), then writes tx_hash only after that call returns. If the process dies after the chain broadcast succeeds but before the DB update persists tx_hash, the row restarts as processing with a null hash and recover_orphans() will refund it and mark it failed. That can still create the double-refund/double-spend condition this PR is trying to prevent.

The fix needs a durable state boundary before broadcast, or orphan recovery should not auto-refund these indeterminate processing rows based only on a missing tx_hash.

There is also a smaller reconciliation issue: when lookup_withdrawal_status(tx_hash) returns False, the new code only rewrites error_msg; it leaves the withdrawal in processing and does not restore the deducted balance or move the row to a terminal/manual state. That means known-failed broadcasts remain selected forever by reconcile_broadcast_withdrawals() and the miner cannot retry without out-of-band DB surgery.

Copy link
Copy Markdown
Contributor

@surim0n surim0n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the payout-worker orphan recovery change and ran the new recovery test file.\n\nWhat I checked:\n- The orphan refund query now only touches processing withdrawals without a tx_hash, which avoids refunding a withdrawal that may already have been broadcast.\n- The new reconciliation hook keeps unknown broadcast transactions in processing by default, which is safer than marking them failed without chain evidence.\n- Confirmed transactions move to completed without refunding the account balance, preserving the original debit semantics.\n- The failed/unknown broadcast path records a manual-reconciliation message instead of creating an automatic double-spend/refund risk.\n\nVerification run:\n- python3 -m pytest node/tests/test_payout_worker_recovery.py -q -> 4 passed\n\nNo blocking findings. The main implementation gap is intentional: lookup_withdrawal_status is still a stub, so production safety depends on wiring it to a real chain lookup before relying on automatic completion. As written, the default behavior is conservative and preserves funds state.

Copy link
Copy Markdown

@kevinyan911 kevinyan911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #5752

Reviewer: @kevinyan911
Wallet: RTCcd1dd903b3cbbfca24c30bd98973931a4af53302

What this PR does

Adds lookup_withdrawal_status(tx_hash) hook to payout_worker.py and reconcile_broadcast_withdrawals() method. The hook returns True (confirmed), False (known failed), or None (unknown). reconcile_broadcast_withdrawals() queries all status='processing' withdrawals with a non-null tx_hash, calls the hook, and updates the status to completed or failed accordingly. Uses SQLite BEGIN IMMEDIATE for serialization.

Code quality

  • lookup_withdrawal_status default returns None — preserves manual reconciliation when hook is not overridden — correct design.
  • BEGIN IMMEDIATE acquires a write lock immediately rather than deferring — prevents the race condition between concurrent workers reading approved rows.
  • chain_status is Nonecontinue — correctly skips unknown chain status without marking withdrawal as failed.
  • processed_at set on completion, error_msg = NULL to clear previous error — correct state transition.

APPROVED — proper concurrency-safe withdrawal reconciliation.


Code review bounty claim submitted to rustchain-bounties

@ItsOtherMauridian
Copy link
Copy Markdown
Contributor Author

ItsOtherMauridian commented May 20, 2026

Addressed in commit 4f6ca03.

I removed the automatic refund for processing withdrawals with missing tx_hash, because that state is ambiguous after a crash between broadcast success and DB persistence. recover_orphans() now leaves the debit/status intact and flags the row for manual reconciliation instead of marking it failed/refunding. Broadcast rows with a tx hash still go through reconcile_broadcast_withdrawals().

Updated regression coverage in node/tests/test_payout_worker_recovery.py.

Local validation:

  • python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.py
  • python3 -B -m pytest -q node/tests/test_payout_worker_recovery.py

Copy link
Copy Markdown

@aisoh877 aisoh877 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent in #5751 is to keep the old refund behavior for processing rows that never got a tx_hash, and only exclude the already-broadcast/manual-reconciliation rows from that path. This patch changes the no-tx_hash branch too: recover_orphans() now only writes an error message and leaves those rows in status=processing instead of refunding them and marking them failed.

That means pre-broadcast crashes stop being recoverable and those debits can remain stranded indefinitely while the worker re-flags the same rows every loop. I think the fix needs to preserve the original refund+fail behavior for (tx_hash IS NULL OR tx_hash = ) and only skip the refund for rows that already have a broadcast hash.

@ItsOtherMauridian
Copy link
Copy Markdown
Contributor Author

ItsOtherMauridian commented May 20, 2026

Follow-up pushed in 1cc60ac.

I addressed the bounded reconciliation gap for known-failed broadcast transactions: when lookup_withdrawal_status(tx_hash) returns False, the row now leaves the retry scan by moving to failed while preserving the original debit and tx_hash for manual refund/reconciliation. Added regression coverage for that terminal-without-refund path.

I intentionally kept no-tx_hash processing rows in manual reconciliation instead of auto-refunding them, because that state is ambiguous after a crash between chain broadcast success and DB tx_hash persistence.

Validation:

  • python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.py
  • python3 -B -m pytest -q node/tests/test_payout_worker_recovery.py -> 5 passed
  • git diff --check

@Scottcjn Scottcjn merged commit f0e8f74 into Scottcjn:main May 20, 2026
11 checks passed
@ItsOtherMauridian
Copy link
Copy Markdown
Contributor Author

Payout address: RTC71d6976297ed35377b867f13ed962f54020ef434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BCOS-L1 Beacon Certified Open Source tier BCOS-L1 (required for non-doc PRs) node Node server related size/L PR: 201-500 lines tests Test suite changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: payout orphan recovery refunds withdrawals after broadcast tx hash

8 participants