fix: preserve broadcast withdrawals during orphan recovery#5752
Conversation
TJCurnutte
left a comment
There was a problem hiding this comment.
Approved. I focused on the orphan-recovery path because this patch narrows recover_orphans() to processing withdrawals that do not already have a broadcast transaction hash.
Validation run at head e3dab0e964b84e851e886f95174342faf18cdc0c:
git diff --check origin/main...HEAD -- node/payout_worker.py node/tests/test_payout_worker_recovery.pypassed.python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.pypassed.python3 -B -m pytest node/tests/test_payout_worker_recovery.py -qpassed with2 passed in 0.08s.- Direct SQLite origin-main-vs-PR probe for a
processingwithdrawal withtx_hash = '0xalready_broadcast':origin_mainrefunded balance89.0 -> 100.0and marked the rowfailed / Orphaned processing state recovered; this PR left the row at(89.0, 'processing', '0xalready_broadcast', 'manual reconciliation required').
That preserves the manual-reconciliation boundary for already-broadcast withdrawals while keeping the existing refund path for processing rows with no transaction hash.
jaxint
left a comment
There was a problem hiding this comment.
LGTM! Great work on this PR. 🚀
|
Addressed in the latest commit. I added
Validation: |
kevinyan911
left a comment
There was a problem hiding this comment.
Code Review — PR #5752
Reviewer: @kevinyan911
Wallet: RTCcd1dd903b3cbbfca24c30bd98973931a4af53302
Reviewed 2 changed files: node/payout_worker.py, node/tests/test_payout_worker_recovery.py.
What this PR does
Fixes recover_orphans() to only refund withdrawals that are processing AND have no tx_hash. Previously, any processing withdrawal was refunded, including those that already had a broadcast transaction hash — unsafe because it would refund a withdrawal that was intentionally left in manual-reconciliation state after a post-broadcast update failure.
Code quality
- SQL constraint
WHERE status='processing' AND (tx_hash IS NULL OR tx_hash='')is correct. - Explicit comment explains the reasoning — future maintainers will understand why
tx_hashpresence gates the refund. - Regression tests cover both cases: refund-without-hash (should refund) and refund-with-hash (should preserve).
py_compileclean,git diff --checkclean.
APPROVED — clean, scoped fix with good test coverage.
Code review bounty claim submitted to rustchain-bounties
|
@ItsOtherMauridian — high-quality batch overall. Triaging the rest of your 7 PRs in detail. Already merged this round
Holding for Scott (consensus/security-critical)These each look correct and well-tested, but they touch consensus-adjacent or money-path code where deploy coordination matters:
Each is a real fix with comprehensive tests. The right deploy is probably: merge #5752 + #5750 + #5744 first (money-safety + security path, less consensus risk), then #5740 + #5742 + #5746 as a coordinated settlement-path update with all 4 attestation nodes restarted in the same epoch window. @Scottcjn — your call. I'd be comfortable merging #5752, #5750, #5744 immediately if you give the nod; the others I'd hold for the next settlement-path deploy. Wallet/miner ID acknowledged as |
victortran0904
left a comment
There was a problem hiding this comment.
Requesting changes for a money-path edge case that this patch still leaves open.
The new orphan filter treats tx_hash IS NULL OR tx_hash = '' as proof that a processing withdrawal was never broadcast, but that is not a durable invariant in this worker. process_withdrawal() marks the row processing and commits the balance debit before calling execute_withdrawal(), then writes tx_hash only after that call returns. If the process dies after the chain broadcast succeeds but before the DB update persists tx_hash, the row restarts as processing with a null hash and recover_orphans() will refund it and mark it failed. That can still create the double-refund/double-spend condition this PR is trying to prevent.
The fix needs a durable state boundary before broadcast, or orphan recovery should not auto-refund these indeterminate processing rows based only on a missing tx_hash.
There is also a smaller reconciliation issue: when lookup_withdrawal_status(tx_hash) returns False, the new code only rewrites error_msg; it leaves the withdrawal in processing and does not restore the deducted balance or move the row to a terminal/manual state. That means known-failed broadcasts remain selected forever by reconcile_broadcast_withdrawals() and the miner cannot retry without out-of-band DB surgery.
surim0n
left a comment
There was a problem hiding this comment.
Reviewed the payout-worker orphan recovery change and ran the new recovery test file.\n\nWhat I checked:\n- The orphan refund query now only touches processing withdrawals without a tx_hash, which avoids refunding a withdrawal that may already have been broadcast.\n- The new reconciliation hook keeps unknown broadcast transactions in processing by default, which is safer than marking them failed without chain evidence.\n- Confirmed transactions move to completed without refunding the account balance, preserving the original debit semantics.\n- The failed/unknown broadcast path records a manual-reconciliation message instead of creating an automatic double-spend/refund risk.\n\nVerification run:\n- python3 -m pytest node/tests/test_payout_worker_recovery.py -q -> 4 passed\n\nNo blocking findings. The main implementation gap is intentional: lookup_withdrawal_status is still a stub, so production safety depends on wiring it to a real chain lookup before relying on automatic completion. As written, the default behavior is conservative and preserves funds state.
kevinyan911
left a comment
There was a problem hiding this comment.
Code Review — PR #5752
Reviewer: @kevinyan911
Wallet: RTCcd1dd903b3cbbfca24c30bd98973931a4af53302
What this PR does
Adds lookup_withdrawal_status(tx_hash) hook to payout_worker.py and reconcile_broadcast_withdrawals() method. The hook returns True (confirmed), False (known failed), or None (unknown). reconcile_broadcast_withdrawals() queries all status='processing' withdrawals with a non-null tx_hash, calls the hook, and updates the status to completed or failed accordingly. Uses SQLite BEGIN IMMEDIATE for serialization.
Code quality
lookup_withdrawal_statusdefault returnsNone— preserves manual reconciliation when hook is not overridden — correct design.BEGIN IMMEDIATEacquires a write lock immediately rather than deferring — prevents the race condition between concurrent workers readingapprovedrows.chain_status is None→continue— correctly skips unknown chain status without marking withdrawal as failed.processed_atset on completion,error_msg = NULLto clear previous error — correct state transition.
APPROVED — proper concurrency-safe withdrawal reconciliation.
Code review bounty claim submitted to rustchain-bounties
|
Addressed in commit 4f6ca03. I removed the automatic refund for Updated regression coverage in Local validation:
|
aisoh877
left a comment
There was a problem hiding this comment.
The intent in #5751 is to keep the old refund behavior for processing rows that never got a tx_hash, and only exclude the already-broadcast/manual-reconciliation rows from that path. This patch changes the no-tx_hash branch too: recover_orphans() now only writes an error message and leaves those rows in status=processing instead of refunding them and marking them failed.
That means pre-broadcast crashes stop being recoverable and those debits can remain stranded indefinitely while the worker re-flags the same rows every loop. I think the fix needs to preserve the original refund+fail behavior for (tx_hash IS NULL OR tx_hash = ) and only skip the refund for rows that already have a broadcast hash.
|
Follow-up pushed in 1cc60ac. I addressed the bounded reconciliation gap for known-failed broadcast transactions: when I intentionally kept no- Validation:
|
|
Payout address: RTC71d6976297ed35377b867f13ed962f54020ef434 |
Summary
Fixes #5751.
recover_orphans()was refunding every withdrawal withstatus='processing'. That is correct for pre-broadcast crash recovery, but unsafe for rows that already have atx_hashand were intentionally left in processing/manual-reconciliation state after a post-broadcast completion-update failure.This PR restricts automatic orphan refunds to processing withdrawals without a broadcast transaction hash:
Rows with
tx_hashremain untouched for operator reconciliation instead of being internally refunded and marked failed.Tests
Added
node/tests/test_payout_worker_recovery.pycovering:Validation:
PYTHONPATH=node python3 -B -m pytest -q node/tests/test_payout_worker_recovery.py --tb=short -p no:cacheprovider # 2 passed python3 -B -m py_compile node/payout_worker.py node/tests/test_payout_worker_recovery.py git diff --check