Skip to content

Preserve original exception when sendCancel fails during INSERT#97339

Merged
alexey-milovidov merged 6 commits intomasterfrom
fix-client-exception-loss-on-insert
Feb 28, 2026
Merged

Preserve original exception when sendCancel fails during INSERT#97339
alexey-milovidov merged 6 commits intomasterfrom
fix-client-exception-loss-on-insert

Conversation

@alexey-milovidov
Copy link
Member

Summary

  • When the client-side parallel parser throws a parsing exception during INSERT (e.g., with "(at row N)"), the catch block in ClientBase::processInsertQuery calls sendCancel() to notify the server
  • sendCancel() performs socket I/O via Connection::sendCancel() which can throw NETWORK_ERROR if the server already closed the connection
  • This new exception escapes the catch block, permanently replacing the original parsing exception — the throw; is never reached
  • The fix wraps cleanup (sendCancel + receiveEndOfQueryForInsert) in an inner try-catch so connection errors cannot replace the original exception

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b841c75e5133904d06b5d02f79485aceff0af3b6&name_0=MasterCI&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix the client reporting NETWORK_ERROR instead of the actual parsing error (with the correct row number) when an INSERT with parallel parsing encounters invalid data.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

In `ClientBase::processInsertQuery`, when the client-side parallel
parser throws a parsing exception (e.g., with the correct row number),
the catch block calls `sendCancel()` to notify the server. But
`sendCancel()` performs socket I/O via `Connection::sendCancel()` which
can throw `NETWORK_ERROR` if the server already closed the connection.
This new exception escapes the catch block, permanently replacing the
original parsing exception - the `throw;` is never reached.

This was observed as a flaky failure in
`01583_parallel_parsing_exception_with_offset`: the test expects
"(at row 2000001)" in the error, but the client reports
"Connection reset by peer" instead.

The fix wraps `sendCancel()` and `receiveEndOfQueryForInsert()` in an
inner try-catch so connection errors during cleanup cannot replace the
original exception.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=b841c75e5133904d06b5d02f79485aceff0af3b6&name_0=MasterCI&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Feb 19, 2026

Workflow [PR], commit [039ec6e]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Feb 19, 2026
@alexey-milovidov
Copy link
Member Author

Flaky check shows that the test is slow. But it's okay for this test.

@Fgrtue Fgrtue self-assigned this Feb 20, 2026
@alexey-milovidov alexey-milovidov force-pushed the fix-client-exception-loss-on-insert branch from 7a5f7fb to db96399 Compare February 22, 2026 22:46
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Feb 28, 2026
Copy link
Contributor

@Fgrtue Fgrtue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to reproduce the bug, even though I can see that the changed code can be erroneous in its nature, and that the change might fix this.

In the server logs in failed cases I can see the following pattern: 1) the client sends some data 2) the server is trying to send profile event to the client 3) server receives a network error Connection reset by peer, while writing to socket. The error happens after about 5 minutes, or 300 seconds, which is set as our receive_timeout by default.

Claude suggests that there can be a deadlock scenario:

  - Client is sending massive amounts of data and fills the server's TCP receive buffer                                                                 
  - Server periodically tries to send profile events back                                                                                               
  - Client's TCP receive buffer fills up (client busy sending, not reading responses)                                                                   
  - Server's send() blocks waiting for buffer space (up to 300 seconds)                                                                                 
  - During this time, server stops reading from client                                                                                                  
  - Client's TCP send buffer also fills up                                                                                                              
  - Both sides are blocked on send operations → effective deadlock 

And from there I could not find any evidence that the flakiness is related to the overridden exceptions. However, we can try to see if the fix does the job -- maybe it is going to work. It is valid anyways.

Merged via the queue into master with commit 8dd47a6 Feb 28, 2026
148 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-client-exception-loss-on-insert branch February 28, 2026 16:21
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 28, 2026
alexey-milovidov added a commit that referenced this pull request Mar 1, 2026
…n-insert

Preserve original exception when sendCancel fails during INSERT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants