Skip to content

Fix misleading decompression errors for missing URLs in glob patterns#99034

Merged
alexey-milovidov merged 3 commits intomasterfrom
fix-inflate-error-message-for-missing-urls
Mar 16, 2026
Merged

Fix misleading decompression errors for missing URLs in glob patterns#99034
alexey-milovidov merged 3 commits intomasterfrom
fix-inflate-error-message-for-missing-urls

Conversation

@alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented Mar 9, 2026

Summary

  • When a glob URL pattern expands to URLs that return 404, http_skip_not_found_url_for_globs (enabled by default) silently swallows the HTTP error. But the decompression wrappers (Zlib, Brotli, LZMA) didn't handle a completely empty input stream — they attempted decompression with zero bytes and produced misleading errors like "inflate failed: buffer error" instead of simply returning no rows.
  • Fixed ZlibInflatingReadBuffer, BrotliReadBuffer, and LZMAInflatingReadBuffer to detect empty streams (total_in == 0 and inner buffer at EOF) and return EOF. Zstd and Lz4 already handled this correctly.
  • Added a test covering all 5 compression formats (gzip, zstd, brotli, lz4, lzma).

#49231 (comment)

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix misleading "inflate failed: buffer error" when reading non-existent compressed files via url() table function with glob patterns. Now returns empty result as expected when http_skip_not_found_url_for_globs is enabled.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code


Note

Medium Risk
Touches core decompression read paths used by url() and other inputs; a small logic change could affect EOF/error handling for truncated or unusual streams across multiple codecs.

Overview
Fixes misleading decompression failures when a globbed url() input is skipped due to 404s (via http_skip_not_found_url_for_globs) by treating a completely empty underlying stream as EOF instead of attempting to decompress.

Adds empty-stream detection to ZlibInflatingReadBuffer, LZMAInflatingReadBuffer, and BrotliReadBuffer (tracking total_in for brotli) and introduces a stateless test that queries missing .gz/.zst/.br/.lz4/.xz URLs and expects zero rows without errors.

Written by Cursor Bugbot for commit 350d87b. This will update automatically on new commits. Configure here.

When a glob URL pattern like `{00..59}` expands to URLs that return 404,
`http_skip_not_found_url_for_globs` (enabled by default) silently swallows
the error. But the decompression wrappers (Zlib, Brotli, LZMA) didn't
handle a completely empty input stream - they attempted decompression
with zero bytes and produced misleading errors like "inflate failed:
buffer error" instead of simply returning no rows.

Fix by detecting empty streams (`total_in == 0` and inner buffer at EOF)
in `ZlibInflatingReadBuffer`, `BrotliReadBuffer`, and
`LZMAInflatingReadBuffer`. Zstd and Lz4 already handled this correctly.

#49231 (comment)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Mar 9, 2026

Workflow [PR], commit [5ff912f]

Summary:


AI Review

Summary

This PR fixes misleading decompression errors for empty streams returned by missing URLs in glob-based url reads by adding explicit empty-input handling in Brotli, LZMA, and Zlib inflating buffers, and adds a stateless regression test that verifies zero-row behavior across compressed URL suffixes. I did not find any high-confidence correctness, safety, or performance regressions in the touched code.

Missing context

  • ⚠️ No CI run artifacts/logs were provided in this review context.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsHistory.cpp
Safe rollout
Compilation time

Final Verdict

  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 9, 2026
Copy link
Member Author

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why we are even trying to decompress on 404?

@alexey-milovidov
Copy link
Member Author

But why we are even trying to decompress on 404?

We are not decompressing the 404 response body. The flow is:

  1. ReadWriteBufferFromHTTP is created with delayed initialization (delay_initialization = true because current_uri_options.size() == 1 in StorageURLSource).
  2. FormatFactory::getInput wraps it with wrapReadBufferWithCompressionMethod (gzip decompression based on the .gz URL extension). At this point, no HTTP request has been made yet.
  3. When the format reader tries to read, the chain is: Format → GzipDecompression → ReadWriteBufferFromHTTP.
  4. The decompression buffer calls in->nextIfAtEnd(), which triggers ReadWriteBufferFromHTTP::nextImpl()initialize() → HTTP request → 404.
  5. Because http_skip_not_found_url_for_globs is true, the 404 is caught at ReadWriteBufferFromHTTP.cpp:482 and the buffer returns EOF (empty).
  6. The decompression buffer now has zero bytes of input and tries to run inflate() / lzma_code() / BrotliDecoderDecompressStream() with empty input → fails with a misleading error.

So the decompressor sees an empty stream, not the 404 error page. The fix handles this edge case in the decompression buffers. The alternative would be to propagate the "not found" status through the buffer chain so the decompression wrapper can be skipped entirely, but that would be more invasive — it would require changes to CompressedReadBufferWrapper, FormatFactory, and StorageURLSource.

@alexey-milovidov
Copy link
Member Author

LGTM

@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Mar 16, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.70% 83.80% +0.10%
Functions 23.90% 23.90% +0.00%
Branches 76.20% 76.30% +0.10%

PR changed lines: PR changed-lines coverage: 91.67% (33/36, 0 noise lines excluded)
Diff coverage report
Uncovered code

@alexey-milovidov alexey-milovidov self-assigned this Mar 16, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Mar 16, 2026
Merged via the queue into master with commit a8c86cb Mar 16, 2026
163 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-inflate-error-message-for-missing-urls branch March 16, 2026 21:17
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 16, 2026
zlareb1 added a commit to zlareb1/ClickHouse that referenced this pull request Mar 17, 2026
PR ClickHouse#99034 fixed misleading decompression errors for missing URLs in glob
patterns for gzip, zstd, brotli, lz4, and lzma, but missed bzip2.

Bzip2ReadBuffer::nextImpl() has no empty-stream guard: when the inner
stream is empty (404 URL with http_skip_not_found_url_for_globs=1), it
reaches the eof check at line 126 with ret=BZ_OK and throws
UNEXPECTED_END_OF_FILE instead of returning an empty result.

The other formats handle this via an early check before calling the
decompressor (e.g. ZlibInflatingReadBuffer checks total_in==0 && eof).

This test should fail until the same guard is added to Bzip2ReadBuffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zlareb1
Copy link
Member

zlareb1 commented Mar 17, 2026

@alexey-milovidov the fix should be made for bz2 as well: #99806

mango766 pushed a commit to mango766/ClickHouse that referenced this pull request Mar 18, 2026
When http_skip_not_found_url_for_globs is enabled and a URL returns 404,
all other compression formats (gzip, zstd, brotli, lz4, lzma) gracefully
return 0 rows. Bzip2 was the only format missing this check, causing it
to throw UNEXPECTED_END_OF_FILE instead.

This adds the same empty-stream guard that was added to the other formats
in ClickHouse#99034, and extends the existing test to cover bzip2.

Closes ClickHouse#99806
alexey-milovidov pushed a commit to mango766/ClickHouse that referenced this pull request Mar 19, 2026
When http_skip_not_found_url_for_globs is enabled and a URL returns 404,
all other compression formats (gzip, zstd, brotli, lz4, lzma) gracefully
return 0 rows. Bzip2 was the only format missing this check, causing it
to throw UNEXPECTED_END_OF_FILE instead.

This adds the same empty-stream guard that was added to the other formats
in ClickHouse#99034, and extends the existing test to cover bzip2.

Closes ClickHouse#99806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants