Fix getting file format from globs in file() function #88947

vitlibar · 2025-10-24T12:11:26Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix getting file format from globs in file() function. Resolves #88920

clickhouse-gh · 2025-10-24T12:11:55Z

Workflow [PR], commit [8e30ae1]

Summary: ❌

job_name	test_name	status	info
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (amd_debug, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (amd_ubsan, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (amd_debug, distributed plan, s3 storage, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stateless tests (arm_binary, parallel)		failure
	00900_long_parquet_load_2	FAIL	cidb
Stress test (arm_asan)		failure
	Server died	FAIL	cidb
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL	cidb
	Killed by signal (in clickhouse-server.log)	FAIL	cidb
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL	cidb
	Killed by signal (output files)	FAIL	cidb

vitlibar · 2025-10-24T12:17:51Z

tests/queries/0_stateless/03701_parquet_conversion_to_datetime64.sh

+# import pyarrow as pa
+# id = pa.array(["1"], type=pa.string())
+# ts = pa.array(["2020-01-01 14:00:00"], type=pa.string())
+# ts_plus_tz = pa.array(["2020-01-01 14:00:00+00:00"], type=pa.string())


2020-01-01 14:00:00+00:00 is a wrong format for ClickHouse columns of type DateTime64. So the command

INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.parquet')

fails with error message Cannot parse string '2020-01-01 14:00:00+00:00' as DateTime64(9, 'UTC') as expected.

However before this fix the following command

INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.par?uet')

was successful and inserted 1970-01-01 00:00:00.000000000 to the table.

This PR makes such commands using globs fail with error message Cannot parse as expected.

PedroTadim · 2025-10-24T12:42:12Z

Does it fix #88920 ? If so, please mention it to close it.

vitlibar · 2025-10-24T13:42:28Z

Does it fix #88920 ? If so, please mention it to close it.

Yes. Thanks, I didn't notice they created an issue.

jkartseva

LGTM

src/TableFunctions/TableFunctionFile.cpp

jkartseva · 2025-10-29T02:34:04Z

src/Storages/StorageFile.cpp

+    if (!path_to_archive.empty())
+        res.archive_info = getArchiveInfo(path_to_archive, filename, user_files_path, context, res.total_bytes_to_read);
+    else
+        res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read);
+
+    res.with_globs = res.paths.size() > 1;
+
+    if (res.archive_info)
+    {
+        res.format_from_filenames = FormatFactory::instance().tryGetFormatFromFileName(res.archive_info->path_in_archive);
+    }
+    else
+    {
+        for (const String & path : res.paths)
+        {
+            auto single_file_format = FormatFactory::instance().tryGetFormatFromFileName(path);
+            if (!res.format_from_filenames)
+            {
+                res.format_from_filenames = single_file_format;
+            }
+            else if (res.format_from_filenames != single_file_format)
+            {
+                res.format_from_filenames = {};
+                break;
+            }
+        }
+    }
+
+    if (!res.paths.empty())
+        res.path_for_partitioned_write = res.paths.front();
+    else
+        res.path_for_partitioned_write = filename;
+
+    return res;


As a suggestion, instead of setting fields manually, pass them to a private FileSource c-tor:

FileSource(String filename_, Strings paths_, std::optional<ArchiveInfo> archive_info_, size_t total_bytes_to_read_) : filename(std::move(filename_)), paths(std::move(paths_)), archive_info(std::move(archive_info_)), total_bytes_to_read(total_bytes_to_read_) { }

Honestly I don't really see any benefits of adding such a constructor because since these fields are public and we need to calculate these fields anyway, the constructor looks like just an extra step.

jkartseva · 2025-10-29T02:36:22Z

src/Storages/StorageFile.cpp

+    else
+        res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read);
+
+    res.with_globs = res.paths.size() > 1;


with_globs is derived from another field and it's cheap so that it can be in the form of a getter instead:

bool withGlobs() const { return paths.size() > 1; }

Yes, this calculation is simple, however it looks clear why with_globs = (paths.size() > 1) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.

jkartseva · 2025-10-29T03:08:39Z

src/Storages/StorageFile.cpp

+    if (!res.paths.empty())
+        res.path_for_partitioned_write = res.paths.front();
+    else
+        res.path_for_partitioned_write = filename;


path_for_partitioned_write can also be in the form of a method

Yes, this calculation is simple, however it looks clear why path_for_partitioned_write = res.paths.front() (or filename) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.

src/Storages/StorageFile.cpp

vitlibar · 2025-11-01T14:01:09Z

CI failures are unrelated:

00900_long_parquet_load_2 - test is flaky.
Disk does not support WriteMode::Append. (NOT_IMPLEMENTED) - see issue #84669

azat · 2025-11-02T21:34:57Z

This change actually breaks 00900_long_parquet_load_2, since you added new file, will be fixed in #89383

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Oct 24, 2025

vitlibar commented Oct 24, 2025

View reviewed changes

vitlibar force-pushed the fix-getting-file-format-from-globs branch 5 times, most recently from ca9347f to 09000da Compare October 24, 2025 17:27

jkartseva self-assigned this Oct 24, 2025

vitlibar force-pushed the fix-getting-file-format-from-globs branch 2 times, most recently from 87d27c3 to 6cba59f Compare October 26, 2025 14:01

vitlibar added 3 commits October 26, 2025 15:38

Fix getting file formats from globs.

abc1b95

Add test.

5bd59c8

Fix test 03023_invalid_format_detection.

5a67ff5

vitlibar force-pushed the fix-getting-file-format-from-globs branch from 6cba59f to 5a67ff5 Compare October 26, 2025 14:39

jkartseva approved these changes Oct 29, 2025

View reviewed changes

Corrections after review.

8e30ae1

vitlibar enabled auto-merge November 1, 2025 14:03

vitlibar added this pull request to the merge queue Nov 1, 2025

Merged via the queue into ClickHouse:master with commit 99a5825 Nov 1, 2025
115 of 124 checks passed

vitlibar deleted the fix-getting-file-format-from-globs branch November 1, 2025 14:19

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 1, 2025

azat mentioned this pull request Nov 2, 2025

tests: fix 00900_long_parquet_load_2 #89383

Merged

Fix getting file format from globs in file() function #88947

Fix getting file format from globs in file() function #88947

Uh oh!

Conversation

vitlibar commented Oct 24, 2025 • edited by clickhouse-gh bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitlibar Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PedroTadim commented Oct 24, 2025

Uh oh!

vitlibar commented Oct 24, 2025

Uh oh!

jkartseva left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkartseva Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitlibar Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

jkartseva Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vitlibar Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkartseva Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vitlibar Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vitlibar commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

azat commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vitlibar commented Oct 24, 2025 •

edited by clickhouse-gh bot

Loading

clickhouse-gh bot commented Oct 24, 2025 •

edited

Loading

vitlibar Oct 24, 2025 •

edited

Loading

jkartseva Oct 29, 2025 •

edited

Loading

vitlibar Oct 31, 2025 •

edited

Loading

vitlibar commented Nov 1, 2025 •

edited

Loading

azat commented Nov 2, 2025 •

edited

Loading