Skip to content

Conversation

@vitlibar
Copy link
Member

@vitlibar vitlibar commented Oct 24, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix getting file format from globs in file() function. Resolves #88920

@clickhouse-gh
Copy link

clickhouse-gh bot commented Oct 24, 2025

Workflow [PR], commit [8e30ae1]

Summary:

job_name test_name status info comment
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (amd_debug, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (amd_ubsan, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (amd_debug, distributed plan, s3 storage, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stateless tests (arm_binary, parallel) failure
00900_long_parquet_load_2 FAIL cidb
Stress test (arm_asan) failure
Server died FAIL cidb
Hung check failed, possible deadlock found (see hung_check.log) FAIL cidb
Killed by signal (in clickhouse-server.log) FAIL cidb
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL cidb
Killed by signal (output files) FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Oct 24, 2025
# import pyarrow as pa
# id = pa.array(["1"], type=pa.string())
# ts = pa.array(["2020-01-01 14:00:00"], type=pa.string())
# ts_plus_tz = pa.array(["2020-01-01 14:00:00+00:00"], type=pa.string())
Copy link
Member Author

@vitlibar vitlibar Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020-01-01 14:00:00+00:00 is a wrong format for ClickHouse columns of type DateTime64. So the command

INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.parquet')

fails with error message Cannot parse string '2020-01-01 14:00:00+00:00' as DateTime64(9, 'UTC') as expected.


However before this fix the following command

INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.par?uet')

was successful and inserted 1970-01-01 00:00:00.000000000 to the table.

This PR makes such commands using globs fail with error message Cannot parse as expected.

@PedroTadim
Copy link
Member

Does it fix #88920 ? If so, please mention it to close it.

@vitlibar
Copy link
Member Author

Does it fix #88920 ? If so, please mention it to close it.

Yes. Thanks, I didn't notice they created an issue.

@vitlibar vitlibar force-pushed the fix-getting-file-format-from-globs branch 5 times, most recently from ca9347f to 09000da Compare October 24, 2025 17:27
@jkartseva jkartseva self-assigned this Oct 24, 2025
@vitlibar vitlibar force-pushed the fix-getting-file-format-from-globs branch 2 times, most recently from 87d27c3 to 6cba59f Compare October 26, 2025 14:01
@vitlibar vitlibar force-pushed the fix-getting-file-format-from-globs branch from 6cba59f to 5a67ff5 Compare October 26, 2025 14:39
Copy link
Member

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +993 to +1026
if (!path_to_archive.empty())
res.archive_info = getArchiveInfo(path_to_archive, filename, user_files_path, context, res.total_bytes_to_read);
else
res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read);

res.with_globs = res.paths.size() > 1;

if (res.archive_info)
{
res.format_from_filenames = FormatFactory::instance().tryGetFormatFromFileName(res.archive_info->path_in_archive);
}
else
{
for (const String & path : res.paths)
{
auto single_file_format = FormatFactory::instance().tryGetFormatFromFileName(path);
if (!res.format_from_filenames)
{
res.format_from_filenames = single_file_format;
}
else if (res.format_from_filenames != single_file_format)
{
res.format_from_filenames = {};
break;
}
}
}

if (!res.paths.empty())
res.path_for_partitioned_write = res.paths.front();
else
res.path_for_partitioned_write = filename;

return res;
Copy link
Member

@jkartseva jkartseva Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a suggestion, instead of setting fields manually, pass them to a private FileSource c-tor:

    FileSource(String filename_, Strings paths_, std::optional<ArchiveInfo> archive_info_, size_t total_bytes_to_read_)
        : filename(std::move(filename_)), paths(std::move(paths_)), archive_info(std::move(archive_info_)), total_bytes_to_read(total_bytes_to_read_)
    {
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't really see any benefits of adding such a constructor because since these fields are public and we need to calculate these fields anyway, the constructor looks like just an extra step.

else
res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read);

res.with_globs = res.paths.size() > 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_globs is derived from another field and it's cheap so that it can be in the form of a getter instead:

bool withGlobs() const { return paths.size() > 1; }

Copy link
Member Author

@vitlibar vitlibar Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this calculation is simple, however it looks clear why with_globs = (paths.size() > 1) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.

Comment on lines +1021 to +1024
if (!res.paths.empty())
res.path_for_partitioned_write = res.paths.front();
else
res.path_for_partitioned_write = filename;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path_for_partitioned_write can also be in the form of a method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this calculation is simple, however it looks clear why path_for_partitioned_write = res.paths.front() (or filename) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.

@vitlibar
Copy link
Member Author

vitlibar commented Nov 1, 2025

@vitlibar vitlibar enabled auto-merge November 1, 2025 14:03
@vitlibar vitlibar added this pull request to the merge queue Nov 1, 2025
Merged via the queue into ClickHouse:master with commit 99a5825 Nov 1, 2025
115 of 124 checks passed
@vitlibar vitlibar deleted the fix-getting-file-format-from-globs branch November 1, 2025 14:19
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 1, 2025
@azat
Copy link
Member

azat commented Nov 2, 2025

This change actually breaks 00900_long_parquet_load_2, since you added new file, will be fixed in #89383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected exception behavior with brace expansion and table function

5 participants