-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix getting file format from globs in file() function #88947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix getting file format from globs in file() function #88947
Conversation
|
Workflow [PR], commit [8e30ae1] Summary: ❌
|
| # import pyarrow as pa | ||
| # id = pa.array(["1"], type=pa.string()) | ||
| # ts = pa.array(["2020-01-01 14:00:00"], type=pa.string()) | ||
| # ts_plus_tz = pa.array(["2020-01-01 14:00:00+00:00"], type=pa.string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2020-01-01 14:00:00+00:00 is a wrong format for ClickHouse columns of type DateTime64. So the command
INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.parquet')
fails with error message Cannot parse string '2020-01-01 14:00:00+00:00' as DateTime64(9, 'UTC') as expected.
However before this fix the following command
INSERT INTO test (ts) SELECT ts_plus_tz FROM file('conversion_to_datetime64_test.par?uet')
was successful and inserted 1970-01-01 00:00:00.000000000 to the table.
This PR makes such commands using globs fail with error message Cannot parse as expected.
|
Does it fix #88920 ? If so, please mention it to close it. |
Yes. Thanks, I didn't notice they created an issue. |
ca9347f to
09000da
Compare
87d27c3 to
6cba59f
Compare
6cba59f to
5a67ff5
Compare
jkartseva
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| if (!path_to_archive.empty()) | ||
| res.archive_info = getArchiveInfo(path_to_archive, filename, user_files_path, context, res.total_bytes_to_read); | ||
| else | ||
| res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read); | ||
|
|
||
| res.with_globs = res.paths.size() > 1; | ||
|
|
||
| if (res.archive_info) | ||
| { | ||
| res.format_from_filenames = FormatFactory::instance().tryGetFormatFromFileName(res.archive_info->path_in_archive); | ||
| } | ||
| else | ||
| { | ||
| for (const String & path : res.paths) | ||
| { | ||
| auto single_file_format = FormatFactory::instance().tryGetFormatFromFileName(path); | ||
| if (!res.format_from_filenames) | ||
| { | ||
| res.format_from_filenames = single_file_format; | ||
| } | ||
| else if (res.format_from_filenames != single_file_format) | ||
| { | ||
| res.format_from_filenames = {}; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (!res.paths.empty()) | ||
| res.path_for_partitioned_write = res.paths.front(); | ||
| else | ||
| res.path_for_partitioned_write = filename; | ||
|
|
||
| return res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a suggestion, instead of setting fields manually, pass them to a private FileSource c-tor:
FileSource(String filename_, Strings paths_, std::optional<ArchiveInfo> archive_info_, size_t total_bytes_to_read_)
: filename(std::move(filename_)), paths(std::move(paths_)), archive_info(std::move(archive_info_)), total_bytes_to_read(total_bytes_to_read_)
{
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I don't really see any benefits of adding such a constructor because since these fields are public and we need to calculate these fields anyway, the constructor looks like just an extra step.
| else | ||
| res.paths = getPathsList(filename, user_files_path, context, res.total_bytes_to_read); | ||
|
|
||
| res.with_globs = res.paths.size() > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_globs is derived from another field and it's cheap so that it can be in the form of a getter instead:
bool withGlobs() const { return paths.size() > 1; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this calculation is simple, however it looks clear why with_globs = (paths.size() > 1) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.
| if (!res.paths.empty()) | ||
| res.path_for_partitioned_write = res.paths.front(); | ||
| else | ||
| res.path_for_partitioned_write = filename; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path_for_partitioned_write can also be in the form of a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this calculation is simple, however it looks clear why path_for_partitioned_write = res.paths.front() (or filename) only when we parse FileSource from string. If we made this calculation later it would look strange. For this reason I'd prefer to keep this code as is.
|
CI failures are unrelated: |
99a5825
|
This change actually breaks |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix getting file format from globs in file() function. Resolves #88920