Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend timestamp variable parameters in File name format #344

Open
BulgakovKD opened this issue Feb 29, 2024 · 1 comment
Open

Extend timestamp variable parameters in File name format #344

BulgakovKD opened this issue Feb 29, 2024 · 1 comment
Assignees

Comments

@BulgakovKD
Copy link

Scenario Overview

We use s3 filename template prefix/'%Y_%m_%d__%H_%M_%S_%f to sort filenames alphabetically.
The next new file is guaranteed to receive the following name in alphabetical order.

In kafka, we have several partitions of one topic, each of them must be written with the same prefix (prefix=topic_name) in order.
It's possible to ensure the files order with this template by running no more than 1 connector task.

Issue:

Timestamp variable have next parameters:

unit parameter values:
yyyy - year, e.g. 2020 (please note that YYYY is deprecated and is interpreted as yyyy)
MM - month, e.g. 03
dd - day, e.g. 01
HH - hour, e.g. 24

Consequences:

With these parameters, files recorded within 1 hour will not differ in name.
Adding the partition number and offset to the file name in the template can solve this problem, but it makes working with the root prefix more difficult.
Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.

Details:

Looks like it's enough to extend the following functionality :

    private static final Map<String, DateTimeFormatter> TIMESTAMP_FORMATTERS =
            Map.of(
                    "yyyy", DateTimeFormatter.ofPattern("yyyy"),
                    "MM", DateTimeFormatter.ofPattern("MM"),
                    "dd", DateTimeFormatter.ofPattern("dd"),
                    "HH", DateTimeFormatter.ofPattern("HH")
            );

with next parameters:

"%M" - Minutes in two-digit format.
"%S" - Seconds in two-digit format.
"%f" - Microseconds.
@jeqo jeqo self-assigned this Mar 15, 2024
@jeqo
Copy link
Contributor

jeqo commented Mar 15, 2024

@BulgakovKD thanks for reporting this.

Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.

Not sure this is the case. At least it won't be guaranteed as there's still a chance for messages from different partitions to be in the same file (if I'm understanding your case correctly).
By reducing the time unit we are just hoping that within a shorter time window there's only messages from one partition.

I guess adding minutes is a valid request -- as some users may want to have messages rotated more frequently -- but I'm not sure going down to microseconds is the right way to proceed. Even more that this will add pressure to the connector task as it will lead to a large number of keys to keep in memory before all messages are flushed to S3.

Do you require ordering only between files? or also ordering within the messages in the file?
You may be solving the first ordering with your proposed approach, but I don't think the second ordering is guaranteed as it's mixing messages from different partitions in the same file.

Not sure yet how to handle this, but let's see if we can find a workaround with the existing configurations before considering changing the connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants