Extend timestamp variable parameters in File name format #344

BulgakovKD · 2024-02-29T14:12:03Z

Scenario Overview

We use s3 filename template prefix/'%Y_%m_%d__%H_%M_%S_%f to sort filenames alphabetically.
The next new file is guaranteed to receive the following name in alphabetical order.

In kafka, we have several partitions of one topic, each of them must be written with the same prefix (prefix=topic_name) in order.
It's possible to ensure the files order with this template by running no more than 1 connector task.

Issue:

Timestamp variable have next parameters:

unit parameter values:
yyyy - year, e.g. 2020 (please note that YYYY is deprecated and is interpreted as yyyy)
MM - month, e.g. 03
dd - day, e.g. 01
HH - hour, e.g. 24

Consequences:

With these parameters, files recorded within 1 hour will not differ in name.
Adding the partition number and offset to the file name in the template can solve this problem, but it makes working with the root prefix more difficult.
Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.

Details:

Looks like it's enough to extend the following functionality :

    private static final Map<String, DateTimeFormatter> TIMESTAMP_FORMATTERS =
            Map.of(
                    "yyyy", DateTimeFormatter.ofPattern("yyyy"),
                    "MM", DateTimeFormatter.ofPattern("MM"),
                    "dd", DateTimeFormatter.ofPattern("dd"),
                    "HH", DateTimeFormatter.ofPattern("HH")
            );

with next parameters:

"%M" - Minutes in two-digit format.
"%S" - Seconds in two-digit format.
"%f" - Microseconds.

The text was updated successfully, but these errors were encountered:

jeqo · 2024-03-15T13:33:00Z

@BulgakovKD thanks for reporting this.

Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.

Not sure this is the case. At least it won't be guaranteed as there's still a chance for messages from different partitions to be in the same file (if I'm understanding your case correctly).
By reducing the time unit we are just hoping that within a shorter time window there's only messages from one partition.

I guess adding minutes is a valid request -- as some users may want to have messages rotated more frequently -- but I'm not sure going down to microseconds is the right way to proceed. Even more that this will add pressure to the connector task as it will lead to a large number of keys to keep in memory before all messages are flushed to S3.

Do you require ordering only between files? or also ordering within the messages in the file?
You may be solving the first ordering with your proposed approach, but I don't think the second ordering is guaranteed as it's mixing messages from different partitions in the same file.

Not sure yet how to handle this, but let's see if we can find a workaround with the existing configurations before considering changing the connector.

jeqo self-assigned this Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend timestamp variable parameters in File name format #344

Extend timestamp variable parameters in File name format #344

BulgakovKD commented Feb 29, 2024

jeqo commented Mar 15, 2024

Extend timestamp variable parameters in File name format #344

Extend timestamp variable parameters in File name format #344

Comments

BulgakovKD commented Feb 29, 2024

Scenario Overview

Issue:

Consequences:

Details:

jeqo commented Mar 15, 2024