Skip to content

fix(transforms): DayTransform #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 5, 2025
Merged

fix(transforms): DayTransform #423

merged 5 commits into from
Jun 5, 2025

Conversation

jhump
Copy link
Contributor

@jhump jhump commented May 7, 2025

Its transform function returns an int32, but its ResultType method indicated it's a date. Per the spec, the transform function is correct: the result is an int32 number of days since epoch.

Its transform function returns an `int32`, but its `ResultType` method indicated it's a date. Per the spec, the transform function is correct: the result is an int32 number of days since epoch.
@zeroshade
Copy link
Member

The date type is represented exactly as an int32 number of days since the epoch though. The reason for using the date type as the result is so that the transformation to a string happens properly etc.

@jhump
Copy link
Contributor Author

jhump commented May 7, 2025

Hmm, ok. The spec states that the result type of the day transform is int. And, per the spec for representing these in Avro, that the resulting Avro schema type (for the partition type in a manifest file) should just be int, without the date logical type.

so that the transformation to a string happens properly

Where does it need to be transformed to a string? The implementations of DayTransform.Transformer, DayTransform.Apply, and DayTransform.ToHumanStr all return/require int32 values, not Date. So this feels very inconsistent.

I ran into an issue with this when using Transformer to compute a partition value and storing that in a manifest entry: it was an int32 result, and encoding to Avro was unhappy because the computed partition schema wanted a Date instead. (At least I think that's what happened; I had to make this change in a fork to get day transforms working, but my recollection of these specifics could be incorrect.) I suppose an alternate fix in my own code base would be to special-case this transform and explicitly convert the result value to Date? But that seems problematic, that it requires special-casing that way to correctly build a manifest.

@zeroshade
Copy link
Member

Where does it need to be transformed to a string? The implementations of DayTransform.Transformer, DayTransform.Apply, and DayTransform.ToHumanStr all return/require int32 values, not Date. So this feels very inconsistent.

That's a good point, you're right. I forgot about the ToHumanStr function which handles the only case where we'd need to do that conversion to string. So you're probably right, we can safely shift the result type to be int32.

@zeroshade
Copy link
Member

Can you create a test case that hits the problem you found and is solved by this change?

@jhump
Copy link
Contributor Author

jhump commented May 7, 2025

Can you create a test case that hits the problem you found and is solved by this change?

Yep, will do.

@@ -720,7 +720,7 @@ func (t DayTransform) MarshalText() ([]byte, error) {

func (DayTransform) String() string { return "day" }

func (DayTransform) ResultType(Type) Type { return PrimitiveTypes.Date }
func (DayTransform) ResultType(Type) Type { return PrimitiveTypes.Int32 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://iceberg.apache.org/spec/#partition-transforms

day transform result type should be int

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably where I got it from 😄

Then we're all good here once @jhump adds a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably get to adding a test early next week. Sorry for the delays. I got side-tracked with some other urgent stuff that popped up this week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to apologize :P
that's just how OSS contributions go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup yup! No rush nor need for apologizing

Copy link
Contributor Author

@jhump jhump May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I've pushed a test. But, unfortunately, it did not reproduce the issue I originally encountered.

I stepped through everything in this test with a debugger to figure out what was going on, since I was surprised it passed. It turns out that the avro encoder will happily accept an int32 value (and also accept an iceberg.Date, since its underlying kind is int32) even for a field with the "date" logical type. It is a little odd that it doesn't survive a round-trip: the tests writes an int32 but gets back an iceberg.Date when reading that just-written file. But I didn't add an assertion to verify that all values round-trip correctly because (sigh 😢) none of the int32 values do: while this one gets deserialized as a time.Time (and then converted to iceberg.Date in dataFile.initializeMapData), all of the others get deserialized as int (instead of int32). ☹️

I did a little spelunking in my own codebase, to try to figure out what error I saw when I originally encountered an issue. It turns out that my application has extra checks before writing a manifest, which verify that the partition data for each data file has the correct/expected types of values. The map had an int32 value, but because DayTransform.ResultType() returned Date instead of Int32 the checks was expecting a time.Time (what would have been loaded directly from Avro, due to the "date" logical type in the schema) or an iceberg.Date. So it was my own code that was returning an error, not code in this library.

I still think this change makes sense, despite the fact that things do actually work without it. It is possible that integrating with something that tries to read the manifest could fail since the type in the manifest doesn't quite match the expected type per the Iceberg spec. 🤷

@zeroshade zeroshade changed the title Fix DayTransform fix(transforms): DayTransform May 9, 2025
@jhump
Copy link
Contributor Author

jhump commented May 19, 2025

Hmmm, looks like a semantic merge conflict with #427...

@jhump
Copy link
Contributor Author

jhump commented Jun 4, 2025

@zeroshade, I finally had a second to look at this. It was a very simple/tiny fix, but I just didn't have a chance to even look at it before now. Tests and lint are green for me locally now.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zeroshade zeroshade merged commit 0ca80e0 into apache:main Jun 5, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants