Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Change S3FileIO to use SHA1 based checksums #10293

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muddyfish
Copy link

Issue

Using S3 Express for Iceberg was previously failing multipart upload integration tests as well as when s3.checksum-enabled was set. This was because S3 Express doesn't support using MD5 based checksums.

Fix

Changes S3FileIO to use SHA1 based checksums when checksumming is enabled.

Testing

Ran S3FileIO integration tests against both a regular S3 bucket as well as an S3 Express bucket. For regular S3, all tests pass, and for S3 Express, the checksumming errors are now resolved.

@github-actions github-actions bot added the AWS label May 9, 2024
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of questions :

  1. Are there scenarios in which we would prefer checksum to be MD5 instead of SHA1 from S3 perspective ?
  2. Was reading MD5 is faster than SHA1 so does it makes sense to expose the algorithm selection as a parameter, rather than making it default ? And specify in iceberg docs that when using express use SHA1 ?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar questions to @singhpk234. I think changing the the checksum default to a more expensive SHA1 validation may be more expensive without any significant benefits. TLS already would have some more modern HMAC integrity check anyways.

I also think we should avoid adding any net new configuration to S3FileIO just to keep things simpler.

One possible path may be to deprecate the checksumEnabled path, and add a new configuration for checksumAlgorithm where the possible values are none, md5, sha1 etc. The default would be none to preserve the existing behavior and that way the number of configurations remain the same in the long run, and we could also support the S3 express case.

What do you think @muddyfish @singhpk234?

@muddyfish
Copy link
Author

I think that's a reasonable comment, and I'd be happy going for that path forward.

@singhpk234
Copy link
Contributor

What do you think @muddyfish @singhpk234?

Sounds good, @amogh-jahagirdar !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants