-
Notifications
You must be signed in to change notification settings - Fork 80
Add utility to merge datasets together #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| for chunk in input_dir_file_content["chunks"]: # type: ignore | ||
| assert isinstance(chunk, dict) | ||
| old_filename = chunk["filename"] | ||
| new_filename = f"chunk-0-{counter}.bin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This is exactly what I needed.
I notice that this filename is incorrect if compression==zstd.
The correct filename is "chunk-0-{counter}.{compression}.bin"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ouj. You are right ! Do you want to contribute a fix ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do when I get a chance. Should be a two line change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW @ouj You can join our Discord to follow dev on Litdata: https://discord.gg/BH765hvQ
Before submitting
This PR enables merging optimized datasets together.
# Create 2 different datasets
# Merged into a third one
What does this PR do?
Fixes # (issue).
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃