Skip to content

Conversation

@tchaton
Copy link
Collaborator

@tchaton tchaton commented Jun 27, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

This PR enables merging optimized datasets together.

# Create 2 different datasets

from litdata import optimize, StreamingDataset

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="/teamspace/s3_connections/laoin-400m/folder_1",
        chunk_bytes="64MB",
    )
from litdata import optimize, StreamingDataset

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="/teamspace/s3_connections/laoin-400m/folder_2",
        chunk_bytes="64MB",
    )

# Merged into a third one

from litdata import merge_datasets

merge_datasets(
    input_dirs=[
		"/teamspace/s3_connections/laoin-400m/folder_1",
		"/teamspace/s3_connections/laoin-400m/folder_2"
	],
    output_dir="/teamspace/s3_connections/laoin-400m/folder_3"
)

What does this PR do?

Fixes # (issue).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@tchaton tchaton requested a review from awaelchli as a code owner June 27, 2024 15:00
@tchaton tchaton merged commit f2c5a7b into main Jun 27, 2024
@tchaton tchaton deleted the add_support_for_merging_datasets branch June 27, 2024 16:35
for chunk in input_dir_file_content["chunks"]: # type: ignore
assert isinstance(chunk, dict)
old_filename = chunk["filename"]
new_filename = f"chunk-0-{counter}.bin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is exactly what I needed.

I notice that this filename is incorrect if compression==zstd.

The correct filename is "chunk-0-{counter}.{compression}.bin"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ouj. You are right ! Do you want to contribute a fix ?

Copy link
Contributor

@ouj ouj Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do when I get a chance. Should be a two line change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @ouj You can join our Discord to follow dev on Litdata: https://discord.gg/BH765hvQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants