Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate use of mosaicml-streaming for data pipeline #165

Closed
yellowcap opened this issue Feb 26, 2024 · 2 comments
Closed

Evaluate use of mosaicml-streaming for data pipeline #165

yellowcap opened this issue Feb 26, 2024 · 2 comments
Assignees
Labels
data-pipeline Pull Requests about the data pipeline

Comments

@yellowcap
Copy link
Member

Streaming is a solution for very large scale multi-node ready data pipeline that is fully integrated with pytorch.

We should evaluate this library, as for v1 the scale of the data will no longer allow the previous approach of downloading all training data to a block storage.

@yellowcap yellowcap self-assigned this Feb 26, 2024
@yellowcap
Copy link
Member Author

yellowcap commented Feb 26, 2024

I was able to generate mosaicml streaming MDS files from a sample of our current data. I think I understand how the library works and think we can update the pipeline to output MDS files instead of tiff files.

We can generate one set of MDS shards with an index for each MGRS tile. Then we can use the merge_index function to combine those into one main index that the dataloader can use.

So I propose to go ahead and use this for the v0.2 run as a testbed for v1.

@yellowcap
Copy link
Member Author

Initial test have not resulted in speed improvements. Also, we are no longer planning to create prefabricated tiles, but will assume a streaming approach instead. This kind of dataset is not usable for the dynamic chipping scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

No branches or pull requests

3 participants