You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Streaming is a solution for very large scale multi-node ready data pipeline that is fully integrated with pytorch.
We should evaluate this library, as for v1 the scale of the data will no longer allow the previous approach of downloading all training data to a block storage.
The text was updated successfully, but these errors were encountered:
I was able to generate mosaicml streaming MDS files from a sample of our current data. I think I understand how the library works and think we can update the pipeline to output MDS files instead of tiff files.
We can generate one set of MDS shards with an index for each MGRS tile. Then we can use the merge_index function to combine those into one main index that the dataloader can use.
So I propose to go ahead and use this for the v0.2 run as a testbed for v1.
Initial test have not resulted in speed improvements. Also, we are no longer planning to create prefabricated tiles, but will assume a streaming approach instead. This kind of dataset is not usable for the dynamic chipping scenario.
Streaming is a solution for very large scale multi-node ready data pipeline that is fully integrated with pytorch.
We should evaluate this library, as for v1 the scale of the data will no longer allow the previous approach of downloading all training data to a block storage.
The text was updated successfully, but these errors were encountered: