Skip to content

Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets#505

Merged
tchaton merged 35 commits intoLightning-AI:mainfrom
bhimrazy:feat/add-streaming-support-to-hf
Mar 20, 2025
Merged

Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets#505
tchaton merged 35 commits intoLightning-AI:mainfrom
bhimrazy:feat/add-streaming-support-to-hf

Conversation

@bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Mar 10, 2025

What does this PR do?

This PR introduces the following improvements:

  • Updates HFDownloader to use hf_hub_download, which can be combined with hf_transfer for faster downloads.
  • Enhances the parquet dataset indexing process to retrieve metadata and build index without downloading entire files.
  • Adds a low-memory usage option to the Parquet item loader.
  • Updates test cases to reflect these changes.

Fixes #502.

Example usage:

from tqdm import tqdm
from litdata import StreamingDataLoader, StreamingDataset
from litdata.streaming.item_loader import ParquetLoader

# Create a StreamingDataset object
dataset = StreamingDataset(
    input_dir="hf://datasets/open-thoughts/OpenThoughts-114k/data",
    item_loader=ParquetLoader(low_memory=True), # by default it has low_memory=True
)

data_loader = StreamingDataLoader(dataset=dataset, batch_size=8, num_workers=4, shuffle=False, drop_last=True)

for i, batch in enumerate(tqdm(data_loader, desc="Streaming data")):
    pass

Benchmarks

Using LitData

These benchmarks were generated using this script with the following settings: batch_size = 256, num_workers = 32, and machine = A10G. The results may vary slightly across different runs.

For more detailed logs (older), please check this comment.

Notes:
Time in brackets represents the time taken to complete that epoch.


Dataset: OpenThoughts-114k (3.55 GB)

CASE Low Memory Pre-load Chunk Shuffle Samples/sec 1st epoch Samples/sec 2nd epoch Peak Memory Usage (GB)
a yes no no 14,329 (7.95s) 15,948 (7.14s) ~13 GB
b yes no yes 14,093 (8.08s) 16,026 (7.11s) ~13 GB
c no no no 12,145 (9.38s) 12,779 (8.91s) ~23 GB
d no no yes 11,696 (9.74s) 12,747 (8.93s) ~23 GB
e no yes no 11,707 (9.73s) 10,602 (10.75s) ~33 GB
f no yes yes 11,296 (10.08s) 10,554 (10.79s) ~34 GB

Dataset: fineweb-edu (10BT Sample) (~26 GB)

CASE Low Memory Pre-load Chunk Shuffle Samples/sec 1st epoch Samples/sec 2nd epoch Peak Memory Usage (GB)
a yes no no 36,918 (261.98s) 44,803 (215.87s) ~18 GB
b yes no yes 34,983 (276.47s) 39,966 (242s) ~58 GB
c no no no - (crashed) ~120 GB

Using huggingface datasets streaming

These benchmarks were generated using this script with the following settings: batch_size = 256, num_workers = 32, and machine = A10G. The results may vary slightly across different runs.

For num_workers, it doesn't not seem to accept a value greater than num_shards.
Warning: Too many dataloader workers: 32 (max is dataset.num_shards=6). Stopping 26 dataloader workers.

Dataset: OpenThoughts-114k (3.55 GB)
Shuffle Samples/sec 1st epoch Samples/sec 2nd epoch Peak Memory Usage (GB)
no 14,834 (7.66s) 15,115 (7.51s) ~9 GB
yes 12,173 (9.33s) 12,288 (9.23s) ~9 GB

Dataset: fineweb-edu (10BT Sample) (~26 GB)

Shuffle Samples/sec 1st epoch Samples/sec 2nd epoch Peak Memory Usage (GB)
no 44,519 (217.23s) 44,178 (218.90s) ~55 GB
yes 33,147 (291.74s) 33,900 (285.27s) ~55 GB

PR Review

Community members are welcome to review this PR once tests have passed.
If your PR was not previously discussed in GitHub issues, there's a high chance it will not be merged.

Did you have fun?

Yes, I did. 😊

@bhimrazy bhimrazy added the enhancement New feature or request label Mar 10, 2025
@bhimrazy bhimrazy self-assigned this Mar 10, 2025
@tchaton
Copy link
Collaborator

tchaton commented Mar 11, 2025

@bhimrazy I wonder if we could benchmark pyarrow vs polars for streaming the data

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Mar 11, 2025

@bhimrazy I wonder if we could benchmark pyarrow vs polars for streaming the data

Sure @tchaton , will check that.

duckdb also seems to be another option.
But none of these options seem to provide efficient way of reading particular rows.

@codecov
Copy link

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 91.28205% with 17 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (a2b2570) to head (7c87dea).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #505   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        39     39           
  Lines      5859   5887   +28     
===================================
+ Hits       4621   4648   +27     
- Misses     1238   1239    +1     
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy bhimrazy changed the title [wip]: Feat/add streaming support to hf Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets Mar 18, 2025
@bhimrazy bhimrazy marked this pull request as ready for review March 18, 2025 17:58
@bhimrazy bhimrazy requested a review from tchaton March 18, 2025 17:58
@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Mar 18, 2025

Benchmarks for open-thoughts/OpenThoughts-114k

  • batch_size = 256, num_workers=32, machine=A10G
a) Results for this PR with low_memory=True:
main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1 python stream_hf_dataset.py 
Seed set to 42
Shuffle: False, Preload: False, Low Memory: True
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 48.63step/s]
Total number of samples in the dataset: 113957
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 218MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 193MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 207MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 189MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 135MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 51.68it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 8.631164073944092 or 13202.965949617279 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:07<00:00, 61.00it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 7.311708211898804 or 15585.54473275472 samples/sec.
Finished benchmarking.
b) Results for this PR with low_memory=False:
main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 23.32step/s]
Total number of samples in the dataset: 113957
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 234MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 232MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 227MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 205MB/s]
train-00001-of-00006.parquet: 100%|████████████████████████████████████████████████████████████████████████████| 175M/175M [00:02<00:00, 72.0MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 47.54it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.382962703704834 or 12145.093821165492 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 50.02it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 8.917301416397095 or 12779.310479351509 samples/sec.
Finished benchmarking.
c) Results for this PR with low_memory=False and pre_load_chunk=True:
main ~/litdata-benchmark SHUFFLE=0 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: False, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 42.90step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 216MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 224MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 219MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 203MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 233MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 45.82it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.733589887619019 or 11707.597777877443 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.50it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.748220205307007 or 10602.402993807767 samples/sec.
Finished benchmarking.
d) Results for this PR with low_memory=False, pre_load_chunk=True and shuffle=True:
main ~/litdata-benchmark SHUFFLE=1 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: True, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 12.96step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 218MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 191MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 217MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 192MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 214MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 44.21it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 10.087934494018555 or 11296.361705695894 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.31it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.79666018486023 or 10554.834578638292 samples/sec.
Finished benchmarking.

image

Benchmarks for HuggingFaceFW/fineweb-edu 10BT sample

  • batch_size = 256, num_workers=32, machine=A10G
a) Results for this PR with low_memory=True:
main ~/litdata-benchmark DATASET=1 SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1  python stream_hf_dataset.py 

Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: True 
Dataset: hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT
Indexing HF dataset from hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba5
7c5a17f304dc97347a732/index.json.
Indexing progress:   0%|                                                                                                 | 0/14 [00:00<?, ?step/s]'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 5c7d2c3a-6572-46bf-9d9e-7177ec4138e3)')' thrown while requesting GET https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/resolve/main/sample/10BT/004_00000.parquet
Retrying in 1s [Retry 1/5].
Indexing progress:  36%|███████████████████████████████▊                                                         | 5/14 [00:11<00:20,  2.24s/step]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba57c5a17f304dc97347a732/index.json
Indexing progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:11<00:00,  1.25step/s]
Total number of samples in the dataset: 9672101
013_00000.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████| 541M/541M [00:02<00:00, 228MB/s]
011_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 203MB/s]
002_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
003_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
004_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 201MB/s]
001_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 199MB/s]
000_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 193MB/s]
012_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 191MB/s]
009_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 186MB/s]
005_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 182MB/s]
008_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 179MB/s]
010_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 177MB/s]
006_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 176MB/s]
007_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 172MB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [04:25<00:00, 142.39it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 9672101 samples in 265.3391079902649 or 36451.84810564034 samples/sec.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [03:38<00:00, 172.91it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 9672101 samples in 218.5037910938263 or 44265.1395731683 samples/sec.

image

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply put, fantastic work !

@tchaton tchaton merged commit 6e4a409 into Lightning-AI:main Mar 20, 2025
29 checks passed
@bhimrazy bhimrazy deleted the feat/add-streaming-support-to-hf branch March 20, 2025 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add streaming support for huggingface parquet dataset similar to chunk stream

2 participants