Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets by bhimrazy · Pull Request #505 · Lightning-AI/litData

bhimrazy · 2025-03-10T19:24:53Z

What does this PR do?

This PR introduces the following improvements:

Updates HFDownloader to use hf_hub_download, which can be combined with hf_transfer for faster downloads.
Enhances the parquet dataset indexing process to retrieve metadata and build index without downloading entire files.
Adds a low-memory usage option to the Parquet item loader.
Updates test cases to reflect these changes.

Fixes #502.

Example usage:

from tqdm import tqdm
from litdata import StreamingDataLoader, StreamingDataset
from litdata.streaming.item_loader import ParquetLoader

# Create a StreamingDataset object
dataset = StreamingDataset(
    input_dir="hf://datasets/open-thoughts/OpenThoughts-114k/data",
    item_loader=ParquetLoader(low_memory=True), # by default it has low_memory=True
)

data_loader = StreamingDataLoader(dataset=dataset, batch_size=8, num_workers=4, shuffle=False, drop_last=True)

for i, batch in enumerate(tqdm(data_loader, desc="Streaming data")):
    pass

Benchmarks

Using LitData

These benchmarks were generated using this script with the following settings: batch_size = 256, num_workers = 32, and machine = A10G. The results may vary slightly across different runs.

For more detailed logs (older), please check this comment.

Notes:
Time in brackets represents the time taken to complete that epoch.

Dataset: OpenThoughts-114k (3.55 GB)

CASE	Low Memory	Pre-load Chunk	Shuffle	Samples/sec 1st epoch	Samples/sec 2nd epoch	Peak Memory Usage (GB)
a	yes	no	no	14,329 (7.95s)	15,948 (7.14s)	~13 GB
b	yes	no	yes	14,093 (8.08s)	16,026 (7.11s)	~13 GB
c	no	no	no	12,145 (9.38s)	12,779 (8.91s)	~23 GB
d	no	no	yes	11,696 (9.74s)	12,747 (8.93s)	~23 GB
e	no	yes	no	11,707 (9.73s)	10,602 (10.75s)	~33 GB
f	no	yes	yes	11,296 (10.08s)	10,554 (10.79s)	~34 GB

Dataset: fineweb-edu (10BT Sample) (~26 GB)

CASE	Low Memory	Pre-load Chunk	Shuffle	Samples/sec 1st epoch	Samples/sec 2nd epoch	Peak Memory Usage (GB)
a	yes	no	no	36,918 (261.98s)	44,803 (215.87s)	~18 GB
b	yes	no	yes	34,983 (276.47s)	39,966 (242s)	~58 GB
c	no	no	no	- (crashed)		~120 GB

Using huggingface datasets streaming

These benchmarks were generated using this script with the following settings: batch_size = 256, num_workers = 32, and machine = A10G. The results may vary slightly across different runs.

For num_workers, it doesn't not seem to accept a value greater than num_shards.
Warning: Too many dataloader workers: 32 (max is dataset.num_shards=6). Stopping 26 dataloader workers.

Dataset: OpenThoughts-114k (3.55 GB)

Shuffle	Samples/sec 1st epoch	Samples/sec 2nd epoch	Peak Memory Usage (GB)
no	14,834 (7.66s)	15,115 (7.51s)	~9 GB
yes	12,173 (9.33s)	12,288 (9.23s)	~9 GB

Dataset: fineweb-edu (10BT Sample) (~26 GB)

Shuffle	Samples/sec 1st epoch	Samples/sec 2nd epoch	Peak Memory Usage (GB)
no	44,519 (217.23s)	44,178 (218.90s)	~55 GB
yes	33,147 (291.74s)	33,900 (285.27s)	~55 GB

PR Review

Community members are welcome to review this PR once tests have passed.
If your PR was not previously discussed in GitHub issues, there's a high chance it will not be merged.

Did you have fun?

Yes, I did. 😊

tchaton · 2025-03-11T10:03:39Z

@bhimrazy I wonder if we could benchmark pyarrow vs polars for streaming the data

bhimrazy · 2025-03-11T15:24:11Z

@bhimrazy I wonder if we could benchmark pyarrow vs polars for streaming the data

Sure @tchaton , will check that.

duckdb also seems to be another option.
But none of these options seem to provide efficient way of reading particular rows.

src/litdata/streaming/item_loader.py

…aset

for more information, see https://pre-commit.ci

codecov · 2025-03-18T16:52:01Z

Codecov Report

Attention: Patch coverage is 91.28205% with 17 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (a2b2570) to head (7c87dea).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #505   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        39     39           
  Lines      5859   5887   +28     
===================================
+ Hits       4621   4648   +27     
- Misses     1238   1239    +1

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bhimrazy · 2025-03-18T19:07:19Z

Benchmarks for open-thoughts/OpenThoughts-114k

batch_size = 256, num_workers=32, machine=A10G

a) Results for this PR with low_memory=True:

⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1 python stream_hf_dataset.py 
Seed set to 42
Shuffle: False, Preload: False, Low Memory: True
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 48.63step/s]
Total number of samples in the dataset: 113957
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 218MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 193MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 207MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 189MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 135MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 51.68it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 8.631164073944092 or 13202.965949617279 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:07<00:00, 61.00it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 7.311708211898804 or 15585.54473275472 samples/sec.
Finished benchmarking.

b) Results for this PR with low_memory=False:

⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=0 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 23.32step/s]
Total number of samples in the dataset: 113957
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 234MB/s]
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 232MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 226MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 227MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 205MB/s]
train-00001-of-00006.parquet: 100%|████████████████████████████████████████████████████████████████████████████| 175M/175M [00:02<00:00, 72.0MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 47.54it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.382962703704834 or 12145.093821165492 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:08<00:00, 50.02it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 8.917301416397095 or 12779.310479351509 samples/sec.
Finished benchmarking.

c) Results for this PR with low_memory=False and pre_load_chunk=True:

⚡ main ~/litdata-benchmark SHUFFLE=0 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: False, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 42.90step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 216MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 224MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 219MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 203MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 233MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:09<00:00, 45.82it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 9.733589887619019 or 11707.597777877443 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.50it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.748220205307007 or 10602.402993807767 samples/sec.
Finished benchmarking.

d) Results for this PR with low_memory=False, pre_load_chunk=True and shuffle=True:

⚡ main ~/litdata-benchmark SHUFFLE=1 PRELOAD=1 LOW_MEMORY=0 python stream_hf_dataset.py
Seed set to 42
Shuffle: True, Preload: True, Low Memory: False
Indexing HF dataset from hf://datasets/open-thoughts/OpenThoughts-114k/data into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json.
Indexing progress:   0%|                                                                                                  | 0/6 [00:00<?, ?step/s]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/0c2f2da21e97421c865837df5d13b4af5511701940e3c2d18f3659ae37907f91/index.json
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 12.96step/s]
Total number of samples in the dataset: 113957
train-00004-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 154M/154M [00:00<00:00, 221MB/s]
train-00002-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 173M/173M [00:00<00:00, 218MB/s]
train-00005-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 152M/152M [00:00<00:00, 191MB/s]
train-00001-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 175M/175M [00:00<00:00, 217MB/s]
train-00003-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 174M/174M [00:00<00:00, 192MB/s]
train-00000-of-00006.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 250M/250M [00:01<00:00, 214MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 44.21it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 113957 samples in 10.087934494018555 or 11296.361705695894 samples/sec.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 446/446 [00:10<00:00, 41.31it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 113957 samples in 10.79666018486023 or 10554.834578638292 samples/sec.
Finished benchmarking.

Benchmarks for HuggingFaceFW/fineweb-edu 10BT sample

batch_size = 256, num_workers=32, machine=A10G

a) Results for this PR with low_memory=True:

⚡ main ~/litdata-benchmark DATASET=1 SHUFFLE=0 PRELOAD=0 LOW_MEMORY=1  python stream_hf_dataset.py 

Seed set to 42
Benchmarking using litdata version: 0.2.42
Shuffle: False, Preload: False, Low Memory: True 
Dataset: hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT
Indexing HF dataset from hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT into /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba5
7c5a17f304dc97347a732/index.json.
Indexing progress:   0%|                                                                                                 | 0/14 [00:00<?, ?step/s]'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 5c7d2c3a-6572-46bf-9d9e-7177ec4138e3)')' thrown while requesting GET https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/resolve/main/sample/10BT/004_00000.parquet
Retrying in 1s [Retry 1/5].
Indexing progress:  36%|███████████████████████████████▊                                                         | 5/14 [00:11<00:20,  2.24s/step]
Index file successfully written to: /teamspace/studios/this_studio/.cache/litdata-cache-index-pq/137494f2a7a4a992c32c1f0beab95c60c7933474ba57c5a17f304dc97347a732/index.json
Indexing progress: 100%|████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:11<00:00,  1.25step/s]
Total number of samples in the dataset: 9672101
013_00000.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████| 541M/541M [00:02<00:00, 228MB/s]
011_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 203MB/s]
002_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
003_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 202MB/s]
004_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 201MB/s]
001_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:10<00:00, 199MB/s]
000_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 193MB/s]
012_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 191MB/s]
009_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 186MB/s]
005_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:11<00:00, 182MB/s]
008_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 179MB/s]
010_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 177MB/s]
006_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 176MB/s]
007_00000.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2.15G/2.15G [00:12<00:00, 172MB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [04:25<00:00, 142.39it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 0, streamed over 9672101 samples in 265.3391079902649 or 36451.84810564034 samples/sec.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 37782/37782 [03:38<00:00, 172.91it/s]
For /teamspace/studios/this_studio/litdata-benchmark/stream_hf_dataset.py on 1, streamed over 9672101 samples in 218.5037910938263 or 44265.1395731683 samples/sec.

tchaton

Simply put, fantastic work !

src/litdata/streaming/downloader.py

src/litdata/streaming/item_loader.py

bhimrazy added 5 commits March 11, 2025 01:07

moved to constants

a91ae70

moved constants

4478e2a

update hf downloader

aa28083

updated writer if file obj

fb06b56

updated num workers

55b1b47

bhimrazy added the enhancement New feature or request label Mar 10, 2025

bhimrazy self-assigned this Mar 10, 2025

bhimrazy added 6 commits March 11, 2025 01:40

add existence check for chunk file before loading in ParquetLoader

ef40f0c

add close method to ParquetLoader for memory management

5ffdd96

Merge branch 'main' into feat/add-streaming-support-to-hf

4b041a6

fix closing of parquet chunks

f4c4ef7

refactor: replace shutil.copy2 with shutil.copyfile

9e1e30f

update preload

74e128d

Merge branch 'main' into feat/add-streaming-support-to-hf

10df3c3

tchaton reviewed Mar 13, 2025

View reviewed changes

src/litdata/streaming/item_loader.py Show resolved Hide resolved

bhimrazy added 13 commits March 17, 2025 19:34

Merge branch 'main' into feat/add-streaming-support-to-hf

294bdbb

upd documentation for default_cache_dir function

93cc6d7

added test case for hf downloader

939b152

update test cases for parquet

da9fda2

update index hf dataset

1f75811

updated parquet writer

540bf15

added test case for index hf dataset

e356e7f

validate item_loader type for hf datasets and improve error handling

32d5ed9

add support for ParquetLoader in StreamingDataset tests

8a68c16

simplified the parquet indexing process from different file services

5d6a1f7

update num workers

a7d207a

cleanup

be1c069

updaet the order

75113f0

bhimrazy and others added 9 commits March 18, 2025 15:11

add validation for low memory mode with ParquetLoader in StreamingDat…

e37bf5a

…aset

update params

68ce8f4

update item loader for low memory usage

ca19d53

[pre-commit.ci] auto fixes from pre-commit.com hooks

33125ba

for more information, see https://pre-commit.ci

Merge branch 'main' into feat/add-streaming-support-to-hf

71d6f9f

update naming conventions

bf45e4b

fix type error

cb78963

fix type errors

d2334d0

fix patch

30c6187

bhimrazy changed the title ~~[wip]: Feat/add streaming support to hf~~ Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets Mar 18, 2025

bhimrazy marked this pull request as ready for review March 18, 2025 17:58

bhimrazy requested review from justusschock and lantiga as code owners March 18, 2025 17:58

bhimrazy requested a review from tchaton March 18, 2025 17:58

add read count

7c87dea

tchaton approved these changes Mar 20, 2025

View reviewed changes

src/litdata/streaming/downloader.py Show resolved Hide resolved

src/litdata/streaming/item_loader.py Show resolved Hide resolved

tchaton merged commit 6e4a409 into Lightning-AI:main Mar 20, 2025
29 checks passed

bhimrazy deleted the feat/add-streaming-support-to-hf branch March 20, 2025 08:19

bhimrazy mentioned this pull request Mar 20, 2025

fix: remove warnings for Streaming Dataset with hf dataset and shuffle enabled #520

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets#505

Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets#505
tchaton merged 35 commits intoLightning-AI:mainfrom
bhimrazy:feat/add-streaming-support-to-hf

bhimrazy commented Mar 10, 2025 •

edited

Loading

Uh oh!

tchaton commented Mar 11, 2025

Uh oh!

bhimrazy commented Mar 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Mar 18, 2025 •

edited

Loading

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhimrazy commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Example usage:

Benchmarks

Using LitData

Dataset: OpenThoughts-114k (3.55 GB)

Dataset: fineweb-edu (10BT Sample) (~26 GB)

Using huggingface datasets streaming

Dataset: OpenThoughts-114k (3.55 GB)

Dataset: fineweb-edu (10BT Sample) (~26 GB)

PR Review

Did you have fun?

Uh oh!

tchaton commented Mar 11, 2025

Uh oh!

bhimrazy commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bhimrazy commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks for open-thoughts/OpenThoughts-114k

Benchmarks for HuggingFaceFW/fineweb-edu 10BT sample

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bhimrazy commented Mar 10, 2025 •

edited

Loading

bhimrazy commented Mar 11, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading

bhimrazy commented Mar 18, 2025 •

edited

Loading