feat: add consecutive batch shard sampler for pytorch #3886

Jay-ju · 2025-05-27T06:47:33Z

No description provided.

github-actions · 2025-05-27T06:48:14Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Jay-ju · 2025-06-02T23:58:36Z

@westonpace @wjones127 When you have time, would it be convenient for you to review this PR for me?

yanghua

Left some comments.

python/python/lance/sampler.py

yanghua · 2025-06-09T02:23:43Z

python/python/lance/sampler.py

+            self._len = self._compute_length()
+        self._epoch = 0
+
+    # can't have filter


What does this mean?

The sampler here is mainly implemented with the hope that the data of batch_size are all adjacent, so we don't want to use filter to break this adjacent feature.

Understand, so IMO, you should make the comment clearer.

yanghua · 2025-06-09T02:25:02Z

python/python/tests/conftest.py

@@ -90,3 +90,14 @@ def pytest_collection_modifyitems(config, items):
        disable_items_with_mark(items, "torch", reason)
        disable_items_with_mark(items, "cuda", reason)
        disable_items_with_mark(items, "gpu", reason)
+
+
+def has_cuda():


What about renaming to is_cuda_available?

unused, deleted it

yanghua · 2025-06-09T02:29:38Z

python/python/lance/sampler.py

+        Parameters:
+        rank (int): Process ID in distributed cluster
+        world_size (int): Total processes in cluster
+        total_num_rows (int): [Index Mode] Total dataset rows
+        batch_size (int): [Index Mode] Rows per batch
+        randomize (bool): Enable batch order randomization
+        seed (int): Random seed for reproducibility


These param comments may not follow the code style?

This doesn't seem to have triggered a code style error?

I mean, here you listed and added parameters for __init__. If yes, you can add doc for __init__ method. And use the doc style like this:

Parameters ---------- column : str The column to be indexed. Must be a boolean, integer, float, or string column. index_type : str The type of the index. One of ``"BTREE"``, ``"BITMAP"``, ``"LABEL_LIST"``, ``"NGRAM"``, ``"FTS"`` or ``"INVERTED"``. name : str, optional The index name. If not provided, it will be generated from the column name. replace : bool, default True Replace the existing index if it exists.

python/python/lance/sampler.py

…hes and can be directly connected to pytorch dataloader at the same time Signed-off-by: jukejian <jukejian@bytedance.com>

jackye1995

looks good to me!

jackye1995 · 2025-06-19T18:37:48Z

@yanghua there is a requested change from you, is it addressed?

yanghua

+1

github-actions bot added enhancement python labels May 27, 2025

Jay-ju changed the title ~~feat: supporting shard sampler enables the output of consecutive batc…~~ feat: supporting shard sampler enables the output of consecutive batches and can be directly connected to pytorch dataloader at the same time May 27, 2025

Jay-ju changed the title ~~feat: supporting shard sampler enables the output of consecutive batches and can be directly connected to pytorch dataloader at the same time~~ feat: add consecutive batch shard sampler (PyTorch) May 27, 2025

Jay-ju changed the title ~~feat: add consecutive batch shard sampler (PyTorch)~~ feat: add consecutive batch shard sampler for pytorch May 27, 2025

Jay-ju force-pushed the add_consecutive_batches branch 2 times, most recently from 2f44073 to 2f9194c Compare May 27, 2025 12:22

yanghua requested changes Jun 9, 2025

View reviewed changes

Jay-ju force-pushed the add_consecutive_batches branch 4 times, most recently from 74d51d2 to 3110969 Compare June 16, 2025 13:56

jackye1995 reviewed Jun 17, 2025

View reviewed changes

python/python/lance/sampler.py Outdated Show resolved Hide resolved

Jay-ju force-pushed the add_consecutive_batches branch 4 times, most recently from 370aaf0 to 1a79609 Compare June 18, 2025 10:12

feat: supporting shard sampler enables the output of consecutive batc…

ad7a855

…hes and can be directly connected to pytorch dataloader at the same time Signed-off-by: jukejian <jukejian@bytedance.com>

Jay-ju force-pushed the add_consecutive_batches branch from 1a79609 to ad7a855 Compare June 18, 2025 11:32

jackye1995 approved these changes Jun 19, 2025

View reviewed changes

yanghua approved these changes Jun 19, 2025

View reviewed changes

yanghua merged commit b7fb848 into lancedb:main Jun 19, 2025
13 checks passed

feat: add consecutive batch shard sampler for pytorch #3886

feat: add consecutive batch shard sampler for pytorch #3886

Uh oh!

Conversation

Jay-ju commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

Jay-ju commented Jun 2, 2025

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Jun 19, 2025

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Jay-ju commented May 27, 2025 •

edited

Loading