Skip to content

feat: add consecutive batch shard sampler for pytorch #3886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2025

Conversation

Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented May 27, 2025

No description provided.

@github-actions github-actions bot added enhancement New feature or request python labels May 27, 2025
@Jay-ju Jay-ju changed the title feat: supporting shard sampler enables the output of consecutive batc… feat: supporting shard sampler enables the output of consecutive batches and can be directly connected to pytorch dataloader at the same time May 27, 2025
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@Jay-ju Jay-ju changed the title feat: supporting shard sampler enables the output of consecutive batches and can be directly connected to pytorch dataloader at the same time feat: add consecutive batch shard sampler (PyTorch) May 27, 2025
@Jay-ju Jay-ju changed the title feat: add consecutive batch shard sampler (PyTorch) feat: add consecutive batch shard sampler for pytorch May 27, 2025
@Jay-ju Jay-ju force-pushed the add_consecutive_batches branch 2 times, most recently from 2f44073 to 2f9194c Compare May 27, 2025 12:22
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Jun 2, 2025

@westonpace @wjones127 When you have time, would it be convenient for you to review this PR for me?

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

self._len = self._compute_length()
self._epoch = 0

# can't have filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampler here is mainly implemented with the hope that the data of batch_size are all adjacent, so we don't want to use filter to break this adjacent feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, so IMO, you should make the comment clearer.

@@ -90,3 +90,14 @@ def pytest_collection_modifyitems(config, items):
disable_items_with_mark(items, "torch", reason)
disable_items_with_mark(items, "cuda", reason)
disable_items_with_mark(items, "gpu", reason)


def has_cuda():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming to is_cuda_available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, deleted it

Comment on lines 425 to 431
Parameters:
rank (int): Process ID in distributed cluster
world_size (int): Total processes in cluster
total_num_rows (int): [Index Mode] Total dataset rows
batch_size (int): [Index Mode] Rows per batch
randomize (bool): Enable batch order randomization
seed (int): Random seed for reproducibility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These param comments may not follow the code style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to have triggered a code style error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, here you listed and added parameters for __init__. If yes, you can add doc for __init__ method. And use the doc style like this:

        Parameters
        ----------
        column : str
            The column to be indexed.  Must be a boolean, integer, float,
            or string column.
        index_type : str
            The type of the index.  One of ``"BTREE"``, ``"BITMAP"``,
            ``"LABEL_LIST"``, ``"NGRAM"``, ``"FTS"`` or ``"INVERTED"``.
        name : str, optional
            The index name. If not provided, it will be generated from the
            column name.
        replace : bool, default True
            Replace the existing index if it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get

@Jay-ju Jay-ju force-pushed the add_consecutive_batches branch 4 times, most recently from 74d51d2 to 3110969 Compare June 16, 2025 13:56
@Jay-ju Jay-ju force-pushed the add_consecutive_batches branch 4 times, most recently from 370aaf0 to 1a79609 Compare June 18, 2025 10:12
…hes and can be directly connected to pytorch dataloader at the same time

Signed-off-by: jukejian <jukejian@bytedance.com>
@Jay-ju Jay-ju force-pushed the add_consecutive_batches branch from 1a79609 to ad7a855 Compare June 18, 2025 11:32
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jackye1995
Copy link
Contributor

@yanghua there is a requested change from you, is it addressed?

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yanghua yanghua merged commit b7fb848 into lancedb:main Jun 19, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants