-
Notifications
You must be signed in to change notification settings - Fork 440
feat: add consecutive batch shard sampler for pytorch #3886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
2f44073
to
2f9194c
Compare
@westonpace @wjones127 When you have time, would it be convenient for you to review this PR for me? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
python/python/lance/sampler.py
Outdated
self._len = self._compute_length() | ||
self._epoch = 0 | ||
|
||
# can't have filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sampler here is mainly implemented with the hope that the data of batch_size are all adjacent, so we don't want to use filter to break this adjacent feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand, so IMO, you should make the comment clearer.
python/python/tests/conftest.py
Outdated
@@ -90,3 +90,14 @@ def pytest_collection_modifyitems(config, items): | |||
disable_items_with_mark(items, "torch", reason) | |||
disable_items_with_mark(items, "cuda", reason) | |||
disable_items_with_mark(items, "gpu", reason) | |||
|
|||
|
|||
def has_cuda(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about renaming to is_cuda_available
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused, deleted it
python/python/lance/sampler.py
Outdated
Parameters: | ||
rank (int): Process ID in distributed cluster | ||
world_size (int): Total processes in cluster | ||
total_num_rows (int): [Index Mode] Total dataset rows | ||
batch_size (int): [Index Mode] Rows per batch | ||
randomize (bool): Enable batch order randomization | ||
seed (int): Random seed for reproducibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These param comments may not follow the code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to have triggered a code style error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, here you listed and added parameters for __init__
. If yes, you can add doc for __init__
method. And use the doc style like this:
Parameters
----------
column : str
The column to be indexed. Must be a boolean, integer, float,
or string column.
index_type : str
The type of the index. One of ``"BTREE"``, ``"BITMAP"``,
``"LABEL_LIST"``, ``"NGRAM"``, ``"FTS"`` or ``"INVERTED"``.
name : str, optional
The index name. If not provided, it will be generated from the
column name.
replace : bool, default True
Replace the existing index if it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get
74d51d2
to
3110969
Compare
370aaf0
to
1a79609
Compare
…hes and can be directly connected to pytorch dataloader at the same time Signed-off-by: jukejian <jukejian@bytedance.com>
1a79609
to
ad7a855
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
@yanghua there is a requested change from you, is it addressed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
No description provided.