-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sharding in Caffe reader #5172
Conversation
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
def sample_id(sample): | ||
return sample.as_array().sum() | ||
|
||
return {sample_id(p.run()[0]) for _ in range(size * num_shards)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return {sample_id(p.run()[0]) for _ in range(size * num_shards)} | |
return {sample_id(p.run()[0]) for _ in range(size / num_shards)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I want to go through all the shards. I thought that meta['epoch_size_padded']
will give me size of a single shard, that's why I added * num_shards
. With / num_shards
I'll process only one shard, so that's not enough. But just size
is what I want :) Thanks for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8517557
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
MDB_val tmp_key, tmp_value; | ||
mdb_cursor_get(mdb_cursor_, &tmp_key, &tmp_value, MDB_FIRST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the comment here why it is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thank you
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
!build |
CI MESSAGE: [10866228]: BUILD STARTED |
CI MESSAGE: [10866228]: BUILD PASSED |
Category:
Bug fix (non-breaking change which fixes an issue)
Description:
The problem
This PR fixes a problem with sharding in Caffe reader. The observed problem was that shards with
shard_id > 0
were containing samples they shouldn't, in particular there was an overlap between subsequent shards.The root cause
Our LMDB wrapper (
IndexedLMDB
) assumed that the cursor is pointing at index0
when created, which turned out to be false - actually it seems that the cursor position is initialized to-1
.I found no mention in the docs of initial cursor position being defined or undefined, but in their getting started they state that:
which suggests that the cursor first needs to be positioned at initial position with
MDB_FIRST
.The solution
I now reset the cursor with
MDB_FIRST
inOpen
.Why did it work with one shard?
When in the first shard, we reset to
index = 0
, causingSeekByIndex
to use absoluteMDB_FIRST
instead of relativeMDB_NEXT
.DALI/dali/operators/reader/loader/lmdb.h
Lines 80 to 90 in b350581
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A