Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sharding in Caffe reader #5172

Merged
merged 3 commits into from
Nov 17, 2023
Merged

Conversation

szkarpinski
Copy link
Collaborator

@szkarpinski szkarpinski commented Nov 16, 2023

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

The problem

This PR fixes a problem with sharding in Caffe reader. The observed problem was that shards with shard_id > 0 were containing samples they shouldn't, in particular there was an overlap between subsequent shards.

The root cause

Our LMDB wrapper (IndexedLMDB) assumed that the cursor is pointing at index 0 when created, which turned out to be false - actually it seems that the cursor position is initialized to -1.

I found no mention in the docs of initial cursor position being defined or undefined, but in their getting started they state that:

For example, to list all key/value pairs in a database, use operation MDB_FIRST for the first call to mdb_cursor_get(), and MDB_NEXT on subsequent calls, until the end is hit.

which suggests that the cursor first needs to be positioned at initial position with MDB_FIRST.

The solution

I now reset the cursor with MDB_FIRST in Open.

Why did it work with one shard?

When in the first shard, we reset to index = 0, causing SeekByIndex to use absolute MDB_FIRST instead of relative MDB_NEXT.

if (index == 0) {
CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_FIRST), db_path_);
} else if (index == mdb_size_ - 1) {
CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_LAST), db_path_);
} else if (index == mdb_index_) {
CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_GET_CURRENT), db_path_);
} else if (index == mdb_index_ - 1) {
CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_PREV), db_path_);
} else if (index == mdb_index_ + 1) {
CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_NEXT), db_path_);
} else if (index > mdb_index_) {

Additional information:

Affected modules and functionalities:

  • LMDB loader, Caffe(2) readers

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
def sample_id(sample):
return sample.as_array().sum()

return {sample_id(p.run()[0]) for _ in range(size * num_shards)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {sample_id(p.run()[0]) for _ in range(size * num_shards)}
return {sample_id(p.run()[0]) for _ in range(size / num_shards)}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I want to go through all the shards. I thought that meta['epoch_size_padded'] will give me size of a single shard, that's why I added * num_shards. With / num_shards I'll process only one shard, so that's not enough. But just size is what I want :) Thanks for pointing this out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8517557

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@JanuszL JanuszL self-assigned this Nov 16, 2023
Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
Comment on lines +63 to +64
MDB_val tmp_key, tmp_value;
mdb_cursor_get(mdb_cursor_, &tmp_key, &tmp_value, MDB_FIRST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the comment here why it is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thank you

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>
@szkarpinski
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10866228]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [10866228]: BUILD PASSED

@szkarpinski szkarpinski merged commit b462beb into NVIDIA:main Nov 17, 2023
5 checks passed
@JanuszL JanuszL added the important-fix Fixes an important issue in the software or development environment. label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
important-fix Fixes an important issue in the software or development environment.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants