Fix sharding in Caffe reader #5172

szkarpinski · 2023-11-16T14:01:01Z

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

The problem

This PR fixes a problem with sharding in Caffe reader. The observed problem was that shards with shard_id > 0 were containing samples they shouldn't, in particular there was an overlap between subsequent shards.

The root cause

Our LMDB wrapper (IndexedLMDB) assumed that the cursor is pointing at index 0 when created, which turned out to be false - actually it seems that the cursor position is initialized to -1.

I found no mention in the docs of initial cursor position being defined or undefined, but in their getting started they state that:

For example, to list all key/value pairs in a database, use operation MDB_FIRST for the first call to mdb_cursor_get(), and MDB_NEXT on subsequent calls, until the end is hit.

which suggests that the cursor first needs to be positioned at initial position with MDB_FIRST.

The solution

I now reset the cursor with MDB_FIRST in Open.

Why did it work with one shard?

When in the first shard, we reset to index = 0, causing SeekByIndex to use absolute MDB_FIRST instead of relative MDB_NEXT.

DALI/dali/operators/reader/loader/lmdb.h

Lines 80 to 90 in b350581

    
           if (index == 0) { 
        
             CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_FIRST), db_path_); 
        
           } else if (index == mdb_size_ - 1) { 
        
             CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_LAST), db_path_); 
        
           } else if (index == mdb_index_) { 
        
             CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_GET_CURRENT), db_path_); 
        
           } else if (index == mdb_index_ - 1) { 
        
             CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_PREV), db_path_); 
        
           } else if (index == mdb_index_ + 1) { 
        
             CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_NEXT), db_path_); 
        
           } else if (index > mdb_index_) {

Additional information:

Affected modules and functionalities:

LMDB loader, Caffe(2) readers

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

JanuszL · 2023-11-16T14:06:48Z

dali/test/python/reader/test_caffe.py

+        def sample_id(sample):
+            return sample.as_array().sum()
+
+        return {sample_id(p.run()[0]) for _ in range(size * num_shards)}


Suggested change

return {sample_id(p.run()[0]) for _ in range(size * num_shards)}

return {sample_id(p.run()[0]) for _ in range(size / num_shards)}

Here I want to go through all the shards. I thought that meta['epoch_size_padded'] will give me size of a single shard, that's why I added * num_shards. With / num_shards I'll process only one shard, so that's not enough. But just size is what I want :) Thanks for pointing this out!

Fixed in 8517557

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

klecki · 2023-11-17T11:29:20Z

dali/operators/reader/loader/lmdb.h

+    MDB_val tmp_key, tmp_value;
+    mdb_cursor_get(mdb_cursor_, &tmp_key, &tmp_value, MDB_FIRST);


Can you add the comment here why it is needed?

Added, thank you

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

szkarpinski · 2023-11-17T11:41:45Z

!build

dali-automaton · 2023-11-17T11:45:44Z

CI MESSAGE: [10866228]: BUILD STARTED

dali-automaton · 2023-11-17T14:11:28Z

CI MESSAGE: [10866228]: BUILD PASSED

Reset LMDB cursor to the first entry

4cbbf07

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

JanuszL reviewed Nov 16, 2023

View reviewed changes

JanuszL approved these changes Nov 16, 2023

View reviewed changes

JanuszL self-assigned this Nov 16, 2023

Reduce number of iterations in the test

8517557

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

JanuszL approved these changes Nov 16, 2023

View reviewed changes

dali-automaton assigned klecki Nov 17, 2023

klecki approved these changes Nov 17, 2023

View reviewed changes

Add comment

b038033

Signed-off-by: Szymon Karpiński <skarpinski@nvidia.com>

klecki approved these changes Nov 17, 2023

View reviewed changes

szkarpinski merged commit b462beb into NVIDIA:main Nov 17, 2023
5 checks passed

JanuszL added the important-fix Fixes an important issue in the software or development environment. label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sharding in Caffe reader #5172

Fix sharding in Caffe reader #5172

szkarpinski commented Nov 16, 2023 •

edited

Loading

JanuszL Nov 16, 2023

szkarpinski Nov 16, 2023

szkarpinski Nov 16, 2023

JanuszL Nov 16, 2023

klecki Nov 17, 2023

szkarpinski Nov 17, 2023

szkarpinski commented Nov 17, 2023

dali-automaton commented Nov 17, 2023

dali-automaton commented Nov 17, 2023

	if (index == 0) {
	CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_FIRST), db_path_);
	} else if (index == mdb_size_ - 1) {
	CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_LAST), db_path_);
	} else if (index == mdb_index_) {
	CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_GET_CURRENT), db_path_);
	} else if (index == mdb_index_ - 1) {
	CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_PREV), db_path_);
	} else if (index == mdb_index_ + 1) {
	CHECK_LMDB(mdb_cursor_get(mdb_cursor_, key, value, MDB_NEXT), db_path_);
	} else if (index > mdb_index_) {

	return {sample_id(p.run()[0]) for _ in range(size * num_shards)}
	return {sample_id(p.run()[0]) for _ in range(size / num_shards)}

		MDB_val tmp_key, tmp_value;
		mdb_cursor_get(mdb_cursor_, &tmp_key, &tmp_value, MDB_FIRST);

Fix sharding in Caffe reader #5172

Fix sharding in Caffe reader #5172

Conversation

szkarpinski commented Nov 16, 2023 • edited Loading

Category:

Description:

The problem

The root cause

The solution

Why did it work with one shard?

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

JanuszL Nov 16, 2023

Choose a reason for hiding this comment

szkarpinski Nov 16, 2023

Choose a reason for hiding this comment

szkarpinski Nov 16, 2023

Choose a reason for hiding this comment

JanuszL Nov 16, 2023

Choose a reason for hiding this comment

klecki Nov 17, 2023

Choose a reason for hiding this comment

szkarpinski Nov 17, 2023

Choose a reason for hiding this comment

szkarpinski commented Nov 17, 2023

dali-automaton commented Nov 17, 2023

dali-automaton commented Nov 17, 2023

szkarpinski commented Nov 16, 2023 •

edited

Loading