[NodeStorage] Expose the ability to set custom rocksdb::Options #671

gdanezis · 2022-08-03T15:21:53Z

Currently we open the NodeStorage database with None as options, which does not allow callers (like Sui) to set limits on the size of memtables. We should allow opening with custom DB Options:

narwhal/node/src/lib.rs

Lines 57 to 70 in 442d853

    
           pub fn reopen<Path: AsRef<std::path::Path>>(store_path: Path) -> Self { 
        
               let rocksdb = open_cf( 
        
                   store_path, 
        
                   None, 
        
                   &[ 
        
                       Self::HEADERS_CF, 
        
                       Self::CERTIFICATES_CF, 
        
                       Self::PAYLOAD_CF, 
        
                       Self::BATCHES_CF, 
        
                       Self::LAST_COMMITTED_CF, 
        
                       Self::SEQUENCE_CF, 
        
                   ], 
        
               ) 
        
               .expect("Cannot open database");

mwtian · 2022-08-04T05:36:50Z

From reading rocksdb doc, I believe the memtable size for each column family is by default bounded to 64MB (write_buffer_size). But there can be a large number of column families, across 1 db in Narwhal and >5 dbs in Sui. So the upper bound on total memtable sizes can be unexpectedly large. This matches the observation that the stress tool memory usage stabilizes at ~3.5GB.

This jemalloc heap profile can be displayed fully in a new tab. It was taken for stress when it is at ~3.5GB. There is a bit of weirdness but I believe rocksdb memtables are using >2GB of memory here.

Having observability into the memtable sizes of individual rocksdb and column families could be useful in deciding an appropriate memtable size limit, as @gdanezis mentioned elsewhere.

In the short term we can try limiting memtable size per db to e.g. 512MB, and make sure it reduces total memtable size across dbs in the common case.

Longer term, I would like to understand better the rationale for using many rocksdb instances, instead of only relying on column families for isolation. Then maybe we can come up with a sensible way to limit total memory usage.

mwtian · 2022-08-05T05:32:18Z

Investigating out of curiosity instead, because the memory usage here is much less than the peaks we see on devnet e.g 40GB. I learned that in Sui, each struct deriving DBMapUtils has each field in a separate column family. This can explain at peak of stress test, # of column families * 64MB > 2GB.

With MystenLabs/mysten-infra#109, stress memory usage seems to grow slower.

At the current scale how column families are configured likely does not matter much, except when we want to set cross-DB or cross-CF limits. In future for read and write perf, we can try to group data read or written together in the same column families.

huitseeker · 2022-08-07T18:44:33Z

@oxade would you be up, to close up the above issue, to porting MystenLabs/mysten-infra#110 to Narwhal once done?

oxade · 2022-08-08T04:23:32Z

@oxade would you be up, to close up the above issue, to porting MystenLabs/mysten-infra#110 to Narwhal once done?

@huitseeker no problem.

oxade · 2022-08-09T20:56:42Z

See this PR for one way we could do it.
It basically gives the caller a way to specify the options for each table before opening.
We can port this to Narwhal.

The previous approach allows configuring the options, but only in one place.

Let me know thoughts folks.

mwtian · 2022-08-10T01:50:21Z

Great @oxade! I'm taking a look.

Btw two things that I want to make a case for, maybe out of the scope of this issue and related PRs:

Maybe we should not expose rocksdb options to typed store callers. Instead, we can stay at the high level, e.g. total write buffer size, "point read" for column family etc.

Sui will need to configure rocksdb instances to limit overall memory and disk usage, or even build or switch to a distributed db. Doing any of the above tasks with rocksdb options populated all over the place will be hard.

We should use a single rocksdb instance per process. Rocksdb does not seem built for running multiple instances in the same process / container. It has very basic support for limiting memory across multiple instances in the same process, but no way to limit total WAL or cache size across multiple instances. It will be unrealistic for us to develop a framework to manage multiple rocksdb instances.

oxade · 2022-08-10T14:03:19Z

Maybe we should not expose rocksdb options to typed store callers. Instead, we can stay at the high level, e.g. total write buffer size, "point read" for column family etc.

Sui will need to configure rocksdb instances to limit overall memory and disk usage, or even build or switch to a distributed db. Doing any of the above tasks with rocksdb options populated all over the place will be hard.

I understand your points. This simplification something I wanted to move toward but as @gdanezis pointed out, it compromises controllability which is higher priority right now. We can simplify after. But I'll let @gdanezis give his feedback.

We should use a single rocksdb instance per process. Rocksdb does not seem built for running multiple instances in the same process / container. It has very basic support for limiting memory across multiple instances in the same process, but no way to limit total WAL or cache size across multiple instances. It will be unrealistic for us to develop a framework to manage multiple rocksdb instances.

Interesting. I think the struct abstraction of the DBs made it easy to fragment and forget that we're dealing with DBs.
For validators, we can start by by condensing the CheckpointStoreTables, EpochStoreTables, and AuthorityStoreTables.

mwtian mentioned this issue Aug 4, 2022

Set default total memtable size to 512MB per Rocksdb MystenLabs/mysten-infra#109

Merged

mwtian mentioned this issue Aug 5, 2022

Bump mysten-infra version MystenLabs/sui#3777

Merged

oxade self-assigned this Aug 8, 2022

This was referenced Aug 9, 2022

[typed store] Allow better controls of DBMap/Store options MystenLabs/mysten-infra#115

Closed

revert to manual dbmap init MystenLabs/sui#3820

Closed

oxade mentioned this issue Aug 18, 2022

utils for Store open and config MystenLabs/mysten-infra#123

Merged

oxade closed this as completed in MystenLabs/mysten-infra#123 Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NodeStorage] Expose the ability to set custom rocksdb::Options #671

[NodeStorage] Expose the ability to set custom rocksdb::Options #671

gdanezis commented Aug 3, 2022

mwtian commented Aug 4, 2022

mwtian commented Aug 5, 2022

huitseeker commented Aug 7, 2022

oxade commented Aug 8, 2022

oxade commented Aug 9, 2022 •

edited

Loading

mwtian commented Aug 10, 2022

oxade commented Aug 10, 2022

[NodeStorage] Expose the ability to set custom rocksdb::Options #671

[NodeStorage] Expose the ability to set custom rocksdb::Options #671

Comments

gdanezis commented Aug 3, 2022

mwtian commented Aug 4, 2022

mwtian commented Aug 5, 2022

huitseeker commented Aug 7, 2022

oxade commented Aug 8, 2022

oxade commented Aug 9, 2022 • edited Loading

mwtian commented Aug 10, 2022

oxade commented Aug 10, 2022

oxade commented Aug 9, 2022 •

edited

Loading