Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[NodeStorage] Expose the ability to set custom rocksdb::Options #671

Closed
gdanezis opened this issue Aug 3, 2022 · 7 comments · Fixed by MystenLabs/mysten-infra#123
Closed
Assignees

Comments

@gdanezis
Copy link
Contributor

gdanezis commented Aug 3, 2022

Currently we open the NodeStorage database with None as options, which does not allow callers (like Sui) to set limits on the size of memtables. We should allow opening with custom DB Options:

narwhal/node/src/lib.rs

Lines 57 to 70 in 442d853

pub fn reopen<Path: AsRef<std::path::Path>>(store_path: Path) -> Self {
let rocksdb = open_cf(
store_path,
None,
&[
Self::HEADERS_CF,
Self::CERTIFICATES_CF,
Self::PAYLOAD_CF,
Self::BATCHES_CF,
Self::LAST_COMMITTED_CF,
Self::SEQUENCE_CF,
],
)
.expect("Cannot open database");

@mwtian
Copy link
Member

mwtian commented Aug 4, 2022

From reading rocksdb doc, I believe the memtable size for each column family is by default bounded to 64MB (write_buffer_size). But there can be a large number of column families, across 1 db in Narwhal and >5 dbs in Sui. So the upper bound on total memtable sizes can be unexpectedly large. This matches the observation that the stress tool memory usage stabilizes at ~3.5GB.

c01

This jemalloc heap profile can be displayed fully in a new tab. It was taken for stress when it is at ~3.5GB. There is a bit of weirdness but I believe rocksdb memtables are using >2GB of memory here.

Having observability into the memtable sizes of individual rocksdb and column families could be useful in deciding an appropriate memtable size limit, as @gdanezis mentioned elsewhere.

In the short term we can try limiting memtable size per db to e.g. 512MB, and make sure it reduces total memtable size across dbs in the common case.

Longer term, I would like to understand better the rationale for using many rocksdb instances, instead of only relying on column families for isolation. Then maybe we can come up with a sensible way to limit total memory usage.

@mwtian
Copy link
Member

mwtian commented Aug 5, 2022

Investigating out of curiosity instead, because the memory usage here is much less than the peaks we see on devnet e.g 40GB. I learned that in Sui, each struct deriving DBMapUtils has each field in a separate column family. This can explain at peak of stress test, # of column families * 64MB > 2GB.

With MystenLabs/mysten-infra#109, stress memory usage seems to grow slower.

At the current scale how column families are configured likely does not matter much, except when we want to set cross-DB or cross-CF limits. In future for read and write perf, we can try to group data read or written together in the same column families.

@huitseeker
Copy link
Contributor

@oxade would you be up, to close up the above issue, to porting MystenLabs/mysten-infra#110 to Narwhal once done?

@oxade
Copy link

oxade commented Aug 8, 2022

@oxade would you be up, to close up the above issue, to porting MystenLabs/mysten-infra#110 to Narwhal once done?

@huitseeker no problem.

@oxade
Copy link

oxade commented Aug 9, 2022

See this PR for one way we could do it.
It basically gives the caller a way to specify the options for each table before opening.
We can port this to Narwhal.

The previous approach allows configuring the options, but only in one place.

Let me know thoughts folks.

@mwtian
Copy link
Member

mwtian commented Aug 10, 2022

Great @oxade! I'm taking a look.

Btw two things that I want to make a case for, maybe out of the scope of this issue and related PRs:

  1. Maybe we should not expose rocksdb options to typed store callers. Instead, we can stay at the high level, e.g. total write buffer size, "point read" for column family etc.
  • Sui will need to configure rocksdb instances to limit overall memory and disk usage, or even build or switch to a distributed db. Doing any of the above tasks with rocksdb options populated all over the place will be hard.
  1. We should use a single rocksdb instance per process. Rocksdb does not seem built for running multiple instances in the same process / container. It has very basic support for limiting memory across multiple instances in the same process, but no way to limit total WAL or cache size across multiple instances. It will be unrealistic for us to develop a framework to manage multiple rocksdb instances.

@oxade
Copy link

oxade commented Aug 10, 2022

  1. Maybe we should not expose rocksdb options to typed store callers. Instead, we can stay at the high level, e.g. total write buffer size, "point read" for column family etc.
  • Sui will need to configure rocksdb instances to limit overall memory and disk usage, or even build or switch to a distributed db. Doing any of the above tasks with rocksdb options populated all over the place will be hard.

I understand your points. This simplification something I wanted to move toward but as @gdanezis pointed out, it compromises controllability which is higher priority right now. We can simplify after. But I'll let @gdanezis give his feedback.

  1. We should use a single rocksdb instance per process. Rocksdb does not seem built for running multiple instances in the same process / container. It has very basic support for limiting memory across multiple instances in the same process, but no way to limit total WAL or cache size across multiple instances. It will be unrealistic for us to develop a framework to manage multiple rocksdb instances.

Interesting. I think the struct abstraction of the DBs made it easy to fragment and forget that we're dealing with DBs.
For validators, we can start by by condensing the CheckpointStoreTables, EpochStoreTables, and AuthorityStoreTables.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants