Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshots #486

Closed
frankboehmer opened this issue Jun 19, 2018 · 23 comments
Closed

Snapshots #486

frankboehmer opened this issue Jun 19, 2018 · 23 comments
Labels
discussion Discussion enhancement New feature or request

Comments

@frankboehmer
Copy link

frankboehmer commented Jun 19, 2018

Leveldb exposes a feature, called "Snapshots" that comes in handy to get consistent reads on a database. LevelDOWN makes use of snapshots to provide consistent ReadStreams. However, sometimes one needs to combine this with additional read operations - e.g. multiple ReadStreams or multiple get calls. Those won't be consistent because they do not share a common snapshot.
Here's a fork that exposes snapshots to node.js. db.snapshot() creates a snapshot which provides the normal read operations known from the LevelDOWN interface: https://github.com/frankboehmer/leveldown
It's important to close the snapshot when no longer needed because it may suck up resources.

Since the implementation of a snpashot interface would also need to be discussed with LevelUP, I did not create pull request. Even worse: the fork reintegrates parts of LevelUP.

I think, snapshots would be a nice addition to LevelDOWN / LevelUP. And I feel it's quite easy to achieve. So please feel free to look at the fork or get in touch to join forces to make this happen.

Cheers, Frank

@vweevers
Copy link
Member

Thanks for reaching out! Previous discussions:

TLDR: there hasn't been a compelling use case for exposing snapshots. Can you share yours? :)

Regarding "multiple get calls": we're working to expose iterators and iterator.seek, which allows multiple reads on one snapshot. Note that leveldown already has iterator.seek (and it works great), it just isn't exposed in levelup or implemented in other stores.

Even worse: the fork reintegrates parts of LevelUP.

In what way and why was it necessary?

@vweevers
Copy link
Member

See also Level/levelup#491 (comment)

@frankboehmer
Copy link
Author

Exposing iterators in LevelUP would do the trick in some use cases but then you can't have multiple ReadStreams (they cannot share the same iterator concurrently).
I'm not sure why there's no strong demand, nobody sees a compelling use case... think e. g. about scanning an index and getting records.

db.createValueStream({start:'index c', end:'index d'})
.on('data', key => {
    db.get('record ' + key, (err, value) => {
        ...
    })
)

@frankboehmer
Copy link
Author

In what way and why was it necessary?
Just to make the project usable for our use case without putting too much work into changing all the dependent projects like levelup, abstract-leveldown etc. - definitly not the right way to do it in the long term ;-) I'd rather help to make this a core feature of the leveldb ecosystem

@peakji
Copy link
Member

peakji commented Jun 19, 2018

Multiple get calls on one single snapshot can be achieved by creating multiple iterators at the exact same time.

@frankboehmer have you tried this approach? I think the performance should be identical.

@frankboehmer
Copy link
Author

@peakji Actually, I wasn't aware of this behavior. This would do the trick :). It's a weird interface, though. Having a let snap = db.snapshot() and then snap.get(), snap.createReadStream() etc., snap.close() on LevelUP would seem much more intuitive to me. And the implementation is probably quite simple, at least for the LeveDOWN backend.

@frankboehmer
Copy link
Author

Also you'd need to create all required iterators up-front which might be ugly in some use cases.

@vweevers
Copy link
Member

And the implementation is probably quite simple, at least for the LeveDOWN backend.

at least for the LeveDOWN backend is exactly right 😉

@peakji
Copy link
Member

peakji commented Jun 19, 2018

@frankboehmer This is just a workaround 😉. It works because each iterator has their own internal snapshot, and since Node.js is single-threaded in the JS land, all the snapshots we created synchronously should capture the exact same status.

But after reading the index-retrieve use case above, I'm starting to agree that it's necessary to expose raw snapshot, otherwise it's pretty hard to achieve the consistency requirement.

However, I don't think this feature should be added in levelup and making it REQUIRED in the level-ecosystem. snapshot is a leveldb concept and I'm not sure with the implementation of other backends. AFAIK, we didn't mention snapshot/consistency in levelup.

+1 for implementing in leveldown first.

@vweevers
Copy link
Member

otherwise it's pretty hard to achieve the consistency requirement

I'd like to hear why the consistency is required. Often it's perfectly fine to do:

db.createValueStream().on('data', key => {
    db.get(key, (err, value) => {
        if (err.notFound) {
          // Just forget about this key and move on
        }
    })
)

@frankboehmer
Copy link
Author

frankboehmer commented Jun 19, 2018

@vweevers think about the index as a reverse index of a data field that's also stored in the record. If the record gets updated in between, you get inconsistencies - the index doesn't reflect the current state of the record. You might find all sorts of workarounds to handle such a situation - however, it stays an inconsistent view.
The most compelling argument for exposing snapshots might be that many users won't actually spend too much thinking on possible inconsistencies when seeing code like the one I posted above. They might be using batch() to get atomic writes - not beeing aware of the fact that they still can end up with inconsistent reads.

@peakji
Copy link
Member

peakji commented Jun 19, 2018

@vweevers Suppose we are creating a simple inverted index, and we put the index keys and documents in one single database by using different prefixes:

key value
word:cat [1,2,3]
word:dog [2,3]
word:spider [1]
word:spiderman [3]
doc:1 lorem ipsum dolor cat spider
doc:2 ...
doc:3 ...

We can issue a prefix query for docs with words starting with spider* by scanning keys after word:spider, but the documents might change while we are iterating. For example, if we update doc:3 to a sentence without spiderman, the word: iterator still reads the stale snapshot, and will eventually result in irrelevant search hits.

@vweevers
Copy link
Member

The most compelling argument for exposing snapshots might be that many users won't actually spend too much thinking on possible inconsistencies when seeing code like the one I posted above.

I see your point and want to add the nuance that many people dó think about this and decide they don't care about such inconsistencies. Taking @peakji's example, which is excellent, the end result might be that an application sometimes shows a search result for "spiderman" that isn't relevant anymore. That can be acceptable.

That said, I understand the use case. I'm not sure yet about:

However, I don't think this feature should be added in levelup and making it REQUIRED in the level-ecosystem.

It definitely shouldn't be a required backend feature, but that doesn't mean levelup can't wrap it. It could handle open state, read streams (like it wraps iterators on a regular db) and promises.

@vweevers
Copy link
Member

@peakji just curious: why did you make this edit to your comment:

I think the performance should be identical.

@peakji
Copy link
Member

peakji commented Jun 19, 2018

why did you make this edit to your comment

Just theoretically, I didn't run the benchmark yet:

In the inverted index example, we can have another iterator and use iterator.seek() + iterator.next() to replace db.get(). Theoretically this should not make much difference, but in leveldown we are crossing the JS/C++ boundary twice.

@vweevers
Copy link
Member

@peakji: AFAIK, we didn't mention snapshot/consistency in levelup.

=> https://github.com/Level/levelup/issues/590

@vweevers vweevers added enhancement New feature or request discussion Discussion labels Dec 31, 2018
@CMCDragonkai
Copy link

Exposing snapshots can make it easier to build a transaction layer. There was some work on this back in 2015 that was never merged. PR #152 https://ongardie.net/blog/node-leveldb-transactions/

@CMCDragonkai
Copy link

CMCDragonkai commented Mar 14, 2022

Multiple get calls on one single snapshot can be achieved by creating multiple iterators at the exact same time.

@frankboehmer have you tried this approach? I think the performance should be identical.

Is this still true if one of the iterators is done through the stream creation?

const i = db.iterator();
for await (const o of db.createReadStream()) {
  // use i here
  // can we expect consistency between o and i?
}

@vweevers
Copy link
Member

@CMCDragonkai Yes, because a stream is backed by an iterator too, which is created when you call createReadStream().

@CMCDragonkai
Copy link

Ok that's useful, I was wondering that iterators have the db property.

Does this mean that iterator.db is operating from the same snapshot, or is just a reference to the DB that created the iterator?

@vweevers
Copy link
Member

Just a reference.

@CMCDragonkai
Copy link

I've been attempting to implement snapshot isolation in leveldb, and although iterators provide a snapshot, one problem is the ability to derive further iterators from the same snapshot. Right now this can only be done by creating all the iterators up-front and at the same time. This limits the expressiveness of iterators.

What would it take to bring snapshots directly to the leveldown interface (if not the abstract-level` interface?).

vweevers added a commit to Level/abstract-level that referenced this issue Sep 25, 2022
At the moment, this is a documentation-only PR, acting as an RFC. In
particular I'd like review of my choice to use a token-based
approach (as first suggested by Rod Vagg in
Level/community#45) as opposed to a
dedicated snapshot API surface (as suggested by Julian Gruber in
Level/community#47). Main reasons for not
choosing the latter:

- It would be a third API surface. We have the main database API,
  the sublevel API which is equal to that except for implementation-
  specific additional methods, and would now add another API which
  only has read methods (get, iterator, etc, but not put and batch).
  If the API was reusable, meaning you could pass around a snapshot
  as if it was a regular database (like you can with sublevels) then
  I'd be cool with it. But for that to happen, we'd have to
  implement transactions as well, which I consider to be out of
  scope although transactions are in fact a use case of snapshots.
- It would have a higher complexity once you factor in sublevels.
  I.e. to make `db.sublevel().snapshot().get(key)` read from the
  snapshot but also prefix the given key. By instead doing
  `db.sublevel().get(key, { snapshot })`, the sublevel can just
  forward that snapshot option to its parent database.
- Furthermore, with the token approach, you can pass a snapshot to
  multiple sublevels, which cleanly solves the main use case of
  retrieving data from an index (I included an example in the docs).

I renamed the existing snapshot mechanism to "implicit snapshots"
and attempted to clarify the behavior of those as well.

If accepted, some related issues can be closed, because this PR:

- Includes (a more complete write-up than)
  Level/classic-level#40.
- Mentions Level/leveldown#796 (which
  should be moved rather than closed).
- Answers Level/classic-level#28.
- Solves the main use case as described in
  Level/leveldown#486.
- Effectively completes Level/awesome#19.
@vweevers
Copy link
Member

Tracked in Level/community#118. It won't be added to leveldown, only its successor classic-level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants