Indexing Domains and DB Levels for Faster & Flexible Lookup #188

joshuakarp · 2021-06-24T00:59:51Z

Specification

The sigchain currently uses a linear search to locate a particular claim (for example, relating to a specific identity's cryptolink). This should be extended to improve this O(n) lookup.

For example, we could incorporate a map of cryptolink ID (or the equivalent) to the sequence number in the chain of the most recent change to this cryptolink.

Additional context

See:

The text was updated successfully, but these errors were encountered:

joshuakarp · 2021-07-22T01:08:10Z

This is related to #189. It would make sense that these be completed together (as it will be essential to have a more efficient claim look-up time when we have different claim types in the chain).

joshuakarp · 2021-07-22T01:08:45Z

A relevant comment from the gestalt discovery MR https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/195#note_623220960. This is referring to how we are currently storing a node's entire chain of claims in the GestaltGraph database (when discovering a user's gestalt):

The new problem is that are we meant to load the entire sigchain every time we call into any Node? If the sigchain were small static data, this would not be a problem, but if the sigchain is going to grow, this can be a problem.

In the near future we would have to optimize this and make the sigchain closer to a indexed stream where we can explore in piece meal rather than having to go through the entire sigchain. Ideally we would only crawl from most recent claims and ignore older/revoked or replaced claims. And we would need to filter the claims to focus on ownership claims, and no other claims. Since the sigchain database may contain other audit logging data.

joshuakarp · 2021-08-20T05:09:47Z

After resolving some issues regarding "self-discovery" in discoverGestalt https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/202#note_655726653 (where a node discovers nodes in its own gestalt, with claims that point back to itself), some extra points regarding the usage of the sigchain in Discovery were brought up.

An aspect of discoverGestalt will eventually be its ability to "update" the state of the stored gestalt information. In order to make this as efficient as possible, we could do the following:

the sigchain is an append-only set of chronologically-ordered claims
it would make sense to be able to request only the claims after the last claim that we looked at on a particular node, from whenever our previous discoverGestalt call was (no longer having to iterate through the whole sigchain)
would require being able to iterate over the sigchain from some given claim ID (sequence number) onwards

Making the discoverGestalt process more efficient is inherently linked to the work done here in improving claim look-up time (see previous comment).

However, given that we're eventually going to be using the sigchain for auditing and provenance use cases, it would make sense for these efficiencies to be implemented directly into the sigchain domain as much as possible.

CMCDragonkai · 2021-09-03T05:26:10Z

There are several situations where you are using leveldb to store a stream of things key to value. While the main key is the primary way of looking up an entry, you may also want to look the entry up via other fields. This is basically DB indexing.

I've gotten around this right now by creating additional sublevels, and just duplicating those keys. See things like the ACL database where I did it with PermId.

However a more general solution is better idea. Something that can be used by all domains that have indexing needs. I can see that sigchain, notifications, gestalts and acl all need something like this.

As discussed here: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/209#note_668300560 with respect to notification invitation search, there are some existing libraries that we can consider and "port" over:

https://github.com/hypermodules/level-auto-index
https://github.com/hypermodules/level-idx (which actually uses level-auto-index)

The basic idea is sound, however it seems to be lacking in the garbage collection department.

For example: bcomnes/level-idx#19 with no answer, and looking at the source code shows no maintenance of indexes when entries are updated or entries deleted.

It should be easy to reimplement their indexing libraries with proper maintenance, and combined with the transaction system coming from EFS work, then it can all be embedded into the DB class.

@emmacasolin in that case, I think you should not bother with indexing just yet. It's more general problem.

CMCDragonkai · 2021-09-26T04:49:40Z

The implementation of secondary indexing at @matrixai/db level can automate this. The relevant issue is here: MatrixAI/js-db#1

This issue can be kept as separate issue representing the integration work of secondary indexing into:

acl - perm id, vault id, node id
gestalts - discovery, adjacency list indexing
nodes - possibly needed in the future for all sorts of optimal routing needs (kadmelia indexing?)
notifications - to be able to filter notifications
sigchain - to be able to filter sigchain
vaults - for vault tagging and vault names

CMCDragonkai · 2021-09-26T04:55:12Z

This will lead to #197.

CMCDragonkai · 2021-09-28T06:27:04Z

The discussion in MatrixAI/js-db#1 means that we will only have simple indexing functionality for now.

More complex indexes will require a restructure of the underlying DB, to potentially just use sqlite3 to avoid rebuilding such low level structures ourselves, but we might be too deep into leveldb for now. Not sure, requires more requirements analysis.

I imagine later in the future we're going to need full text indexing too to help with performance.

CMCDragonkai · 2022-02-16T03:20:31Z

Made this issue more general to the concept of indexing across PK.

CMCDragonkai · 2022-11-21T02:09:09Z

All indexing is manually done per-domain.

joshuakarp changed the title ~~Faster claim look-up in sigchain~~ Improve claim look-up time complexity in sigchain Jun 24, 2021

joshuakarp self-assigned this Jul 5, 2021

joshuakarp added development Standard development enhancement New feature or request labels Jul 22, 2021

CMCDragonkai changed the title ~~Improve claim look-up time complexity in sigchain~~ Indexing Sigchain, Notifications for Faster & Flexible Lookup Sep 3, 2021

CMCDragonkai mentioned this issue Sep 26, 2021

Integrate Automatic Indexing into DB MatrixAI/js-db#1

Closed

2 tasks

emmacasolin mentioned this issue Jan 10, 2022

Refactor Notifications Reading for Better CLI Usage #315

Open

CMCDragonkai mentioned this issue Jan 24, 2022

Implement Social Discovery to add found DIs to the Gestalt Graph #316

Closed

This was referenced Feb 15, 2022

Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329

Closed

Redesign usage of CDSS with VaultInternal #305

Closed

CMCDragonkai assigned emmacasolin and unassigned joshuakarp Feb 16, 2022

CMCDragonkai changed the title ~~Indexing Sigchain, Notifications for Faster & Flexible Lookup~~ Indexing Domains and DB Levels for Faster & Flexible Lookup Feb 16, 2022

CMCDragonkai added the epic Big issue with multiple subissues label Feb 16, 2022

CMCDragonkai mentioned this issue Mar 10, 2022

Testnet Deployment #326

Closed

29 tasks

CMCDragonkai mentioned this issue Mar 28, 2022

WIP: Upgrading @matrixai/async-init, @matrixai/async-locks, @matrixai/db, @matrixai/errors, @matrixai/workers, Node.js and integrating @matrixai/resources #366

Closed

33 tasks

tegefaulkes mentioned this issue May 20, 2022

Upgrading lib dependencies and node.js version #374

Merged

40 tasks

CMCDragonkai added r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy labels Jul 24, 2022

CMCDragonkai assigned CMCDragonkai and unassigned emmacasolin Oct 17, 2022

CMCDragonkai closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

joshuakarp commented Jun 24, 2021 •

edited

Loading

joshuakarp commented Jul 22, 2021

joshuakarp commented Jul 22, 2021

joshuakarp commented Aug 20, 2021

CMCDragonkai commented Sep 3, 2021

CMCDragonkai commented Sep 26, 2021

CMCDragonkai commented Sep 26, 2021

CMCDragonkai commented Sep 28, 2021 •

edited

Loading

CMCDragonkai commented Feb 16, 2022

CMCDragonkai commented Nov 21, 2022

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

Comments

joshuakarp commented Jun 24, 2021 • edited Loading

Specification

Additional context

joshuakarp commented Jul 22, 2021

joshuakarp commented Jul 22, 2021

joshuakarp commented Aug 20, 2021

CMCDragonkai commented Sep 3, 2021

CMCDragonkai commented Sep 26, 2021

CMCDragonkai commented Sep 26, 2021

CMCDragonkai commented Sep 28, 2021 • edited Loading

CMCDragonkai commented Feb 16, 2022

CMCDragonkai commented Nov 21, 2022

joshuakarp commented Jun 24, 2021 •

edited

Loading

CMCDragonkai commented Sep 28, 2021 •

edited

Loading