Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

Closed
joshuakarp opened this issue Jun 24, 2021 · 9 comments
Closed

Indexing Domains and DB Levels for Faster & Flexible Lookup #188

joshuakarp opened this issue Jun 24, 2021 · 9 comments
Assignees
Labels
development Standard development enhancement New feature or request epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@joshuakarp
Copy link
Contributor

joshuakarp commented Jun 24, 2021

Specification

The sigchain currently uses a linear search to locate a particular claim (for example, relating to a specific identity's cryptolink). This should be extended to improve this O(n) lookup.

For example, we could incorporate a map of cryptolink ID (or the equivalent) to the sequence number in the chain of the most recent change to this cryptolink.

Additional context

See:

@joshuakarp joshuakarp changed the title Faster claim look-up in sigchain Improve claim look-up time complexity in sigchain Jun 24, 2021
@joshuakarp joshuakarp self-assigned this Jul 5, 2021
@joshuakarp joshuakarp added development Standard development enhancement New feature or request labels Jul 22, 2021
@joshuakarp
Copy link
Contributor Author

This is related to #189. It would make sense that these be completed together (as it will be essential to have a more efficient claim look-up time when we have different claim types in the chain).

@joshuakarp
Copy link
Contributor Author

A relevant comment from the gestalt discovery MR https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/195#note_623220960. This is referring to how we are currently storing a node's entire chain of claims in the GestaltGraph database (when discovering a user's gestalt):

The new problem is that are we meant to load the entire sigchain every time we call into any Node? If the sigchain were small static data, this would not be a problem, but if the sigchain is going to grow, this can be a problem.

In the near future we would have to optimize this and make the sigchain closer to a indexed stream where we can explore in piece meal rather than having to go through the entire sigchain. Ideally we would only crawl from most recent claims and ignore older/revoked or replaced claims. And we would need to filter the claims to focus on ownership claims, and no other claims. Since the sigchain database may contain other audit logging data.

@joshuakarp
Copy link
Contributor Author

After resolving some issues regarding "self-discovery" in discoverGestalt https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/202#note_655726653 (where a node discovers nodes in its own gestalt, with claims that point back to itself), some extra points regarding the usage of the sigchain in Discovery were brought up.

An aspect of discoverGestalt will eventually be its ability to "update" the state of the stored gestalt information. In order to make this as efficient as possible, we could do the following:

  • the sigchain is an append-only set of chronologically-ordered claims
  • it would make sense to be able to request only the claims after the last claim that we looked at on a particular node, from whenever our previous discoverGestalt call was (no longer having to iterate through the whole sigchain)
  • would require being able to iterate over the sigchain from some given claim ID (sequence number) onwards

Making the discoverGestalt process more efficient is inherently linked to the work done here in improving claim look-up time (see previous comment).

However, given that we're eventually going to be using the sigchain for auditing and provenance use cases, it would make sense for these efficiencies to be implemented directly into the sigchain domain as much as possible.

@CMCDragonkai CMCDragonkai changed the title Improve claim look-up time complexity in sigchain Indexing Sigchain, Notifications for Faster & Flexible Lookup Sep 3, 2021
@CMCDragonkai
Copy link
Member

There are several situations where you are using leveldb to store a stream of things key to value. While the main key is the primary way of looking up an entry, you may also want to look the entry up via other fields. This is basically DB indexing.

I've gotten around this right now by creating additional sublevels, and just duplicating those keys. See things like the ACL database where I did it with PermId.

However a more general solution is better idea. Something that can be used by all domains that have indexing needs. I can see that sigchain, notifications, gestalts and acl all need something like this.

As discussed here: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/209#note_668300560 with respect to notification invitation search, there are some existing libraries that we can consider and "port" over:

The basic idea is sound, however it seems to be lacking in the garbage collection department.

For example: bcomnes/level-idx#19 with no answer, and looking at the source code shows no maintenance of indexes when entries are updated or entries deleted.

It should be easy to reimplement their indexing libraries with proper maintenance, and combined with the transaction system coming from EFS work, then it can all be embedded into the DB class.

@emmacasolin in that case, I think you should not bother with indexing just yet. It's more general problem.

@CMCDragonkai
Copy link
Member

The implementation of secondary indexing at @matrixai/db level can automate this. The relevant issue is here: MatrixAI/js-db#1

This issue can be kept as separate issue representing the integration work of secondary indexing into:

  • acl - perm id, vault id, node id
  • gestalts - discovery, adjacency list indexing
  • nodes - possibly needed in the future for all sorts of optimal routing needs (kadmelia indexing?)
  • notifications - to be able to filter notifications
  • sigchain - to be able to filter sigchain
  • vaults - for vault tagging and vault names

@CMCDragonkai
Copy link
Member

This will lead to #197.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 28, 2021

The discussion in MatrixAI/js-db#1 means that we will only have simple indexing functionality for now.

More complex indexes will require a restructure of the underlying DB, to potentially just use sqlite3 to avoid rebuilding such low level structures ourselves, but we might be too deep into leveldb for now. Not sure, requires more requirements analysis.

I imagine later in the future we're going to need full text indexing too to help with performance.

@CMCDragonkai CMCDragonkai changed the title Indexing Sigchain, Notifications for Faster & Flexible Lookup Indexing Domains and DB Levels for Faster & Flexible Lookup Feb 16, 2022
@CMCDragonkai CMCDragonkai added the epic Big issue with multiple subissues label Feb 16, 2022
@CMCDragonkai
Copy link
Member

Made this issue more general to the concept of indexing across PK.

@CMCDragonkai
Copy link
Member

All indexing is manually done per-domain.

@CMCDragonkai CMCDragonkai closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development enhancement New feature or request epic Big issue with multiple subissues r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

No branches or pull requests

3 participants