-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Integrating Automatic Indexing into DB #2
Conversation
One usecase from sigchain is to find the latest claim for a given node id. So you want to do a O(1) lookup by the node id. You then get a list of claim ids or claim values. And you want to find the latest claim from either the list of claim ids or claim values. If you use the autoindexing to automatically resolve the claim ids, this means you are getting alot of claim values. But if we only want the latest, this isn't that efficient. The claim id itself is ordered using Additionally the index itself can preserve the order as well on write. That's probably better and will be easier. So I need to prototype these choices. Additionally for vault usecase, it's about knowing whether a vault name is unique or has been used. So it's a uniqueness constraint. |
Realised I forgot to bring in We also need explicit
Those dependencies may not be needed in PK unless they are explicitly used outside of js-db @joshuakarp. |
Some things I recently realised:
LevelDB interface interface LevelDB<K = any, V = any> extends LevelUp<EncodingDown<K, V>> {
errors: typeof errors;
} LevelUp interfaceexport interface LevelUp<DB = AbstractLevelDOWN, Iterator = AbstractIterator<any, any>> extends EventEmitter {
open(): Promise<void>;
open(callback?: ErrorCallback): void;
close(): Promise<void>;
close(callback?: ErrorCallback): void;
put: InferDBPut<DB>;
get: InferDBGet<DB>;
del: InferDBDel<DB>;
clear: InferDBClear<DB>;
batch(array: AbstractBatch[], options?: any): Promise<void>;
batch(array: AbstractBatch[], options: any, callback: (err?: any) => any): void;
batch(array: AbstractBatch[], callback: (err?: any) => any): void;
batch(): LevelUpChain;
iterator(options?: AbstractIteratorOptions): Iterator;
isOpen(): boolean;
isClosed(): boolean;
createReadStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream;
createKeyStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream;
createValueStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream;
/*
emitted when a new value is 'put'
*/
on(event: 'put', cb: (key: any, value: any) => void): this;
/*
emitted when a value is deleted
*/
on(event: 'del', cb: (key: any) => void): this;
/*
emitted when a batch operation has executed
*/
on(event: 'batch', cb: (ary: any[]) => void): this;
/*
emitted when clear is called
*/
on(event: 'clear', cb: (opts: any) => void): this;
/*
emitted on given event
*/
on(event: 'open' | 'ready' | 'closed' | 'opening' | 'closing', cb: () => void): this;
} AbstractLevelDOWNexport interface AbstractLevelDOWN<K = any, V = any> extends AbstractOptions {
open(cb: ErrorCallback): void;
open(options: AbstractOpenOptions, cb: ErrorCallback): void;
close(cb: ErrorCallback): void;
get(key: K, cb: ErrorValueCallback<V>): void;
get(key: K, options: AbstractGetOptions, cb: ErrorValueCallback<V>): void;
put(key: K, value: V, cb: ErrorCallback): void;
put(key: K, value: V, options: AbstractOptions, cb: ErrorCallback): void;
del(key: K, cb: ErrorCallback): void;
del(key: K, options: AbstractOptions, cb: ErrorCallback): void;
batch(): AbstractChainedBatch<K, V>;
batch(array: ReadonlyArray<AbstractBatch<K, V>>, cb: ErrorCallback): AbstractChainedBatch<K, V>;
batch(
array: ReadonlyArray<AbstractBatch<K, V>>,
options: AbstractOptions,
cb: ErrorCallback,
): AbstractChainedBatch<K, V>;
iterator(options?: AbstractIteratorOptions<K>): AbstractIterator<K, V>;
} Seems like if we want to hook into the events, we would at the very least need the |
The Need to test if the hooks are allowed to mutate the operation records, if that's possible, maybe one can do encryption with a fork. These types I worked out for the type Callback<P extends Array<any> = [], R = any, E extends Error = Error> = {
(e: E, ...params: Partial<P>): R;
(e?: null | undefined, ...params: P): R;
};
type HookOp<K = any, V = any> = {
type: 'put';
key: K;
value: V;
opts: AbstractOptions;
} | {
type: 'del';
key: K;
opts: AbstractOptions;
} | {
type: 'batch',
array: Array<HookOp<K, V>>;
opts: AbstractOptions;
};
interface LevelDBHooked<K = any, V = any> extends LevelDB<K, V> {
prehooks: Array<(op: HookOp<K, V>, cb: Callback) => void>;
posthooks: Array<(op: HookOp<K, V>, cb: Callback) => void>;
} |
The hook system seems similar to providing a generic way of layering on leveldb. But the awesome list https://github.com/Level/awesome#layers indicates that the preferred way of extending leveldb is to:
If that's the case, then perhaps it's better to just do with abstract-leveldown wrapper instead, then we don't have to re-invent the wheel regarding eventemitter in order to hook in indexing then encryption on writes, and decryption on reads. The hook system is incapable of mutating the operations. Therefore the conclusions is that the hookdown is not useful for doing any kind of encryption/decryption. const hookdb = hookdown(db) as LevelDBHooked;
const prehook1 = (op: HookOp, cb: Callback) => {
console.log('pre1', op);
if (op.type == 'put') {
op.key = Buffer.from('changed');
op.value = Buffer.from('changed');
}
cb();
};
hookdb.prehooks.push(prehook1);
await db.put(Buffer.from('beep'), Buffer.from('boop'));
console.log(await db.get('changed')); // NotFoundError: Key not found in database [changed] But we already knew this because of the lack of Therefore if we wanted to reuse |
The I believe the eventemitter of levelup only works for emitting events post-action. See https://github.com/Level/levelup#events it shows that the events only correspond to after the action has occurred. So the event emitter isn't sufficient for maintaining the ability to do any kind of encryption. The hookdown doesn't follow the same structure as other leveldown layers: https://github.com/Level/awesome#layers. See an example of this here: https://github.com/adorsys/encrypt-down/blob/master/src/index.js and https://github.com/Level/encoding-down/blob/master/index.js. They all make use of One thing to note is that we are passing the interface EncodingDown<K = any, V = any> extends AbstractLevelDOWN<K, V> { Subleveldown can do this because |
Going down the rabbit hole of 1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers) For indexing, levelup's eventemitter interface is enough to implement it, just like using hookdown. However both are not sufficient to implement encryption (because you need to do something pre-get for decryption, and pre-write for encryption), encryption would require an abstract-leveldown wrapper like If one were to use hookdown or eventemitter for indexing, this would only work if encryption is moved to something like To simplify things, suppose we ignore the eventemitter and hookdown systems... It is also possible that indexing and encryption/decryption are both implemented as abstract-leveldown layers. This would have strict ordering requirements with indexing wrapping an encrypted layer. If the encrypted layer isn't using reachdown, then it is essential the db instance is using a binary encoding down which is what we are already doing. If encryption/decryption where to use reachdown, then may sit below encoding-down, which would mean an order like this:
The encryption layer would only work against strings or buffers as this is the expectation of leveldown, and rely on encoding layer to encode anything. Would indexing works prior to encoding so that you can access to the raw types, and also use encoding down when storing the indexes? Anyway we need to do some more prototyping with the existing autoindex.
Reference these for prototyping: |
I'm leaning towards 2. Building on top of DB instead. This is because doing 1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers) we would have to move the encryption layer, and the existing indexing libraries may not be sufficient for our usecase necessitating a derivative implementation of indexing anyway. This is 3 portions of work:
This would be fine except that at the end of the day we still didn't solve #5. The amount of work above would be shared with doing 2. Building on top of DB, in that:
And doesn't have the wasted work of "a derivative of It may also be faster because we understand our own code better than how abstract-leveldown prototype works. And doubling down on leveldb peculiarities may not be important if we later want to use sqlite3 or rocksdb. Going down the 2. route, there are additional reasons why a derivative indexing impl from level-idx, level-auto-index is required:
|
With respect to creating dynamic sublevels. The prefix names must not have So far none of our static and dynamic sublevel creations would use
The I wonder if it's possible given that the keys are all binary encoded, to use A foolproof way would be base-encoding the values being indexed to ensure that If base-encoding is needed regardless of hashing, then the chosen base algorithm should preserve lexicographic order. Otherwise the ordering would be corrupted before hashing occurs. And the reason to not-hash was to preserve the lexicographic-order. The base64 is not lexicographic-preserving, that's why there's this library: https://github.com/deanlandolt/base64-lex. Not sure about base58btc, and maybe there's an encoding in the multibase encoding codecs that does preserve lexicographic order? I believe encoding to hex would be lexicographic-preserving because that's how UUID works too and hex is just I've asked which ones are lexicographic order in multiformats/js-multiformats#124 We can make a simple decision like this:
This led to an issue in js-id: MatrixAI/js-id#7, fixing the bugs in this issue resulted in experiments demonstrating that base58btc preserved lexicographic order. Further testing indicates that The list of encodings that do preserve order is:
Furthermore the JS binary strings also preserved order. We should update the js-id tests. This means we will need to use The We won't need padding because we aren't concatenating encoded ids here, and there's structure around it. Here is the empirical test demonstrating this: import type { Codec } from 'multiformats/bases/base';
import crypto from 'crypto';
import { bases } from 'multiformats/basics';
function randomBytes(size: number): Uint8Array {
return crypto.randomBytes(size);
}
type MultibaseFormats = keyof typeof bases;
const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
const codec = bases[k];
basesByPrefix[codec.prefix] = codec;
}
function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
const codec = bases[format];
return codec.encode(id);
}
function fromMultibase(idString: string): Uint8Array | undefined {
const prefix = idString[0];
const codec = basesByPrefix[prefix];
if (codec == null) {
return;
}
const buffer = codec.decode(idString);
return buffer;
}
const originalList: Array<Uint8Array> = [];
const total = 100000;
let count = total;
while (count) {
originalList.push(randomBytes(36));
count--;
}
originalList.sort(Buffer.compare);
const encodedList = originalList.map(
(bs) => toMultibase(bs, 'base58btc')
);
const encodedList_ = encodedList.slice();
encodedList_.sort();
// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order
for (let i = 0; i < total; i++) {
if (encodedList[i] !== encodedList_[i]) {
console.log('Does not match on:', i);
console.log('original order', encodedList[i]);
console.log('encoded order', encodedList_[i]);
break;
}
}
const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
// @ts-ignore
if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
console.log('bug in the code');
break;
}
} |
Closing this as won't fix because our usage of PK has meant we actually index the DB in a number of ways not conducive to automatic indexing at this point in time. In fact we would need a structured schema first because any kind of automatic schema can take place. |
Description
Adding automatic indexing. Inspired by level-auto-index. But this has to be adapted to DB due to encryption and level needs.
Because indexing is so complicated, we are not going to integrate all forms of indexing. If we need these, we would be better off swapping to using sqlite3 which is a much bigger change.
So for now the only type of indexing is:
The way this works is that it takes a property of the value, which has to be a POJO. This value must be "hashable". This means the value must be convertible to be a string or value. I imagine something that has the
toString
andvalueOf
methods. There's already aToString
interface that could be used. Alternatively one could instead ask for the ability to JSON encode whatever the property is and hash that. Note that we would have to use the canonical json encoding to ensure that it is deterministic.Once we have this stringified value, then we can proceed to hash the value. We can use a cryptographically strong hash. In fact the reason to use a cryptographically secure hash is simply to prevent preimage attacks. If we aren't interested in preventing preimage attacks, then this is just an unnecessary step where we can just put the value as a plaintext key. Therefore if we want to protect the value, it makes sense to just hash the value.
To prevent any chance of collision, we would have to ensure that the value we are hashing is smaller than the hash size. Of course collision is very unlikely. So we should just use SHA256 or SHA512 (note that SHA512 is faster than SHA256 but uses double the space, so one can use a SHA512/256, but why bother for now).
Note that until Order Preserving Encryption becomes more easier to use, this means such indexes loses their order. So they can only be used as a "hashtable" style index. There's no order to the indexes.
If users want to have order, they will need to bypass the hashing and just manually create their index just by creating their own sublevels.
Having a unique value index (hashtable style) should be sufficient for alot of usecases.
One problem is dealing with non-unique values. Like for example vault tags. If each vault can have multiple tags. One can imagine that a tag can point to multiple primary ids. In that case we may at the very least generalise our index values to contain multiple ids.
The level-auto-index also supports compound indexes by concatenation. We will avoid this for now. So this means what will be missing are:
If we find ourselves really needing any of the above, it is better we use sqlite3. And for encryption, it would nice if we can put sqlite3 or leveldb on top of a block mapping system, and encryption is applied to the block mapping system, not the sqlite3/leveldb keys & values. I think this should be possible since others have implemented sqlite3 on top of indexdb. https://nolanlawson.com/2021/08/22/speeding-up-indexeddb-reads-and-writes/
So an ideal future situation might be: sqlite3/leveldb on top of an encrypted blockdb (indexeddb) so that encryption happens at a lower level. This would eliminate all the encryption problems and allow keys and values to be encrypted. But this will be challenging for cross-platform scenarios. #5
Issues Fixed
Tasks
[ ] - Integrate hooks/events to the get/put/batch- concluded that hooks and eventemitter are not sufficient[ ] - Prototype embedding indexing and encryption into the abstract-leveldown layers of leveldb ecosystem` - not doing this because it would be an inefficient stop-gap)DB
as an alternative outside of the leveldb ecosystemFinal checklist