Composite identifiers #147

kirillt · 2021-11-22T13:53:07Z

Right now, during indexing ids are assigned to each resource (initially known only by path). Tags, in its turn, are assigned to resources using these ids in order to be path-independent. We can't use random ids (like UUID) since for any resource it's id must be computable on other devices (see content-addressing).

Unfortunately, cryptographic hash functions consume significantly more resources than CRC32, which we are using at the moment due to performance reasons. Using weak hash functions can lead to potential conflicts of identifiers (situation when several resources have the same id). Probability of it is open question, but in practice I suspect that several resources out of 5K+ files collection do conflict by id.

These collisions aren't harmful so far. Even in case of destructive operations being applied to an id which has several resources attached to it, only one of them is deleted (the resource which is visible to the user, see #140). This state of things can change in future.

One of potential solutions would be usage of composite identifiers: a stack of functions, where every next function is used when previous function produces a conflict in some given collection. This way, we could use our CRC32 for fast indexing and, in case of detected conflicts, falling back to a stronger function like SHA256 or MD5 applying it only for particular resources involved in the conflict.

Example:

Resource A has id crc32: 1.
Resource B has id crc32: 2.
Resource C has id crc32: 1.

Now we index A and B additionally with SHA-256.
Resource C isn't needed to re-index here (but after implementation of #23 we might put such a task into the queue).

Result:

Resource A has id crc32: 1, sha256: 17.
Resource B has id crc32: 2, sha256: 5.
Resource C has id crc32: 1.

Maybe we can replace old id with the newer one, but for this we need to update tags storage with newer id.

If such feature is possible, we can go further and also significantly increase indexing speed by using file size as first (the weakest) hash function in the stack and calculating CRC-32 only for resources of the same size.

The text was updated successfully, but these errors were encountered:

kirillt added concept performance UX labels Nov 22, 2021

kirillt closed this as completed Nov 22, 2021

kirillt reopened this Nov 22, 2021

kirillt added the level up label Nov 22, 2021

kirillt mentioned this issue Nov 16, 2023

Better handling of _id_ collisions ARK-Builders/arklib#58

Open

kirillt removed the level up label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composite identifiers #147

Composite identifiers #147

kirillt commented Nov 22, 2021 •

edited

Loading

Composite identifiers #147

Composite identifiers #147

Comments

kirillt commented Nov 22, 2021 • edited Loading

kirillt commented Nov 22, 2021 •

edited

Loading