Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite identifiers #147

Open
kirillt opened this issue Nov 22, 2021 · 0 comments
Open

Composite identifiers #147

kirillt opened this issue Nov 22, 2021 · 0 comments

Comments

@kirillt
Copy link
Member

kirillt commented Nov 22, 2021

Right now, during indexing ids are assigned to each resource (initially known only by path). Tags, in its turn, are assigned to resources using these ids in order to be path-independent. We can't use random ids (like UUID) since for any resource it's id must be computable on other devices (see content-addressing).

Unfortunately, cryptographic hash functions consume significantly more resources than CRC32, which we are using at the moment due to performance reasons. Using weak hash functions can lead to potential conflicts of identifiers (situation when several resources have the same id). Probability of it is open question, but in practice I suspect that several resources out of 5K+ files collection do conflict by id.

These collisions aren't harmful so far. Even in case of destructive operations being applied to an id which has several resources attached to it, only one of them is deleted (the resource which is visible to the user, see #140). This state of things can change in future.

One of potential solutions would be usage of composite identifiers: a stack of functions, where every next function is used when previous function produces a conflict in some given collection. This way, we could use our CRC32 for fast indexing and, in case of detected conflicts, falling back to a stronger function like SHA256 or MD5 applying it only for particular resources involved in the conflict.

Example:

  • Resource A has id crc32: 1.
  • Resource B has id crc32: 2.
  • Resource C has id crc32: 1.

Now we index A and B additionally with SHA-256.
Resource C isn't needed to re-index here (but after implementation of #23 we might put such a task into the queue).

Result:

  • Resource A has id crc32: 1, sha256: 17.
  • Resource B has id crc32: 2, sha256: 5.
  • Resource C has id crc32: 1.

Maybe we can replace old id with the newer one, but for this we need to update tags storage with newer id.

If such feature is possible, we can go further and also significantly increase indexing speed by using file size as first (the weakest) hash function in the stack and calculating CRC-32 only for resources of the same size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant