-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore using timebased decentralized UUID for autogenerated IDs #5941
Comments
What we are proposing is since a machine may have several es instances running in parallel. We will generate a secure xor of the mac address, prefixed by a 64bit timestamp and postfixed by an atomic counter. timestamp64bit-macAddr-counter |
+1 for prefixing the ID with timestamp, or anything "predictable": this is better for Lucene's term dictionary since sometimes it will be able to short-circuit a lookup and know the ID cannot reside in a given segment. It's also important that the IDs are fixed-length, so that all terms happen in the "leaf blocks" of the dictionary. Finally, something like base36 is best, since we tell the terms dict writing to put 25 - 48 terms per block (so up to base 48 should be good too). |
@mikemccand why does the encoding matter ? Or do you mean between 36 and 48 bytes long ? |
@GaelTadh what I mean is each digit should have no more than 48 values (and lower would be better: less scanning within a block). This is because the terms dictionary assigns terms into blocks of at most 48 terms at once. However ... thinking about this more, I think this matters less, since the IDs won't be sequential here. |
Ahh ok that makes sense. |
Using SecureRandom as a UUID generator is slow and doesn't allow us to take adavantage of some lucene optimizations around ids with common prefixes. This commit will allow us to use a timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared among several nodes running on the same hardware we use an xor of the macaddr with a SecureRandom number generated on startup. See elastic#5941
I used jmh (http://openjdk.java.net/projects/code-tools/jmh/) to do some micro benchmarks on the following code : ,,,,
,,, Which gave the following output :
|
I'm gonna run these again and remove the new call from insecureRandom. |
|
…hout an incoming id. Wire up the timestampUUID generator to indexing. See elastic#5941
Incorporate some of the changes from @kimchy and @s1monw. Move the UUID generators into their own classes and provide a common interface as a first step to moving them under a singleton. Use a better method of getting the mac address and fall back to a secure random address if it fails. Add tests to test conccurency and shared prefix integrity of UUIDs. Use PaddedAtomicLongs to hold the sequence number and lasttime. Check to see if a time slip has occured as described by @s1monw in a CAS loop. Next step is to move the impls under a singleton. See elastic#5941
Reduce number of time bytes to 6 reducing total number of bytes to 20. Validate that we have a mac address that contains data to avoid getting addresses that are just 00:00:00:00:00:00 which can happen on virtualized machines. Remove use of ByteBuffer on puts to reduce overhead. Add code to attempt to prevent time slips. See elastic#5941
Today we use UUIDs which are very costly to compute since they use
SecureRandom
and they have some properties that might not play very well / or as well as they could with the Lucene internal datastructures since they don't share a common prefix. With a time based ID we can potentially remove the need for bloom fitlers on the ID fields since the shared prefixes across the IDs might be all in memory and that would speedup updates dramatically. Something along the lines of flake ids would be very interesting since it has this propertyThe text was updated successfully, but these errors were encountered: