Explore using timebased decentralized UUID for autogenerated IDs #5941

s1monw · 2014-04-25T11:27:15Z

Today we use UUIDs which are very costly to compute since they use SecureRandom and they have some properties that might not play very well / or as well as they could with the Lucene internal datastructures since they don't share a common prefix. With a time based ID we can potentially remove the need for bloom fitlers on the ID fields since the shared prefixes across the IDs might be all in memory and that would speedup updates dramatically. Something along the lines of flake ids would be very interesting since it has this property

The text was updated successfully, but these errors were encountered:

GaelTadh · 2014-04-25T11:50:08Z

What we are proposing is since a machine may have several es instances running in parallel. We will generate a secure xor of the mac address, prefixed by a 64bit timestamp and postfixed by an atomic counter.

timestamp64bit-macAddr-counter

mikemccand · 2014-04-25T13:29:27Z

+1 for prefixing the ID with timestamp, or anything "predictable": this is better for Lucene's term dictionary since sometimes it will be able to short-circuit a lookup and know the ID cannot reside in a given segment. It's also important that the IDs are fixed-length, so that all terms happen in the "leaf blocks" of the dictionary. Finally, something like base36 is best, since we tell the terms dict writing to put 25 - 48 terms per block (so up to base 48 should be good too).

GaelTadh · 2014-04-25T14:01:51Z

@mikemccand why does the encoding matter ? Or do you mean between 36 and 48 bytes long ?

mikemccand · 2014-04-25T14:58:22Z

@GaelTadh what I mean is each digit should have no more than 48 values (and lower would be better: less scanning within a block). This is because the terms dictionary assigns terms into blocks of at most 48 terms at once.

However ... thinking about this more, I think this matters less, since the IDs won't be sequential here.

GaelTadh · 2014-04-25T15:14:14Z

Ahh ok that makes sense.

Using SecureRandom as a UUID generator is slow and doesn't allow us to take adavantage of some lucene optimizations around ids with common prefixes. This commit will allow us to use a timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared among several nodes running on the same hardware we use an xor of the macaddr with a SecureRandom number generated on startup. See elastic#5941

GaelTadh · 2014-04-28T13:33:52Z

I used jmh (http://openjdk.java.net/projects/code-tools/jmh/) to do some micro benchmarks on the following code :

,,,,

@GenerateMicroBenchmark
public void secureRandomUUID(){
    randomBase64UUID(SecureRandomHolder.INSTANCE);
}

@GenerateMicroBenchmark
public void insecureRandomUUID(){
    randomBase64UUID(new Random());
}


@GenerateMicroBenchmark
public void timestampUUID(){
    timestampBase64UUID();
}

,,,

Which gave the following output :

Benchmark	Mode	Samples	Mean	Mean error	Units
o.s.UUIDBenchmark.insecureRandomUUID	thrpt	200	5155.120	16.137	ops/ms
o.s.UUIDBenchmark.secureRandomUUID	thrpt	200	469.264	3.287	ops/ms
o.s.UUIDBenchmark.timestampUUID	thrpt	200	5703.520	13.350	ops/ms

GaelTadh · 2014-04-28T14:04:21Z

I'm gonna run these again and remove the new call from insecureRandom.

GaelTadh · 2014-04-28T14:37:05Z

Benchmark	Mode	Samples	Mean	Mean error	Units
o.s.UUIDBenchmark.insecureRandomUUID	thrpt	200	7348.913	16.807	ops/ms
o.s.UUIDBenchmark.secureRandomUUID	thrpt	200	475.986	1.489	ops/ms
o.s.UUIDBenchmark.timestampUUID	thrpt	200	5698.841	23.908	ops/ms

…hout an incoming id. Wire up the timestampUUID generator to indexing. See elastic#5941

@kimchy

Incorporate some of the changes from @kimchy and @s1monw. Move the UUID generators into their own classes and provide a common interface as a first step to moving them under a singleton. Use a better method of getting the mac address and fall back to a secure random address if it fails. Add tests to test conccurency and shared prefix integrity of UUIDs. Use PaddedAtomicLongs to hold the sequence number and lasttime. Check to see if a time slip has occured as described by @s1monw in a CAS loop. Next step is to move the impls under a singleton. See elastic#5941

Reduce number of time bytes to 6 reducing total number of bytes to 20. Validate that we have a mac address that contains data to avoid getting addresses that are just 00:00:00:00:00:00 which can happen on virtualized machines. Remove use of ByteBuffer on puts to reduce overhead. Add code to attempt to prevent time slips. See elastic#5941

Flake IDs give better lookup performance in Lucene since they share predictable prefixes (timestamp). Closes #7531 Closes #6004 Closes #5941

s1monw added the enhancement label Apr 25, 2014

GaelTadh self-assigned this Apr 25, 2014

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 1, 2014

Use a timestamp/macaddres/sequence number as the uuid for objects wit…

0e90c65

…hout an incoming id. Wire up the timestampUUID generator to indexing. See elastic#5941

GaelTadh mentioned this issue May 1, 2014

Add time based UUIDs #6004

Closed

kevinkluge added the v1.2.0 label May 5, 2014

kevinkluge added v1.3.0 and removed v1.2.0 labels May 12, 2014

clintongormley added v1.4.0 and removed v1.3.0 labels Jul 11, 2014

mikemccand added the v2.0.0 label Sep 1, 2014

mikemccand closed this as completed in 9c1ac95 Sep 2, 2014

mikemccand added a commit that referenced this issue Sep 2, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field

7184406

Flake IDs give better lookup performance in Lucene since they share predictable prefixes (timestamp). Closes #7531 Closes #6004 Closes #5941

mikemccand added a commit that referenced this issue Sep 8, 2014

Use Flake IDs instead of random UUIDs when auto-generating id field

14b5308

Flake IDs give better lookup performance in Lucene since they share predictable prefixes (timestamp). Closes #7531 Closes #6004 Closes #5941

mikemccand mentioned this issue Sep 11, 2014

Switch auto-generated IDs to Flake IDs from random UUIDs #7531

Closed

vitorbaptista mentioned this issue Aug 31, 2016

Use UUIDv1 instead of UUIDv4 for IDs opentrials/opentrials#337

Closed

itiyamas mentioned this issue Sep 7, 2020

Introduce bulk routing optimization. #62020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore using timebased decentralized UUID for autogenerated IDs #5941

Explore using timebased decentralized UUID for autogenerated IDs #5941

s1monw commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

mikemccand commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

mikemccand commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

GaelTadh commented Apr 28, 2014

GaelTadh commented Apr 28, 2014

GaelTadh commented Apr 28, 2014

Explore using timebased decentralized UUID for autogenerated IDs #5941

Explore using timebased decentralized UUID for autogenerated IDs #5941

Comments

s1monw commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

mikemccand commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

mikemccand commented Apr 25, 2014

GaelTadh commented Apr 25, 2014

GaelTadh commented Apr 28, 2014

GaelTadh commented Apr 28, 2014

GaelTadh commented Apr 28, 2014