Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using timebased decentralized UUID for autogenerated IDs #5941

Closed
s1monw opened this issue Apr 25, 2014 · 8 comments
Closed

Explore using timebased decentralized UUID for autogenerated IDs #5941

s1monw opened this issue Apr 25, 2014 · 8 comments

Comments

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2014

Today we use UUIDs which are very costly to compute since they use SecureRandom and they have some properties that might not play very well / or as well as they could with the Lucene internal datastructures since they don't share a common prefix. With a time based ID we can potentially remove the need for bloom fitlers on the ID fields since the shared prefixes across the IDs might be all in memory and that would speedup updates dramatically. Something along the lines of flake ids would be very interesting since it has this property

@GaelTadh GaelTadh self-assigned this Apr 25, 2014
@GaelTadh
Copy link
Contributor

What we are proposing is since a machine may have several es instances running in parallel. We will generate a secure xor of the mac address, prefixed by a 64bit timestamp and postfixed by an atomic counter.

timestamp64bit-macAddr-counter

@mikemccand
Copy link
Contributor

+1 for prefixing the ID with timestamp, or anything "predictable": this is better for Lucene's term dictionary since sometimes it will be able to short-circuit a lookup and know the ID cannot reside in a given segment. It's also important that the IDs are fixed-length, so that all terms happen in the "leaf blocks" of the dictionary. Finally, something like base36 is best, since we tell the terms dict writing to put 25 - 48 terms per block (so up to base 48 should be good too).

@GaelTadh
Copy link
Contributor

@mikemccand why does the encoding matter ? Or do you mean between 36 and 48 bytes long ?

@mikemccand
Copy link
Contributor

@GaelTadh what I mean is each digit should have no more than 48 values (and lower would be better: less scanning within a block). This is because the terms dictionary assigns terms into blocks of at most 48 terms at once.

However ... thinking about this more, I think this matters less, since the IDs won't be sequential here.

@GaelTadh
Copy link
Contributor

Ahh ok that makes sense.

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue Apr 28, 2014
Using SecureRandom as a UUID generator is slow and doesn't allow us
to take adavantage of some lucene optimizations around ids with common
prefixes. This commit will allow us to use a
timestamp64bit-macAddr-counter UUID. Since the macAddr may be shared
among several nodes running on the same hardware we use an xor of the
macaddr with a SecureRandom number generated on startup.

See elastic#5941
@GaelTadh
Copy link
Contributor

I used jmh (http://openjdk.java.net/projects/code-tools/jmh/) to do some micro benchmarks on the following code :

,,,,

@GenerateMicroBenchmark
public void secureRandomUUID(){
    randomBase64UUID(SecureRandomHolder.INSTANCE);
}

@GenerateMicroBenchmark
public void insecureRandomUUID(){
    randomBase64UUID(new Random());
}


@GenerateMicroBenchmark
public void timestampUUID(){
    timestampBase64UUID();
}

,,,

Which gave the following output :

Benchmark Mode Samples Mean Mean error Units
o.s.UUIDBenchmark.insecureRandomUUID thrpt 200 5155.120 16.137 ops/ms
o.s.UUIDBenchmark.secureRandomUUID thrpt 200 469.264 3.287 ops/ms
o.s.UUIDBenchmark.timestampUUID thrpt 200 5703.520 13.350 ops/ms

@GaelTadh
Copy link
Contributor

I'm gonna run these again and remove the new call from insecureRandom.

@GaelTadh
Copy link
Contributor

Benchmark Mode Samples Mean Mean error Units
o.s.UUIDBenchmark.insecureRandomUUID thrpt 200 7348.913 16.807 ops/ms
o.s.UUIDBenchmark.secureRandomUUID thrpt 200 475.986 1.489 ops/ms
o.s.UUIDBenchmark.timestampUUID thrpt 200 5698.841 23.908 ops/ms

GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 1, 2014
…hout an incoming id.

Wire up the timestampUUID generator to indexing.

See elastic#5941
GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 2, 2014
Incorporate some of the changes from @kimchy and @s1monw.
Move the UUID generators into their own classes and provide a common
interface as a first step to moving them under a singleton. Use a better
method of getting the mac address and fall back to a secure random address
if it fails. Add tests to test conccurency and shared prefix integrity of
UUIDs. Use PaddedAtomicLongs to hold the sequence number and lasttime.
Check to see if a time slip has occured as described by @s1monw in a CAS loop.
Next step is to move the impls under a singleton.

See elastic#5941
GaelTadh added a commit to GaelTadh/elasticsearch that referenced this issue May 6, 2014
Reduce number of time bytes to 6 reducing total number of bytes to 20.
Validate that we have a mac address that contains data to avoid getting
addresses that are just 00:00:00:00:00:00 which can happen on virtualized
machines. Remove use of ByteBuffer on puts to reduce overhead. Add code to
attempt to prevent time slips.

See elastic#5941
@kevinkluge kevinkluge added v1.3.0 and removed v1.2.0 labels May 12, 2014
mikemccand added a commit that referenced this issue Sep 2, 2014
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941
mikemccand added a commit that referenced this issue Sep 8, 2014
Flake IDs give better lookup performance in Lucene since they share
predictable prefixes (timestamp).

Closes #7531

Closes #6004

Closes #5941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants