Disable auto gen id optimization #9468

brwe · 2015-01-28T16:34:23Z

This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788

When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788

s1monw · 2015-01-29T08:40:49Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

@@ -401,11 +401,11 @@ private void innerCreateNoLock(Create create, IndexWriter writer, long currentVe
        if ((versionValue != null && versionValue.delete() == false) || (versionValue == null && currentVersion != Versions.NOT_FOUND)) {
            if (create.origin() == Operation.Origin.RECOVERY) {
                return;
-            } else if (create.origin() == Operation.Origin.REPLICA) {


do we need this change here? I mean removing the optimization should be enough?

Without the change there are two scenarios:

First and second create request arrive in order on replica: on the replica 'doUpdate' is set but the version is not set to 1 -> versions on primary and replica are out of sync.

First and second create request arrive out of order on primary: we get a 'DocumentAlreadyExistsException' because none of the other criteria match.

The tests I added in InternalEngineTests fail that way.
It seems to me we have to change something to make this work but I might be missing something.

@bleskes I agree we do not have to update the doc at all if we find it already exists and create.autoGeneratedId() is true. We can just return. I added a commit, test pass.

ok, ignore the comment for now, might just have gotten the versioning wrong...

yes, that was wrong. I removed the change and fixed the test instead

s1monw · 2015-01-29T08:41:50Z

left some comments

bleskes · 2015-01-29T09:26:07Z

src/main/java/org/elasticsearch/index/engine/internal/InternalEngine.java

@@ -401,11 +401,11 @@ private void innerCreateNoLock(Create create, IndexWriter writer, long currentVe
        if ((versionValue != null && versionValue.delete() == false) || (versionValue == null && currentVersion != Versions.NOT_FOUND)) {
            if (create.origin() == Operation.Origin.RECOVERY) {
                return;
-            } else if (create.origin() == Operation.Origin.REPLICA) {
+            } else if (create.origin() == Operation.Origin.REPLICA && !create.autoGeneratedId()) {


I think this is wrong? I this is an indexing request on a replica and we're here, that means that there is already a doc with this id. In this case we want to just ignore the request. +1 on what Simon said regarding not changing this code.

indeed, I removed the change and fixed the test instead

brwe · 2015-01-29T10:35:46Z

thanks for the comments, I got it all wrong the first time. please have another look!

bleskes · 2015-01-29T10:56:17Z

src/test/java/org/elasticsearch/index/engine/internal/InternalEngineTests.java

+        assertThat(topDocs.totalHits, equalTo(1));
+
+        index = new Engine.Create(null, analyzer, newUid("1"), doc, index.version(), index.versionType().versionTypeForReplicationAndRecovery(), REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+        try {


we can use ElasticsearchAssertions.assertThrows. Slightly cleaner code.

assertThrows currently only accepts ActionFuture. we can change that, but I think that should probably a different pr.

bleskes · 2015-01-29T10:59:47Z

LGTM. Left a minor suggestion.

s1monw · 2015-01-29T13:30:11Z

LGTM too

This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes #8788 closes #9468

This pr removes the optimization for auto generated ids. Previously, when ids were auto generated by elasticsearch then there was no check to see if a document with same id already existed and instead the new document was only appended. However, due to lucene improvements this optimization does not add much value. In addition, under rare circumstances it might cause duplicate documents: When an indexing request is retried (due to connect lost, node closed etc), then a flag 'canHaveDuplicates' is set to true for the indexing request that is send a second time. This was to make sure that even when an indexing request for a document with autogenerated id comes in we do not have to update unless this flag is set and instead only append. However, it might happen that for a retry or for the replication the indexing request that has the canHaveDuplicates set to true (the retried request) arrives at the destination before the original request that does have it set false. In this case both request add a document and we have a duplicated a document. This commit adds a workaround: remove the optimization for auto generated ids and always update the document. The asumtion is that this will not slow down indexing more than 10 percent, see: http://benchmarks.elasticsearch.org/ closes elastic#8788 closes elastic#9468

s1monw reviewed Jan 29, 2015
View reviewed changes

bleskes reviewed Jan 29, 2015
View reviewed changes

brwe added 2 commits January 29, 2015 10:38

just return if autoGeneratedId and doc exists already

5eb7247

undo changes to InternalEngine and fix tests instead

2871413

bleskes reviewed Jan 29, 2015
View reviewed changes

brwe changed the title ~~core: fix duplicate docs with autogenerated ids~~ core: disable auto gen id optimization Jan 29, 2015

brwe closed this in 0a07ce8 Jan 29, 2015

clintongormley added >bug :Core/Infra/Core Core issues without another label v1.4.3 v1.5.0 v2.0.0-beta1 labels Feb 3, 2015

clintongormley changed the title ~~core: disable auto gen id optimization~~ Core: disable auto gen id optimization Feb 10, 2015

clintongormley changed the title ~~Core: disable auto gen id optimization~~ Core: Disable auto gen id optimization Feb 10, 2015

clintongormley changed the title ~~Core: Disable auto gen id optimization~~ Disable auto gen id optimization Jun 7, 2015

bleskes mentioned this pull request Aug 4, 2016

Optimize indexing in create once and never update scenarios #19813

Closed

10 tasks

Leaf-Lin mentioned this pull request Dec 3, 2021

Disconnect between coordinating node and shards can cause duplicate updates or wrong status code #9967

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable auto gen id optimization #9468

Disable auto gen id optimization #9468

Uh oh!

brwe commented Jan 28, 2015

Uh oh!

s1monw Jan 29, 2015

Uh oh!

brwe Jan 29, 2015

Uh oh!

brwe Jan 29, 2015

Uh oh!

brwe Jan 29, 2015

Uh oh!

s1monw commented Jan 29, 2015

Uh oh!

bleskes Jan 29, 2015

Uh oh!

brwe Jan 29, 2015

Uh oh!

brwe commented Jan 29, 2015

Uh oh!

bleskes Jan 29, 2015

Uh oh!

brwe Jan 29, 2015

Uh oh!

bleskes commented Jan 29, 2015

Uh oh!

s1monw commented Jan 29, 2015

Uh oh!

Uh oh!

Disable auto gen id optimization #9468

Disable auto gen id optimization #9468

Uh oh!

Conversation

brwe commented Jan 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jan 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brwe commented Jan 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jan 29, 2015

Uh oh!

s1monw commented Jan 29, 2015

Uh oh!

Uh oh!