[CI] resource_already_exists_exception failures #45605

droberts195 · 2019-08-15T11:04:50Z

This problem occurred while running indices.update_aliases/10_basic/Basic test for multiple aliases in the qa/smoke-test-multinode directory. However, I do not think that particular test is to blame for the failure - it seems more like a very rare problem with index management in multi-node clusters.

The build where this problem was seen was https://gradle-enterprise.elastic.co/s/fknt4cque3p4e

The client side problem that made the test fail was "index already exists":

  1> [2019-08-15T18:29:39,851][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] before test
  1> [2019-08-15T18:30:09,895][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>             "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>             "index" : "test_index",
  1>             "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [test_index/3APvix2mRpuc5B52QMFNbA] already exists",
  1>         "index_uuid" : "3APvix2mRpuc5B52QMFNbA",
  1>         "index" : "test_index",
  1>         "stack_trace" : "[test_index/3APvix2mRpuc5B52QMFNbA] ResourceAlreadyExistsException[index [test_index/3APvix2mRpuc5B52QMFNbA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] There are still tasks running after this test that might break subsequent tests [internal:index/shard/recovery/start_recovery].
  1> [2019-08-15T18:30:29,541][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.update_aliases/10_basic/Basic test for multiple aliases] after test

The underlying reason from the server side logs was that a previous test's index of the same name could not be deleted:

[2019-08-15T08:28:37,221][WARN ][o.e.i.IndicesService     ] [integTest-0] [test_index/THgD8CcvQyGEf_ADVnVoCw] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [test_index][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
[2019-08-15T08:28:37,240][WARN ][o.e.i.s.RetentionLeaseSyncAction] [integTest-0] unexpected error during the primary phase for action [indices:admin/seq_no/retention_lease_sync], request [RetentionLeaseSyncAction.Request{retentionLeases=RetentionLeases{primaryTerm=1, version=2, leases={peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA=RetentionLease{id='peer_recovery/QZ3ZxWTXT5WjQjs7JuH_sA', retainingSequenceNumber=0, timestamp=1565857716949, source='peer recovery'}, peer_recovery/bnbl33BbQveq_mL2_DzTcg=RetentionLease{id='peer_recovery/bnbl33BbQveq_mL2_DzTcg', retainingSequenceNumber=0, timestamp=1565857717149, source='peer recovery'}}}, shardId=[test_index][0], timeout=1m, index='test_index', waitForActiveShards=0}]
org.elasticsearch.index.IndexNotFoundException: no such index [test_index]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:190) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:116) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteSingleIndex(IndexNameExpressionResolver.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.concreteIndex(TransportReplicationAction.java:234) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:651) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onNewClusterState(TransportReplicationAction.java:795) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.postAdded(ClusterStateObserver.java:220) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$1.run(ClusterApplierService.java:299) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The repro command is:

./gradlew :qa:smoke-test-multinode:integTestRunner --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" \
  -Dtests.method="test {yaml=indices.update_aliases/10_basic/Basic test for multiple aliases}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-ML \
  -Dtests.timezone=Australia/ACT \
  -Dcompiler.java=12 \
  -Druntime.java=11

This did not reproduce for me.

The other thing that might be relevant is that around the time this was failing a completely different test against a different test cluster failed on the same CI worker. It was:

./gradlew :docs:integTestRunner --tests "org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT.test {yaml=reference/docs/update-by-query/line_253}" \
  -Dtests.seed=8C4DDB5E411196EE \
  -Dtests.security.manager=true \
  -Dtests.locale=dyo \
  -Dtests.timezone=Asia/Jerusalem \
  -Dcompiler.java=12 \
  -Druntime.java=11

The client side error for that was:

  1> [2019-08-15T11:29:39,823][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] before test
  1> [2019-08-15T11:30:09,868][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "error" : {
  1>         "root_cause" : [
  1>           {
  1>             "type" : "resource_already_exists_exception",
  1>             "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>             "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>             "index" : "twitter",
  1>             "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>           }
  1>         ],
  1>         "type" : "resource_already_exists_exception",
  1>         "reason" : "index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists",
  1>         "index_uuid" : "I-JDSTD4S6KmKfAEkI95iA",
  1>         "index" : "twitter",
  1>         "stack_trace" : "[twitter/I-JDSTD4S6KmKfAEkI95iA] ResourceAlreadyExistsException[index [twitter/I-JDSTD4S6KmKfAEkI95iA] already exists]
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:148)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:607)
  1> 	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:285)
  1> 	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
  1> 	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1> 	at java.base/java.lang.Thread.run(Thread.java:834)
  1> "
  1>       },
  1>       "status" : 400
  1>     }
  1>   }
  1> }]
  1> [2019-08-15T11:30:36,334][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/docs/update-by-query/line_253] after test

The server side for that docs test failure shows nothing apart from a big gap:

[2019-08-15T08:29:39,832][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [twitter] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-08-15T08:30:24,229][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node-0] [twitter/I-JDSTD4S6KmKfAEkI95iA] deleting index

These two test clusters were different - it's not that the docs tests and the multi-node smoke tests were simultaneously talking to the same test cluster. But they were running at the same time and suffered failures around the same time.

It may be that these failures are caused because CI worker was just completely overloaded by multiple test clusters running simultaneously around the time of these failures.

But maybe there's something interesting in the ShardLockObtainFailedException that might be worth considering in case users see it on machines that are under heavy load.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-15T11:04:52Z

Pinging @elastic/es-distributed

droberts195 · 2019-08-15T12:00:23Z

I just noticed this is a more detailed investigation of one specific case of #45600, so this may not be a rare problem at all.

elasticmachine · 2019-08-15T12:23:42Z

Pinging @elastic/es-core-infra

alpar-t · 2019-08-15T12:26:48Z

Could it be that somehow the clusters unite so in fact the tests are running against a larger cluster not two separate ones ? I think that would explain the behavior from the logs.
I remember @ywelsch suggested we use unique cluster names for tests to make sure this doesn't happen, even trough I don't think it should with the current configuration.

droberts195 · 2019-08-15T12:39:47Z

Could it be that somehow the clusters unite

Maybe that has happened in other CI runs, but in the CI run described in the description of this issue the two test clusters did not unite. The server side logs for the docs cluster show that it only had 1 node (as expected) and the server side logs for the qa/smoke-test-multinode cluster show that it only had 2 nodes (as expected).

droberts195 · 2019-08-15T14:00:46Z

The two examples we have so far where tests for qa/smoke-test-multinode and docs failed around the same time may be coincidence.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2016/71/console is an example where the failure to delete an index between tests occurs and just a single test fails in the CI run:

  1> [2019-08-15T04:48:49,931][WARN ][o.e.i.IndicesService     ] [node_t1] [test/y12jwHyUSiS70AhJ_iAD3g] failed to delete index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [test][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]

The failure that follows on from that is:

    java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.

The repro command for this failure was:

gradlew :server:integTest --tests "org.elasticsearch.cluster.allocation.ClusterRerouteIT.testRerouteExplain" \
  -Dtests.seed=13BCAC0C1240CA7E \
  -Dtests.security.manager=true \
  -Dtests.locale=fi \
  -Dtests.timezone=Canada/Yukon \
  -Dcompiler.java=12 \
  -Druntime.java=8

Given that the problem occurs in the cleanup in between tests I'm not sure individual repro commands are that helpful though.

alpar-t · 2019-08-15T14:18:04Z

As a side note, it seems that we should fail if the cleanup fails instead of attempting to run the tests

alpar-t · 2019-08-22T07:54:17Z

We looked at this failure on slack a couple of days ago with @original-brownbear and did not update here. Doing so in case someone finds it useful.

There were some CCR test failures that looked similar ( index not found ) that had to do with caching test tasks when we shouldn't have. CCR tests are usually made up of a test that sets CCR up and another one that checks on it, so if part of it get's cached the index will never replicate. This was fixed in #45659 and is unrelated to this issue as here we are dealing with tests that do run with testclusters which can correctly deal with caching.

It also seems that this fails on tests that run against a multi node cluster.

The following exception shows up after the test when the test framework tries to clean up all indices:

[2019-08-19T12:37:44,585][WARN ][o.e.i.IndicesService     ] [integTest-0] [goodbad/29xCvTM8T9GIAsfnTPWONQ] failed to delete index
org.elasticsearch.env.ShardLockObtainFailedException: [goodbad][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory]
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:860) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:775) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:718) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:670) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStoreIfDeletionAllowed(IndicesService.java:800) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:787) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:691) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:375) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:508) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:505) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

In this case we had 2 tests that used the same index.
The first one to do so failed, saying the index already exists, and then the cleanup failed ( silently ) again when trying to delete that existing index, yet the next test that used that index name passed.

There is no evidence that the index that exists was ever created. No tests that use that index seem to have ran prior to this, and the test is connected to the right cluster, that has the right nodes, excluding cross-talk with other tests.
This also seems to happen in bwc tests that do not run in parallel.
checkPart1 seems to fail most often so running ./gradlew checkPart1 --parallel on a CI worker is likely to reproduce this.

@dnhatn suggested this could be related to #45409 but we seem to have seen failures after that was merged as well.

In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates elastic#45605 which might be caused by unacked deletes that were just slow.

williamrandolph · 2019-08-22T13:07:47Z

On this intake build, OpenCloseIndexIT failed with symptoms that were similar to what @atorok described above: mysterious test failure, then ShardLockObtainFailedException at test cleanup time. https://gradle-enterprise.elastic.co/s/t26bxyvqgurkq/tests/kyv2y2z3r4v7m-rg5rz5gnpfyvq

In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates #45605 which might be caused by unacked deletes that were just slow.

) In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates elastic#45605 which might be caused by unacked deletes that were just slow.

…45842) In internal test clusters tests we check that wiping all indices was acknowledged but in REST tests we didn't. This aligns the behavior in both kinds of tests. Relates #45605 which might be caused by unacked deletes that were just slow.

DaveCTurner · 2019-08-28T21:45:40Z

As noted in #45956 (comment) I do not think that org.elasticsearch.env.ShardLockObtainFailedException: [goodbad][0]: obtaining shard lock timed out after 0ms, previous lock details: [shard creation] trying to lock for [deleting index directory] is necessarily a problem. But looking at the build failure linked in #45605 (comment) (i.e. https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2016/71/consoleText) I see this:

  1> [2019-08-15T04:48:59,949][WARN ][o.e.i.c.IndicesClusterStateService] [node_t0] [[test/y12jwHyUSiS70AhJ_iAD3g]] failed to complete pending deletion for index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [test][0]: thread interrupted while trying to obtain shard lock
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:775) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:684) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:627) ~[main/:?]
  1> 	at org.elasticsearch.indices.IndicesService.processPendingDeletes(IndicesService.java:1077) ~[main/:?]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService$2.doRun(IndicesClusterStateService.java:356) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> Caused by: java.lang.InterruptedException
  1> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) ~[?:1.8.0_221]
  1> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) ~[?:1.8.0_221]
  1> 	at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) ~[?:1.8.0_221]
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:766) ~[main/:?]
  1> 	... 9 more

Interrupting this wait for the shard lock while the node is closing does seem like it might be problematic.

ywelsch · 2019-09-05T10:10:55Z

I've edited the title here. The ShardLockObtainFailedException and IndexNotFoundException warnings are harmless and also happen on successful runs. I've created #46368 to quiet the ShardLockObtainFailedException down as they seem to cause confusion (as any warning that we log).

We still need to figure out what's causing the resource_already_exists_exception

droberts195 · 2019-09-10T13:38:13Z

Subsequent updates moved to #46091, so closing this one.

droberts195 added >test-failure Triaged test failures from CI :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Aug 15, 2019

droberts195 mentioned this issue Aug 15, 2019

[CI] Yaml tests failing due to "resource_already_exists_exception" #45600

Closed

alpar-t added the :Delivery/Build Build or test infrastructure label Aug 15, 2019

jkakavas mentioned this issue Aug 16, 2019

Document SAML APIs #45105

Merged

original-brownbear mentioned this issue Aug 16, 2019

RepositoryURLClientYamlTestSuiteIT fails on master - intake part 1 #45560

Closed

mayya-sharipova mentioned this issue Aug 18, 2019

Test search/160_exists_query fails #45581

Closed

bizybot mentioned this issue Aug 20, 2019

Add support for authentication based predicate for cluster permission #45431

Merged

original-brownbear self-assigned this Aug 22, 2019

original-brownbear mentioned this issue Aug 22, 2019

Acknowledge Indices Were Wiped Successfully in REST Tests #45832

Merged

henningandersen mentioned this issue Aug 22, 2019

Disallow partial results when shard unavailable #45739

Merged

original-brownbear mentioned this issue Aug 22, 2019

Acknowledge Indices Were Wiped Successfully in REST Tests (#45832) #45842

Merged

gwbrown mentioned this issue Aug 28, 2019

CI failures due to indices not being cleaned up as expected #46091

Closed

alpar-t mentioned this issue Sep 2, 2019

CI UpdateShardAllocationSettingsIT.testUpdateSameHostSetting failure #45956

Closed

ywelsch mentioned this issue Sep 5, 2019

Quiet down shard lock failures #46368

Merged

ywelsch changed the title ~~[CI] "failed to delete index" due to ShardLockObtainFailedException~~ [CI] resource_already_exists_exception failures Sep 5, 2019

droberts195 closed this as completed Sep 10, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] resource_already_exists_exception failures #45605

[CI] resource_already_exists_exception failures #45605

droberts195 commented Aug 15, 2019

elasticmachine commented Aug 15, 2019

droberts195 commented Aug 15, 2019 •

edited

elasticmachine commented Aug 15, 2019

alpar-t commented Aug 15, 2019

droberts195 commented Aug 15, 2019

droberts195 commented Aug 15, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 22, 2019

williamrandolph commented Aug 22, 2019

DaveCTurner commented Aug 28, 2019

ywelsch commented Sep 5, 2019

droberts195 commented Sep 10, 2019

[CI] resource_already_exists_exception failures #45605

[CI] resource_already_exists_exception failures #45605

Comments

droberts195 commented Aug 15, 2019

elasticmachine commented Aug 15, 2019

droberts195 commented Aug 15, 2019 • edited

elasticmachine commented Aug 15, 2019

alpar-t commented Aug 15, 2019

droberts195 commented Aug 15, 2019

droberts195 commented Aug 15, 2019

alpar-t commented Aug 15, 2019

alpar-t commented Aug 22, 2019

williamrandolph commented Aug 22, 2019

DaveCTurner commented Aug 28, 2019

ywelsch commented Sep 5, 2019

droberts195 commented Sep 10, 2019

droberts195 commented Aug 15, 2019 •

edited