index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

rayward · 2016-09-08T04:19:24Z

Elasticsearch version: 2.3.4 / 2.4.0

Plugins installed: kopf

JVM version:

version: "1.8.0_102",
vm_name: "OpenJDK 64-Bit Server VM",
vm_version: "25.102-b14"",
vm_vendor: "Oracle Corporation",

OS version:
Debian 8.5
Linux 8e884c0e66a2 4.4.15-moby # 1 SMP Thu Jul 28 22:03:07 UTC 2016 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

ES arbitrarily accepts any integer for index.merge.scheduler.max_thread_count even if it exceeds internal limits which then causes various scheduled tasks to throw exceptions and eventually break indexing.

Steps to reproduce:

We experienced this on our production cluster but I was able to reproduce it using a couple of docker containers.

Create an index:

POST /test_index
{
"number_of_shards": 4,
"number_of_replicas": 1
}

{
"acknowledged": true
}

Update the max thread count:

PUT /test_index/_settings
{
"index.merge.scheduler.max_thread_count": 12
}

{
"acknowledged": true
}

Alternate creating documents ...

POST /test_index/test_type
{
"name": "bob"
}

{
"_index": "test_index",
"_type": "test_type",
"_id": "AVcH80xPYp3inMh6kFUL",
"_version": 1,
"_shards": {
  "total": 2,
  "successful": 2,
  "failed": 0
},
"created": true
}

And optimizing the index (to force a flush):

GET /test_index/_optimize

This will shortly lead to translog failures:

POST /test_index/test_type
{
"name": "bob"
}

{
"error": {
  "root_cause": [
    {
      "type": "create_failed_engine_exception",
      "reason": "Create failed for [test_type#AVcH7buoYp3inMh6kFT4]",
      "shard": "3",
      "index": "test_index"
    }
  ],
  "type": "create_failed_engine_exception",
  "reason": "Create failed for [test_type#AVcH7buoYp3inMh6kFT4]",
  "shard": "3",
  "index": "test_index",
  "caused_by": {
    "type": "already_closed_exception",
    "reason": "translog is already closed"
  }
},
"status": 500
}

GET /test_index/_optimize
{
"_shards": {
  "total": 8,
  "successful": 2,
  "failed": 6,
  "failures": [
    {
      "shard": 3,
      "index": "test_index",
      "status": "INTERNAL_SERVER_ERROR",
      "reason": {
        "type": "flush_failed_engine_exception",
        "reason": "Flush failed",
        "shard": "3",
        "index": "test_index",
        "caused_by": {
          "type": "already_closed_exception",
          "reason": "translog is already closed"
        }
      }
    }
  ]
}
}

Provide logs (if relevant):

The invalid max_thread_count causes all sorts of exceptions:

Updating the index settings:

[2016-09-08 03:50:48,676][INFO ][index.shard              ] [Abe Brown] [test_index][3] updating [index.merge.scheduler.max_thread_count] from [7] to [12]
[2016-09-08 03:50:48,677][WARN ][index.settings           ] [Abe Brown] [test_index] failed to refresh settings for [org.elasticsearch.index.shard.IndexShard$ApplyRefreshSettings@27098528]
java.lang.IllegalArgumentException: maxThreadCount should be <= maxMergeCount (= 7)
    at org.apache.lucene.index.ConcurrentMergeScheduler.setMaxMergesAndThreads(ConcurrentMergeScheduler.java:154)
    at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.refreshConfig(ElasticsearchConcurrentMergeScheduler.java:176)
    at org.elasticsearch.index.engine.InternalEngine.onSettingsChanged(InternalEngine.java:1268)
    at org.elasticsearch.index.shard.IndexShard$ApplyRefreshSettings.onRefreshSettings(IndexShard.java:1325)
    at org.elasticsearch.index.settings.IndexSettingsService.refreshSettings(IndexSettingsService.java:54)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applySettings(IndicesClusterStateService.java:341)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:180)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Various errors seen while indexing and optimizing:

[2016-09-08 03:58:16,128][WARN ][index.shard              ] [Abe Brown] [test_index][3] Failed to perform scheduled engine refresh
java.lang.IllegalArgumentException: maxThreadCount should be <= maxMergeCount (= 7)
    at org.apache.lucene.index.ConcurrentMergeScheduler.setMaxMergesAndThreads(ConcurrentMergeScheduler.java:154)
    at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.refreshConfig(ElasticsearchConcurrentMergeScheduler.java:176)
    at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:685)
    at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:661)
    at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1343)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2016-09-08 03:58:36,218][WARN ][indices.memory           ] [Abe Brown] failed to set shard [test_index][1] index buffer to [24.7mb]
org.apache.lucene.store.AlreadyClosedException: translog [2] is already closed
    at org.elasticsearch.index.translog.TranslogWriter.ensureOpen(TranslogWriter.java:329)
    at org.elasticsearch.index.translog.BufferingTranslogWriter.updateBufferSize(BufferingTranslogWriter.java:143)
    at org.elasticsearch.index.translog.Translog.updateBuffer(Translog.java:406)
    at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1164)
    at org.elasticsearch.indices.memory.IndexingMemoryController.updateShardBuffers(IndexingMemoryController.java:232)
    at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:286)
    at org.elasticsearch.threadpool.ThreadPool$ReschedulingRunnable.doRun(ThreadPool.java:1057)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:668)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

[2016-09-08 03:53:05,512][DEBUG][action.admin.indices.forcemerge] [Abe Brown] [indices:admin/optimize] failed to execute operation for shard [[test_index][1], node[6jC3WEUrQWifFvTn-twoBw], [P], v[3], s[STARTED], a[id=Fa-2IEiESai0wk2csElq0Q]]
[test_index][[test_index][1]] FlushFailedEngineException[Flush failed]; nested: IllegalStateException[already committing a translog with generation: 1];
    at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:762)
    at org.elasticsearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:863)
    at org.elasticsearch.index.shard.IndexShard.forceMerge(IndexShard.java:793)
    at org.elasticsearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:79)
    at org.elasticsearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:46)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:436)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:415)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:402)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: already committing a translog with generation: 1
    at org.elasticsearch.index.translog.Translog.prepareCommit(Translog.java:1785)
    at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:753)
    ... 14 more

[2016-09-08 03:59:27,335][WARN ][rest.suppressed          ] path: //test_index/test_type, params: {index=test_index, type=test_type}
RemoteTransportException[[Abe Brown][172.19.0.2:9300][indices:data/write/index[p]]]; nested: CreateFailedEngineException[Create failed for [test_type#AVcH80pjYp3inMh6kFUK]]; nested: AlreadyClosedException[translog is already closed];
Caused by: [test_index][[test_index][1]] CreateFailedEngineException[Create failed for [test_type#AVcH80pjYp3inMh6kFUK]]; nested: AlreadyClosedException[translog is already closed];
    at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:351)
    at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:545)
    at org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:236)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:157)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:648)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
    at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1888)
    at org.elasticsearch.index.translog.Translog.add(Translog.java:544)
    at org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:437)
    at org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:375)
    at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:346)
    ... 15 more

Interestingly, closing and reopening the index will restore things to working order with the configured thread count.

Returning the max_thread_count to 1 shows that it has increased the thread count above what was previously set:

PUT /test_index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

[2016-09-08 04:14:22,483][INFO ][index.shard              ] [Abe Brown] [test_index][1] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][2] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][3] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][0] updating [index.merge.scheduler.max_thread_count] from [17] to [1]

The text was updated successfully, but these errors were encountered:

jimczi · 2016-09-08T08:55:03Z

Thanks for reporting @rayward
I opened #20383 which fails the update (and the index creation) if the merge settings are not valid

Manual backport of #20383 Since settings are not validate in 2.x, this change catch the exception if the merge settings are not valid and do not apply them in the current CMS. The downside is that we'll try to apply the new settings if the index is reopened or if the node is restarted, this will fail and the only solution for the user is to re-update the merge settings back to valid values. Relates #20380

rayward · 2016-09-08T23:15:52Z

Thanks for the quick fix!

jimczi mentioned this issue Sep 8, 2016

Validate max thread/merge settings #20383

Merged

jasontedor mentioned this issue Sep 8, 2016

Need to validate index.merge.scheduler.max_thread_count input? #19444

Closed

jimczi closed this as completed in #20383 Sep 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

rayward commented Sep 8, 2016 •

edited

jimczi commented Sep 8, 2016

rayward commented Sep 8, 2016

index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

Comments

rayward commented Sep 8, 2016 • edited

jimczi commented Sep 8, 2016

rayward commented Sep 8, 2016

rayward commented Sep 8, 2016 •

edited