Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index.merge.scheduler.max_thread_count is not validated correctly, can break indexing #20380

Closed
rayward opened this issue Sep 8, 2016 · 2 comments

Comments

@rayward
Copy link

rayward commented Sep 8, 2016

Elasticsearch version: 2.3.4 / 2.4.0

Plugins installed: kopf

JVM version:

version: "1.8.0_102",
vm_name: "OpenJDK 64-Bit Server VM",
vm_version: "25.102-b14"",
vm_vendor: "Oracle Corporation",

OS version:
Debian 8.5
Linux 8e884c0e66a2 4.4.15-moby # 1 SMP Thu Jul 28 22:03:07 UTC 2016 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

ES arbitrarily accepts any integer for index.merge.scheduler.max_thread_count even if it exceeds internal limits which then causes various scheduled tasks to throw exceptions and eventually break indexing.

Steps to reproduce:

We experienced this on our production cluster but I was able to reproduce it using a couple of docker containers.

  1. Create an index:

    POST /test_index
    {
    "number_of_shards": 4,
    "number_of_replicas": 1
    }
    
    {
    "acknowledged": true
    }
    
  2. Update the max thread count:

    PUT /test_index/_settings
    {
    "index.merge.scheduler.max_thread_count": 12
    }
    
    {
    "acknowledged": true
    }
    
  3. Alternate creating documents ...

    POST /test_index/test_type
    {
    "name": "bob"
    }
    
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "AVcH80xPYp3inMh6kFUL",
    "_version": 1,
    "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
    },
    "created": true
    }
    

    And optimizing the index (to force a flush):

    GET /test_index/_optimize
    
  4. This will shortly lead to translog failures:

    POST /test_index/test_type
    {
    "name": "bob"
    }
    
    {
    "error": {
      "root_cause": [
        {
          "type": "create_failed_engine_exception",
          "reason": "Create failed for [test_type#AVcH7buoYp3inMh6kFT4]",
          "shard": "3",
          "index": "test_index"
        }
      ],
      "type": "create_failed_engine_exception",
      "reason": "Create failed for [test_type#AVcH7buoYp3inMh6kFT4]",
      "shard": "3",
      "index": "test_index",
      "caused_by": {
        "type": "already_closed_exception",
        "reason": "translog is already closed"
      }
    },
    "status": 500
    }
    
    GET /test_index/_optimize
    {
    "_shards": {
      "total": 8,
      "successful": 2,
      "failed": 6,
      "failures": [
        {
          "shard": 3,
          "index": "test_index",
          "status": "INTERNAL_SERVER_ERROR",
          "reason": {
            "type": "flush_failed_engine_exception",
            "reason": "Flush failed",
            "shard": "3",
            "index": "test_index",
            "caused_by": {
              "type": "already_closed_exception",
              "reason": "translog is already closed"
            }
          }
        }
      ]
    }
    }
    

Provide logs (if relevant):

The invalid max_thread_count causes all sorts of exceptions:

Updating the index settings:

[2016-09-08 03:50:48,676][INFO ][index.shard              ] [Abe Brown] [test_index][3] updating [index.merge.scheduler.max_thread_count] from [7] to [12]
[2016-09-08 03:50:48,677][WARN ][index.settings           ] [Abe Brown] [test_index] failed to refresh settings for [org.elasticsearch.index.shard.IndexShard$ApplyRefreshSettings@27098528]
java.lang.IllegalArgumentException: maxThreadCount should be <= maxMergeCount (= 7)
    at org.apache.lucene.index.ConcurrentMergeScheduler.setMaxMergesAndThreads(ConcurrentMergeScheduler.java:154)
    at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.refreshConfig(ElasticsearchConcurrentMergeScheduler.java:176)
    at org.elasticsearch.index.engine.InternalEngine.onSettingsChanged(InternalEngine.java:1268)
    at org.elasticsearch.index.shard.IndexShard$ApplyRefreshSettings.onRefreshSettings(IndexShard.java:1325)
    at org.elasticsearch.index.settings.IndexSettingsService.refreshSettings(IndexSettingsService.java:54)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applySettings(IndicesClusterStateService.java:341)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:180)
    at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Various errors seen while indexing and optimizing:

[2016-09-08 03:58:16,128][WARN ][index.shard              ] [Abe Brown] [test_index][3] Failed to perform scheduled engine refresh
java.lang.IllegalArgumentException: maxThreadCount should be <= maxMergeCount (= 7)
    at org.apache.lucene.index.ConcurrentMergeScheduler.setMaxMergesAndThreads(ConcurrentMergeScheduler.java:154)
    at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.refreshConfig(ElasticsearchConcurrentMergeScheduler.java:176)
    at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:685)
    at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:661)
    at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1343)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2016-09-08 03:58:36,218][WARN ][indices.memory           ] [Abe Brown] failed to set shard [test_index][1] index buffer to [24.7mb]
org.apache.lucene.store.AlreadyClosedException: translog [2] is already closed
    at org.elasticsearch.index.translog.TranslogWriter.ensureOpen(TranslogWriter.java:329)
    at org.elasticsearch.index.translog.BufferingTranslogWriter.updateBufferSize(BufferingTranslogWriter.java:143)
    at org.elasticsearch.index.translog.Translog.updateBuffer(Translog.java:406)
    at org.elasticsearch.index.shard.IndexShard.updateBufferSize(IndexShard.java:1164)
    at org.elasticsearch.indices.memory.IndexingMemoryController.updateShardBuffers(IndexingMemoryController.java:232)
    at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:286)
    at org.elasticsearch.threadpool.ThreadPool$ReschedulingRunnable.doRun(ThreadPool.java:1057)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:668)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2016-09-08 03:53:05,512][DEBUG][action.admin.indices.forcemerge] [Abe Brown] [indices:admin/optimize] failed to execute operation for shard [[test_index][1], node[6jC3WEUrQWifFvTn-twoBw], [P], v[3], s[STARTED], a[id=Fa-2IEiESai0wk2csElq0Q]]
[test_index][[test_index][1]] FlushFailedEngineException[Flush failed]; nested: IllegalStateException[already committing a translog with generation: 1];
    at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:762)
    at org.elasticsearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:863)
    at org.elasticsearch.index.shard.IndexShard.forceMerge(IndexShard.java:793)
    at org.elasticsearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:79)
    at org.elasticsearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:46)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:436)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:415)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:402)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: already committing a translog with generation: 1
    at org.elasticsearch.index.translog.Translog.prepareCommit(Translog.java:1785)
    at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:753)
    ... 14 more
[2016-09-08 03:59:27,335][WARN ][rest.suppressed          ] path: //test_index/test_type, params: {index=test_index, type=test_type}
RemoteTransportException[[Abe Brown][172.19.0.2:9300][indices:data/write/index[p]]]; nested: CreateFailedEngineException[Create failed for [test_type#AVcH80pjYp3inMh6kFUK]]; nested: AlreadyClosedException[translog is already closed];
Caused by: [test_index][[test_index][1]] CreateFailedEngineException[Create failed for [test_type#AVcH80pjYp3inMh6kFUK]]; nested: AlreadyClosedException[translog is already closed];
    at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:351)
    at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:545)
    at org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:236)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:157)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:648)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
    at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1888)
    at org.elasticsearch.index.translog.Translog.add(Translog.java:544)
    at org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:437)
    at org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:375)
    at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:346)
    ... 15 more

Interestingly, closing and reopening the index will restore things to working order with the configured thread count.

Returning the max_thread_count to 1 shows that it has increased the thread count above what was previously set:

PUT /test_index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}
[2016-09-08 04:14:22,483][INFO ][index.shard              ] [Abe Brown] [test_index][1] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][2] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][3] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
[2016-09-08 04:14:22,484][INFO ][index.shard              ] [Abe Brown] [test_index][0] updating [index.merge.scheduler.max_thread_count] from [17] to [1]
@jimczi
Copy link
Contributor

jimczi commented Sep 8, 2016

Thanks for reporting @rayward
I opened #20383 which fails the update (and the index creation) if the merge settings are not valid

jimczi added a commit that referenced this issue Sep 8, 2016
Manual backport of #20383
Since settings are not validate in 2.x, this change catch the exception if the merge settings are not valid and do not apply them in the current CMS.
The downside is that we'll try to apply the new settings if the index is reopened or if the node is restarted, this will fail and the only solution for the user is to re-update the merge settings back to valid values.

Relates #20380
@rayward
Copy link
Author

rayward commented Sep 8, 2016

Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants