Translog size not effectively limited by index.translog.flush_threshold_size? #15814

aquarapid · 2016-01-07T04:12:21Z

I have a logstash index in my ES 2.1.1 cluster (2 nodes):

{
  "logstash-2016.01.07" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "30s",
        "number_of_shards" : "8",
        "translog" : {
          "flush_threshold_ops" : "100000",
          "flush_threshold_size" : "1gb",
          "sync_interval" : "5s",
          "durability" : "async"
        },
        "query" : {
          "default_field" : "body"
        },
        "creation_date" : "1452124822948",
        "store" : {
          "compress" : {
            "stored" : "true"
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "XXXXX",
        "version" : {
          "created" : "2010199"
        }
      }
    }
  }
}

So, the translogs should be limited in size. However:

On one node:

# du -sh /disk*/logstash/nodes/0/indices/logstash-2016.01.07/*/translog/*.tlog
3.1G    /disk10/logstash/nodes/0/indices/logstash-2016.01.07/2/translog/translog-1.tlog
40M /disk11/logstash/nodes/0/indices/logstash-2016.01.07/1/translog/translog-35.tlog
44M /disk12/logstash/nodes/0/indices/logstash-2016.01.07/3/translog/translog-35.tlog
44M /disk1/logstash/nodes/0/indices/logstash-2016.01.07/0/translog/translog-35.tlog
3.0G    /disk2/logstash/nodes/0/indices/logstash-2016.01.07/7/translog/translog-1.tlog
3.0G    /disk3/logstash/nodes/0/indices/logstash-2016.01.07/4/translog/translog-1.tlog
39M /disk4/logstash/nodes/0/indices/logstash-2016.01.07/6/translog/translog-35.tlog
50M /disk5/logstash/nodes/0/indices/logstash-2016.01.07/5/translog/translog-35.tlog

and the other:

# du -sh /disk*/logstash/nodes/0/indices/logstash-2016.01.07/*/translog/*.tlog
2.7G    /disk11/logstash/nodes/0/indices/logstash-2016.01.07/6/translog/translog-1.tlog
3.1G    /disk12/logstash/nodes/0/indices/logstash-2016.01.07/0/translog/translog-1.tlog
30M /disk1/logstash/nodes/0/indices/logstash-2016.01.07/7/translog/translog-35.tlog
38M /disk2/logstash/nodes/0/indices/logstash-2016.01.07/4/translog/translog-35.tlog
2.9G    /disk3/logstash/nodes/0/indices/logstash-2016.01.07/5/translog/translog-1.tlog
45M /disk6/logstash/nodes/0/indices/logstash-2016.01.07/2/translog/translog-35.tlog
3.0G    /disk7/logstash/nodes/0/indices/logstash-2016.01.07/3/translog/translog-1.tlog
3.0G    /disk9/logstash/nodes/0/indices/logstash-2016.01.07/1/translog/translog-1.tlog

So it seems the translog size is not being limited to the 1GB set in flush_threshold_size? Or do I misunderstand something here?

The text was updated successfully, but these errors were encountered:

s1monw · 2016-01-07T08:37:12Z

hey @aquarapid I am curious, something looks not right here. Can I get the output of localhost:9200/_cat/shards by any chance? And can you also paste localhost:9200/_stats/merge,refresh,store,translog,indexing ?

bleskes · 2016-01-07T09:33:16Z

also , do you see this in the logs anywhere?

logger.warn("failed to flush shard on translog threshold", e);

… slow recovery elastic#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled. Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes elastic#15814

… slow recovery Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes #15814 Closes #15830

aquarapid · 2016-01-07T17:15:19Z

I did not see the translog fail flush message in the logs.

I do have an ongoing recovery of shards; which I pause during the day to not interfere with indexing. The cluster contains 10TB or so of data (+replication). At the moment there are 36 unassigned shards remaining.

shards and stats attached (slightly obfuscated):

stats.txt
shards.txt

bleskes · 2016-01-07T17:18:39Z

thanks @aquarapid . We spend some time on this and believe to have found the issue (see #15830) . That Pr explains the issue if you don't have short periods of 5m without indexing. I'm going to close this for now, please let me know if that's not the case.

Thanks so much for reporting this. It led us to find an important bug. You can use the _flush api to force the translog to be trimmed.

bleskes mentioned this issue Jan 7, 2016

Log uncaught exceptions from scheduled once tasks #15824

Closed

bleskes mentioned this issue Jan 7, 2016

Translog base flushes can be disabled after replication relocation or slow recovery #15830

Closed

bleskes added a commit that referenced this issue Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…

49fbcb2

… slow recovery Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes #15814 Closes #15830

bleskes added a commit that referenced this issue Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…

555f1b5

… slow recovery Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes #15814 Closes #15830

bleskes added a commit that referenced this issue Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…

dd06fd1

… slow recovery Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes #15814 Closes #15830

bleskes added a commit that referenced this issue Jan 7, 2016

Translog base flushes can be disabled after replication relocation or…

65ed968

… slow recovery Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes #15814 Closes #15830

bleskes closed this as completed Jan 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translog size not effectively limited by index.translog.flush_threshold_size? #15814

Translog size not effectively limited by index.translog.flush_threshold_size? #15814

aquarapid commented Jan 7, 2016

s1monw commented Jan 7, 2016

bleskes commented Jan 7, 2016

aquarapid commented Jan 7, 2016

bleskes commented Jan 7, 2016

Translog size not effectively limited by index.translog.flush_threshold_size? #15814

Translog size not effectively limited by index.translog.flush_threshold_size? #15814

Comments

aquarapid commented Jan 7, 2016

s1monw commented Jan 7, 2016

bleskes commented Jan 7, 2016

aquarapid commented Jan 7, 2016

bleskes commented Jan 7, 2016