Add logging around gateway shard allocation #9562

bleskes · 2015-02-04T15:13:51Z

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in WARN. Shard info loading time is logged under DEBUG. Also, we log a WARN message if an exception forces a full checksum check during reading the store metadata.

Examples:

[2015-02-04 17:15:31,632][WARN ][gateway.local            ] [Danger] [index][0]: failed to list shard state on node [9tAoCoTLTIiXfJz88S6Sqg], reason [Failed node [9tAoCoTLTIiXfJz88S6Sqg][[Tiger Shark][inet[/192.168.1.248:9301]][internal:gateway/local/started_shards[n]] request_id [14] timed out after [30003ms]]]
[2015-02-04 17:15:31,632][WARN ][gateway.local            ] [Danger] [index][0]: [1] failures when trying to list shard state on nodes.

[2015-02-04 15:13:42,183][DEBUG][indices.store             ] [Fault Zone] loaded store meta data for [index][0] (took [28ms])

[2015-02-04 15:13:31,358][DEBUG][gateway.local.state.shards] [Edwin Jarvis] [index][0] shard state info found: [version [18], primary [true]]

bleskes · 2015-02-04T15:17:44Z

src/main/java/org/elasticsearch/gateway/local/LocalGatewayAllocator.java

@@ -79,7 +81,7 @@ public LocalGatewayAllocator(Settings settings,
        this.listGatewayStartedShards = listGatewayStartedShards;
        this.listShardStoreMetaData = listShardStoreMetaData;

-        this.listTimeout = componentSettings.getAsTime("list_timeout", TimeValue.timeValueSeconds(30));
+        this.listTimeout = componentSettings.getAsTime("list_timeout", TimeValue.timeValueSeconds(10));


Grr. fixing.

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

kimchy · 2015-02-04T15:20:47Z

src/main/java/org/elasticsearch/action/support/nodes/TransportNodesOperationAction.java

@@ -203,7 +203,7 @@ private void onFailure(int idx, String nodeId, Throwable t) {
                logger.debug("failed to execute on node [{}]", t, nodeId);
            }
            if (accumulateExceptions()) {
-                responses.set(idx, new FailedNodeException(nodeId, "Failed node [" + nodeId + "]", t));
+                responses.set(idx, new FailedNodeException(nodeId, "Failed node [" + nodeId + "][" + t.getMessage() + "]", t));


Should we use ExceptionsHelper#detailedMessage to now miss when this get wrapped in inner exceptions?

My goal here was to keep it brief. We already get a full trace in the log message above.

I Am concerned that we will miss the actual message because its wrapped, my vote is the detailed message one

If that's the case we change the log message on logListActionFailures to log the entire exception. It's the difference between detailedMessage and just a message. But putting the detail in the message the choice is gone. Do agree with solving this on the logging side?

yea, in that case, let's not add the addition message, and solve it on the logging side m

bleskes · 2015-02-04T16:17:44Z

@kimchy pushed another commit

s1monw · 2015-02-04T16:35:20Z

src/main/java/org/elasticsearch/gateway/local/LocalGatewayAllocator.java

@@ -421,6 +413,25 @@ public boolean apply(DiscoveryNode node) {
        return shardStates;
    }

+    private void logListActionFailures(MutableShardRouting shard, String actionType, FailedNodeException[] failures) {
+        if (failures == null || failures.length == 0) {


can we require this to be non-null and just iterate?

I'll double check. Tried to be defensive here.

It's OK. removed the check.

s1monw · 2015-02-04T16:37:27Z

left some minor things otherwise looks good to me

kimchy · 2015-02-04T16:39:52Z

src/main/java/org/elasticsearch/gateway/local/LocalGatewayAllocator.java

+        }
+        if (totalWarnFailures > 0) {
+            logger.warn("{}: [{}] failures when trying to list shard {} on nodes.", shard.shardId(), totalWarnFailures, actionType);
+        }


I think this total message is not needed, since the previous ones are in WARN, so we have an indication already, we just repeat it...

kimchy · 2015-02-04T16:42:29Z

left more comments

bleskes · 2015-02-04T17:35:54Z

@s1monw @kimchy pushed another commit.

@kimchy I updated the code based on my suggestion. An error now looks as follows. Let me know if it doesn't work/need to be extended:

[2015-02-04 18:32:37,030][DEBUG][indices.store            ] [Thin Man] failed to execute on node [KH0QQnZhQg-aRgQLNjgjMQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
[2015-02-04 18:32:37,031][WARN ][gateway.local            ] [Thin Man] [index][0]: failed to list shard stores on node [KH0QQnZhQg-aRgQLNjgjMQ]
org.elasticsearch.action.FailedNodeException: Failed node [KH0QQnZhQg-aRgQLNjgjMQ][[Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]]
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]
    ... 4 more

bleskes · 2015-02-05T08:29:54Z

@s1monw the last commit I pushed add special handling in the store for IndexNotFoundException (https://github.com/elasticsearch/elasticsearch/pull/9562/files#diff-cdc68fb92150d348c209021260e9c50dR626 ) . It happens when we start a replica on an empty folder (the exception is caught and ignored higher up). Can you double check me on that?

s1monw · 2015-02-05T13:47:15Z

The IndexNotFoundException is caught on a higher level since depending on the method you cal you care / don't care if it's there ie. #getMetadataOrEmpty vs. #getMetadata I think we should not catch it on that level to be honest. Maybe we should just add the logging on the higher level?

bleskes · 2015-02-05T13:55:09Z

@s1monw I don't catch it, but rethrow it without doing Lucene.checkSegmentInfoIntegrity(directory);

s1monw · 2015-02-05T15:17:09Z

you are right! sorry LGTM

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata Closes #9562

kimchy · 2015-02-06T17:33:43Z

src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

+        } finally {
+            TimeValue took = new TimeValue(System.currentTimeMillis() - startTime);
+            if (exists) {
+                logger.debug("loaded store meta data for {} (took [{}])", shardId, took);


in this log message, and the next one, we are not consistent with our logging messages, when we log something in the context of a shard, its the first thing in the log line, can we please fix it and do: logger.debug("{} loaded store meta data for, took [{}]", shardId, took);

kimchy · 2015-02-06T17:34:17Z

@bleskes sorry for the late comment, but it would be great if we can be consistent with the way we log

bleskes · 2015-02-06T17:42:13Z

@kimchy no need to be sorry. It's a valid point.

kimchy · 2015-02-06T17:49:04Z

src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

-                    return new StoreFilesMetaData(true, shardId, store.getMetadataOrEmpty().asMap());
-                } finally {
-                    store.decRef();
+        logger.trace("listing store meta data for {}", shardId);


here as well :)

Closes #9603

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata Closes #9562

Closes #9603

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata Closes elastic#9562

Closes elastic#9603

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata Closes elastic#9562

Closes elastic#9603

bleskes added :Core/Infra/Logging Log management and logging utilities v1.4.3 v1.5.0 v2.0.0-beta1 labels Feb 4, 2015

bleskes reviewed Feb 4, 2015
View reviewed changes

bleskes force-pushed the logging_store_listing_on_allocation branch from a5c6771 to 4673de9 Compare February 4, 2015 15:19

kimchy reviewed Feb 4, 2015
View reviewed changes

feedback

0f7be8a

feedback 2

35b28de

s1monw reviewed Feb 4, 2015
View reviewed changes

kimchy reviewed Feb 4, 2015
View reviewed changes

feedback 3

56aedf6

last tweak + not logging on IndexNotFoundException

6e93ed2

bleskes closed this in 9362ba2 Feb 5, 2015

bleskes deleted the logging_store_listing_on_allocation branch February 5, 2015 17:07

kimchy reviewed Feb 6, 2015
View reviewed changes

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Feb 6, 2015

Logging: improve logging messages added in elastic#9562 & elastic#9562

a126d00

bleskes added a commit that referenced this pull request Feb 6, 2015

Logging: improve logging messages added in #9562 & #9562

332dc79

Closes #9603

bleskes added a commit that referenced this pull request Feb 6, 2015

Logging: improve logging messages added in #9562 & #9562

a27b681

Closes #9603

bleskes added a commit that referenced this pull request Feb 6, 2015

Logging: improve logging messages added in #9562 & #9562

e684d7f

Closes #9603

bleskes added a commit that referenced this pull request Feb 9, 2015

Logging: improve logging messages added in #9562 & #9562

28072c5

Closes #9603

clintongormley added the >enhancement label Feb 10, 2015

clintongormley changed the title ~~Gateway: add logging around gateway shard allocation~~ Logging: add logging around gateway shard allocation Feb 11, 2015

bleskes mentioned this pull request Jun 7, 2015

Logging: improve logging messages added in #9562 #9603

Closed

clintongormley changed the title ~~Logging: add logging around gateway shard allocation~~ Add logging around gateway shard allocation Jun 7, 2015

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Logging: improve logging messages added in elastic#9562 & elastic#9562

ebe03be

Closes elastic#9603

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Logging: improve logging messages added in elastic#9562 & elastic#9562

2a39af3

Closes elastic#9603

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging around gateway shard allocation #9562

Add logging around gateway shard allocation #9562

bleskes commented Feb 4, 2015

bleskes Feb 4, 2015

kimchy Feb 4, 2015

bleskes Feb 4, 2015

kimchy Feb 4, 2015

bleskes Feb 4, 2015

kimchy Feb 4, 2015

bleskes commented Feb 4, 2015

s1monw Feb 4, 2015

bleskes Feb 4, 2015

bleskes Feb 4, 2015

s1monw commented Feb 4, 2015

kimchy Feb 4, 2015

bleskes Feb 4, 2015

kimchy commented Feb 4, 2015

bleskes commented Feb 4, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

kimchy Feb 6, 2015

kimchy commented Feb 6, 2015

bleskes commented Feb 6, 2015

kimchy Feb 6, 2015

Add logging around gateway shard allocation #9562

Add logging around gateway shard allocation #9562

Conversation

bleskes commented Feb 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Feb 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Feb 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimchy commented Feb 4, 2015

bleskes commented Feb 4, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

Choose a reason for hiding this comment

kimchy commented Feb 6, 2015

bleskes commented Feb 6, 2015

Choose a reason for hiding this comment