Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging around gateway shard allocation #9562

Closed

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Feb 4, 2015

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in WARN. Shard info loading time is logged under DEBUG. Also, we log a WARN message if an exception forces a full checksum check during reading the store metadata.

Examples:

[2015-02-04 17:15:31,632][WARN ][gateway.local            ] [Danger] [index][0]: failed to list shard state on node [9tAoCoTLTIiXfJz88S6Sqg], reason [Failed node [9tAoCoTLTIiXfJz88S6Sqg][[Tiger Shark][inet[/192.168.1.248:9301]][internal:gateway/local/started_shards[n]] request_id [14] timed out after [30003ms]]]
[2015-02-04 17:15:31,632][WARN ][gateway.local            ] [Danger] [index][0]: [1] failures when trying to list shard state on nodes.
[2015-02-04 15:13:42,183][DEBUG][indices.store             ] [Fault Zone] loaded store meta data for [index][0] (took [28ms])
[2015-02-04 15:13:31,358][DEBUG][gateway.local.state.shards] [Edwin Jarvis] [index][0] shard state info found: [version [18], primary [true]]

@bleskes bleskes added :Core/Infra/Logging Log management and logging utilities v1.4.3 v1.5.0 v2.0.0-beta1 labels Feb 4, 2015
@@ -79,7 +81,7 @@ public LocalGatewayAllocator(Settings settings,
this.listGatewayStartedShards = listGatewayStartedShards;
this.listShardStoreMetaData = listShardStoreMetaData;

this.listTimeout = componentSettings.getAsTime("list_timeout", TimeValue.timeValueSeconds(30));
this.listTimeout = componentSettings.getAsTime("list_timeout", TimeValue.timeValueSeconds(10));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr. fixing.

This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata
@bleskes bleskes force-pushed the logging_store_listing_on_allocation branch from a5c6771 to 4673de9 Compare February 4, 2015 15:19
@@ -203,7 +203,7 @@ private void onFailure(int idx, String nodeId, Throwable t) {
logger.debug("failed to execute on node [{}]", t, nodeId);
}
if (accumulateExceptions()) {
responses.set(idx, new FailedNodeException(nodeId, "Failed node [" + nodeId + "]", t));
responses.set(idx, new FailedNodeException(nodeId, "Failed node [" + nodeId + "][" + t.getMessage() + "]", t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ExceptionsHelper#detailedMessage to now miss when this get wrapped in inner exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here was to keep it brief. We already get a full trace in the log message above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Am concerned that we will miss the actual message because its wrapped, my vote is the detailed message one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case we change the log message on logListActionFailures to log the entire exception. It's the difference between detailedMessage and just a message. But putting the detail in the message the choice is gone. Do agree with solving this on the logging side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, in that case, let's not add the addition message, and solve it on the logging side m

@bleskes
Copy link
Contributor Author

bleskes commented Feb 4, 2015

@kimchy pushed another commit

@@ -421,6 +413,25 @@ public boolean apply(DiscoveryNode node) {
return shardStates;
}

private void logListActionFailures(MutableShardRouting shard, String actionType, FailedNodeException[] failures) {
if (failures == null || failures.length == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we require this to be non-null and just iterate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check. Tried to be defensive here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK. removed the check.

@s1monw
Copy link
Contributor

s1monw commented Feb 4, 2015

left some minor things otherwise looks good to me

}
if (totalWarnFailures > 0) {
logger.warn("{}: [{}] failures when trying to list shard {} on nodes.", shard.shardId(), totalWarnFailures, actionType);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this total message is not needed, since the previous ones are in WARN, so we have an indication already, we just repeat it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@kimchy
Copy link
Member

kimchy commented Feb 4, 2015

left more comments

@bleskes
Copy link
Contributor Author

bleskes commented Feb 4, 2015

@s1monw @kimchy pushed another commit.

@kimchy I updated the code based on my suggestion. An error now looks as follows. Let me know if it doesn't work/need to be extended:

[2015-02-04 18:32:37,030][DEBUG][indices.store            ] [Thin Man] failed to execute on node [KH0QQnZhQg-aRgQLNjgjMQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
[2015-02-04 18:32:37,031][WARN ][gateway.local            ] [Thin Man] [index][0]: failed to list shard stores on node [KH0QQnZhQg-aRgQLNjgjMQ]
org.elasticsearch.action.FailedNodeException: Failed node [KH0QQnZhQg-aRgQLNjgjMQ][[Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]]
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [Savage Steel][inet[/192.168.1.248:9301]][internal:cluster/nodes/indices/shard/store[n]] request_id [21] timed out after [30004ms]
    ... 4 more

@bleskes
Copy link
Contributor Author

bleskes commented Feb 5, 2015

@s1monw the last commit I pushed add special handling in the store for IndexNotFoundException (https://github.com/elasticsearch/elasticsearch/pull/9562/files#diff-cdc68fb92150d348c209021260e9c50dR626 ) . It happens when we start a replica on an empty folder (the exception is caught and ignored higher up). Can you double check me on that?

@s1monw
Copy link
Contributor

s1monw commented Feb 5, 2015

The IndexNotFoundException is caught on a higher level since depending on the method you cal you care / don't care if it's there ie. #getMetadataOrEmpty vs. #getMetadata I think we should not catch it on that level to be honest. Maybe we should just add the logging on the higher level?

@bleskes
Copy link
Contributor Author

bleskes commented Feb 5, 2015

@s1monw I don't catch it, but rethrow it without doing Lucene.checkSegmentInfoIntegrity(directory);

@s1monw
Copy link
Contributor

s1monw commented Feb 5, 2015

you are right! sorry LGTM

bleskes added a commit that referenced this pull request Feb 5, 2015
This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

Closes #9562
bleskes added a commit that referenced this pull request Feb 5, 2015
This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

Closes #9562
@bleskes bleskes closed this in 9362ba2 Feb 5, 2015
@bleskes bleskes deleted the logging_store_listing_on_allocation branch February 5, 2015 17:07
} finally {
TimeValue took = new TimeValue(System.currentTimeMillis() - startTime);
if (exists) {
logger.debug("loaded store meta data for {} (took [{}])", shardId, took);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this log message, and the next one, we are not consistent with our logging messages, when we log something in the context of a shard, its the first thing in the log line, can we please fix it and do: logger.debug("{} loaded store meta data for, took [{}]", shardId, took);

@kimchy
Copy link
Member

kimchy commented Feb 6, 2015

@bleskes sorry for the late comment, but it would be great if we can be consistent with the way we log

@bleskes
Copy link
Contributor Author

bleskes commented Feb 6, 2015

@kimchy no need to be sorry. It's a valid point.

return new StoreFilesMetaData(true, shardId, store.getMetadataOrEmpty().asMap());
} finally {
store.decRef();
logger.trace("listing store meta data for {}", shardId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well :)

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Feb 6, 2015
bleskes added a commit that referenced this pull request Feb 6, 2015
bleskes added a commit that referenced this pull request Feb 6, 2015
bleskes added a commit that referenced this pull request Feb 6, 2015
bleskes added a commit that referenced this pull request Feb 9, 2015
This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

Closes #9562
bleskes added a commit that referenced this pull request Feb 9, 2015
@clintongormley clintongormley changed the title Gateway: add logging around gateway shard allocation Logging: add logging around gateway shard allocation Feb 11, 2015
@clintongormley clintongormley changed the title Logging: add logging around gateway shard allocation Add logging around gateway shard allocation Jun 7, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

Closes elastic#9562
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
This commit adds more logs around the gateway shard allocation. Any errors while reaching out to nodes to list the local shards are logged in `WARN`. Shard info loading time is logged under DEBUG. Also, we log a `WARN` message if an exception forces a full checksum check during reading the store metadata

Closes elastic#9562
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants