Skip to content

Commit

Permalink
Suggest reducing tcp_retries2
Browse files Browse the repository at this point in the history
Adds documentation suggesting reducing `tcp_retries2` on Linux to detect
network partitions more quickly.

Relates elastic#34405
  • Loading branch information
DaveCTurner committed Jul 8, 2020
1 parent 3515909 commit cdec0f8
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/reference/setup/sysconfig.asciidoc
Expand Up @@ -14,6 +14,7 @@ The following settings *must* be considered before going to production:
* <<max-number-of-threads,Ensure sufficient threads>>
* <<networkaddress-cache-ttl,JVM DNS cache settings>>
* <<executable-jna-tmpdir,Temporary directory not mounted with `noexec`>>
* <<system-config-tcpretries,TCP retransmission timeout>>

[[dev-vs-prod]]
[float]
Expand Down Expand Up @@ -43,3 +44,5 @@ include::sysconfig/threads.asciidoc[]
include::sysconfig/dns-cache.asciidoc[]

include::sysconfig/executable-jna-tmpdir.asciidoc[]

include::sysconfig/tcpretries.asciidoc[]
49 changes: 49 additions & 0 deletions docs/reference/setup/sysconfig/tcpretries.asciidoc
@@ -0,0 +1,49 @@
[[system-config-tcpretries]]
=== TCP retransmission timeout

Each pair of nodes in a cluster communicates via a number of TCP connections
which remain open until one of the nodes shuts down or communication between
the nodes is disrupted by a failure in the underlying infrastructure.

TCP provides reliable communication over occasionally-unreliable networks by
hiding temporary network disruptions from the communicating applications. Your
operating system will retransmit any lost messages a number of times before
informing the sender of any problem. Most Linux distributions default to
retransmitting any lost packets 15 times. Retransmissions back off
exponentially, so these 15 retransmissions take over 900 seconds to complete.
This means it takes Linux many minutes to detect a network partition or a
failed node with this method. Windows defaults to just 5 retransmissions which
corresponds with a timeout of around 6 seconds.

The Linux default allows for communication over networks that may experience
very long periods of packet loss, but this default is excessive for production
networks within a single data centre as is the case for most {es} clusters.
Highly-available clusters must be able to detect node failures quickly so that
they can react promptly by reallocating lost shards, rerouting searches and
perhaps electing a new master node. Linux users should therefore reduce the
maximum number of TCP retransmissions.

You can decrease the maximum number of TCP retransmissions to `5` by running
the following command as `root`. Five retransmissions corresponds with a
timeout of around 6 seconds.

[source,sh]
-------------------------------------
sysctl -w net.ipv4.tcp_retries2=5
-------------------------------------

To set this value permanently, update the `net.ipv4.tcp_retries2` setting in
`/etc/sysctl.conf`. To verify after rebooting, run `sysctl
net.ipv4.tcp_retries2`.

{es} also implements its own health checks with timeouts that are much shorter
than the default retransmission timeout on Linux. However these health checks
must allow for application-level effects such as garbage collection pauses. We
do not recommend reducing any timeouts related to these application-level
health checks.

IMPORTANT: This setting applies to all TCP connections and will affect the
reliability of communication with systems outside your cluster too. If your
cluster communicates with external systems over an unreliable network then you
may need to select a higher value for `net.ipv4.tcp_retries2`. For this reason,
{es} does not adjust this setting automatically.

0 comments on commit cdec0f8

Please sign in to comment.