Skip to content
Permalink
Browse files

Update cluster docs with a note on zookeeper session timeouts

Zookeeper client does session tracking and validation (pings) on
separate threads. This could make sessions timeouts likely during high
CPU utilization. With multi threaded backups, we have seen consistent
session timeouts during backup of a cluster with 1 billion files, which
causes the leader to change and backup to be interrupted.( See
#9023 for details) This PR adds
a note to related doc section.

pr-link: #9088
change-id: cid-4935b8b8910f263edf9cc0bbf69f10577b0d76bf
  • Loading branch information...
ggezer authored and alluxio-bot committed May 14, 2019
1 parent 055c3ff commit 0917c51c9f624024618fdc2b48db2719529e15fa
Showing with 8 additions and 0 deletions.
  1. +8 −0 docs/en/deploy/Running-Alluxio-On-a-Cluster.md
@@ -196,6 +196,14 @@ The configuration parameters which must be set are:
accessible by all master nodes.
- Examples: `alluxio.master.journal.folder=hdfs://1.2.3.4:9000/alluxio/journal/`

For clusters with large namespaces, increased CPU overhead on leader could cause delays on Zookeeper client heartbeats.
For this reason, we recommend setting Zookeeper client session timeout to at least 2 minutes on large clusters with namespace
size more than several hundred millions of files.
- `alluxio.zookeeper.session.timeout=120s`
- Zookeeper server's tick time must also be configured as such to allow
this timeout. The current implementation requires that the timeout be a minimum of 2 times the tickTime (as set in the server configuration)
and a maximum of 20 times the tickTime.

Make sure all master nodes and all worker nodes have configured their respective
`conf/alluxio-site.properties` configuration file appropriately.

0 comments on commit 0917c51

Please sign in to comment.
You can’t perform that action at this time.