We had a minor production incident caused by Singularity's connection to ZK entering a zombie state.
INFO [2015-07-14 07:59:13,127] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@54f4a7f0{/,null,AVAILABLE}
INFO [2015-07-14 07:59:13,129] io.dropwizard.setup.AdminEnvironment: tasks =
POST /tasks/gc (io.dropwizard.servlets.tasks.GarbageCollectionTask)
INFO [2015-07-14 07:59:13,133] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@408e96d9{/admin,null,AVAILABLE}
INFO [2015-07-14 07:59:13,152] org.eclipse.jetty.server.ServerConnector: Started SingularityService@12f8b1d8{HTTP/1.1}{0.0.0.0:7099}
INFO [2015-07-14 07:59:32,309] org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x24e8b9313780002, likely server has closed socket, closing socket connection and attempting reconnect
INFO [2015-07-14 07:59:32,413] org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED
INFO [2015-07-14 07:59:32,481] org.apache.zookeeper.ClientCnxn: Opening socket connection to server mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181. Will not attempt to authenticate using SASL (unknown error)
INFO [2015-07-14 07:59:32,482] org.apache.zookeeper.ClientCnxn: Socket connection established to mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181, initiating session
INFO [2015-07-14 07:59:32,785] org.apache.zookeeper.ClientCnxn: Session establishment complete on server mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181, sessionid = 0x24e8b9313780002, negotiated timeout = 40000
INFO [2015-07-14 07:59:32,786] org.apache.curator.framework.state.ConnectionStateManager: State change: RECONNECTED
ERROR [2015-07-14 07:59:33,428] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
! java.lang.IllegalArgumentException: Path must start with / character
! at org.apache.curator.utils.PathUtils.validatePath(PathUtils.java:53) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.utils.ZKPaths.getNodeFromPath(ZKPaths.java:56) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:478) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch.access$500(LeaderLatch.java:60) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch$6.processResult(LeaderLatch.java:535) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:686) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:485) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:587) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
INFO [2015-07-14 10:30:17,357] org.eclipse.jetty.server.ServerConnector: Stopped SingularityService@12f8b1d8{HTTP/1.1}{0.0.0.0:7099}
INFO [2015-07-14 10:30:17,358] org.eclipse.jetty.server.handler.ContextHandler: Stopped i.d.j.MutableServletContextHandler@408e96d9{/admin,null,UNAVAILABLE}
INFO [2015-07-14 10:30:17,359] org.eclipse.jetty.server.handler.ContextHandler: Stopped i.d.j.MutableServletContextHandler@54f4a7f0{/,null,UNAVAILABLE}
INFO [2015-07-14 10:30:17,378] org.apache.zookeeper.ClientCnxn: EventThread shut down
INFO [2015-07-14 10:30:17,378] org.apache.zookeeper.ZooKeeper: Session: 0x24e8b9313780002 closed
The three hour gap is until an administrator was paged and came in to restart manually.
This is a slightly patched (in a way that shouldn't possibly be relevant, just the PRs I have that are not merged yet) 0.4.3-SNAPSHOT as of a week ago.
We had a minor production incident caused by Singularity's connection to ZK entering a zombie state.
The three hour gap is until an administrator was paged and came in to restart manually.
Two issues here:
This is a slightly patched (in a way that shouldn't possibly be relevant, just the PRs I have that are not merged yet) 0.4.3-SNAPSHOT as of a week ago.