Skip to content

"Path must start with / character" leads to Singularity hanging forever #620

@stevenschlansker

Description

@stevenschlansker

We had a minor production incident caused by Singularity's connection to ZK entering a zombie state.

INFO  [2015-07-14 07:59:13,127] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@54f4a7f0{/,null,AVAILABLE}
INFO  [2015-07-14 07:59:13,129] io.dropwizard.setup.AdminEnvironment: tasks = 

    POST    /tasks/gc (io.dropwizard.servlets.tasks.GarbageCollectionTask)

INFO  [2015-07-14 07:59:13,133] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@408e96d9{/admin,null,AVAILABLE}
INFO  [2015-07-14 07:59:13,152] org.eclipse.jetty.server.ServerConnector: Started SingularityService@12f8b1d8{HTTP/1.1}{0.0.0.0:7099}
INFO  [2015-07-14 07:59:32,309] org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x24e8b9313780002, likely server has closed socket, closing socket connection and attempting reconnect
INFO  [2015-07-14 07:59:32,413] org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED
INFO  [2015-07-14 07:59:32,481] org.apache.zookeeper.ClientCnxn: Opening socket connection to server mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181. Will not attempt to authenticate using SASL (unknown error)
INFO  [2015-07-14 07:59:32,482] org.apache.zookeeper.ClientCnxn: Socket connection established to mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181, initiating session
INFO  [2015-07-14 07:59:32,785] org.apache.zookeeper.ClientCnxn: Session establishment complete on server mesos3-prod-sc.otsql.opentable.com/10.20.16.252:2181, sessionid = 0x24e8b9313780002, negotiated timeout = 40000
INFO  [2015-07-14 07:59:32,786] org.apache.curator.framework.state.ConnectionStateManager: State change: RECONNECTED
ERROR [2015-07-14 07:59:33,428] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
! java.lang.IllegalArgumentException: Path must start with / character
! at org.apache.curator.utils.PathUtils.validatePath(PathUtils.java:53) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.utils.ZKPaths.getNodeFromPath(ZKPaths.java:56) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:478) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch.access$500(LeaderLatch.java:60) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.recipes.leader.LeaderLatch$6.processResult(LeaderLatch.java:535) ~[singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:686) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:485) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:587) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
! at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) [singularity-20150708-2.jar:0.4.3-SNAPSHOT]
INFO  [2015-07-14 10:30:17,357] org.eclipse.jetty.server.ServerConnector: Stopped SingularityService@12f8b1d8{HTTP/1.1}{0.0.0.0:7099}
INFO  [2015-07-14 10:30:17,358] org.eclipse.jetty.server.handler.ContextHandler: Stopped i.d.j.MutableServletContextHandler@408e96d9{/admin,null,UNAVAILABLE}
INFO  [2015-07-14 10:30:17,359] org.eclipse.jetty.server.handler.ContextHandler: Stopped i.d.j.MutableServletContextHandler@54f4a7f0{/,null,UNAVAILABLE}
INFO  [2015-07-14 10:30:17,378] org.apache.zookeeper.ClientCnxn: EventThread shut down
INFO  [2015-07-14 10:30:17,378] org.apache.zookeeper.ZooKeeper: Session: 0x24e8b9313780002 closed

The three hour gap is until an administrator was paged and came in to restart manually.

Two issues here:

  • What is this IAE about the path being invalid?
  • Why didn't a ZK exception cause Singularity to abort?

This is a slightly patched (in a way that shouldn't possibly be relevant, just the PRs I have that are not merged yet) 0.4.3-SNAPSHOT as of a week ago.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions