Homestead neither terminates nor recovers on error #49

plwhite · 2016-11-09T13:35:24Z

Symptoms

Homestead failed. It turns out that this was because of #48. Homestead reported that it was terminating but failed to do so, and continued listening on its port. Homer appeared to be in a similar state

Logs are below.

    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:181: Configuring store connection
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:182:   Hostname:  localhost
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:183:   Port:      9160
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:211: Configuring store worker pool
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:212:   Threads:   10
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:213:   Max Queue: 0
    09-11-2016 11:20:23.577 UTC Error main.cpp:745: Failed to initialize the Cassandra cache with error code 3.
    09-11-2016 11:20:23.577 UTC Status main.cpp:746: Homestead is shutting down

Impact

Deployment fails completely.

This is because the orchestration (Docker Compose / Kubernetes) cannot terminate and restart the failed container or report that the container has failed.

Correct behaviour is for the container to exit when it has failed and is shutting down the service. Wrong but possibly acceptable behaviour (i.e. code workaround) might be for it to stop listening on well known ports so that the orchestration can detect the failure.

Release and environment

Current master release.

Steps to reproduce

Happens at same time as #48. Misconfiguring cassandra address should do it once #48 is fixed.

The text was updated successfully, but these errors were encountered:

richardwhiuk · 2016-11-09T21:49:35Z

Homestead is run under supervisord. When Homestead emits the above log, we immediately exit with a process exit code of 2, so I doubt the process was stuck.

Homestead is configured with config to restart (up to 5000 times!) if the process fails within a second, and to restart indefinitely if the process dies after a second.

Homestead doesn't open ports (HTTP Signalling, HTTP Management or the Diameter stack) until after it's confirmed it can connect to Cassandra, so I don't know what port you are talking about that, except possibly 22.

Given that this problem was clearly terminal, it seems unlikely that if the entire container had stopped, we'd be in any better position.

Can you comment any further on what you saw here and what behaviour you would expect?

plwhite · 2016-11-21T09:53:18Z

@richardwhiuk I would expect that if my deployment is DOA, and an individual component in it was broken, then I'd know. If the container died, then my orchestration could handle the issue by recreating it, and I'd be getting alarms. You can argue that supervisord is restarting the process and that's fine so far as it goes, but it seems that supervisord is giving up, perhaps after 5000 retries in a few seconds.

I think the right answer is for the orchestration to add a test on the various ports - if the container does not start listening on the HTTP ports within (say) 30 seconds, kill it and recreate it, which can be done by the Kubernetes infrastructure in this case. I'll implement that in the Kubernetes branch shortly, at which point I can close this issue down.

plwhite · 2016-11-22T12:53:51Z

OK, so I have extended the Kubernetes orchestration to nuke Homestead if it fails and then restart it. Hence closing this.

richardwhiuk assigned plwhite Nov 9, 2016

plwhite mentioned this issue Nov 14, 2016

Homestead does not work if SCTP not installed #54

Open

plwhite closed this as completed Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homestead neither terminates nor recovers on error #49

Homestead neither terminates nor recovers on error #49

plwhite commented Nov 9, 2016

richardwhiuk commented Nov 9, 2016

plwhite commented Nov 21, 2016

plwhite commented Nov 22, 2016

Homestead neither terminates nor recovers on error #49

Homestead neither terminates nor recovers on error #49

Comments

plwhite commented Nov 9, 2016

Symptoms

Impact

Release and environment

Steps to reproduce

richardwhiuk commented Nov 9, 2016

plwhite commented Nov 21, 2016

plwhite commented Nov 22, 2016