Skip to content
This repository has been archived by the owner on Feb 27, 2020. It is now read-only.

Homestead neither terminates nor recovers on error #49

Closed
plwhite opened this issue Nov 9, 2016 · 3 comments
Closed

Homestead neither terminates nor recovers on error #49

plwhite opened this issue Nov 9, 2016 · 3 comments
Assignees

Comments

@plwhite
Copy link
Collaborator

plwhite commented Nov 9, 2016

Symptoms

Homestead failed. It turns out that this was because of #48. Homestead reported that it was terminating but failed to do so, and continued listening on its port. Homer appeared to be in a similar state

Logs are below.

    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:181: Configuring store connection
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:182:   Hostname:  localhost
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:183:   Port:      9160
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:211: Configuring store worker pool
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:212:   Threads:   10
    09-11-2016 11:20:23.576 UTC Status cassandra_store.cpp:213:   Max Queue: 0
    09-11-2016 11:20:23.577 UTC Error main.cpp:745: Failed to initialize the Cassandra cache with error code 3.
    09-11-2016 11:20:23.577 UTC Status main.cpp:746: Homestead is shutting down

Impact

Deployment fails completely.

This is because the orchestration (Docker Compose / Kubernetes) cannot terminate and restart the failed container or report that the container has failed.

Correct behaviour is for the container to exit when it has failed and is shutting down the service. Wrong but possibly acceptable behaviour (i.e. code workaround) might be for it to stop listening on well known ports so that the orchestration can detect the failure.

Release and environment

Current master release.

Steps to reproduce

Happens at same time as #48. Misconfiguring cassandra address should do it once #48 is fixed.

@richardwhiuk
Copy link
Contributor

Homestead is run under supervisord. When Homestead emits the above log, we immediately exit with a process exit code of 2, so I doubt the process was stuck.

Homestead is configured with config to restart (up to 5000 times!) if the process fails within a second, and to restart indefinitely if the process dies after a second.

Homestead doesn't open ports (HTTP Signalling, HTTP Management or the Diameter stack) until after it's confirmed it can connect to Cassandra, so I don't know what port you are talking about that, except possibly 22.

Given that this problem was clearly terminal, it seems unlikely that if the entire container had stopped, we'd be in any better position.

Can you comment any further on what you saw here and what behaviour you would expect?

@plwhite
Copy link
Collaborator Author

plwhite commented Nov 21, 2016

@richardwhiuk I would expect that if my deployment is DOA, and an individual component in it was broken, then I'd know. If the container died, then my orchestration could handle the issue by recreating it, and I'd be getting alarms. You can argue that supervisord is restarting the process and that's fine so far as it goes, but it seems that supervisord is giving up, perhaps after 5000 retries in a few seconds.

I think the right answer is for the orchestration to add a test on the various ports - if the container does not start listening on the HTTP ports within (say) 30 seconds, kill it and recreate it, which can be done by the Kubernetes infrastructure in this case. I'll implement that in the Kubernetes branch shortly, at which point I can close this issue down.

@plwhite
Copy link
Collaborator Author

plwhite commented Nov 22, 2016

OK, so I have extended the Kubernetes orchestration to nuke Homestead if it fails and then restart it. Hence closing this.

@plwhite plwhite closed this as completed Nov 22, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants