-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bamboo stalls during deployment once haproxy config is invalid. #136
Comments
Thanks for reporting this. Are you using the multi-port support configuration? |
No, this is just a single port (the commented out 'service port' example is mine). I've disabled one of the gateway servers on the environment in question today so I'll run |
Thanks for the update. It sounds like an edge case we need to filter out. If you can reproduce the problem with specific steps, the fix might be easy to find. |
The isolated gateway broke yesterday afternoon. I'm still not 100% sure how it got into this state (was offsite at the time annoyingly) but here's how it looks now. the haproxy config is certainly invalid e.g. we have :
(where 2.2.3-rc1 is the version they've just deployed, and 'f48cea4' is the old scaled down one). It's not being 'healed' because bamboo state is invalid (from a curl to 0:8000/api/state ):
The logs show webhooks arriving every 30 seconds as normal, Now it's in a broken state I've traced it (using sysdig with the echo_fds chisel) and can see
|
Just doublechecked marathons state - although there is no longer an entry for the old app under /v2/apps, the old task is still showing up at /v2/tasks:
(Notice that there's no servicePort on that first 'f48c...' task, and the appId is I don't know if this is a Marathon bug (I'm running 0.7.5)? |
First thing I suggest is upgrading to marathon 0.8.1 (This is currently what we are running). We upgraded Marathon from the 0.7.x releases to 0.8.0 then 0.8.1 smoothly. I also agree that it might be better to check associated apps exists. |
It's on my todo list, don't worry :) If I was to send a patch for this I can see a few ways to tackle this, which would you prefer? in reverse order of effort :
I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons. |
The second option might be a better alternative. Your proposed fix doesn't affect future roadmap. We are internally re-designing a new way of how bamboo works not just with haproxy, but any number of exports. Essentially, it's possible to say MyApp can be loadbalacned via HAProxy, DNS services (for udp, long TCP connection), or configuring specialised proxy for Redis clusters.
|
ok, i'll go with the second option then - the redesign sounds interesting too - expect a PR tomorrow sometime, it's probably quicker for me to develop against a known-bad marathon deployment than try to reproduce it here. |
Should have added - managed to reproduce my original problem on a set of test VMs, and can confirm this resolves the issue when there are 'floating' tasks in the marathon REST responses. |
Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy. |
I had a quickly look over the code. One minor issue with the m_app variable name should be camalCase. We are moving office at the moment, please expect my slow response..
|
Thanks, I'll have another crack at it then. |
Tweaked that commit to camelCase the new variables. Thanks! |
add apps, then assign tasks. closes #136 .
our ops do something a bit - odd - during deploys;
when the old app is destroyed, bamboo thinks the app has a port of '0' and so generates an invalid haproxy config. at this point bamboo stops responding to webhooks so the config is never regenerated and we lose service.
Arguably this is pilot error but I'm having a hell of a job figuring out how bamboo becomes unresponsive.
Wondering if anyone's seen anything similar? I'm about to break out the sysdig on a test instance, but some pointers would help.
The text was updated successfully, but these errors were encountered: