Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bamboo stalls during deployment once haproxy config is invalid. #136

Closed
rasputnik opened this issue Jun 15, 2015 · 14 comments
Closed

bamboo stalls during deployment once haproxy config is invalid. #136

rasputnik opened this issue Jun 15, 2015 · 14 comments

Comments

@rasputnik
Copy link
Contributor

our ops do something a bit - odd - during deploys;

  • they scale existing app to 1,
  • deploy a new app at the same port
  • destroy the old app

when the old app is destroyed, bamboo thinks the app has a port of '0' and so generates an invalid haproxy config. at this point bamboo stops responding to webhooks so the config is never regenerated and we lose service.

Arguably this is pilot error but I'm having a hell of a job figuring out how bamboo becomes unresponsive.

Wondering if anyone's seen anything similar? I'm about to break out the sysdig on a test instance, but some pointers would help.

@j1n6
Copy link
Contributor

j1n6 commented Jun 15, 2015

Thanks for reporting this. Are you using the multi-port support configuration?

@rasputnik
Copy link
Contributor Author

No, this is just a single port (the commented out 'service port' example is mine).
Seems like at some point in the dance, the serviceports start coming back from bamboo as '0'
(which I'm guessing is the Go 'nil value' for integers). Marathon is available during this
deployment as far as I can make out.

I've disabled one of the gateway servers on the environment in question today so I'll run
bamboo on there and wait for it to break, will report back if that turns up any more information.

@j1n6
Copy link
Contributor

j1n6 commented Jun 16, 2015

Thanks for the update. It sounds like an edge case we need to filter out. If you can reproduce the problem with specific steps, the fix might be easy to find.

@rasputnik
Copy link
Contributor Author

The isolated gateway broke yesterday afternoon. I'm still not 100% sure how it got into this state (was offsite at the time annoyingly) but here's how it looks now.

the haproxy config is certainly invalid e.g. we have :

    listen ::appname-2.2.3-rc1_18005
  bind *:18005
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-2.2.3-rc1-slavename-31911 slavename:31911

listen ::appname-f48cea4_0
  bind *:0
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-f48cea4-slavename-31687 slavename:31687

(where 2.2.3-rc1 is the version they've just deployed, and 'f48cea4' is the old scaled down one).

It's not being 'healed' because bamboo state is invalid (from a curl to 0:8000/api/state ):

  {
    "Env": {},
    "ServicePorts": [
      18007
    ],
    "ServicePort": 18007,
    "Tasks": [
      {
        "Ports": [
          31340
        ],
        "Port": 31340,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-2.2.3-rc1",
    "Id": "/appname-2.2.3-rc1"
  },
  {
    "Env": null,
    "ServicePorts": null,
    "ServicePort": 0,
    "Tasks": [
      {
        "Ports": [
          31710
        ],
        "Port": 31710,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-f48cea4",
    "Id": "/appname-f48cea4"
  },

The logs show webhooks arriving every 30 seconds as normal,
but the state remains invalid.

Now it's in a broken state I've traced it (using sysdig with the echo_fds chisel) and can see
this happening:

  • POST comes in from marathon master
  • writes to its log "Queuing an haproxy update"
  • increments statsd counter
  • responds with '200 OK' to marathon
  • reads current /etc/haproxy/haproxy.cfg contents
  • reads its haproxy_template.cfg
  • makes a GET to master for /v2/tasks
  • at this point, more webhooks fire (marathon sends one for each task)
  • the /v2/tasks response comes back
  • make a GET to master for /v2/apps
  • more webhooks are still arriving
  • the /v2/apps response arrives
  • updates /bamboo/state in zookeeper

@rasputnik
Copy link
Contributor Author

Just doublechecked marathons state - although there is no longer an entry for the old app under /v2/apps, the old task is still showing up at /v2/tasks:

...
  {
    "version": "2015-04-09T07:59:25.530Z",
    "stagedAt": "2015-04-13T14:40:40.584Z",
    "startedAt": "2015-04-13T14:52:32.956Z",
    "ports": [
      31687
    ],
    "host": "slavename",
    "id": "appname-f48cea4.0e8f164d-e1eb-11e4-bf47-001a4aa850cf",
    "appId": "/appname-f48cea4"
  },
...
    {
      "servicePorts": [
        18005
      ],
      "version": "2015-06-15T13:58:54.000Z",
      "stagedAt": "2015-06-15T13:59:21.431Z",
      "startedAt": "2015-06-18T11:20:51.431Z",
      "ports": [
        31911
      ],
      "host": "slavename",
      "id": "appname-2.2.3-rc1.b8e586c3-1366-11e5-bf47-001a4aa850cf",
      "appId": "/appname-2.2.3-rc1"
    },


(Notice that there's no servicePort on that first 'f48c...' task, and the appId is
no longer present in /v2/apps)

I don't know if this is a Marathon bug (I'm running 0.7.5)?
Either way, would it be hard to only add tasks into bamboo if the associated app exists?

@j1n6
Copy link
Contributor

j1n6 commented Jun 18, 2015

First thing I suggest is upgrading to marathon 0.8.1 (This is currently what we are running). We upgraded Marathon from the 0.7.x releases to 0.8.0 then 0.8.1 smoothly.

I also agree that it might be better to check associated apps exists.

@rasputnik
Copy link
Contributor Author

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

  • a check in the template for service port == 0
  • some sanity checking in services/marathon/marathon.go to only return tasks with associated apps
  • a rework to get apps and tasks in one request using a GET to /v2/apps/?embed=apps.tasks

I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.

@j1n6
Copy link
Contributor

j1n6 commented Jun 18, 2015

The second option might be a better alternative. Your proposed fix doesn't affect future roadmap.

We are internally re-designing a new way of how bamboo works not just with haproxy, but any number of exports. Essentially, it's possible to say MyApp can be loadbalacned via HAProxy, DNS services (for udp, long TCP connection), or configuring specialised proxy for Redis clusters.

On 18 Jun 2015, at 15:01, Dick Davies notifications@github.com wrote:

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

a check in the template for service port == 0
some sanity checking in services/marathon/marathon.go to only return tasks with associated apps
a rework to get apps and tasks in one request using a GET to /v2/apps/?embed=apps.tasks
I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.


Reply to this email directly or view it on GitHub.

@rasputnik
Copy link
Contributor Author

ok, i'll go with the second option then - the redesign sounds interesting too - expect a PR tomorrow sometime, it's probably quicker for me to develop against a known-bad marathon deployment than try to reproduce it here.

@rasputnik
Copy link
Contributor Author

Should have added - managed to reproduce my original problem on a set of test VMs, and can confirm this resolves the issue when there are 'floating' tasks in the marathon REST responses.

@rasputnik
Copy link
Contributor Author

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.

@j1n6
Copy link
Contributor

j1n6 commented Jun 26, 2015

I had a quickly look over the code. One minor issue with the m_app variable name should be camalCase.

We are moving office at the moment, please expect my slow response..

On 26 Jun 2015, at 10:45, Dick Davies notifications@github.com wrote:

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.


Reply to this email directly or view it on GitHub.

@rasputnik
Copy link
Contributor Author

Thanks, I'll have another crack at it then.

@rasputnik
Copy link
Contributor Author

Tweaked that commit to camelCase the new variables. Thanks!

@j1n6 j1n6 closed this as completed in 2ab5167 Jul 23, 2015
j1n6 pushed a commit that referenced this issue Jul 23, 2015
add apps, then assign tasks. closes #136 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants