bamboo stalls during deployment once haproxy config is invalid. #136

rasputnik · 2015-06-15T11:27:47Z

our ops do something a bit - odd - during deploys;

they scale existing app to 1,
deploy a new app at the same port
destroy the old app

when the old app is destroyed, bamboo thinks the app has a port of '0' and so generates an invalid haproxy config. at this point bamboo stops responding to webhooks so the config is never regenerated and we lose service.

Arguably this is pilot error but I'm having a hell of a job figuring out how bamboo becomes unresponsive.

Wondering if anyone's seen anything similar? I'm about to break out the sysdig on a test instance, but some pointers would help.

j1n6 · 2015-06-15T14:58:05Z

Thanks for reporting this. Are you using the multi-port support configuration?

rasputnik · 2015-06-16T10:16:29Z

No, this is just a single port (the commented out 'service port' example is mine).
Seems like at some point in the dance, the serviceports start coming back from bamboo as '0'
(which I'm guessing is the Go 'nil value' for integers). Marathon is available during this
deployment as far as I can make out.

I've disabled one of the gateway servers on the environment in question today so I'll run
bamboo on there and wait for it to break, will report back if that turns up any more information.

j1n6 · 2015-06-16T10:23:17Z

Thanks for the update. It sounds like an edge case we need to filter out. If you can reproduce the problem with specific steps, the fix might be easy to find.

rasputnik · 2015-06-18T10:41:47Z

The isolated gateway broke yesterday afternoon. I'm still not 100% sure how it got into this state (was offsite at the time annoyingly) but here's how it looks now.

the haproxy config is certainly invalid e.g. we have :

    listen ::appname-2.2.3-rc1_18005
  bind *:18005
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-2.2.3-rc1-slavename-31911 slavename:31911

listen ::appname-f48cea4_0
  bind *:0
  mode http

  balance leastconn
  capture request header X-Forwarded-For len 15

    server ::appname-f48cea4-slavename-31687 slavename:31687

(where 2.2.3-rc1 is the version they've just deployed, and 'f48cea4' is the old scaled down one).

It's not being 'healed' because bamboo state is invalid (from a curl to 0:8000/api/state ):

  {
    "Env": {},
    "ServicePorts": [
      18007
    ],
    "ServicePort": 18007,
    "Tasks": [
      {
        "Ports": [
          31340
        ],
        "Port": 31340,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-2.2.3-rc1",
    "Id": "/appname-2.2.3-rc1"
  },
  {
    "Env": null,
    "ServicePorts": null,
    "ServicePort": 0,
    "Tasks": [
      {
        "Ports": [
          31710
        ],
        "Port": 31710,
        "Host": "slavename"
      }
    ],
    "HealthCheckPath": "",
    "EscapedId": "::appname-f48cea4",
    "Id": "/appname-f48cea4"
  },

The logs show webhooks arriving every 30 seconds as normal,
but the state remains invalid.

Now it's in a broken state I've traced it (using sysdig with the echo_fds chisel) and can see
this happening:

POST comes in from marathon master
writes to its log "Queuing an haproxy update"
increments statsd counter
responds with '200 OK' to marathon
reads current /etc/haproxy/haproxy.cfg contents
reads its haproxy_template.cfg
makes a GET to master for /v2/tasks
at this point, more webhooks fire (marathon sends one for each task)
the /v2/tasks response comes back
make a GET to master for /v2/apps
more webhooks are still arriving
the /v2/apps response arrives
updates /bamboo/state in zookeeper

rasputnik · 2015-06-18T11:25:45Z

Just doublechecked marathons state - although there is no longer an entry for the old app under /v2/apps, the old task is still showing up at /v2/tasks:

...
  {
    "version": "2015-04-09T07:59:25.530Z",
    "stagedAt": "2015-04-13T14:40:40.584Z",
    "startedAt": "2015-04-13T14:52:32.956Z",
    "ports": [
      31687
    ],
    "host": "slavename",
    "id": "appname-f48cea4.0e8f164d-e1eb-11e4-bf47-001a4aa850cf",
    "appId": "/appname-f48cea4"
  },
...
    {
      "servicePorts": [
        18005
      ],
      "version": "2015-06-15T13:58:54.000Z",
      "stagedAt": "2015-06-15T13:59:21.431Z",
      "startedAt": "2015-06-18T11:20:51.431Z",
      "ports": [
        31911
      ],
      "host": "slavename",
      "id": "appname-2.2.3-rc1.b8e586c3-1366-11e5-bf47-001a4aa850cf",
      "appId": "/appname-2.2.3-rc1"
    },

(Notice that there's no servicePort on that first 'f48c...' task, and the appId is
no longer present in /v2/apps)

I don't know if this is a Marathon bug (I'm running 0.7.5)?
Either way, would it be hard to only add tasks into bamboo if the associated app exists?

j1n6 · 2015-06-18T13:43:41Z

First thing I suggest is upgrading to marathon 0.8.1 (This is currently what we are running). We upgraded Marathon from the 0.7.x releases to 0.8.0 then 0.8.1 smoothly.

I also agree that it might be better to check associated apps exists.

rasputnik · 2015-06-18T14:01:49Z

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

a check in the template for service port == 0
some sanity checking in services/marathon/marathon.go to only return tasks with associated apps
a rework to get apps and tasks in one request using a GET to /v2/apps/?embed=apps.tasks

I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.

j1n6 · 2015-06-18T18:31:55Z

The second option might be a better alternative. Your proposed fix doesn't affect future roadmap.

We are internally re-designing a new way of how bamboo works not just with haproxy, but any number of exports. Essentially, it's possible to say MyApp can be loadbalacned via HAProxy, DNS services (for udp, long TCP connection), or configuring specialised proxy for Redis clusters.

On 18 Jun 2015, at 15:01, Dick Davies notifications@github.com wrote:

It's on my todo list, don't worry :)

If I was to send a patch for this I can see a few ways to tackle this, which would you prefer?

in reverse order of effort :

a check in the template for service port == 0
some sanity checking in services/marathon/marathon.go to only return tasks with associated apps
a rework to get apps and tasks in one request using a GET to /v2/apps/?embed=apps.tasks
I'd prefer the last option, with a caveat that I'm not 100% sure when that query param was added - so it may not work on older marathons.

—
Reply to this email directly or view it on GitHub.

rasputnik · 2015-06-18T18:41:14Z

ok, i'll go with the second option then - the redesign sounds interesting too - expect a PR tomorrow sometime, it's probably quicker for me to develop against a known-bad marathon deployment than try to reproduce it here.

rasputnik · 2015-06-19T16:11:28Z

Should have added - managed to reproduce my original problem on a set of test VMs, and can confirm this resolves the issue when there are 'floating' tasks in the marathon REST responses.

rasputnik · 2015-06-26T09:45:34Z

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.

j1n6 · 2015-06-26T15:27:35Z

I had a quickly look over the code. One minor issue with the m_app variable name should be camalCase.

We are moving office at the moment, please expect my slow response..

On 26 Jun 2015, at 10:45, Dick Davies notifications@github.com wrote:

Sorry to nag but any objections to merging this in? This version of the code has been running well for us all week whereas previous release crashes each time we deploy.

—
Reply to this email directly or view it on GitHub.

rasputnik · 2015-06-26T19:14:46Z

Thanks, I'll have another crack at it then.

rasputnik · 2015-06-27T08:50:48Z

Tweaked that commit to camelCase the new variables. Thanks!

add apps, then assign tasks. closes #136 .

rasputnik mentioned this issue Jun 23, 2015

Allow automatic configuration of Host based routing #28

Open

j1n6 closed this as completed in 2ab5167 Jul 23, 2015

j1n6 pushed a commit that referenced this issue Jul 23, 2015

Merge pull request #139 from rasputnik/apps_then_tasks

5fb80e0

add apps, then assign tasks. closes #136 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bamboo stalls during deployment once haproxy config is invalid. #136

bamboo stalls during deployment once haproxy config is invalid. #136

rasputnik commented Jun 15, 2015

j1n6 commented Jun 15, 2015

rasputnik commented Jun 16, 2015

j1n6 commented Jun 16, 2015

rasputnik commented Jun 18, 2015

rasputnik commented Jun 18, 2015

j1n6 commented Jun 18, 2015

rasputnik commented Jun 18, 2015

j1n6 commented Jun 18, 2015

rasputnik commented Jun 18, 2015

rasputnik commented Jun 19, 2015

rasputnik commented Jun 26, 2015

j1n6 commented Jun 26, 2015

rasputnik commented Jun 26, 2015

rasputnik commented Jun 27, 2015

bamboo stalls during deployment once haproxy config is invalid. #136

bamboo stalls during deployment once haproxy config is invalid. #136

Comments

rasputnik commented Jun 15, 2015

j1n6 commented Jun 15, 2015

rasputnik commented Jun 16, 2015

j1n6 commented Jun 16, 2015

rasputnik commented Jun 18, 2015

rasputnik commented Jun 18, 2015

j1n6 commented Jun 18, 2015

rasputnik commented Jun 18, 2015

j1n6 commented Jun 18, 2015

rasputnik commented Jun 18, 2015

rasputnik commented Jun 19, 2015

rasputnik commented Jun 26, 2015

j1n6 commented Jun 26, 2015

rasputnik commented Jun 26, 2015

rasputnik commented Jun 27, 2015