Standalone buggy restart with SIGUSR2 #125

Closed
ybogdanov opened this Issue Jul 18, 2011 · 13 comments

Projects

None yet

3 participants

Hi,

I'm using the examples/standalone.js example to test this.

/**
 * Module dependencies.
 */

var cluster = require('../');

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .start();

if (proc.isWorker) {
  var id = process.env.CLUSTER_WORKER;
  console.log('  worker #%d started', id);
  setInterval(function(){
    console.log('  processing job from worker #%d', id);
  }, 1000);
}

When you start the cluster, everything goes fine:

$ node examples/standalone.js 
  info - master started
  info - worker 0 spawned
  info - worker 1 spawned
  info - worker 2 spawned
  info - worker 3 spawned
  info - listening for connections
  worker #0 started
  info - worker 0 connected
  worker #1 started
  info - worker 1 connected
  worker #2 started
  info - worker 2 connected
  worker #3 started
  info - worker 3 connected

Node processes:

$ ps aux | grep node
   11537 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11538 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11539 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11540 /opt/local/bin/node ~/node-cluster/examples/standalone.js #worker
   11536 node examples/standalone.js #master

SIGUSR2 spawns new workers but doesn't kill the old ones

$ kill -USR2 11536
$ ps aux | grep node
   11537 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11538 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11539 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11540 /opt/local/bin/node ~/node-cluster/examples/standalone.js #old worker
   11536 node examples/standalone.js #old master
   11625 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11626 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11627 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11628 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new worker
   11624 /opt/local/bin/node ~/node-cluster/examples/standalone.js #new master

Ok,
This problem appears only if you use cluster.start() (without socket binding).
If you edit standalone.js and change .start() to .listen('/path/to/socket'), restart will work correctly:

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .listen('/tmp/cluster.sock');
Member
tj commented Jul 18, 2011

interesting, I'll have a look at this thanks for the report

Member
tj commented Jul 18, 2011

should be fixed in the most recent release, forgot to tag this issue

nope, it doesn't :)
I'll try to make a test for it.

With 0.6.6 even listen() behave same as start():

var proc = cluster()
  .set('workers', 4)
  .use(cluster.debug())
  .listen('/tmp/cluster.sock');

(SIGUSR2 doesn't kills old processes)

@felixge felixge added a commit to felixge/cluster that referenced this issue Jul 19, 2011
@felixge felixge Provide test case for #125 6da2f0e
Contributor
felixge commented Jul 19, 2011

I have a test case:

felixge@6da2f0e

It's not perfect, but in most cases it should exit with 1 right now. The problem is a race condition here:

https://github.com/LearnBoost/cluster/blob/master/lib/master.js#L720

There is no guarantee that 'listening' will not already have fired when the new master receives the 'connectMaster' instruction from its parent. This race condition probably affects both regular and standalone clusters, but the standalone ones are more likely to run into it because their 'listening' method is fired earlier:

https://github.com/LearnBoost/cluster/blob/master/lib/master.js#L275

I'm not entirely sure what fix to go for, but one idea is to keep track if 'listening' already fired, and if so, don't wait for it in the connectMaster() method.

--fg

Member
tj commented Jul 19, 2011

yeah I think we can get away with simply tracking if has[ not] happened yet. It's not necessarily required but I think it's best and for some future functionality that the parent sticks around until the child is fully booted and ready to serve.

@tj tj added a commit that referenced this issue Jul 19, 2011
@felixge @tj felixge + tj Provide test case for #125 fb651b0
@tj tj closed this in eed4d36 Jul 19, 2011
Member
tj commented Jul 19, 2011

thanks for the test case! I didnt realize until yesterday that I had none for stand-alone

Contributor
felixge commented Jul 19, 2011

Thanks for the fix! I didn't realize that cluster had a standalone mode until I saw this ticket : )! That's really awesome for running worker scripts!

Member
tj commented Jul 19, 2011

totally! It works great with Kue

Thank you, guys!

I'm working on node-cloud - queue-based jobs processing in a cloud. Cluster's standalone feature makes it multicore-scalable in few lines of code!

Great job!

Member
tj commented Jul 19, 2011

@0ctave cool man!

Contributor
felixge commented Jul 19, 2011

totally! It works great with Kue

Wow, that looks really rad. I'll check it out, if I can use it I'll send patches : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment