New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout param not working properly #128

Open
iwiznia opened this Issue Mar 27, 2017 · 23 comments

Comments

Projects
None yet
3 participants
@iwiznia
Contributor

iwiznia commented Mar 27, 2017

I don't know if this is something particular to the jobs plugin or that affects all bedrock.
The timeout param is not working. Tested this by:

  • Clearing the jobs table.
  • Starting a request to GetJob with a timeout of 60 seconds and connection wait.
  • Create a new job to be processed now.
  • Job is not returned, we always return a timeout.
  • Make the same GetJob query again, job is returned.

/cc @tylerkaraszewski @quinthar

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Mar 27, 2017

Contributor

Let's fix this in multi-write bedrock if it's not urgent.

Contributor

tylerkaraszewski commented Mar 27, 2017

Let's fix this in multi-write bedrock if it's not urgent.

@iwiznia iwiznia removed their assignment Mar 27, 2017

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Mar 28, 2017

Contributor

@iwiznia Can you share copies of exactly what test you ran? Like, copy/paste your terminal session when testing with netcat.

Contributor

quinthar commented Mar 28, 2017

@iwiznia Can you share copies of exactly what test you ran? Like, copy/paste your terminal session when testing with netcat.

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Mar 29, 2017

Contributor

Steps I took:

  • Clear the jobs table: delete from jobs
  • Using psysh I did:
$b = new Expensify\Bedrock\Client();
$j = new Expensify\Bedrock\Jobs($b);
$j->getJob('www-prod/*', 60000);
  • Using a different psysh session, I did:
$b = new Expensify\Bedrock\Client();
$j = new Expensify\Bedrock\Jobs($b);
$j->createJob('www-prod/SampleWorker');
  • The 1st psysh session didn't return immediately a result and when it did return, it didn't return the newly created job.
Contributor

iwiznia commented Mar 29, 2017

Steps I took:

  • Clear the jobs table: delete from jobs
  • Using psysh I did:
$b = new Expensify\Bedrock\Client();
$j = new Expensify\Bedrock\Jobs($b);
$j->getJob('www-prod/*', 60000);
  • Using a different psysh session, I did:
$b = new Expensify\Bedrock\Client();
$j = new Expensify\Bedrock\Jobs($b);
$j->createJob('www-prod/SampleWorker');
  • The 1st psysh session didn't return immediately a result and when it did return, it didn't return the newly created job.
@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Mar 29, 2017

Contributor
Contributor

quinthar commented Mar 29, 2017

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Mar 29, 2017

Contributor

It's the same thing, but here you go:

  1. On one nc instance
GetJob
name:www-prod/*
Connection:wait
timeout:60000
  1. On another one:
CreateJob
name:www-prod/SampleWorker
data:[]
firstRun:2017-03-28
  1. Back to the first one, command keeps waiting till it hits the timeout and:
303 Timeout
commitCount: 997
nodeName: bedrock
processingTime: 1
totalTime: 60105
waitTime: 60103
Content-Length: 0
```
4. Retried the same command immediately:
```
GetJob
name:www-prod/*
Connection:wait
timeout:60000

200 OK
commitCount: 997
nodeName: bedrock
processingTime: 5
totalTime: 5
waitTime: 0
Content-Length: 55

{"data":[],"jobID":426,"name":"www-prod\/SampleWorker"}
```
Contributor

iwiznia commented Mar 29, 2017

It's the same thing, but here you go:

  1. On one nc instance
GetJob
name:www-prod/*
Connection:wait
timeout:60000
  1. On another one:
CreateJob
name:www-prod/SampleWorker
data:[]
firstRun:2017-03-28
  1. Back to the first one, command keeps waiting till it hits the timeout and:
303 Timeout
commitCount: 997
nodeName: bedrock
processingTime: 1
totalTime: 60105
waitTime: 60103
Content-Length: 0
```
4. Retried the same command immediately:
```
GetJob
name:www-prod/*
Connection:wait
timeout:60000

200 OK
commitCount: 997
nodeName: bedrock
processingTime: 5
totalTime: 5
waitTime: 0
Content-Length: 55

{"data":[],"jobID":426,"name":"www-prod\/SampleWorker"}
```
@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Mar 29, 2017

Contributor
Contributor

quinthar commented Mar 29, 2017

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Mar 29, 2017

Contributor

Are you sure BWM completely and totally depends on this behavior? I built the version currently in production 954c75063b232bd3d39f56c81e2acce75d380a69 (according to @coleaeason), and tested this with the same results that @iwiznia saw.

Contributor

tylerkaraszewski commented Mar 29, 2017

Are you sure BWM completely and totally depends on this behavior? I built the version currently in production 954c75063b232bd3d39f56c81e2acce75d380a69 (according to @coleaeason), and tested this with the same results that @iwiznia saw.

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Mar 29, 2017

Contributor
Contributor

quinthar commented Mar 29, 2017

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Mar 29, 2017

Contributor

Being that things are working (except that bedrock jobs are delayed by up to a minute if the queue is completely empty), I think we should stick with the original plan and fix it in multi-write bedrock. Then we won't have to fix it twice in two wildly diverging codebases.

Contributor

tylerkaraszewski commented Mar 29, 2017

Being that things are working (except that bedrock jobs are delayed by up to a minute if the queue is completely empty), I think we should stick with the original plan and fix it in multi-write bedrock. Then we won't have to fix it twice in two wildly diverging codebases.

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Mar 30, 2017

Contributor

Yeah, it's not a big deal, I agree we can fix in after multi-write (fun fact: I discovered this by really zooming in on a graphana graph).
BTW, this happens a lot, when jobs are not backed up, which is most of the time.

Contributor

iwiznia commented Mar 30, 2017

Yeah, it's not a big deal, I agree we can fix in after multi-write (fun fact: I discovered this by really zooming in on a graphana graph).
BTW, this happens a lot, when jobs are not backed up, which is most of the time.

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Aug 21, 2017

Contributor

Re-visiting this, I'd like to remove the feature. We're not relying on it for anything now, and it requires big changes to make it work (bedrockServer would need to notify plugins when it receives new commands, etc), which is a waste of effort if we don't need it.

I propose that instead, whatever's calling this just receives a message that there are currently no matching jobs, and it can try again in a minute if it wants.

Contributor

tylerkaraszewski commented Aug 21, 2017

Re-visiting this, I'd like to remove the feature. We're not relying on it for anything now, and it requires big changes to make it work (bedrockServer would need to notify plugins when it receives new commands, etc), which is a waste of effort if we don't need it.

I propose that instead, whatever's calling this just receives a message that there are currently no matching jobs, and it can try again in a minute if it wants.

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Aug 24, 2017

Contributor
Contributor

quinthar commented Aug 24, 2017

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Aug 24, 2017

Contributor

Granted, we're running in an environment where there are always jobs, but that's
also because we're consistently under-provisioned from a hardware
perspective

I don't think this is true, this happens a lot even now... Having said that, if the effort to fix it is big, I think we can still punt this. Given that we have several different servers processing jobs, means in practice there's probably a very short span between when a job is picked up and when it's processed. As a first step, we can stop sending the timeout in BWM and see how it behaves (given it will reduce the wait from 2 minutes to 1, because we do a sleep in php after the bedrock timeout expires).

Contributor

iwiznia commented Aug 24, 2017

Granted, we're running in an environment where there are always jobs, but that's
also because we're consistently under-provisioned from a hardware
perspective

I don't think this is true, this happens a lot even now... Having said that, if the effort to fix it is big, I think we can still punt this. Given that we have several different servers processing jobs, means in practice there's probably a very short span between when a job is picked up and when it's processed. As a first step, we can stop sending the timeout in BWM and see how it behaves (given it will reduce the wait from 2 minutes to 1, because we do a sleep in php after the bedrock timeout expires).

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Aug 25, 2017

Contributor

Ug, I just hate the idea that we're committing ourselves to waiting up to 60s to process each job. I understand we have bigger problems to fix, and right now we're using jobs in a way where a minute delay isn't a big deal.

But for example, that's up to an extra minute per SmartScan -- in fact, multiple times because the first job creates more jobs (so you pay the delay cost at least twice per receipt). Granted, SmartScan is slow now, but it sure won't get faster if we keep building in extra mandatory delays.

Same goes for Concierge: when a webhook fires from Intercom, this adds another amount of delay to every single conversation. When sending email through Bedrock, more delay. We had a solution that didn't have this delay, and we lost it when moving to multi-write. I'm fine not prioritizing this right now, but I really don't want to let this one die.

Contributor

quinthar commented Aug 25, 2017

Ug, I just hate the idea that we're committing ourselves to waiting up to 60s to process each job. I understand we have bigger problems to fix, and right now we're using jobs in a way where a minute delay isn't a big deal.

But for example, that's up to an extra minute per SmartScan -- in fact, multiple times because the first job creates more jobs (so you pay the delay cost at least twice per receipt). Granted, SmartScan is slow now, but it sure won't get faster if we keep building in extra mandatory delays.

Same goes for Concierge: when a webhook fires from Intercom, this adds another amount of delay to every single conversation. When sending email through Bedrock, more delay. We had a solution that didn't have this delay, and we lost it when moving to multi-write. I'm fine not prioritizing this right now, but I really don't want to let this one die.

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Aug 25, 2017

Contributor

We had a solution that didn't have this delay, and we lost it when moving to multi-write.

Not that it matters, but are you sure this worked before? I thought it was broken since the beginning...

Also, I think that the possibility of all servers running this condition at the same time is low, so in reality I doubt that we wait one minute, since when one server encounters this (and waits a minute) any of the other servers can take the job.

So, yes I agree it's not ideal, but I still don't think this is a priority right now.

Contributor

iwiznia commented Aug 25, 2017

We had a solution that didn't have this delay, and we lost it when moving to multi-write.

Not that it matters, but are you sure this worked before? I thought it was broken since the beginning...

Also, I think that the possibility of all servers running this condition at the same time is low, so in reality I doubt that we wait one minute, since when one server encounters this (and waits a minute) any of the other servers can take the job.

So, yes I agree it's not ideal, but I still don't think this is a priority right now.

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Aug 25, 2017

Contributor
Contributor

quinthar commented Aug 25, 2017

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Aug 25, 2017

Contributor

We can mitigate this by changing the 60s in the following code to 5s, right?

try {
    // Query the server for a job
    $response = $jobs->getJob($jobName, 60 * 1000); // Wait up to 60s
} catch (Exception $e) {
    // Try again in 60 seconds
    $logger->info('Problem getting job, retrying in 60s', ['message' => $e->getMessage()]);
    sleep(60);
}

So when there are no jobs (i.e., no outstanding load), we make up to 11 extra requests to bedrock per minute, which it can easily handle.

We totally can make this work, I'm just not sure about the return on investment with it.

Contributor

tylerkaraszewski commented Aug 25, 2017

We can mitigate this by changing the 60s in the following code to 5s, right?

try {
    // Query the server for a job
    $response = $jobs->getJob($jobName, 60 * 1000); // Wait up to 60s
} catch (Exception $e) {
    // Try again in 60 seconds
    $logger->info('Problem getting job, retrying in 60s', ['message' => $e->getMessage()]);
    sleep(60);
}

So when there are no jobs (i.e., no outstanding load), we make up to 11 extra requests to bedrock per minute, which it can easily handle.

We totally can make this work, I'm just not sure about the return on investment with it.

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Aug 28, 2017

Contributor

We can mitigate this by changing the 60s in the following code to 5s, right?

Well, if we don't want to fix this now, then I say we remove the timeout altogether from the getJob call, since we know that it runs the query and if there are no results, it will just sleep, same as the PHP code does.

Contributor

iwiznia commented Aug 28, 2017

We can mitigate this by changing the 60s in the following code to 5s, right?

Well, if we don't want to fix this now, then I say we remove the timeout altogether from the getJob call, since we know that it runs the query and if there are no results, it will just sleep, same as the PHP code does.

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Sep 5, 2017

Contributor
Contributor

quinthar commented Sep 5, 2017

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Sep 5, 2017

Contributor

One thing to note is that a while ago we allowed ourselves to read/write to the jobs table directly (by using the DB plugin, not the jobs plugin). So if someone updates the job in that way, the queue of held commands will not be cleared. This is probably still fine as we don't update the jobs via DB that often, but wanted to point it out anyway.

Contributor

iwiznia commented Sep 5, 2017

One thing to note is that a while ago we allowed ourselves to read/write to the jobs table directly (by using the DB plugin, not the jobs plugin). So if someone updates the job in that way, the queue of held commands will not be cleared. This is probably still fine as we don't update the jobs via DB that often, but wanted to point it out anyway.

@tylerkaraszewski

This comment has been minimized.

Show comment
Hide comment
@tylerkaraszewski

tylerkaraszewski Sep 5, 2017

Contributor

We have to fix this first before this can work:
https://github.com/Expensify/Bedrock/blob/master/BedrockServer.cpp#L977

Currently, disconnecting doesn't do anything with the command queue, the command just gets processed and a warning is issued when the socket to respond to is gone. This is an unavoidable case (it will always be possible the the socket disconnects after processing is complete, but before the response is sent), and when we generally respond immediately anyway, it doesn't change much. For this, we'd need to make a disconnection explicitly cancel any outstanding requests. This also has to propagate correctly up from slaves to master.

Steps 1 and 3 in David's suggestion above introduce a bunch of co-dependency between BedrockServer and the Jobs plugin. The jobs plugin needs access to all of BedrockServer's queue state, and the mutexes that protect it (the sync thread doesn't dequeue commands in multi-write bedrock). CreateJob can't just run through everything in the queue and un-mark it as held, there are 16 (or more) other threads that might be looking at commands that specifiy wait currently.

If thread 1 is peeking a GetJob request with wait set, and thread 2 gets CreateJob request, it can un-set held on all the jobs currently in the queue, which doesn't include the one currently being handled by thread 1 (and may include no jobs at all), so when thread 1 finishes, it can still end up putting the job back in the queue to wait, where it will sit until it times out, even though there's a job in the queue.

If we're going to do this, I'd rather let the Jobs plugin handle almost all of this, rather than having BedrockServer implement a bunch of special cases just for this plugin. Then the plugin could handle it's own mutex-protected queues and such.

  1. Fix the propagation of disconnected sockets so that this cancels commands.
  2. Change the plugin API to let peekCommand take ownership of the command. Instead of peekCommand marking a request as held, it just returns an indication that it now owns the command.
  3. Jobs::processCommand, when handling a CreateJob command can lock it's own list of held commands. This can prevent any other getJob commands from completing until it's finished.
  4. Upon finishing, it can re-queue one of the commands it had taken ownership of previously.
  5. None of this happens on slaves, they escalate any getJob command when no jobs are found and wait is set, and master will eventually respond with a job.

This lets all the special jobs-specific logic live in the Jobs plugin, which i think is way more maintainable in the long-run, and keeps BedrockServer relatively simple without mucking all the special cases for different plugins into it. It's complex enough as it is with all the multi-threaded logic.

Anyway, we can do this if it's worth the effort.

Contributor

tylerkaraszewski commented Sep 5, 2017

We have to fix this first before this can work:
https://github.com/Expensify/Bedrock/blob/master/BedrockServer.cpp#L977

Currently, disconnecting doesn't do anything with the command queue, the command just gets processed and a warning is issued when the socket to respond to is gone. This is an unavoidable case (it will always be possible the the socket disconnects after processing is complete, but before the response is sent), and when we generally respond immediately anyway, it doesn't change much. For this, we'd need to make a disconnection explicitly cancel any outstanding requests. This also has to propagate correctly up from slaves to master.

Steps 1 and 3 in David's suggestion above introduce a bunch of co-dependency between BedrockServer and the Jobs plugin. The jobs plugin needs access to all of BedrockServer's queue state, and the mutexes that protect it (the sync thread doesn't dequeue commands in multi-write bedrock). CreateJob can't just run through everything in the queue and un-mark it as held, there are 16 (or more) other threads that might be looking at commands that specifiy wait currently.

If thread 1 is peeking a GetJob request with wait set, and thread 2 gets CreateJob request, it can un-set held on all the jobs currently in the queue, which doesn't include the one currently being handled by thread 1 (and may include no jobs at all), so when thread 1 finishes, it can still end up putting the job back in the queue to wait, where it will sit until it times out, even though there's a job in the queue.

If we're going to do this, I'd rather let the Jobs plugin handle almost all of this, rather than having BedrockServer implement a bunch of special cases just for this plugin. Then the plugin could handle it's own mutex-protected queues and such.

  1. Fix the propagation of disconnected sockets so that this cancels commands.
  2. Change the plugin API to let peekCommand take ownership of the command. Instead of peekCommand marking a request as held, it just returns an indication that it now owns the command.
  3. Jobs::processCommand, when handling a CreateJob command can lock it's own list of held commands. This can prevent any other getJob commands from completing until it's finished.
  4. Upon finishing, it can re-queue one of the commands it had taken ownership of previously.
  5. None of this happens on slaves, they escalate any getJob command when no jobs are found and wait is set, and master will eventually respond with a job.

This lets all the special jobs-specific logic live in the Jobs plugin, which i think is way more maintainable in the long-run, and keeps BedrockServer relatively simple without mucking all the special cases for different plugins into it. It's complex enough as it is with all the multi-threaded logic.

Anyway, we can do this if it's worth the effort.

@quinthar

This comment has been minimized.

Show comment
Hide comment
@quinthar

quinthar Sep 6, 2017

Contributor

Currently, disconnecting doesn't do anything with the command queue, the command just gets processed and a warning is issued when the socket to respond to is gone. This is an unavoidable case (it will always be possible the the socket disconnects after processing is complete, but before the response is sent), and when we generally respond immediately anyway, it doesn't change much. For this, we'd need to make a disconnection explicitly cancel any outstanding requests. This also has to propagate correctly up from slaves to master.

Hm, didn't it handle this correctly before? I believe it previously detected failed client connections, sent an CANCEL_ESCALATE (or something like that), and cancelled the request. Does it not do that anymore? If so, we should add that back in.

As for the plan overall, I'm still not sure how this is better or simpler than the original plan -- which was actually very simple. I think the concept of a plugin "holding" a request isn't necessarily unique to a job queue: it could be interesting to "hold" waiting for someone to push a value into a cache, for example. Auth could even have someone "hold" waiting for a report action (eg, have a web socket subscribe to a report's history changes, without pusher). Or hold on the status of a receipt's SmartScan status changing.

Regardless, I think it's a powerful feature that we could build upon, and isn't necessarily unique to Jobs.

Contributor

quinthar commented Sep 6, 2017

Currently, disconnecting doesn't do anything with the command queue, the command just gets processed and a warning is issued when the socket to respond to is gone. This is an unavoidable case (it will always be possible the the socket disconnects after processing is complete, but before the response is sent), and when we generally respond immediately anyway, it doesn't change much. For this, we'd need to make a disconnection explicitly cancel any outstanding requests. This also has to propagate correctly up from slaves to master.

Hm, didn't it handle this correctly before? I believe it previously detected failed client connections, sent an CANCEL_ESCALATE (or something like that), and cancelled the request. Does it not do that anymore? If so, we should add that back in.

As for the plan overall, I'm still not sure how this is better or simpler than the original plan -- which was actually very simple. I think the concept of a plugin "holding" a request isn't necessarily unique to a job queue: it could be interesting to "hold" waiting for someone to push a value into a cache, for example. Auth could even have someone "hold" waiting for a report action (eg, have a web socket subscribe to a report's history changes, without pusher). Or hold on the status of a receipt's SmartScan status changing.

Regardless, I think it's a powerful feature that we could build upon, and isn't necessarily unique to Jobs.

@iwiznia

This comment has been minimized.

Show comment
Hide comment
@iwiznia

iwiznia Sep 7, 2017

Contributor

I agree it's a powerful feature we could build upon, but given that we don't have a compelling use case right now, does it make sense to build this now?

Contributor

iwiznia commented Sep 7, 2017

I agree it's a powerful feature we could build upon, but given that we don't have a compelling use case right now, does it make sense to build this now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment