New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working around abysmal google.cloud.pubsub.Subscription.pull() delays? #24

Closed
wlongabaugh opened this Issue Mar 14, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@wlongabaugh

wlongabaugh commented Mar 14, 2017

So I'm using psq in a light-demand environment, and am encountering abysmal latency on the workers pulling jobs off the queue. As in 10 minutes when things are "speedy", and up to 25 minutes in some cases. I see that the code depends on the behaviour of google.cloud.pubsub.Subscription.pull() in blocking mode. Would modifying the psq code to do its own polling on the unblocked pull() improve that, or would the underlying subscription service still be taking its own sweet time to deliver messages even if we were banging on it every 10 seconds? I can dig into this, but thought you might already have some insights on this problem.

@theacodes

This comment has been minimized.

Show comment
Hide comment
@theacodes

theacodes Mar 15, 2017

Member

I think this is a backend issue, but I'm not sure. We're using pull() which should be streaming, but even if it wasn't it shouldn't have that much latency.

@tmatsuo @lukesneeringer any ideas?

Member

theacodes commented Mar 15, 2017

I think this is a backend issue, but I'm not sure. We're using pull() which should be streaming, but even if it wasn't it shouldn't have that much latency.

@tmatsuo @lukesneeringer any ideas?

@wlongabaugh

This comment has been minimized.

Show comment
Hide comment
@wlongabaugh

wlongabaugh Mar 15, 2017

So if I regularly ping (every ~10 sec) a no-op task for the Worker to execute, I can get the one genuine task I need done delivered in a reasonable time (< 1 minute). It sure looks like PubSub is configured to handle high-throughput messages really well, but is falling down with very high latency on low throughput tasks.

wlongabaugh commented Mar 15, 2017

So if I regularly ping (every ~10 sec) a no-op task for the Worker to execute, I can get the one genuine task I need done delivered in a reasonable time (< 1 minute). It sure looks like PubSub is configured to handle high-throughput messages really well, but is falling down with very high latency on low throughput tasks.

@lukesneeringer

This comment has been minimized.

Show comment
Hide comment
@lukesneeringer

lukesneeringer Mar 15, 2017

We're using pull() which should be streaming, but even if it wasn't it shouldn't have that much latency.

It is not streaming. (A new streaming_pull method is planned.)

lukesneeringer commented Mar 15, 2017

We're using pull() which should be streaming, but even if it wasn't it shouldn't have that much latency.

It is not streaming. (A new streaming_pull method is planned.)

@lukesneeringer

This comment has been minimized.

Show comment
Hide comment
@lukesneeringer

lukesneeringer Mar 15, 2017

So I'm using psq in a light-demand environment, and am encountering abysmal latency on the workers pulling jobs off the queue. As in 10 minutes when things are "speedy", and up to 25 minutes in some cases. I see that the code depends on the behaviour of google.cloud.pubsub.Subscription.pull() in blocking mode. Would modifying the psq code to do its own polling on the unblocked pull() improve that, or would the underlying subscription service still be taking its own sweet time to deliver messages even if we were banging on it every 10 seconds? I can dig into this, but thought you might already have some insights on this problem.

First, thanks for reporting this. I am hoping to dig into improving the PubSub client. Among other things, moving to unblocking methods is a high priority for the PubSub team.

I expect your diagnosis is pretty much accurate. You are not finding out about the task being done because you are not asking the server for it, and there is no automated background ping-and-run-callback at this time.

Can you share with me what you an ideal fix would be for your use case? I am imagining a something like having the subscription accept a callback callable with gets called as messages come in (and of course, we would ask for messages on a cadence with an unblocked call).

lukesneeringer commented Mar 15, 2017

So I'm using psq in a light-demand environment, and am encountering abysmal latency on the workers pulling jobs off the queue. As in 10 minutes when things are "speedy", and up to 25 minutes in some cases. I see that the code depends on the behaviour of google.cloud.pubsub.Subscription.pull() in blocking mode. Would modifying the psq code to do its own polling on the unblocked pull() improve that, or would the underlying subscription service still be taking its own sweet time to deliver messages even if we were banging on it every 10 seconds? I can dig into this, but thought you might already have some insights on this problem.

First, thanks for reporting this. I am hoping to dig into improving the PubSub client. Among other things, moving to unblocking methods is a high priority for the PubSub team.

I expect your diagnosis is pretty much accurate. You are not finding out about the task being done because you are not asking the server for it, and there is no automated background ping-and-run-callback at this time.

Can you share with me what you an ideal fix would be for your use case? I am imagining a something like having the subscription accept a callback callable with gets called as messages come in (and of course, we would ask for messages on a cadence with an unblocked call).

@wlongabaugh

This comment has been minimized.

Show comment
Hide comment
@wlongabaugh

wlongabaugh Mar 15, 2017

Hmm... what you are suggesting sounds like more than I need. I'm just using psq on top of the underlying PubSub layer, and I don't think I need a callback. While it is my understanding that PubSub simply guarantees a message is delivered once within some bounded time frame, I have no idea what that time frame is. My specific use case is that I need a distributed task queue. I have a job that is kicked off from a web form submission that can take a very long time, but these calls are pretty infrequent (even once an hour is unlikely). But some jobs can be short, and the user has a reasonable expectation that such jobs will complete pretty quickly (e.g. in 3 minutes, not 30).

So, at the moment, I have no visibility on whether there might be messages queued up for delivery, but which are perhaps being held off waiting for messages to accumulate. And how long I might have to wait for that to happen. I have only briefly dug down in the psq code to look at the PubSub layer, but I'm not seeing a way to e.g. configure whether I value throughput (not so much) versus latency (a lot) in a topic.

Thanks!

wlongabaugh commented Mar 15, 2017

Hmm... what you are suggesting sounds like more than I need. I'm just using psq on top of the underlying PubSub layer, and I don't think I need a callback. While it is my understanding that PubSub simply guarantees a message is delivered once within some bounded time frame, I have no idea what that time frame is. My specific use case is that I need a distributed task queue. I have a job that is kicked off from a web form submission that can take a very long time, but these calls are pretty infrequent (even once an hour is unlikely). But some jobs can be short, and the user has a reasonable expectation that such jobs will complete pretty quickly (e.g. in 3 minutes, not 30).

So, at the moment, I have no visibility on whether there might be messages queued up for delivery, but which are perhaps being held off waiting for messages to accumulate. And how long I might have to wait for that to happen. I have only briefly dug down in the psq code to look at the PubSub layer, but I'm not seeing a way to e.g. configure whether I value throughput (not so much) versus latency (a lot) in a topic.

Thanks!

@bsinnottkabam

This comment has been minimized.

Show comment
Hide comment
@bsinnottkabam

bsinnottkabam Mar 30, 2017

Just want to chime in and say I'm facing same issue as @wlongabaugh. I'm using PubSub+psq as a task queue on flexible environment for rare, long-duration tasks and am often finding 10+ mins for workers to pull new tasks off the queue.

bsinnottkabam commented Mar 30, 2017

Just want to chime in and say I'm facing same issue as @wlongabaugh. I'm using PubSub+psq as a task queue on flexible environment for rare, long-duration tasks and am often finding 10+ mins for workers to pull new tasks off the queue.

@wlongabaugh

This comment has been minimized.

Show comment
Hide comment
@wlongabaugh

wlongabaugh Mar 30, 2017

Glad to hear I'm not alone, @bsinnottkabam. My workaround is that I stuff a pile (10) of no-op tasks on the queue, then the one I want, followed by another pile of no-op tasks. This flushes the queue immediately. Still hoping for more timely delivery of occasional messages.

wlongabaugh commented Mar 30, 2017

Glad to hear I'm not alone, @bsinnottkabam. My workaround is that I stuff a pile (10) of no-op tasks on the queue, then the one I want, followed by another pile of no-op tasks. This flushes the queue immediately. Still hoping for more timely delivery of occasional messages.

@ajpharrington

This comment has been minimized.

Show comment
Hide comment
@ajpharrington

ajpharrington Apr 24, 2017

I am also facing this issue. It takes 2-10 minutes for the worker to start processing the task from the queue. I will try the work around mentioned by @wlongabaugh

ajpharrington commented Apr 24, 2017

I am also facing this issue. It takes 2-10 minutes for the worker to start processing the task from the queue. I will try the work around mentioned by @wlongabaugh

@wlongabaugh

This comment has been minimized.

Show comment
Hide comment
@wlongabaugh

wlongabaugh Apr 25, 2017

@ajpharrington: After further testing, I find that the arrival of a second message will force the first one out. So it may be enough to follow the desired message immediately with a single empty "ping" job.

wlongabaugh commented Apr 25, 2017

@ajpharrington: After further testing, I find that the arrival of a second message will force the first one out. So it may be enough to follow the desired message immediately with a single empty "ping" job.

@theacodes

This comment has been minimized.

Show comment
Hide comment
@theacodes

theacodes Dec 6, 2017

Member

We've been working hard to improve the underlying pubsub library. Hopefully this issue is resolved with #34. But, if it's not, feel free to comment.

Member

theacodes commented Dec 6, 2017

We've been working hard to improve the underlying pubsub library. Hopefully this issue is resolved with #34. But, if it's not, feel free to comment.

@theacodes theacodes closed this Dec 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment