-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: Receive doesn't call the callback function upon new unacknowledged messages #740
Comments
Please roll back to v0.11.0 while we investigate. |
Rolling back to Thank you for the quick answer. |
Unless it gives you pleasure, I wouldn't worry about the git bisect. It's almost certainly the change from Pull to StreamingPull that causes this. I will try to reproduce. |
I suspect there is some network issue that is disconnecting your stream, and the client is not properly recovering from that. But I can't reproduce. I ran this program over the weekend. I published 800 messages at 1-hour intervals and received them in a separate process. I also tried a 2-hour interval. In both cases I was still receiving messages after many hours. My 2-hour run looked like:
and in another shell:
Can you try to reproduce with my code? Also, I would like to see if any errors are being returned from the low-level stream calls. Maybe you could instrument your system to log them? The relevant places in the v0.12.0 code are:
Thanks in advance for your help. |
It looks like we also get the error with our setup: one of our GKE container stops receiving pubsub messages after a while, and a restart of the container resumes the message processing, until it stops again. We suspected that the |
@yanickbelanger It would be great if you could log the low-level Send and Recv calls as I described above. |
@jba I've just pushed on our cluster a test service that has some fmt.printf around the line you mentioned above. It should receive a message every minutes from another service. We'll see how it goes. |
We cannot reproduce the issue in our test. We suspect the relatively short interval between the test messages may prevent our service to reach the conditions leading to the issue. We've added a subscription to another topic that publishes messages every 60-180 minutes, will report back later. BTW, it's probaly normal, but even if our test went well, we still see errors from the logs I've added as you suggested above:
|
I see the same with a service that pulls from many subscriptions. The ones with frequent messages keep going fine but the subscriptions that only have messages every ~15 minutes stop receiving after they initially drain the queue. |
We've seen the issue with our last test. The service received successfully few messages withing a couple of hours, then stop receiving the next messages. Once restarted, the service processed the pending messages. Between the last received message and the service restart, we got the following logs (added in
|
@yanickbelanger Just to make sure I understand the logs correctly, are the sends and recvs succeeding? Or is it that the sends and recvs themselves never return? |
@jba I think I know what's going on. On L427 we unlock before Send or Recv so that both can happen at the same time. gRPC allows one Send and one Recv to happen concurrently, but not multiple Sends or multiple Recvs. This is usually OK, since each stream only has one goroutine pulling messages and one sending acks. Looking at the logs, I think it's possible that a Recv fails, so we attempt to reopen the stream (L433). But reopening a stream causes a Send (L392). Now we have two Sends, one from reopening and one from acking. The two can happen concurrently since the latter is out of lock. Please let me know if you think I'm on the right track. EDIT: |
I bring good news! I'm finally able to to repro. I modified log.Println("fetch calling")
err := p.call(func(spc pb.Subscriber_StreamingPullClient) error {
var err error
log.Println("inner fetch calling")
res, err = spc.Recv()
log.Println("inner fetch returned", res, err)
return err
})
log.Println("fetch returned", err) and modified I made the publishing half publish 2 messages every 15 minutes. I got these logs:
The call to @yanickbelanger @dansiemon ~~~Could you let us know which gRPC version you're on?~~~ Are you running on GCE as well? EDIT: Tested with v1.6.0 and encountered the same problem. I'll try to see if I can repro this from raw gRPC client as well. |
Yes, I was running in GCE (GKE) w/ gRPC 1.6. |
I can repro this in both Go and Java using only gRPC-generated code. I think the root cause of this is probably on the server and not the client especially because Java and Go have different gRPC implementations. @jba I'll open an internal bug for this and CC you |
@pongad Yes I'm running on GKE. |
@pongad what is the best way to track progress of the internal bug? |
We'll update this issue when we have more information. |
I inadvertently updated to As a quick fix, I changed the clusters to CPU-based autoscaling, which worked for a few hours, and then broke again as was predicted by a previous comment in this post (i.e., the issue is with consuming queue messages at all after a few hours, not the type of autoscaling). If any future problems crop up, I will post them here. |
We had the exact same issue in our production. |
Version:
v0.12.0
We have discovered that one of our Go programs, which should consume ~800 messages once every hour, is not correctly receiving the Pubsub messages that should have been sent by the callback function passed to the
Receive
. At startup, it will correctly get the messages already available and unacknowledged. However, one hour later, when the ~800 new messages arrive, nothing happens. By restarting a new instance, it works again and will block the hour after with the new messages, etc.By adding debug
Print*
statements, we discovered that it's actually blocked at thisselect
fromiterator.go
.What could be the issue?
The text was updated successfully, but these errors were encountered: