Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSub: set_exception can only be called once. can still occur "erroneously" #4463

Closed
dhermes opened this issue Nov 27, 2017 · 9 comments
Closed
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.

Comments

@dhermes
Copy link
Contributor

dhermes commented Nov 27, 2017

For example, If I add

_LOGGER.debug('thread.Policy.on_exception: %s', exception)

just before future.set_exception(), then, when forcing a failure by:

from google.cloud.pubsub_v1.subscriber.policy import base
# Monkey-patch to intentionally miss errors.
base.BasePolicy._RETRYABLE_STREAM_ERRORS = ()

I see (focus on 00117233 and 00185066):

...
00116110:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :The current p99 value is 10 seconds.
00116111:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Renewing lease for 0 ack IDs.
00116111:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Snoozing lease management for 2.555879 seconds.
00117233:DEBUG:google.cloud.pubsub_v1.subscriber.policy.thread:Consumer helper: consume bidirectional stream:thread.Policy.on_exception: 503 The service was unable to fulfill your request. Please try agai
n. [code=8a75]
00117236:DEBUG:google.cloud.pubsub_v1.subscriber._consumer:Thread-23             :Sending initial request: subscription: "projects/precise-truck-742/subscriptions/s-djh-local-1511822800630"
stream_ack_deadline_seconds: 10

00118670:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :The current p99 value is 10 seconds.
00118670:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Renewing lease for 0 ack IDs.
00118670:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Snoozing lease management for 1.883336 seconds.
...
00181929:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :The current p99 value is 10 seconds.
00181929:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Renewing lease for 0 ack IDs.
00181929:DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Thread-8              :Snoozing lease management for 7.864327 seconds.
00185066:DEBUG:google.cloud.pubsub_v1.subscriber.policy.thread:Consumer helper: consume bidirectional stream:thread.Policy.on_exception: 503 The service was unable to fulfill your request. Please try again. [code=8a75]
Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
  File ".../google/api_core/grpc_helpers.py", line 51, in error_remapped_callable
    return callable_(*args, **kwargs)
  File ".../grpc/_channel.py", line 342, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../google/cloud/pubsub_v1/subscriber/_consumer.py", line 235, in _blocking_consume
    for response in response_generator:
  File ".../grpc/_channel.py", line 348, in __next__
    return self._next()
  File ".../google/api_core/grpc_helpers.py", line 53, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 The service was unable to fulfill your request. Please try again. [code=8a75]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "${HOME}/.pyenv/versions/3.6.3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "${HOME}/.pyenv/versions/3.6.3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File ".../google/cloud/pubsub_v1/subscriber/_consumer.py", line 249, in _blocking_consume
    self._policy.on_exception(exc)
  File ".../google/cloud/pubsub_v1/subscriber/policy/thread.py", line 163, in on_exception
    self._future.set_exception(exception)
  File ".../google/cloud/pubsub_v1/futures.py", line 159, in set_exception
    raise RuntimeError('set_exception can only be called once.')
RuntimeError: set_exception can only be called once.

This is PROGRESS! It shows 67.833 seconds after the first UNAVAILABLE, a second one occurs and also tries to call set_exception().

This reveals the real bug: the subscriber doesn't exit after the first call to set_exception().


  • FOOTNOTE 1: This is what remains after Pub/Sub Subscriber does not catch & retry UNAVAILABLE errors #4234.
  • FOOTNOTE 2: The 67 second gap between UNAVAILABLE agrees with what we've previously seen. That is, after ~60 seconds of inactivity, a gRPC client will raise a local UNAVAILABLE (i.e. no round trip to the backend) and then it will become magically usable immediately after raising.
@dhermes dhermes added api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Nov 27, 2017
@dhermes dhermes self-assigned this Nov 27, 2017
@theacodes
Copy link
Contributor

I also got this if I try to listen on a subscription that doesn't exist, FYI.

Listening for messages on projects/python-docs-samples-tests/subscriptions/my-subs
Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 51, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/grpc/_channel.py", line 342, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.NOT_FOUND, Resource not found (resource=my-subs).)>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 235, in _blocking_consume
    for response in response_generator:
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/grpc/_channel.py", line 348, in __next__
    return self._next()
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 53, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.NotFound: 404 Resource not found (resource=my-subs).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jonwayne/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/jonwayne/.pyenv/versions/3.6.0/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py", line 249, in _blocking_consume
    self._policy.on_exception(exc)
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 162, in on_exception
    self._future.set_exception(exception)
  File "/Users/jonwayne/workspace/python-docs-samples/pubsub/cloud-client/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/futures.py", line 159, in set_exception
    raise RuntimeError('set_exception can only be called once.')
RuntimeError: set_exception can only be called once.

@dhermes
Copy link
Contributor Author

dhermes commented Nov 28, 2017

Good data, thanks Jon!

@dhermes
Copy link
Contributor Author

dhermes commented Nov 28, 2017

@jonparrott your example will trigger two NotFound exceptions ~100ms apart. More importantly, this gap is too small for the active = False switch to take place and so we still have the "set_exception can only be called once" issue. (Thanks for providing such an easy to use case!)

@chemelnucfin chemelnucfin changed the title set_exception can only be called once. can still occur "erroneously" PubSub: set_exception can only be called once. can still occur "erroneously" Nov 29, 2017
dhermes added a commit to dhermes/google-cloud-python that referenced this issue Nov 29, 2017
@shileiyu
Copy link

I think if there's any none retry-able exception occurred, the helper loop should just exit immediately by calling stop_consume.

@dhermes
Copy link
Contributor Author

dhermes commented Nov 29, 2017

@shileiyu I totally agree with you. I am currently wrestling this issue. The problem is that if it calls stop_comsuming(), then the current thread will try to join() itself. So I am trying to figure out the right thing to do (i.e. set self.active = False, set the self._exiting event and then signal to some other thread that this one is a zombie)

@shileiyu
Copy link

@dhermes
Maybe just break out of the loop?

BTW: I don't think the KeyboardInterrupt would ever happen on the helper thread, not only because only main thread would receive the system signals, but also if KeyboardInterrupt ever occurred before, the thread should stuck as well.

@dhermes
Copy link
Contributor Author

dhermes commented Nov 29, 2017

Maybe just break out of the loop?

We can effectively do this by setting self.active = False and calling self._exiting.set(), but that still leaves a zombie thread.

BTW: I don't think the KeyboardInterrupt would ever happen

I haven't been able to trigger the KeyboardInterrupt branch, so that doesn't surprise me.

@dhermes
Copy link
Contributor Author

dhermes commented Nov 30, 2017

OK I am going to cut 0.29.2 with this (I'm going to quickly edit #4480 and merge it first)

@dhermes
Copy link
Contributor Author

dhermes commented Nov 30, 2017

OK https://pypi.org/project/google-cloud-pubsub/0.29.2/ is officially released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.
Projects
None yet
Development

No branches or pull requests

3 participants