Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race in mock publisher #1560

Merged
merged 2 commits into from
Jan 25, 2017
Merged

fix race in mock publisher #1560

merged 2 commits into from
Jan 25, 2017

Conversation

pongad
Copy link
Contributor

@pongad pongad commented Jan 24, 2017

FakePublisherServiceImpl::publish has a race.
If the call to publish happen before a response can be placed,
the server will try to respond with a null object.

The fix is to either

  • always set the response before calling
  • make publish wait for the response
    For propriety, this commit does both.

This fix reveals another flake.
Publisher uses exponential backoff with jitter.
The jitter randomly picks a number between 0 and a maximum.
If we pick low values too many times,
it will retry too often and the server will run out of canned
transient errors to respond back with.
The test still passed since it expected any Throwable.

This commit fixed the test to expect FakeException,
and use a fake "random" number generator to ensure
we don't run out of return values.

Retrying can still causes random test failures,
independently of above changes.
If a request fails due to DEADLINE_EXCEEDED,
the future is completed with a corresponding error.
However, the last RPC might not have been successfully cancelled.
When a new test starts, it gives canned response to the server.
The server might use some of these responses to respond to
RPCs of previous tests.
Consequently, a misbehaving test can fail every test that comes
after it.
This commit changes the test setup code so that it
creates a new fake server for every test to avoid this problem.

FakePublisherServiceImpl::publish has a race.
If the call to publish happen before a response can be placed,
the server will try to respond with a null object.

The fix is to either
- always set the response before calling
- make publish wait for the response
For propriety, this commit does both.

This fix reveals another flake.
Publisher uses exponential backoff with jitter.
The jitter randomly picks a number between 0 and a maximum.
If we pick low values too many times,
it will retry too often and the server will run out of canned
transient errors to respond back with.
The test still passed since it expected any Throwable.

This commit fixed the test to expect FakeException,
set the jitter to random in range (max/2, max),
and increases the number of canned errors to compensate.

Retrying can still causes random test failures,
independently of above changes.
If a request fails due to DEADLINE_EXCEEDED,
the future is completed with a corresponding error.
However, the last RPC might not have been successfully cancelled.
When a new test starts, it gives canned response to the server.
The server might use some of these responses to respond to
RPCs of previous tests.
Consequently, a misbehaving test can fail every test that comes
after it.
This commit changes the test setup code so that it
creates a new fake server for every test to avoid this problem.
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jan 24, 2017
@pongad
Copy link
Contributor Author

pongad commented Jan 24, 2017

cc @davidtorres , we still seem to have flakes in SubscriberImplTest. I'll take a look tomorrow.
@garrettjonesgoogle Do we want to make retry jitter in gax (max/2, max) instead of (0, max)? Please see the motivation in the paragraph below the bullets.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 83.189% when pulling 0707971 on pongad:sync-reply-pub into c68968b on GoogleCloudPlatform:pubsub-hp.

@garrettjonesgoogle
Copy link
Member

Couldn't you also mock the jitter, or the random number provider?

@pongad
Copy link
Contributor Author

pongad commented Jan 25, 2017

@garrettjonesgoogle Good idea. PTAL

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 83.081% when pulling 8661957 on pongad:sync-reply-pub into c68968b on GoogleCloudPlatform:pubsub-hp.

@garrettjonesgoogle
Copy link
Member

LGTM, after you update the PR description to match the latest logic.

@pongad pongad merged commit 10bedac into googleapis:pubsub-hp Jan 25, 2017
@pongad pongad deleted the sync-reply-pub branch January 25, 2017 22:29
@pongad
Copy link
Contributor Author

pongad commented Jan 25, 2017

Description fixed in PR and commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants