Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix global failed status when multiple fatal checks have mixed results #50

Merged
merged 4 commits into from
Oct 4, 2018

Conversation

endorama
Copy link
Contributor

@endorama endorama commented Oct 2, 2018

Hello, first of all thank you for this library!

I really like the effective and simple implementation.

I found a bug that I suspect is a race condition in the global status when there are multiple fatal checks with mixed order.

Given the presence of multiple fatal checks, some failing and some passing, if the order of execution makes the passing check end last the global status was reported as ok even if other fatal check were in a failed state.

Looking at the code, each goroutine was updating the h.failed "thread-global" variable at the end of it's check cycle. This makes the h.failed variable to have the last value set by a goroutine in order of exectuion time.
If a successful fatal test run after a failing one, the h.failed variable was reporting true.

To reproduce the bug I updated the test cases

TestFailed/
  Should_return_false_if_a_fatally_configured_check_hasn't_errored

and

TestState/
  When_a_fatally-configured_check_fails_and_recovers,_state_should_get_updated_accordingly

to use two checkers instead of one, both fatal but one failing and one passing, ordered such that the passing check was executed after the failing one.
( You can still test this at commit 1fa23aa )

This PR adds a little feature and implements a fix for this behaviour.

Feature

Add Fatal field in the State struct.
This information is really helpful when multiple checks (mixed fatal and not fatal) are present and the global status is failed.
With this information exposed is clear which check is making the global status to fail.

The bugfix is based upon the Fatal field exposed in the State struct, this could be reworked but I found the Fatal to be useful there anyway.

Bugfix

Remove the h.failed boolean and rely only on State information to evaluate the condition.
With this implementation the h.Failed() function relies on safeGetStates to reported last known status.

Thank you for looking into this!

Fatal information is useful in output to determine which check is
affecting the global status.
Using a global failed boolean updated by each goroutine was producing
a race condition when a fatal failing check was finishing before
a fatal passing check.

With the previous implementation the last fatal check value was always
considered as the global status, even if some other fatal check was
failing.

The new implementation uses the safeGetState in the Failed() function
to check if one of the fatal check is failing and reporting the
failed status based on this condition.
@codecov
Copy link

codecov bot commented Oct 4, 2018

Codecov Report

Merging #50 into master will decrease coverage by 1.16%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #50      +/-   ##
==========================================
- Coverage     100%   98.83%   -1.17%     
==========================================
  Files           6        6              
  Lines         343      344       +1     
==========================================
- Hits          343      340       -3     
- Misses          0        4       +4
Impacted Files Coverage Δ
health.go 100% <100%> (ø) ⬆️
safe.go 81.81% <0%> (-18.19%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3410a5...aefe268. Read the comment docs.

@dselans
Copy link
Contributor

dselans commented Oct 4, 2018

Thanks for your contribution!

On second look, this is pretty obvious... thanks for fixing this bug :)

It's unfortunate that we have to traverse the entire map to determine whether the healthcheck has fatally failed, but it's certainly better than what we have right now.

Thanks again!

@dselans dselans merged commit 73a22a3 into InVisionApp:master Oct 4, 2018
@endorama endorama deleted the fix-global-failed branch October 4, 2018 15:34
@endorama endorama restored the fix-global-failed branch October 4, 2018 15:34
@endorama
Copy link
Contributor Author

endorama commented Oct 4, 2018

Thank you for merging this!

@endorama endorama deleted the fix-global-failed branch October 4, 2018 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants