Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for variations of Jenks/Fischer natural breaks for splitting peaks, as an alternative to or replacement for the local minimum prominence finding algorithm method.
You can find some background info on wikipedia and here. In computer vision, when used on an image intensity histogram, this is known as Otsu's method.
Natural breaks finds a point in the waveform that maximizes the 'goodness of split':
Here s is the (weighted) sum squared deviation from the (weighted) mean, and 'left' and 'right' mean the left and right part of the waveform after the split. To be fully precise, for a waveform w(t), the (weighted) mean is
m = sum[t w(t)]/sum[w(t)]
ands = sum[ w(t) (t - m)^2 ]
.Values near 1 can be interpreted as a strong advice to split the peak; values near 0 as strong advice not to do so.
In this implementation, the user supplies a threshold function that depends on the peak area. The algorithm probes this to decide whether to actually accept the best split and split the peak in two halves. If we do, we recurse on the split halves, until they either drop below some minimum area, have no acceptable splits anymore, or we reach a configurable recursion limit.
The code also supports two modified goodness of split functions:
s = sum[ w(t) (t - m)^2 ] / sum[w(t)]
. This is a kind of F-statistic, and has been proposed as a test for bimodality in the past (Larkin, 1979).Below you can see how these perform on a few prototypical peaks in XENON1T data at high-energies (so features are clear enough to see by eye). These were found with our current default local minimum clustering.
You can see ordinary natural breaks algorithm reaches quite high 'goodness of split' values in the middle of a normal Gaussian-ish peak. Low Split and Normalized Variance both show a much larger difference between good and bad cases (their values are just lower overall). Normalized Variance, however, has trouble recognizing long tails; Low Split again does quite well here. Thus I'd lean towards using Low Split for now.
Well-resolved peaks
Sticky tails
Multiple modes