Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural breaks splitting #225

Merged
merged 12 commits into from Jan 16, 2020
Merged

Conversation

JelleAalbers
Copy link
Member

@JelleAalbers JelleAalbers commented Dec 27, 2019

This adds support for variations of Jenks/Fischer natural breaks for splitting peaks, as an alternative to or replacement for the local minimum prominence finding algorithm method.

You can find some background info on wikipedia and here. In computer vision, when used on an image intensity histogram, this is known as Otsu's method.

Natural breaks finds a point in the waveform that maximizes the 'goodness of split':

      s(left)  +  s(right)
1  -  --------------------
           s(original)     

Here s is the (weighted) sum squared deviation from the (weighted) mean, and 'left' and 'right' mean the left and right part of the waveform after the split. To be fully precise, for a waveform w(t), the (weighted) mean is m = sum[t w(t)]/sum[w(t)] and s = sum[ w(t) (t - m)^2 ].

Values near 1 can be interpreted as a strong advice to split the peak; values near 0 as strong advice not to do so.

In this implementation, the user supplies a threshold function that depends on the peak area. The algorithm probes this to decide whether to actually accept the best split and split the peak in two halves. If we do, we recurse on the split halves, until they either drop below some minimum area, have no acceptable splits anymore, or we reach a configurable recursion limit.

The code also supports two modified goodness of split functions:

  • Normalized variance: This replaces s with the usual, i.e normalized, variance: s = sum[ w(t) (t - m)^2 ] / sum[w(t)]. This is a kind of F-statistic, and has been proposed as a test for bimodality in the past (Larkin, 1979).
  • Low Split: Supress splits at high parts of the waveform, by multipling the goodness of split by [the sum waveform value at sample where we split] divided by [the maximum sum waveform amplitude]. To make this work for jagged low-energy waveforms, we apply a ~150-ns square filter, i.e. a moving average, to the waveform.

Below you can see how these perform on a few prototypical peaks in XENON1T data at high-energies (so features are clear enough to see by eye). These were found with our current default local minimum clustering.

You can see ordinary natural breaks algorithm reaches quite high 'goodness of split' values in the middle of a normal Gaussian-ish peak. Low Split and Normalized Variance both show a much larger difference between good and bad cases (their values are just lower overall). Normalized Variance, however, has trouble recognizing long tails; Low Split again does quite well here. Thus I'd lean towards using Low Split for now.

Well-resolved peaks

natbreaks_example_82
natbreaks_example_1794

Sticky tails

natbreaks_example_1910
natbreaks_example_1096

Multiple modes

natbreaks_example_3756
natbreaks_example_381

@JelleAalbers JelleAalbers force-pushed the natbreaks branch 2 times, most recently from 678c68a to 43aef02 Compare December 30, 2019 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant