Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change binning #219

Merged
merged 5 commits into from
Mar 22, 2017
Merged

Change binning #219

merged 5 commits into from
Mar 22, 2017

Conversation

JoostJM
Copy link
Collaborator

@JoostJM JoostJM commented Mar 10, 2017

This PR changes the way the images are binned. This does not affect the values in the test cases, it only affects the binning for specific cases:

When the maximum gray level inside the ROI is equally dividable, it will be equal to the rightmost bin edge.
numpy.histogram includes these values inside the topmost bin (i.e. the rightmost bin is closed, whereas the others are half-open). However, in numpy.digitize, the rightmoste 'bin' is also half open, and these values will then be set to the number of bins + 1. For image discretization this is better if you consider the following example:

range 5 (unique values 1, 2, 3, 4, 5), binWidth 1: bin edges: (1, 2, 3, 4, 5)

This PR will do binning in such a way, that the values are not changed. Before this PR, voxels with value 5 would be set to 4.

Additionally, set the default value for voxelArrayShift to 0, to prevent users from accidentally using the fairly large array shift which is the current default. This PR also changes the documentation to make users more aware of this, with updates on the definition of the parameter, and incorporation of the parameter in the feature formulas that are affected by this parameter.

@pieper
Copy link
Contributor

pieper commented Mar 14, 2017

I don't completely understand your commit message - I think you are saying that you are changing the behavior to be consistent with numpy.histogram, is that correct? If so LGTM.

@JoostJM
Copy link
Collaborator Author

JoostJM commented Mar 14, 2017

I don't completely understand your commit message - I think you are saying that you are changing the behavior to be consistent with numpy.histogram, is that correct? If so LGTM.

In fact, the opposite. numpy.histogram has a closed interval for the rightmost bin (i.e. if you have gray values in range 1 - 5, with bin width 1, numpy.histogram will assign values 4 and 5 to bin 4, whereas the lower gray values are assigned to their own bin: 1 -> 1, 2 -> 2, 3 -> 3 (half-open interval)). numpy.digitize does not do this (i.e. the gray level 5 would be assigned to len(bins) + 1 -> 4 + 1 = 5).

The reason I removed the line of code which made the numpy.digitize behaviour consistent with numpy.histogram is that it prevents 'switching off' binning by specifying binWidth = 1.

In it's current form, the behavior of numpy.histogram and numpy.digitize differs for the voxels with max(gray level), where max(gray level) mod 0 = 0 (so it's just in very specific cases). An alternative would be to change the behavior of numpy.histogram to match numpy.digitize (i.e. make sure that gray level 5 would be binned to a 5th bin in the example above).

- voxelArrayShift [0]: Integer, This amount is added to the gray level intensity in features Energy, Total Energy and
RMS, this is to prevent negative values. __If using CT data, or data normalized with mean 0, consider setting this
parameter to a fixed value that ensures non-negative numbers in the image. Bear in mind however, that the larger the
value, the larger the volume confounding effect will be.__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a specific note regarding using 2000 for CT should be included somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the suggestion at the definition of voxelArrayShift parameter in the documentation (at the start of firstorder feature class)

@@ -13,42 +13,64 @@
logger = logging.getLogger(__name__)


def getHistogram(binwidth, parameterValues):
"""
def getBinEdges(binwidth, parameterValues):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just now realized that numpy has its own set of strategies for calculating bins. I think it would be helpful to give user an option to access those numpy strategies. I don't think this is possible right now.

Related question - should we raise the question of bin calculation withe radiomics standardization effort group? It does not look like their document discusses this at the moment.

Copy link
Collaborator Author

@JoostJM JoostJM Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just now realized that numpy has its own set of strategies for calculating bins. I think it would be helpful to give user an option to access those numpy strategies. I don't think this is possible right now.

It's a good question. We discussed this in one of the hangouts some months ago (where I proposed adding the possibility for a fixed bin count). We decided then not to incorporate it to prevent confusion. Additionally, do the other options yield equally spaced bins? This is required for numpy.digitize. If not, we have to decide whether or not to enable this for just histogram related features (Entropy and Uniformity in firstorder)

Related question - should we raise the question of bin calculation withe radiomics standardization effort group? It does not look like their document discusses this at the moment.

They already took it up, it's in their latest workversion (but not available online yet I think).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the other options yield equally spaced bins?

I didn't test, but looking at the code, it seems that the binning strategy just calculates the width of the bin: https://github.com/numpy/numpy/blob/v1.12.0/numpy/lib/function_base.py#L695

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test, but looking at the code, it seems that the binning strategy just calculates the width of the bin:

It looks like that to me too. Still raises the question on whether to enable this. Main issue remains that any of these functions calculate bin widths on the input range, meaning that the comparability of gray values may be limited...

@pieper
Copy link
Contributor

pieper commented Mar 14, 2017

Okay, that's clearer now.

@JoostJM JoostJM force-pushed the change-binning branch 2 times, most recently from fe8c8ef to 880fc94 Compare March 20, 2017 09:09
Remove the +1 to `binEdges` in `imageoperations.binImage()`. This produces unexpected behaviour in case of low values for binWidth. This means that the number of unique discretized values in the ROI can be 1 larger than number of bins computed by the histogram. This is due to the fact that in `numpy.histogram`, all bins are half-open, with the exception of the rightmost bin. If the maximum value in ROI is equally dividable by binWidth (max(X) mod binWidth = 0), voxel with this maximum value will be assigned to the topmost bin by the histogram, but assigned (numBins + 1) by `numpy.digitize`.

Additionally, change the implementation of the functions in `imageoperations`. Instead of a function that calculates the histogram, use a function that calculates the bin edges. The histogram is then calculated using these edges where necessary (only in firstorder), whereas the the calculated bin edges can also be passed directly to `numpy.digitize` , removing unnecessary calculation of the histogram for texture classes.

Use the maximum gray value inside the ROI after binning to determine number of gray values. This is due to the fact that this is either equal to the number of bins, or equal to the number of bins + 1 (when the max value is equal to the rightmost bin edge). This maximum gray value will always be the integer indicating the maximum number of unique gray values inside the ROI due to the binning, which is always applied.

Update documentation to reflect changes in implementation.
Set the default value to 0, which corresponds to most implementations of Energy, Total Energy and Root Mean Squared (RMS). This prevents users from accidentally using a too large array shift.

Update the documentation, to better define the usage of this parameter, with additional suggestions as to how to use it.

Update the formulas of features that incorporate this parameter to illustrate the effect this parameter has on the value.

Add documentation to explicitly mention the volume-confounded nature of these features, and indicate that the voxelArrayShift parameter influences the size of the effect of volume on the feature value.
Add a math formula describing how image discretization is performed by `binImage` function.

Add extra documentation and example on formation of bin edges by `getBinEdges` function.
Add '2000' as a suggested value for the `voxelArrayShift` parameter in firstorder when using CT data.

Additionally, fix small bug in documentation of `getBinEdges` (correct last bin edge).
@JoostJM JoostJM merged commit 5b906ea into AIM-Harvard:master Mar 22, 2017
@JoostJM JoostJM deleted the change-binning branch March 22, 2017 17:50
JoostJM added a commit that referenced this pull request Feb 20, 2019
Related to #219. Ensure numpy.histogram and numpy.digitize yield comparable results. The difference is caused by the fact that numpy.histogram treats the rightmost bin as a closed interval (i.e. values equal to the rightmost edge are included in the last bin). On the other hand, numpy.digitize treats all bins as half open (including the rightmost). In this latter case, values equal to the rightmost bin edge are given value `len(bins) + 1`.

Ensure consistent behaviour (with half open bins for all bins) by adding an extra bin edge when using fixed bin width (i.e. the rightmost edge is > max(targetValues)).
When using a fixed bin count, correct behaviour was already ensured by adding +1 to the last bin (which is guaranteed to be equal to max(targetValues). In this case, the +1 ensures nummpy.digitize considers these maximum values as part of the last bin, seeing as the edges are arranged such, that a specific number of bins is obtained).

On relation to #219: In that PR, the +1 to the rightmost bin was removed to allow 'switching off' binning by specifying bin-width 1. However, this changed did return the inconsistency between numpy.histogram and numpy.digitize, which is now corrected by this commit (by ensuring numpy.histogram is consistent with numpy.digitize, instead of the other way around).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants