-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change binning #219
Change binning #219
Conversation
I don't completely understand your commit message - I think you are saying that you are changing the behavior to be consistent with numpy.histogram, is that correct? If so LGTM. |
In fact, the opposite. The reason I removed the line of code which made the In it's current form, the behavior of |
radiomics/firstorder.py
Outdated
- voxelArrayShift [0]: Integer, This amount is added to the gray level intensity in features Energy, Total Energy and | ||
RMS, this is to prevent negative values. __If using CT data, or data normalized with mean 0, consider setting this | ||
parameter to a fixed value that ensures non-negative numbers in the image. Bear in mind however, that the larger the | ||
value, the larger the volume confounding effect will be.__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a specific note regarding using 2000 for CT should be included somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the suggestion at the definition of voxelArrayShift
parameter in the documentation (at the start of firstorder feature class)
@@ -13,42 +13,64 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def getHistogram(binwidth, parameterValues): | |||
""" | |||
def getBinEdges(binwidth, parameterValues): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just now realized that numpy has its own set of strategies for calculating bins. I think it would be helpful to give user an option to access those numpy strategies. I don't think this is possible right now.
Related question - should we raise the question of bin calculation withe radiomics standardization effort group? It does not look like their document discusses this at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just now realized that numpy has its own set of strategies for calculating bins. I think it would be helpful to give user an option to access those numpy strategies. I don't think this is possible right now.
It's a good question. We discussed this in one of the hangouts some months ago (where I proposed adding the possibility for a fixed bin count). We decided then not to incorporate it to prevent confusion. Additionally, do the other options yield equally spaced bins? This is required for numpy.digitize
. If not, we have to decide whether or not to enable this for just histogram related features (Entropy and Uniformity in firstorder)
Related question - should we raise the question of bin calculation withe radiomics standardization effort group? It does not look like their document discusses this at the moment.
They already took it up, it's in their latest workversion (but not available online yet I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the other options yield equally spaced bins?
I didn't test, but looking at the code, it seems that the binning strategy just calculates the width of the bin: https://github.com/numpy/numpy/blob/v1.12.0/numpy/lib/function_base.py#L695
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test, but looking at the code, it seems that the binning strategy just calculates the width of the bin:
It looks like that to me too. Still raises the question on whether to enable this. Main issue remains that any of these functions calculate bin widths on the input range, meaning that the comparability of gray values may be limited...
Okay, that's clearer now. |
fe8c8ef
to
880fc94
Compare
Remove the +1 to `binEdges` in `imageoperations.binImage()`. This produces unexpected behaviour in case of low values for binWidth. This means that the number of unique discretized values in the ROI can be 1 larger than number of bins computed by the histogram. This is due to the fact that in `numpy.histogram`, all bins are half-open, with the exception of the rightmost bin. If the maximum value in ROI is equally dividable by binWidth (max(X) mod binWidth = 0), voxel with this maximum value will be assigned to the topmost bin by the histogram, but assigned (numBins + 1) by `numpy.digitize`. Additionally, change the implementation of the functions in `imageoperations`. Instead of a function that calculates the histogram, use a function that calculates the bin edges. The histogram is then calculated using these edges where necessary (only in firstorder), whereas the the calculated bin edges can also be passed directly to `numpy.digitize` , removing unnecessary calculation of the histogram for texture classes. Use the maximum gray value inside the ROI after binning to determine number of gray values. This is due to the fact that this is either equal to the number of bins, or equal to the number of bins + 1 (when the max value is equal to the rightmost bin edge). This maximum gray value will always be the integer indicating the maximum number of unique gray values inside the ROI due to the binning, which is always applied. Update documentation to reflect changes in implementation.
Set the default value to 0, which corresponds to most implementations of Energy, Total Energy and Root Mean Squared (RMS). This prevents users from accidentally using a too large array shift. Update the documentation, to better define the usage of this parameter, with additional suggestions as to how to use it. Update the formulas of features that incorporate this parameter to illustrate the effect this parameter has on the value. Add documentation to explicitly mention the volume-confounded nature of these features, and indicate that the voxelArrayShift parameter influences the size of the effect of volume on the feature value.
Add a math formula describing how image discretization is performed by `binImage` function. Add extra documentation and example on formation of bin edges by `getBinEdges` function.
Add '2000' as a suggested value for the `voxelArrayShift` parameter in firstorder when using CT data. Additionally, fix small bug in documentation of `getBinEdges` (correct last bin edge).
Related to #219. Ensure numpy.histogram and numpy.digitize yield comparable results. The difference is caused by the fact that numpy.histogram treats the rightmost bin as a closed interval (i.e. values equal to the rightmost edge are included in the last bin). On the other hand, numpy.digitize treats all bins as half open (including the rightmost). In this latter case, values equal to the rightmost bin edge are given value `len(bins) + 1`. Ensure consistent behaviour (with half open bins for all bins) by adding an extra bin edge when using fixed bin width (i.e. the rightmost edge is > max(targetValues)). When using a fixed bin count, correct behaviour was already ensured by adding +1 to the last bin (which is guaranteed to be equal to max(targetValues). In this case, the +1 ensures nummpy.digitize considers these maximum values as part of the last bin, seeing as the edges are arranged such, that a specific number of bins is obtained). On relation to #219: In that PR, the +1 to the rightmost bin was removed to allow 'switching off' binning by specifying bin-width 1. However, this changed did return the inconsistency between numpy.histogram and numpy.digitize, which is now corrected by this commit (by ensuring numpy.histogram is consistent with numpy.digitize, instead of the other way around).
This PR changes the way the images are binned. This does not affect the values in the test cases, it only affects the binning for specific cases:
When the maximum gray level inside the ROI is equally dividable, it will be equal to the rightmost bin edge.
numpy.histogram
includes these values inside the topmost bin (i.e. the rightmost bin is closed, whereas the others are half-open). However, innumpy.digitize
, the rightmoste 'bin' is also half open, and these values will then be set to the number of bins + 1. For image discretization this is better if you consider the following example:range 5 (unique values 1, 2, 3, 4, 5), binWidth 1: bin edges: (1, 2, 3, 4, 5)
This PR will do binning in such a way, that the values are not changed. Before this PR, voxels with value 5 would be set to 4.
Additionally, set the default value for
voxelArrayShift
to 0, to prevent users from accidentally using the fairly large array shift which is the current default. This PR also changes the documentation to make users more aware of this, with updates on the definition of the parameter, and incorporation of the parameter in the feature formulas that are affected by this parameter.