Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT _weighted_percentile supports np.nan values #29034

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented May 16, 2024

What does this implement/fix? Explain your changes.

This PR adds support for nan input into _weighted_percentile, which is used within SplineTransformer, for instance.

Any other comments?

This is for @ogrisel since we had talked about that:

This does not deliver equivalent results as np.nanpercentile in the current dev-version, because they were different before: _weighted_percentile is finding the next lower percentile, while np.nanpercentile seems to find their percentiles differently (closest?). The more variance in the sample_weights, the more important the effect of this gets.

Here an example without any nans and without weights:

import numpy as np
from sklearn.utils.stats import _weighted_percentile

percentile = 66
arr2D = np.array([[3,30],[1,20],[3,10]])
weights2D = np.array([[1,1],[1,1],[1,1]])
print(_weighted_percentile(arr2D, weights2D, percentile))
print(np.nanpercentile(arr2D, percentile, weights=weights2D, axis=0, method="inverted_cdf"))

output:

[3 20]
[3 30]

While setting percentile = 67 leads to the same results.

Edit: What I wrote is actually not true. The differing output was due to not putting sample weights of nan values to 0 before. Now that is done, we have the same results it seems.

Copy link

github-actions bot commented May 16, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 63dc932. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title ENH _weighted_percentile supports np.nan values MNT _weighted_percentile supports np.nan values May 17, 2024
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback.

@@ -41,6 +41,8 @@ def _weighted_percentile(array, sample_weight, percentile=50):
sorted_weights = np.take_along_axis(sample_weight, sorted_idx, axis=0)

# Find index of median prediction for each sample
# nan_mask = np.isnan(array) #???
# weight_cdf[nan_mask] = 0 #
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to zero the weights of nan values in the tiled version of sample_weight, before computing the cumsum so that missing entries in a given column do not contribute to the empirical CDF estimated for that column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will eliminate the issue that varying weights for values that are nans could influence _weighted_percentile to deliver a different result.

Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).

I decided to set the weights from nan-values to 0 on sorted_weights not on sample_weight) and I believe that I don't need to copy thus, correct?

Comment on lines 105 to 106
array[np.random.rand(*array.shape) < 0.8] = np.nan
weights = np.random.randint(1, 6, size=(100, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the rand and randint method of the rng instance instead of the methods from np.random that rely on a shared global RandomState instance.

This is necessary to isolate the behavior of this test from any side effect on the shared global numpy RandomState instance. Otherwise it makes it really hard to debug randomly failing tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you for that explanation. It makes total sense now.



def test_weighted_percentile_nan():
"""Test that np.nan is not returned as a value."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test something stronger: we need to check that calling _weighted_percentile on an array with nan values returns the same results as calling _weighted_percentile on a filtered version of the data, where all missing values and the corresponding weights have been removed.

This is easy to test for the following cases:

  • 1d data and 1d weights;
  • 2d data and 2d weights;

but it's a bit more challenging to implement and test for the:

  • 2d data with shared 1d weights.

"One feature contains too many NaN values. This error should "
"actually either raise within the API, or there needs to be some "
"validation here before to make sure it cannot happen."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to return np.nan valued percentile (maybe with a warning?) if some array is all-nan valued.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and have removed this from here (and also repaired the buggy loop).

But some further thoughts, that I am not sure how important they are to be taken:

The reason I had put it here was that I had tested this on the branch from the nans in SplineTransformer PR where we add test_spline_transformer_handles_all_nans (quite a special case anyway) and this made it raise in a not very informative way.

I think the error needs to be handled somehow, but within SplineTransformer, so that _weighted_percentile can stay flexible in case it's used differently in another context at one point.

For the other cases, we could raise a warning similar to like its done in numpy:

/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1623: RuntimeWarning: All-NaN slice encountered return fnb._ureduce(a, ...

But should we add a warning within a private method? I'm a bit hesitant, as this could be confusing for the users who might not be aware about _weighted_percentile being internally used. Maybe this warning should instead be raised withing the realm of whatever API method is using it if need be. Candidates might be AbsoluteError.fit_intercept_only(), PinballLoss.fit_intercept_only(), HuberLoss.fit_intercept_only(), KBinsDiscretizer.fit() and maybe I should add similar all-nan-column tests and see if they raise or if we want to add a warning. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at this again and went through all the places where _weighted_percentile gets used.

The losses I mentioned above (only internal use) as well as set_huber_delta() and median_absolute_error() find a percentile from within y_true or y_true-some_other_1darray , where I think it should be pretty obvious for the user if all values were nan. This is why I think it's okay without a warning.

KBinsDiscretizer and SplineTransformer use it on the whole X within their fit methods and I think we could add a warning there for all-nan-columns if this kind of validation is not already done somewhere else.

Comment on lines 73 to 74
while bool(np.isnan(percentile).any()) and (
percentile_idx[np.isnan(percentile)].all() > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should bool(np.isnan(percentile).any()) be an if-condition in the inside of the loop for better readability? It actually doesn't construct the looping condition.

Edit: I accidentally published that review comment too early. but to answer my own question: Yes, I would think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the percentile_idx[np.isnan(percentile)].all() > 0 condition. percentile_idx[np.isnan(percentile)] is an array of (non-binary) integers. So calling .all() on it should be equivalent to 0 not in percentile_idx[np.isnan(percentile)] but then I do not see the need to compare that boolean to 0 with > 0.

Maybe you meant

(percentile_idx[np.isnan(percentile)] > 0).all()

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's been odd code, sorry for this.

What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all().

I will change that and modify the comment.

The idea of excluding bool(np.isnan(percentile).any() from the loop condition, does not work (entering infinite loop when no nans present).

Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any().

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am addressing your comments in my latest push, @ogrisel, thank you.

That test now fails because it is an actual test. Through it I got aware that we also need to mask out the nans for calculating the percentile_idx. Simply ignoring nans within the indexing tasks is not enough. I am still trying to figure out how to do that using masked arrays, pushing the current state of my attempts.

Edit: I now found out what I did wrong before, I had created the nan mask on the input array, and but it should have been the sorted one. With the corrected nan mask we can indeed calculate percentile_idx and using a masked array is not necessary anymore, it seems. I fixed the errors below and all the tests pass.

Comment on lines 105 to 106
array[np.random.rand(*array.shape) < 0.8] = np.nan
weights = np.random.randint(1, 6, size=(100, 3))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you for that explanation. It makes total sense now.

Comment on lines 73 to 74
while bool(np.isnan(percentile).any()) and (
percentile_idx[np.isnan(percentile)].all() > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's been odd code, sorry for this.

What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all().

I will change that and modify the comment.

The idea of excluding bool(np.isnan(percentile).any() from the loop condition, does not work (entering infinite loop when no nans present).

Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any().

@StefanieSenger StefanieSenger marked this pull request as ready for review May 29, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants