Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimethods for statistical functions #70

Merged

Conversation

joaosferreira
Copy link
Collaborator

This pull request adds multimethods for statistical functions. The multimethods added are the following:

Order statistics

  • percentile
  • nanpercentile
  • quantile
  • nanquantile

Averages and variances

  • median
  • average
  • mean
  • nanmedian
  • nanmean
  • nanstd
  • nanvar

Correlating

  • corrcoef
  • correlate
  • cov

Histograms

  • histogram
  • histogram2d
  • histogramdd
  • bincount
  • histogram_bin_edges
  • digitize

A few things to discuss:

  1. Why is dtype not being dispatched in std and var? Other multimethods that use _reduce_argreplacer also don't dispatch dtype.
  2. The previously added min and max seem to be equal to amin and amax that this PR intends to add:
In [2]: onp.amin                                                                
Out[2]: <function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>

In [3]: onp.min                                                                 
Out[3]: <function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>
  1. I'm looking into implementing some defaults as well. Multimethods like median that reduce array slices along an axis might be the easier ones for now. As I understand these require a for loop over the given axis, so in terms of complexity it would be O(n) where n is the length of that dimension.

@hameerabbasi
Copy link
Collaborator

  1. Why is dtype not being dispatched in std and var? Other multimethods that use _reduce_argreplacer also don't dispatch dtype.

That's a bug. We should probably add that.

  1. The previously added min and max seem to be equal to amin and amax that this PR intends to add:

Just do amin = min and amax = max.

  1. I'm looking into implementing some defaults as well. Multimethods like median that reduce array slices along an axis might be the easier ones for now. As I understand these require a for loop over the given axis, so in terms of complexity it would be O(n) where n is the length of that dimension.

That seems perfect to me.

@joaosferreira
Copy link
Collaborator Author

I've implemented the default for median with respect to apply_along_axis. I favoured this approach over the one I previously mentioned for a few reasons:

  1. Since axis can be a sequence of integers, it would be necessary to iterate over the various dimensions given by the axis argument (nested for loop), so the complexity would be greater than previously thought.
  2. It makes it easier for read-only arrays.
  3. More simple implementation.

If this is okay I will implement other defaults similarly.

@hameerabbasi
Copy link
Collaborator

Unfortunately, not. apply_along_axis uses a Python for-loop, so it's undesirable to use such an approach. If they're too hard, you can skip the defaults.

@joaosferreira
Copy link
Collaborator Author

Not entirely sure how to do the defaults. I'm guessing that they have to be implemented with respect to some other method since item assignment is undesirable. Regarding the traversing of the dimensions, I think it could be done moving the given axes with swapaxes or transpose so that the for loop goes over them. Any tips here are welcome and could help me unblock. Nevertheless, I'll move on to work on other stuff as well.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Aug 11, 2020

In general, I'd follow the following pattern:

# transpose with (*non_selected_axes, *selected_axes)
# reshape with `(prod_of_nonselected_axes, prod_of_selected_axes)`
# Apply reduction over second axis. If `count` is needed, use `prod_of_selected_axes`.
# Reshape back to `non_selected_axes`.

For median, sort and then apply the selection.

For mean, it's just the sum(axis=axis) / count where count is the product of selected axes' shapes.

For var, it's mean(x ** 2) - mean(x) ** 2 (of course, passing the appropriate axes in).
For std, it's sqrt(var(x)).

For the nan-reductions, replace sum by nansum, and count by sum(~isnan(x), axes=nonselected_axes).

@joaosferreira
Copy link
Collaborator Author

When you mention reshape with (prod_of_nonselected_axes, prod_of_selected_axes) I imagine you mean the product of the dimensions' lengths.

@hameerabbasi
Copy link
Collaborator

When you mention reshape with (prod_of_nonselected_axes, prod_of_selected_axes) I imagine you mean the product of the dimensions' lengths.

That's correct.

@joaosferreira
Copy link
Collaborator Author

That's a bug. We should probably add that.

The last commit fixes this.

Just do amin = min and amax = max.

This was added as well.

@joaosferreira
Copy link
Collaborator Author

@hameerabbasi The last commit refactors _median_default by following the pattern you suggested.

The implementation is also a mashup of these two functions. Also, I think eventually the if/else part can be refactored further into a helper function that works for other reduce methods such as mean and std.

If this seems alright to you I can start implementing the other defaults.

@hameerabbasi
Copy link
Collaborator

LGTM so far, just one minor comment.

@joaosferreira joaosferreira marked this pull request as ready for review August 18, 2020 15:53
@joaosferreira
Copy link
Collaborator Author

There's a few issues with the defaults:

  1. Sometimes median's default behaves badly when handling nans. This happens because of the use of sort in the default. Here is a simple demonstration:
In [2]: a = onp.asarray([[10, onp.nan, 4], [3, 2, 1]])                          

In [3]: onp.sort(a, axis=1)                                                     
Out[3]: 
array([[ 4., 10., nan],
       [ 1.,  2.,  3.]])

In [4]: onp.median(a, axis=1)                                                   
Out[4]: array([nan,  2.])

In this example, the nan value goes to the end of the first row after sorting. Since the default then slices the middle values along the given axis (2nd) the median values for the sorted array become array([10., 2.]) which is incorrect as showed by NumPy's median. Any slice along a given axis with a nan should also reduce to a nan median.

  1. In the default implementations for var and nanvar the case where ddof is different than zero produces wrong results. It seems that it's not as simple as just changing the divisor in the mean formula as stated in the docs. Any tips?

@hameerabbasi
Copy link
Collaborator

For 1: Use ret = np.where(any(isnan(x), axis=axis), nan, ret).
For 2: Do you have an idea of how big the difference is?

@joaosferreira
Copy link
Collaborator Author

For 2: Do you have an idea of how big the difference is?

The default's results are commented:

In [2]: a = onp.asarray([[1, 2], [3, 4]])                                       

In [3]: onp.var(a, ddof=1)                                                      
Out[3]: 1.6666666666666667 # -1.1111111111111125

In [4]: onp.var(a, axis=0, ddof=1)                                              
Out[4]: array([2., 2.]) # array([ -6. -16.])

In [5]: onp.var(a, axis=1, ddof=1)                                              
Out[5]: array([0.5, 0.5]) # array([ -4. -24.])

@hameerabbasi
Copy link
Collaborator

Can you merge master and resolve any merge conflicts that may arise?

@hameerabbasi
Copy link
Collaborator

Is anything remaining here?

@joaosferreira
Copy link
Collaborator Author

No, I think I added all that I wanted. You can merge if you think everything is okay. 😄

dims[ax] = 1
dims = tuple(dims)

a = transpose(a, unselected_axis + axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you transpose and call functions with axis=-1 when you could just call the basic functions with axis=axis?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested that during the meeting to make the implementation simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken it was necessary for median's default which led me to using it on the other reduce methods as well. But now that you point that out it's possible that the only method that needs it is median.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove _reduce from the methods where it's not needed.

unumpy/_multimethods.py Outdated Show resolved Hide resolved
unumpy/_multimethods.py Outdated Show resolved Hide resolved
mask = any(isnan(a), axis=-1).reshape(a.shape[0:-1] + (1,))
a = where(mask, nan, a)

a = sort(a, axis=-1)
Copy link
Contributor

@peterbell10 peterbell10 Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use partition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's mostly for simplicity. Is partition more efficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be slightly more efficient as it doesn't completely sort the array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced sort with partition as you suggested.

unumpy/_multimethods.py Outdated Show resolved Hide resolved
unumpy/_multimethods.py Show resolved Hide resolved
Comment on lines 983 to 985
N = 0
for ax in axis:
N += a.shape[ax]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
N = 0
for ax in axis:
N += a.shape[ax]
N = 1
for ax in axis:
N *= a.shape[ax]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you might want a _prod helper so you can do:

N = _prod(a.shape[ax] for ax in axis)

Since this comes up in a few different places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now.

Comment on lines 1028 to 1030
N = 0
for ax in axis:
N += a.shape[ax]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here.

@joaosferreira
Copy link
Collaborator Author

I think this can be merged now.

@hameerabbasi hameerabbasi merged commit 1607d9b into Quansight-Labs:master Aug 24, 2020
@hameerabbasi
Copy link
Collaborator

Thanks for the excellent work, @joaosferreira!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants