Multimethods for statistical functions #70

joaosferreira · 2020-08-08T15:01:11Z

This pull request adds multimethods for statistical functions. The multimethods added are the following:

Order statistics

percentile
nanpercentile
quantile
nanquantile

Averages and variances

median
average
mean
nanmedian
nanmean
nanstd
nanvar

Correlating

corrcoef
correlate
cov

Histograms

histogram
histogram2d
histogramdd
bincount
histogram_bin_edges
digitize

A few things to discuss:

Why is dtype not being dispatched in std and var? Other multimethods that use _reduce_argreplacer also don't dispatch dtype.
The previously added min and max seem to be equal to amin and amax that this PR intends to add:

In [2]: onp.amin                                                                
Out[2]: <function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>

In [3]: onp.min                                                                 
Out[3]: <function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>

I'm looking into implementing some defaults as well. Multimethods like median that reduce array slices along an axis might be the easier ones for now. As I understand these require a for loop over the given axis, so in terms of complexity it would be O(n) where n is the length of that dimension.

hameerabbasi · 2020-08-08T16:02:59Z

Why is dtype not being dispatched in std and var? Other multimethods that use _reduce_argreplacer also don't dispatch dtype.

That's a bug. We should probably add that.

The previously added min and max seem to be equal to amin and amax that this PR intends to add:

Just do amin = min and amax = max.

I'm looking into implementing some defaults as well. Multimethods like median that reduce array slices along an axis might be the easier ones for now. As I understand these require a for loop over the given axis, so in terms of complexity it would be O(n) where n is the length of that dimension.

That seems perfect to me.

joaosferreira · 2020-08-10T17:12:52Z

I've implemented the default for median with respect to apply_along_axis. I favoured this approach over the one I previously mentioned for a few reasons:

Since axis can be a sequence of integers, it would be necessary to iterate over the various dimensions given by the axis argument (nested for loop), so the complexity would be greater than previously thought.
It makes it easier for read-only arrays.
More simple implementation.

If this is okay I will implement other defaults similarly.

hameerabbasi · 2020-08-10T17:15:28Z

Unfortunately, not. apply_along_axis uses a Python for-loop, so it's undesirable to use such an approach. If they're too hard, you can skip the defaults.

joaosferreira · 2020-08-11T11:00:52Z

Not entirely sure how to do the defaults. I'm guessing that they have to be implemented with respect to some other method since item assignment is undesirable. Regarding the traversing of the dimensions, I think it could be done moving the given axes with swapaxes or transpose so that the for loop goes over them. Any tips here are welcome and could help me unblock. Nevertheless, I'll move on to work on other stuff as well.

hameerabbasi · 2020-08-11T12:19:22Z

In general, I'd follow the following pattern:

# transpose with (*non_selected_axes, *selected_axes)
# reshape with `(prod_of_nonselected_axes, prod_of_selected_axes)`
# Apply reduction over second axis. If `count` is needed, use `prod_of_selected_axes`.
# Reshape back to `non_selected_axes`.

For median, sort and then apply the selection.

For mean, it's just the sum(axis=axis) / count where count is the product of selected axes' shapes.

For var, it's mean(x ** 2) - mean(x) ** 2 (of course, passing the appropriate axes in).
For std, it's sqrt(var(x)).

For the nan-reductions, replace sum by nansum, and count by sum(~isnan(x), axes=nonselected_axes).

joaosferreira · 2020-08-11T15:56:47Z

When you mention reshape with (prod_of_nonselected_axes, prod_of_selected_axes) I imagine you mean the product of the dimensions' lengths.

hameerabbasi · 2020-08-11T15:57:40Z

When you mention reshape with (prod_of_nonselected_axes, prod_of_selected_axes) I imagine you mean the product of the dimensions' lengths.

That's correct.

unumpy/_multimethods.py

joaosferreira · 2020-08-12T14:18:59Z

That's a bug. We should probably add that.

The last commit fixes this.

Just do amin = min and amax = max.

This was added as well.

joaosferreira · 2020-08-12T17:09:25Z

@hameerabbasi The last commit refactors _median_default by following the pattern you suggested.

The implementation is also a mashup of these two functions. Also, I think eventually the if/else part can be refactored further into a helper function that works for other reduce methods such as mean and std.

If this seems alright to you I can start implementing the other defaults.

unumpy/_multimethods.py

hameerabbasi · 2020-08-14T07:48:08Z

LGTM so far, just one minor comment.

joaosferreira · 2020-08-18T15:54:33Z

There's a few issues with the defaults:

Sometimes median's default behaves badly when handling nans. This happens because of the use of sort in the default. Here is a simple demonstration:

In [2]: a = onp.asarray([[10, onp.nan, 4], [3, 2, 1]])                          

In [3]: onp.sort(a, axis=1)                                                     
Out[3]: 
array([[ 4., 10., nan],
       [ 1.,  2.,  3.]])

In [4]: onp.median(a, axis=1)                                                   
Out[4]: array([nan,  2.])

In this example, the nan value goes to the end of the first row after sorting. Since the default then slices the middle values along the given axis (2nd) the median values for the sorted array become array([10., 2.]) which is incorrect as showed by NumPy's median. Any slice along a given axis with a nan should also reduce to a nan median.

In the default implementations for var and nanvar the case where ddof is different than zero produces wrong results. It seems that it's not as simple as just changing the divisor in the mean formula as stated in the docs. Any tips?

hameerabbasi · 2020-08-18T15:57:36Z

For 1: Use ret = np.where(any(isnan(x), axis=axis), nan, ret).
For 2: Do you have an idea of how big the difference is?

joaosferreira · 2020-08-18T16:14:21Z

For 2: Do you have an idea of how big the difference is?

The default's results are commented:

In [2]: a = onp.asarray([[1, 2], [3, 4]])                                       

In [3]: onp.var(a, ddof=1)                                                      
Out[3]: 1.6666666666666667 # -1.1111111111111125

In [4]: onp.var(a, axis=0, ddof=1)                                              
Out[4]: array([2., 2.]) # array([ -6. -16.])

In [5]: onp.var(a, axis=1, ddof=1)                                              
Out[5]: array([0.5, 0.5]) # array([ -4. -24.])

unumpy/_multimethods.py

hameerabbasi · 2020-08-19T11:26:58Z

Can you merge master and resolve any merge conflicts that may arise?

unumpy/_multimethods.py

hameerabbasi · 2020-08-19T15:13:40Z

Is anything remaining here?

joaosferreira · 2020-08-19T15:17:29Z

No, I think I added all that I wanted. You can merge if you think everything is okay. 😄

peterbell10 · 2020-08-19T15:30:09Z

unumpy/_multimethods.py

+            dims[ax] = 1
+        dims = tuple(dims)
+
+        a = transpose(a, unselected_axis + axis)


Why do you transpose and call functions with axis=-1 when you could just call the basic functions with axis=axis?

I suggested that during the meeting to make the implementation simpler.

If I'm not mistaken it was necessary for median's default which led me to using it on the other reduce methods as well. But now that you point that out it's possible that the only method that needs it is median.

I'll remove _reduce from the methods where it's not needed.

unumpy/_multimethods.py

peterbell10 · 2020-08-19T16:21:37Z

unumpy/_multimethods.py

+    mask = any(isnan(a), axis=-1).reshape(a.shape[0:-1] + (1,))
+    a = where(mask, nan, a)
+
+    a = sort(a, axis=-1)


Any reason not to use partition?

I guess it's mostly for simplicity. Is partition more efficient?

It should be slightly more efficient as it doesn't completely sort the array.

I replaced sort with partition as you suggested.

unumpy/_multimethods.py

peterbell10 · 2020-08-19T18:27:11Z

unumpy/_multimethods.py

+    N = 0
+    for ax in axis:
+        N += a.shape[ax]


Suggested change

N = 0

for ax in axis:

N += a.shape[ax]

N = 1

for ax in axis:

N *= a.shape[ax]

Also, you might want a _prod helper so you can do:

N = _prod(a.shape[ax] for ax in axis)

Since this comes up in a few different places.

This is fixed now.

peterbell10 · 2020-08-19T18:30:05Z

unumpy/_multimethods.py

+    N = 0
+    for ax in axis:
+        N += a.shape[ax]


Same issue here.

joaosferreira · 2020-08-24T14:11:04Z

I think this can be merged now.

hameerabbasi · 2020-08-24T14:12:01Z

Thanks for the excellent work, @joaosferreira!

Add multimethods for statistical functions

c2b61fb

Add default implementation for median

7275f20

peterbell10 reviewed Aug 11, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

Remove _reduce_argreplacer

c7149fe

Refactor default implementation for median

586a9e1

Add default implementation for mean

ddde1fd

hameerabbasi reviewed Aug 14, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

joaosferreira force-pushed the statistical-functions branch from 2d2ca56 to 4d87111 Compare August 16, 2020 12:01

joaosferreira added 2 commits August 18, 2020 15:46

Add helper function for reduction methods

764d78f

Add more default implementations for reduction methods

3b39aa7

joaosferreira force-pushed the statistical-functions branch from 91b7cfa to 3b39aa7 Compare August 18, 2020 15:05

joaosferreira marked this pull request as ready for review August 18, 2020 15:53

hameerabbasi reviewed Aug 18, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

Fix some default implementations

3059246

hameerabbasi reviewed Aug 19, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

hameerabbasi reviewed Aug 19, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

Merge branch 'master' and resolve conflicts with Quansight-Labs#67

8fa8b52

peterbell10 reviewed Aug 19, 2020

View reviewed changes

unumpy/_multimethods.py Outdated Show resolved Hide resolved

peterbell10 reviewed Aug 19, 2020

View reviewed changes

Remove _reduce from most default implementations

3f10876

peterbell10 reviewed Aug 19, 2020

View reviewed changes

Refactor median's default in terms of partition

f35a274

hameerabbasi merged commit 1607d9b into Quansight-Labs:master Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimethods for statistical functions #70

Multimethods for statistical functions #70

joaosferreira commented Aug 8, 2020

hameerabbasi commented Aug 8, 2020

joaosferreira commented Aug 10, 2020

hameerabbasi commented Aug 10, 2020

joaosferreira commented Aug 11, 2020

hameerabbasi commented Aug 11, 2020 •

edited

Loading

joaosferreira commented Aug 11, 2020

hameerabbasi commented Aug 11, 2020

joaosferreira commented Aug 12, 2020

joaosferreira commented Aug 12, 2020

hameerabbasi commented Aug 14, 2020

joaosferreira commented Aug 18, 2020

hameerabbasi commented Aug 18, 2020

joaosferreira commented Aug 18, 2020

hameerabbasi commented Aug 19, 2020

hameerabbasi commented Aug 19, 2020

joaosferreira commented Aug 19, 2020

peterbell10 Aug 19, 2020

hameerabbasi Aug 19, 2020

joaosferreira Aug 19, 2020

joaosferreira Aug 19, 2020

peterbell10 Aug 19, 2020 •

edited

Loading

joaosferreira Aug 19, 2020

peterbell10 Aug 19, 2020

joaosferreira Aug 24, 2020

peterbell10 Aug 19, 2020

peterbell10 Aug 19, 2020

joaosferreira Aug 24, 2020

peterbell10 Aug 19, 2020

joaosferreira commented Aug 24, 2020

hameerabbasi commented Aug 24, 2020

Multimethods for statistical functions #70

Multimethods for statistical functions #70

Conversation

joaosferreira commented Aug 8, 2020

hameerabbasi commented Aug 8, 2020

joaosferreira commented Aug 10, 2020

hameerabbasi commented Aug 10, 2020

joaosferreira commented Aug 11, 2020

hameerabbasi commented Aug 11, 2020 • edited Loading

joaosferreira commented Aug 11, 2020

hameerabbasi commented Aug 11, 2020

joaosferreira commented Aug 12, 2020

joaosferreira commented Aug 12, 2020

hameerabbasi commented Aug 14, 2020

joaosferreira commented Aug 18, 2020

hameerabbasi commented Aug 18, 2020

joaosferreira commented Aug 18, 2020

hameerabbasi commented Aug 19, 2020

hameerabbasi commented Aug 19, 2020

joaosferreira commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joaosferreira commented Aug 24, 2020

hameerabbasi commented Aug 24, 2020

hameerabbasi commented Aug 11, 2020 •

edited

Loading

peterbell10 Aug 19, 2020 •

edited

Loading