mean function on dimension #2265

Closed
dfagnan opened this Issue Feb 11, 2013 · 58 comments

Comments

Projects
None yet

dfagnan commented Feb 11, 2013

I believe in the recent shuffle that mean on dimensions of a matrix does not work.

Eg. mean([2 3 4; 1 2 3],2)

Member

pao commented Feb 11, 2013

#2266 didn't appear to address this.

Member

andreasnoack commented Feb 11, 2013

#2200 indirectly discussed this. I asked for mean, median and var treating this the same way and I guess the result is that functions along dimensions is delgated to new amap style functions (which I think is fine), but I am not sure. @StefanKarpinski may have something to add here. One problem with the Matlab way is that var(Matrix,Number) is used when the mean is known.

Owner

StefanKarpinski commented Feb 11, 2013

This was intentionally removed. @JeffBezanson may have some suggestions of ways to accomplish this instead.

Owner

JeffBezanson commented Feb 11, 2013

mean(A, dim) can always be added, implemented in terms of whatever higher-order function we prefer. Conflicts with things like var(Matrix,Number) are the real problem. That just requires an API decision.

Contributor

blakejohnson commented Feb 21, 2013

I think it is best to have mean, var, etc work not just along an axis but on regions (vectors of dimensions), because it is common when working on high-dimensional data sets to need to do reductions along multiple axes. I have an open pull request out on Stats.jl, but this really ought to be in base.

Owner

StefanKarpinski commented Feb 21, 2013

The reason it's not in base is that it was unclear how these calls should look. The std and var functions had known-mean variants that were completely at odds with the dimension argument for mean. Until that is all straightened out with a coherent story, it's not going into base.

Contributor

blakejohnson commented Feb 21, 2013

Fair enough. I'll take a look at what other languages do for std and var with known-mean to see what the options are.

Owner

StefanKarpinski commented Feb 21, 2013

There's also a point where you want to stop adding dimension arguments to every function and start using higher order functions for that kind of thing. I feel like this may be that point.

Contributor

blakejohnson commented Feb 21, 2013

Do you feel the same way about min, max, and sum?

Owner

StefanKarpinski commented Feb 21, 2013

Yeah, that's the problem. Those are heavily used in Matlab style code and they are pretty handy to have a dimension argument for. So I don't know.

Member

kmsquire commented Feb 21, 2013

Keyword arguments (#485) would probably be useful here.

Owner

timholy commented Feb 22, 2013

How about adding a Dim type? For example, B = mean(A, Dim(2)). It would make the meaning of the 2 unambiguous. For std one could support B = std(A, Dim(2)) and B = std(A, Dim(2), m) and B = std(A, m, Dim(2)) so the order of arguments wouldn't matter. This gives keyword-like behavior without requiring them in general.

In a simple implementation Dim would simply wrap a single integer. To include @blakejohnson's multidimensional request, the implementation of Dim might be

type Dim{A<:AbstractVector{Int}}
    index::A
end
Dim(i::Int) = Dim([i])

Note it stores an AbstractArray, but should almost surely not be a subtype of it.

I'm playing with this right now in the context of image processing, adding an img[Seq(14)] syntax to allow one to generically access the 14th image in a sequence no matter whether your images are two- or three-dimensional, grayscale/color, etc. It's too early to tell how much I like it, but if I do I will probably adopt this and finally push out an Image package.

Owner

ViralBShah commented Feb 22, 2013

Using Dim does make the code more readable, but also a bit more verbose to type. That may actually be ok, since we do not yet have keyword args.

Owner

ViralBShah commented Feb 22, 2013

In case of max etc., we could just support both max(A, d) and max(A, Dim(d)).

Owner

timholy commented Feb 22, 2013

What if you want to support max([2,-3,5],0) -> [2,0,5]?

Owner

ViralBShah commented Feb 22, 2013

I would prefer to force the use of Dim() if we adopt it. The behaviour you propose is also perhaps desirable, but it will confuse a lot of matlab users, and more so, it has the potential to break existing programs silently. I don't really have a good solution here.

Owner

timholy commented Feb 22, 2013

I was simply responding to your proposal to support both max(A, d) and max(A, Dim(d)), which from the fact that both used d I interpreted to mean as automatically converting d->Dim(d). I don't think we should do that, and the example I gave was intended to illustrate the pitfall of doing so. I support the idea of forcing Dim.

FYI Matlab behaves exactly as in the example I just gave you:

>> a = [2, -3, 5]

a =

     2    -3     5

>> max(a, 0)

ans =

     2     0     5

Matlab uses the ugly max(a, [], d) to handle the maximum over a dimension.

If we go with the Dim approach, we could also add max!(a, 0) as a sweet syntax for truncating arrays to nonnegative values.

Owner

ViralBShah commented Feb 22, 2013

Oh yes - it has been so long since I used matlab that I actually forgot how I hated that syntax.

Contributor

blakejohnson commented Feb 22, 2013

The Dim syntax is pretty clear. I do have to ask: this whole problem seems to exist because of these variants with known mean. Do other languages provide similar methods? It seems like var(A - m) is just as good as var(A, m) if I know the mean of A is m.

Contributor

blakejohnson commented Feb 22, 2013

I take that back, var(A - m) is only just as good if var always assumes that the mean is zero. Which is not a great assumption to bake in.

Member

andreasnoack commented Feb 22, 2013

I like det Dim type and maybe the idea could be used generally instead of keywords. It is kind of the same thing I have done with the Hermitian and Triangular types in #2308.

dfagnan commented Feb 22, 2013

I'm fairly confident that the known-mean variant is not widely used in
MATLAB and pales in comparison to the regular use of mean along a given
dimension, so it seems a shame to complicate the usage for this reason. To
be honest, I was quite shocked when this simple functionality was broken -
so much so that I stopped updating julia. The forced Dim argument seems
quite reasonable, although I personally feel even this is unnecessary.

On Fri, Feb 22, 2013 at 9:26 AM, Andreas Noack Jensen <
notifications@github.com> wrote:

I like det Dim type and maybe the idea could be used generally instead of
keywords. It is kind of the same thing I have done with the Hermitian and
Triangular types in #2308 #2308
.


Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/2265#issuecomment-13945372.

Owner

timholy commented Feb 22, 2013

Agreed. It's actually a more serious issue for max(A, i::Int). Is that i a dimension index, or is it the comparison value?

blakejohnson added a commit to blakejohnson/julia that referenced this issue Feb 26, 2013

mean, max, min, sum, prod, all, any on Dim.
First pass at implemented the Dim type for #2265. For an array A, you can now
call, e.g.: max(A, Dim(1)). The Dim type can be constructed with integers,
arrays, ranges, or tuples, so all of the following are valid:
Dim(1)
Dim(1, 2)
Dim([1])
Dim([1,2])
Dim([1 2])
Dim(1:2)
Owner

JeffBezanson commented Feb 27, 2013

I strongly feel that the sanest thing is to use a higher-order function apply_to_dims(mean, A, 2), with whatever name is deemed good. The syntax mean(A, Dim(2)) is reasonable by itself, but it means you don't generally know what mean(A, x) does (x might be a Dim). I've always been ok with sum(A,2), but the situation it leaves you in for max and min is just awful. There are also tons of vector->scalar and vector->vector functions, and you can't expect all of them to be written to handle Dim arguments.

Owner

stevengj commented Feb 27, 2013

Note that fft(A, dim) and fft(A, (dim1, dim2, ...)) follow the old convention (which used to have a type union called Dimspec but is now duck-typed as anything that acts like an iterable sequence of integers). I have to admit I like the old syntax and find Dim(d) to be an unnecessary complication in most cases.

@JeffBezanson, the fft example also shows that this kind of thing should not necessarily be a high-order function, for performance reasons.

Contributor

blakejohnson commented Feb 27, 2013

An apply_to_dims method would certainly be useful, but I also feel that, for convenience, certain basic methods ought to have a dimension argument. I find @timholy's last comment persuasive that max(A, b) should not mean the max of A along dimension b, but really something like max(max(A), b).

Owner

JeffBezanson commented Feb 27, 2013

The performance issue is a good point; the higher-order function can't do the fastest thing unless it has lots of special cases inside it anyway. I would also not be fully comfortable telling people to write something much longer than sum(A,2). So it seems there will have to be two approaches in play. In that case I'd vote for using func(A, dim) in as many cases as possible, for conciseness and familiarity. Perhaps the known-mean versions of var and std should just be renamed.
max is the most annoying case. I think max(A, (), d) has to go, and we can use the higher-order function for that.

Owner

timholy commented Feb 27, 2013

If the performance problem could be solved, I'd agree with @JeffBezanson. But right now the performance gap would be too big. Simple example:

julia> A = randn(1000,1000,10);

julia> @time aA = abs(A);
elapsed time: 0.10193109512329102 seconds

julia> @time mA = map(abs, A);
elapsed time: 1.2647428512573242 seconds

I'm not really excited about that kind of price, so I think we need a short-term solution.

BTW, nice work @blakejohnson for putting this together so we can all see what it would look like in practice.

Owner

JeffBezanson commented Feb 27, 2013

There are actually two performance problems: the one @timholy cites can be solved just by me getting my act together, and then cases like fft where the dimension argument needs to be passed through to the planner, instead of being used only by the higher-order function.

Contributor

blakejohnson commented Feb 27, 2013

I agree, max(A, (), d) is as ugly as sin.

I would also be on board with renaming the known mean functions.

Owner

timholy commented Feb 27, 2013

"me just getting my act together" hah. The problem is there are too many "deep" things yet to do and only one Jeff. Thank goodness we have so many other awesome heavy-lifters.

Member

toivoh commented Feb 27, 2013

Regarding the max/min confusion: I think that the upshot of this discussion was to have separate maxof etc. functions for taking the max between arguments.

Owner

ViralBShah commented Feb 27, 2013

I do like the suggestion of maxof, and then not messing around with Dim.

Member

toivoh commented Feb 27, 2013

Me too.

Owner

JeffBezanson commented Feb 27, 2013

I know this discussion has been going on for a while and consensus is emerging, but I really feel like max(a,b) giving the larger of a and b is an absolutely standard mathematical function, as standard as sin. It also appears in that form in every language from C++ to Excel. If you google "max function", every result agrees.

Owner

StefanKarpinski commented Feb 27, 2013

I think Jeff makes a very valid point here. Personally, I'm not sold on the maxof thing.

Contributor

blakejohnson commented Feb 28, 2013

@JeffBezanson Not every language. Python/NumPy uses maximum for element-wise comparison, and max along a dimension.

Contributor

blakejohnson commented Feb 28, 2013

So, what's the solution here? The options suggested so far:

  1. Drop support for dimension arguments entirely and rely on some kind of apply_to_dims method. As @stevengj points out, however, there are certain cases like fft where this would lead to a large performance penalty.
  2. Use a Dim type to distinguish the comparison element from the dimension arguments (see #2409).
  3. Change the method name to distinguish the binary and unary forms of max (see #2419).
  4. Use a keyword argument... when Julia supports keyword arguments.
Member

wlbksy commented Feb 28, 2013

I prefer No.4

Owner

JeffBezanson commented Feb 28, 2013

True, but non-numpy python also has max(a,b) for scalars with the usual meaning. So I'd only insist that max(1,2)==2.
From there I'd prefer but not insist that max vectorize normally and be elementwise if a and/or b are arrays.
@nolta 's branch shows that we could add keyword arguments fairly quickly, even if not with great performance at first.

Owner

ViralBShah commented Feb 28, 2013

Do we know how much performance we would lose with keyword arguments? If it is something like 10-25%, and we have a plan for regaining the lost performance, it would certainly be the right way to go. In general, it would be nice if people's codes do not slow down across releases.

One exercise could be to implement max with keywords on @nolta 's branch and measure the performance difference.

Owner

ViralBShah commented Feb 28, 2013

Of the four suggestions above outlined by @blakejohnson, my preferences are in the order: 4, 3, 2, 1.

Contributor

blakejohnson commented Feb 28, 2013

I think everyone likes option 4, when it becomes available. Option 2 is the closest we can get right now, and it looks pretty clean in use. It's also no more keystrokes than a keyword argument...

Owner

timholy commented Feb 28, 2013

It also leverages type dispatch better than a keyword argument.

Owner

timholy commented Mar 1, 2013

I should explain the last comment more thoroughly. From a user's perspective, max(A,dim = 2) is at least as nice as max(A,Dim(2)). From a function author's perspective it may not be so nice. To me, the role of keyword arguments is to rationalize optional inputs. So when I consider writing this function with keyword arguments, depending on which implementation of keywords gets adopted I get something like this:

function max(kw::Keywords, A, B = nothing, dim = nothing)
  ...
end

Here, B is the comparision value, as in max(3,5)->5. Now, it seems extremely unlikely that we'd want to simultaneously support comparing A vs B and reducing over a dimension, because it makes the implementation of max quite complicated, with lots of branching for different cases. But you kind of have to, because you've declared that this is a 3-input function.

In contrast, the Dim solution is nice and clean:

function max(A, d::Dim)
  ...
end
function max(A, B)
  ...
end

The point is, the type system is actually quite good for eliminating ambiguity, which is really what we're talking about here: is that integer a comparison value or a dimension index? This is getting even better with the impending arrival of immutable types, for which the whole point (as far as I can tell) is to provide interpretive context for sequences of bits.

Keyword arguments seem to be best for "overriding" behavior that you'd rather not have to specify on a regular basis, but that's really quite different from the conceptual challenge here.

Contributor

blakejohnson commented Mar 1, 2013

@timholy that's a very good point, and bumps option 2 (Dim type) to my preferred choice.

If we go with a Dim type, where do we use it? Everywhere we want a dimension argument? Or only where we need to distinguish different call signatures?

Owner

JeffBezanson commented Mar 1, 2013

With the implementations we're considering, you could define:

function max(A, dim = nothing)
  ...
end

function max(A)
  ...
end

function max(A,B)
  ...
end

So while not entirely dispatching on keyword args, still possible to dispatch on whether any are present at all.

Member

toivoh commented Mar 1, 2013

Aha, the first case in favor of dispatch on keyword arguments! :)

Member

tshort commented Mar 1, 2013

I played around with some timings on Jeff's jb/kwargs branch. I couldn't measure any difference between max(x, dim) and max(x, dim = somedim). The problem is that my max along a dimension was slow enough to swamp any overhead from keyword args. With a simpler function involving scalars, the overhead can be sizeable.

cmey commented Mar 5, 2013

So, please, what's the current alternative to doing:

mean([1; 2; 3], 1) == 2

since it disapeared in release-0.1 and broke my code ?

Contributor

blakejohnson commented Mar 5, 2013

Install the Stats package and then that line will work.

On Tuesday, March 5, 2013 at 9:40 AM, Christophe MEYER wrote:

So, please, what's the current alternative to doing:
mean([1; 2; 3], 1)
since it disapeared in release-0.1 and broke my code ?


Reply to this email directly or view it on GitHub (#2265 (comment)).

Member

nolta commented Mar 13, 2013

I think we made the wrong call on std and var. Neither R nor Matlab accepts a known mean as an argument. The fact that our std is much more complicated than R's sd should be a huge red flag ;-)

I mean, look at this help entry:

Base.std(v[, corrected])

   Compute the sample standard deviation of a vector "v". If the
   optional argument "corrected" is either left unspecified or is
   explicitly set to the default value of "true", then the algorithm
   will return an estimator of the generative distribution's standard
   deviation under the assumption that each entry of "v" is an IID
   draw from that generative distribution. This computation is
   equivalent to calculating "sqrt(sum((v .- mean(v)).^2) /
   (length(v) - 1))" and involves an implicit correction term
   sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- mean(v)).^2) /
   length(v))", which is the empirical standard deviation of the
   sample.

Base.std(v, m[, corrected])

   Compute the sample standard deviation of a vector "v" with known
   mean "m". If the optional argument "corrected" is either left
   unspecified or is explicitly set to the default value of "true",
   then the algorithm will return an estimator of the generative
   distribution's standard deviation under the assumption that each
   entry of "v" is an IID draw from that generative distribution.
   This computation is equivalent to calculating "sqrt(sum((v .-
   m).^2) / (length(v) - 1))" and involves an implicit correction
   term sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- m).^2) / length(v))",
   which is the empirical standard deviation of the sample.

I vote for std(x[, dim]), var(x[, dim], and mean(x[, dim]).

Owner

JeffBezanson commented Mar 13, 2013

Not my specialty and not my call, but I agree with @nolta . The known-mean version is exotic enough to be renamed. Or we can add a mean=x keyword arg eventually.

Owner

StefanKarpinski commented Mar 13, 2013

Agreed. The proposed known-mean alternatives are stdm(m,v) and varm(m,x).

Member

johnmyleswhite commented Mar 13, 2013

I still think the names stdm(m,v) and varm(m,x) are the way to go.

As for the help docs (which I wrote), I think their verbosity reflects the continued absence of a viable mechanism for splitting documentation into appropriate components. The current docs are an attempt to provide something like a condensed version of what R documentation would call the "Details" section. For things like variances which have different definitions for different communities, I think it's really important that the exact object being computed is described in detail somewhere in our documentation.

Contributor

blakejohnson commented Mar 13, 2013

@nolta So, are you also suggesting to drop support for the 'corrected' flag to choose between N and N-1 normalization?

Member

nolta commented Mar 13, 2013

@blakejohnson Yes, for two reasons: 1) R's sd doesn't support it, and 2) i think something like central_moment(2, x) would be clearer.

Contributor

blakejohnson commented Mar 13, 2013

I'm on board with that.

@nolta nolta closed this in 7f2a69c Mar 14, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment