Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats function handle a Series of completely missing value #259

Closed
casbby opened this issue Aug 18, 2014 · 10 comments
Closed

Stats function handle a Series of completely missing value #259

casbby opened this issue Aug 18, 2014 · 10 comments

Comments

@casbby
Copy link

casbby commented Aug 18, 2014

The following code returns 0 rather than MISSING

Series.ofObservations[1=>  Double.NaN; 3=> Double.NaN; 6=>Double.NaN; ] |> Stats.sum

is this by design? Does it mean Stats.mean actually treat MISSING as zero rather than skipping it?
Zero is different than null. Can there be a flag to allow MISSING to be returned?

Thanks
Casbby

@adamklein
Copy link
Contributor

It does state this behavior in the docs:

/// Returns the sum of the values in a series. The function skips over missing values
/// and `NaN` values. When there are no available values, the result is 0.
///

Whether this makes sense is up for debate. Stats mean treats NA properly. Tomas - was there a good reason here, or is it just because we leverage Seq.sum ?

@tpetricek
Copy link
Member

The Stats function treat the missing values as missing, which means that they skip over them. Thus Stats.sum [NA; NA] is 0, because we are calculating zero of no elements. Stats.mean [NA] is NaN because we are calculating 0/0. So, I think what Deedle currently does is a reasonably consistent behaviour.

That said, we could have an option to treat "missing" values as not-a-numbers - in that case, having any missing value anywhere in the series would make the result NaN.

@evilpepperman
Copy link

I'm with @adamklein on this. In the world of finance, and data cleaning generally, we just want the NAs out of our way. It's not an error condition, which forcing NaN would be. Reminds me of the signalling/non-signalling NaN concept. If it seems there is strong support for "NaN anywhere" (like SQL NULL), it should be an optional behavior, since skipping is really so useful.

@tpetricek
Copy link
Member

@evilpepperman If I read correctly what you're saying, then you agree with what I wrote too, or am I wrong?

Skipping over NAs gives us:

Stats.sum [NA; 2.0; 1.0] = 3.0
Stats.sum [NA; NA ] = 0.0

Treating them as NaNs gives us:

Stats.sum [NaN; 2.0; 1.0] = NaN
Stats.sum [NaN; NaN ] = NaN

We are currently doing the first thing - and I think it is more useful (that said, R distinguishes between NA and NaN and we could too - but it would be a bigger breaking change).

@evilpepperman
Copy link

@tpetricek So, I was lumping NA and NaN together as I wrote that. I would need to consider carefully the appearance of NaN in a calculation, maybe with checking code. I have two use cases, one where strict is good (i.e. clean data, processing), the other where robust is good (i.e. dirty data, exploration).

If a Frame/Series could bear a mode flag (NaNisNA, default=true), either behavior could be elicited w/o caller changes. Of course, it's messier inside-- maybe just an additional decorator or filter (e.g. Series.dropNaN) inline to remove the strict behavior.

But usually, I want robust behavior when exploring data.

@tpetricek
Copy link
Member

@evilpepperman Sorry, my comment was confusing. Deedle does not distinguish between NAs and NaNs. When you load CSV file with missing values or when you add a series with float values that are NaN, it always converts them to internal <missing> representation.

Then it always implements the first behavior, that is:

Stats.sum [<missing>; 2.0; 1.0] = 3.0
Stats.sum [<missing>; <missing> ] = 0.0

And when you try to create a series containing NaN, it treats those as <missing>, so you cannot have a series that would have some missing values and some NaNs.

@adamklein
Copy link
Contributor

This is inconsistent if you come from the world of pandas, where Series([NaN, NaN]).sum() is NaN. I would propose to stay consistent with pandas, even though I think our current implementation is defensible - but the pandas version seems to be well accepted, so the behavior will surprise people.

@tpetricek
Copy link
Member

Ah, I see - I wouldn't mind doing a change for this case. I'll send a PR - being consistent with pandas is a good thing.

[some more rambling]
It feels a bit odd as 0 is the unit element for "+" (0 + x = x) and NaN is the zero element (NaN + x = NaN), so I thought that unit would be natural thing to return for an empty list. For min/max, we avoid this by just returning option (the unit would be -Infinity / +Infinity).

@adamklein
Copy link
Contributor

I agree w/you @tpetricek, algebraically, we are probably doing the right thing, and it feels natural if you are used to thinking in terms of folds...

@tpetricek
Copy link
Member

Good :-) I don't mind breaking algebraic laws, but I like to understand why!

tpetricek added a commit to tpetricek/Deedle that referenced this issue Feb 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants