Stats function handle a Series of completely missing value #259

casbby · 2014-08-18T22:26:56Z

The following code returns 0 rather than MISSING

Series.ofObservations[1=>  Double.NaN; 3=> Double.NaN; 6=>Double.NaN; ] |> Stats.sum

is this by design? Does it mean Stats.mean actually treat MISSING as zero rather than skipping it?
Zero is different than null. Can there be a flag to allow MISSING to be returned?

Thanks
Casbby

The text was updated successfully, but these errors were encountered:

adamklein · 2014-09-22T15:25:23Z

It does state this behavior in the docs:

/// Returns the sum of the values in a series. The function skips over missing values
/// and `NaN` values. When there are no available values, the result is 0.
///

Whether this makes sense is up for debate. Stats mean treats NA properly. Tomas - was there a good reason here, or is it just because we leverage Seq.sum ?

tpetricek · 2015-02-04T01:16:37Z

The Stats function treat the missing values as missing, which means that they skip over them. Thus Stats.sum [NA; NA] is 0, because we are calculating zero of no elements. Stats.mean [NA] is NaN because we are calculating 0/0. So, I think what Deedle currently does is a reasonably consistent behaviour.

That said, we could have an option to treat "missing" values as not-a-numbers - in that case, having any missing value anywhere in the series would make the result NaN.

evilpepperman · 2015-02-04T17:48:21Z

I'm with @adamklein on this. In the world of finance, and data cleaning generally, we just want the NAs out of our way. It's not an error condition, which forcing NaN would be. Reminds me of the signalling/non-signalling NaN concept. If it seems there is strong support for "NaN anywhere" (like SQL NULL), it should be an optional behavior, since skipping is really so useful.

tpetricek · 2015-02-04T18:47:48Z

@evilpepperman If I read correctly what you're saying, then you agree with what I wrote too, or am I wrong?

Skipping over NAs gives us:

Stats.sum [NA; 2.0; 1.0] = 3.0
Stats.sum [NA; NA ] = 0.0

Treating them as NaNs gives us:

Stats.sum [NaN; 2.0; 1.0] = NaN
Stats.sum [NaN; NaN ] = NaN

We are currently doing the first thing - and I think it is more useful (that said, R distinguishes between NA and NaN and we could too - but it would be a bigger breaking change).

evilpepperman · 2015-02-04T19:05:09Z

@tpetricek So, I was lumping NA and NaN together as I wrote that. I would need to consider carefully the appearance of NaN in a calculation, maybe with checking code. I have two use cases, one where strict is good (i.e. clean data, processing), the other where robust is good (i.e. dirty data, exploration).

If a Frame/Series could bear a mode flag (NaNisNA, default=true), either behavior could be elicited w/o caller changes. Of course, it's messier inside-- maybe just an additional decorator or filter (e.g. Series.dropNaN) inline to remove the strict behavior.

But usually, I want robust behavior when exploring data.

tpetricek · 2015-02-04T19:11:09Z

@evilpepperman Sorry, my comment was confusing. Deedle does not distinguish between NAs and NaNs. When you load CSV file with missing values or when you add a series with float values that are NaN, it always converts them to internal <missing> representation.

Then it always implements the first behavior, that is:

Stats.sum [<missing>; 2.0; 1.0] = 3.0
Stats.sum [<missing>; <missing> ] = 0.0

And when you try to create a series containing NaN, it treats those as <missing>, so you cannot have a series that would have some missing values and some NaNs.

adamklein · 2015-02-04T20:40:04Z

This is inconsistent if you come from the world of pandas, where Series([NaN, NaN]).sum() is NaN. I would propose to stay consistent with pandas, even though I think our current implementation is defensible - but the pandas version seems to be well accepted, so the behavior will surprise people.

tpetricek · 2015-02-04T21:16:41Z

Ah, I see - I wouldn't mind doing a change for this case. I'll send a PR - being consistent with pandas is a good thing.

[some more rambling]
It feels a bit odd as 0 is the unit element for "+" (0 + x = x) and NaN is the zero element (NaN + x = NaN), so I thought that unit would be natural thing to return for an empty list. For min/max, we avoid this by just returning option (the unit would be -Infinity / +Infinity).

adamklein · 2015-02-04T21:26:17Z

I agree w/you @tpetricek, algebraically, we are probably doing the right thing, and it feels natural if you are used to thinking in terms of folds...

tpetricek · 2015-02-04T21:38:21Z

Good :-) I don't mind breaking algebraic laws, but I like to understand why!

tpetricek added meta-review meta-discuss and removed meta-review labels Feb 4, 2015

tpetricek added a commit to tpetricek/Deedle that referenced this issue Feb 5, 2015

Change Stats.sum to return NaN for empty series (fix fslaborg#259)

eadc4d0

tpetricek closed this as completed Feb 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats function handle a Series of completely missing value #259

Stats function handle a Series of completely missing value #259

casbby commented Aug 18, 2014

adamklein commented Sep 22, 2014

tpetricek commented Feb 4, 2015

evilpepperman commented Feb 4, 2015

tpetricek commented Feb 4, 2015

evilpepperman commented Feb 4, 2015

tpetricek commented Feb 4, 2015

adamklein commented Feb 4, 2015

tpetricek commented Feb 4, 2015

adamklein commented Feb 4, 2015

tpetricek commented Feb 4, 2015

Stats function handle a Series of completely missing value #259

Stats function handle a Series of completely missing value #259

Comments

casbby commented Aug 18, 2014

adamklein commented Sep 22, 2014

tpetricek commented Feb 4, 2015

evilpepperman commented Feb 4, 2015

tpetricek commented Feb 4, 2015

evilpepperman commented Feb 4, 2015

tpetricek commented Feb 4, 2015

adamklein commented Feb 4, 2015

tpetricek commented Feb 4, 2015

adamklein commented Feb 4, 2015

tpetricek commented Feb 4, 2015