New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats function handle a Series of completely missing value #259
Comments
It does state this behavior in the docs:
Whether this makes sense is up for debate. Stats mean treats NA properly. Tomas - was there a good reason here, or is it just because we leverage Seq.sum ? |
The That said, we could have an option to treat "missing" values as not-a-numbers - in that case, having any missing value anywhere in the series would make the result |
I'm with @adamklein on this. In the world of finance, and data cleaning generally, we just want the NAs out of our way. It's not an error condition, which forcing NaN would be. Reminds me of the signalling/non-signalling NaN concept. If it seems there is strong support for "NaN anywhere" (like SQL NULL), it should be an optional behavior, since skipping is really so useful. |
@evilpepperman If I read correctly what you're saying, then you agree with what I wrote too, or am I wrong? Skipping over NAs gives us:
Treating them as NaNs gives us:
We are currently doing the first thing - and I think it is more useful (that said, R distinguishes between NA and NaN and we could too - but it would be a bigger breaking change). |
@tpetricek So, I was lumping NA and NaN together as I wrote that. I would need to consider carefully the appearance of NaN in a calculation, maybe with checking code. I have two use cases, one where strict is good (i.e. clean data, processing), the other where robust is good (i.e. dirty data, exploration). If a Frame/Series could bear a mode flag (NaNisNA, default=true), either behavior could be elicited w/o caller changes. Of course, it's messier inside-- maybe just an additional decorator or filter (e.g. Series.dropNaN) inline to remove the strict behavior. But usually, I want robust behavior when exploring data. |
@evilpepperman Sorry, my comment was confusing. Deedle does not distinguish between NAs and NaNs. When you load CSV file with missing values or when you add a series with float values that are NaN, it always converts them to internal Then it always implements the first behavior, that is:
And when you try to create a series containing NaN, it treats those as |
This is inconsistent if you come from the world of pandas, where |
Ah, I see - I wouldn't mind doing a change for this case. I'll send a PR - being consistent with pandas is a good thing. [some more rambling] |
I agree w/you @tpetricek, algebraically, we are probably doing the right thing, and it feels natural if you are used to thinking in terms of folds... |
Good :-) I don't mind breaking algebraic laws, but I like to understand why! |
The following code returns 0 rather than MISSING
is this by design? Does it mean Stats.mean actually treat MISSING as zero rather than skipping it?
Zero is different than null. Can there be a flag to allow MISSING to be returned?
Thanks
Casbby
The text was updated successfully, but these errors were encountered: