Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstack error message for missing values #3339

Closed
jariji opened this issue Jun 1, 2023 · 8 comments · Fixed by #3344
Closed

unstack error message for missing values #3339

jariji opened this issue Jun 1, 2023 · 8 comments · Fixed by #3344
Labels
Milestone

Comments

@jariji
Copy link
Contributor

jariji commented Jun 1, 2023

julia> @chain begin
           DataFrame(x=rand(1:100, 10), y=rand(['a','b', missing], 10), z=rand(10))
           unstack(:x, :y, :z; )
       end
ERROR: ArgumentError: Missing value in variable :y. Pass `allowmissing=true` to skip missings.

"Pass allowmissing=true to skip missings" Is that an accurate description of what that kwarg does? The docs say "if true then a column referring to missing value is created" which seems like a different thing.

@bkamins bkamins added the bug label Jun 1, 2023
@bkamins bkamins added this to the 1.6 milestone Jun 1, 2023
@bkamins
Copy link
Member

bkamins commented Jun 1, 2023

Yes, it should be fixed. But while fixing it let us decide how this case should be handled:

julia> unstack(DataFrame(x=rand(1:100, 10), y=rand(["missing", missing], 10), z=rand(10)), :x, :y, :z, allowmissing=true)
ERROR: ArgumentError: Duplicate variable names: :missing. Pass makeunique=true to make them unique using a suffix automatically.

Maybe we should allow passing column name where missing should be mapped?

@bkamins
Copy link
Member

bkamins commented Jun 1, 2023

Also we have the following issue:

julia> unstack(DataFrame(x=rand(1:100, 10), y=rand(['a', "a"], 10), z=rand(10)), :x, :y, :z, allowmissing=true)
ERROR: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

I am not sure if we want to error in this case, or allow for some other behavior also.

@jariji
Copy link
Contributor Author

jariji commented Jun 1, 2023

Why does missing need special handling? Why not just treat it like a normal value/column?

@bkamins
Copy link
Member

bkamins commented Jun 1, 2023

It does not. I am just saying that I noticed that we need to decide what to do if we have two distinct values that map to the same column name, e.g. "missing" and missing, or 'a' and "a". Maybe we need a separate kwarg.

@jariji
Copy link
Contributor Author

jariji commented Jun 1, 2023

I mean what is the point of allowmissing=false at all - why not just do it?

@bkamins
Copy link
Member

bkamins commented Jun 1, 2023

Because, conceptually, missing can be any value, so we do not know to what value we should map it to.
If allowmissing=true we say that missing should be treated as a valid value (and not as a sentinel for a value that we do not know).

@jariji
Copy link
Contributor Author

jariji commented Jun 1, 2023

Can you distinguish that reasoning from sort([1,2,missing]) which doesn't require an allowmissing=true kwarg? missing can appear in any position, so first(sort(xs)) could give the wrong answer but Julia says that's okay.

@bkamins
Copy link
Member

bkamins commented Jun 2, 2023

sort design is not missing aware, while DataFrames.jl design is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants