Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improwe workflows with filtered DataFrame #2354

Open
bkamins opened this issue Aug 5, 2020 · 21 comments
Open

Improwe workflows with filtered DataFrame #2354

bkamins opened this issue Aug 5, 2020 · 21 comments
Labels
feature non-breaking The proposed change is not breaking
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Aug 5, 2020

This has been discussed in several places I create a separate issue for this to keep track of it as it is an important functionality I think. What we want is filter(predicate, df, view=true) to return a view (so then we can conveniently update this view for example).

@bkamins bkamins added this to the 1.0 milestone Aug 5, 2020
@bkamins bkamins added feature non-breaking The proposed change is not breaking labels Aug 5, 2020
@bkamins
Copy link
Member Author

bkamins commented Aug 6, 2020

Now given #2211 (comment) and #2211 (comment)

my question is if we should not also add dropmissing keyword argument that would:

  • if nothing do what we have now
  • if true then missings would be excluded from the selection
  • if false then missings would be included in the selection

This is essentially as adding coalesce wrapper to the filtering function (so maybe we feel that it is not needed), but maybe you will judge that:

filter(:x => >(1), df, dropmissing=yes)

reads better than:

filter(:x => x -> coalesce(x > 1, false), df)

we do not save typing here, but maybe you will find it more readable?

CC @nalimilan @matthieugomez @pdeffebach

@matthieugomez
Copy link
Contributor

matthieugomez commented Aug 6, 2020

I think I would prefer a kwarg skipmissing, where true corresponds to dropmissing = true, and false corresponds to dropmissing = nothing. I am not sure it is worth introducing a new kwarg dropmissing just to get the behavior dropmissing = false.

More generally, I think that the decision (and also the default of the kwarg option) should be taken in conjonction with the discussion for transform/select etc... #2314

@bkamins
Copy link
Member Author

bkamins commented Aug 6, 2020

A good point. But I would not alter the defaults (sorry for this position here but I do want to avoid breaking changes of this kind till 1.0).

@bkamins
Copy link
Member Author

bkamins commented Aug 6, 2020

The only issue is that skipmissing in select etc. will have a bit different behavior than skipmissing here.

@matthieugomez
Copy link
Contributor

matthieugomez commented Aug 6, 2020

Yes. Personally, I prefer an argument that always have the same name (skipmissing), even if it has slightly different meanings in different contexts, rather than a different keyword argument everytime. I am not a fan on how R has na.rm, na.exclude, na.omit etc.

@bkamins
Copy link
Member Author

bkamins commented Aug 6, 2020

(ok - and I would appreciate a comment in #2314 what you think should be the results on the cases I have listed there 😄)

@bkamins bkamins changed the title allow filter to return a view Improwe workflows with filtered DataFrame Aug 8, 2020
@bkamins
Copy link
Member Author

bkamins commented Aug 8, 2020

I was thinking about these things and I would close #2211, #2314 and #2258 in favour of this issue which I would aim to solve all the issues raised in one consistent design.

So my proposal would be the following:

  1. in filter and dropmissing functions if they take DataFrame or SubDataFrame with index Index (not just any AbstractDataFrame, and not for filter! and dropmissing!) add view::Bool=flase keyword argument. Of course view also allows to create it
  2. if view=true then SubDataFrame is created with Index
  3. We would add support for ! functions for such SubDataFrame (now it is an error) (here is a big task to list all that we feel it makes sense to support) and several more special cases:
    • setproperty!, setindex!, select!, transform!, sort!, filter!, dropmissing! (where the schema of the parent is allowed to be updated by the operation with the rules that rows excluded by view are left untouched and if new columns are created they are filled with missing); if such functions return a source data frame then SubDataFrame does not get unwrapped and calling parent to require explicit unwrapping is required (this is because otherwise it is problematic in my opinion to ensure consistent and intuitive for entry-level users behaviour between transform and transform!) - this is what I would call "sticky behaviour"
    • in particular it is allowed to do groupby on SubDataFrame with Index and it will have the same rules for select! and transform! as described above

So the only thing that does not go as you want is "sticky behavior", which I think we should have because of two reasons:

  • you can then chain operations and you do not have to filter rows again
  • we get a consistent behavior between ! and no-! functions; which I believe will be easier to understand for users

The decision to be made is if we want automatic column promotion for such SubDataFrame, but I think we do not want it and an error should be thrown in cases where promotion would be needed (this is what we currently do in other places).

A side benefit is that it will be much easier to implement it than doing all new stuff for WhereDataFrame and users will not have to learn one more type, which we already have many.

Now how it addresses the issues I mention:

What do you think?

@nalimilan
Copy link
Member

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values. Recommending that pattern as the standard way to skip missing values could even be dangerous, as people would be likely to do transform(dropmissing(df, view=true), :a => ... => :a_new) when they mean transform(dropmissing(df, :a, view=true), :a => ... => :a_new).

@pdeffebach
Copy link
Contributor

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values.

I agree with @nalimilan. I think I have been confusing two distinct issues in my comments, which I apologize for.

  1. Operations like "create a variable that is the average income tax payed by people, but only for women". For this you would use a view with all of the operations you described above.
  2. Operations on data with many missing values, where you don't want to use passmissing and skipmissing everywhere. Actually doing a view here would require a lot of repeating which variables are the problem. For this you would use the skipmissing = true keyword argument, potentially.

@bkamins
Copy link
Member Author

bkamins commented Aug 8, 2020

Ah - OK. So for #2258 and #2314 it would be a "partial fix" (via a more general mechanism, which has it as a special case, but I agree we can think of more convenient patterns).

So do you think what I propose is OK for resolving #2211 (and having a partial solution for the other issues does not hurt, but of course let us discuss more convenient options)?

@matthieugomez
Copy link
Contributor

matthieugomez commented Aug 9, 2020

If the goal is to be able to make it easier to replace certain rows of a column if a condition is satisfied, I am not sure it is worth it.

That is because I do not think that the pattern

parent(transform!(filter(:x => >(1), df, view = true), :x => 1)))

is simpler than

transform!(df, :x => x-> ifelse.(x .>= 1, 1, x))

@bkamins
Copy link
Member Author

bkamins commented Aug 9, 2020

Yes, but all that was asked for in #2211 can be handled by ifelse and this is the currently (i.e. before adding views) recommended pattern.

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

(even if we do not add #2211 functionality, still we can add view kwarg to filter and dropmissing as this is useful in its own rights, as it is a memory efficient way to filter a data frame)

@pdeffebach
Copy link
Contributor

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

I agree with you both. i don't think this is that important.

I do like the way stata reads gen x if y > z. But that convenience can be handled by DataFramesMeta potentially.

Therefore I agree that we should continue thinking about better ways to make skipmissing work easily inside transform.

@bkamins
Copy link
Member Author

bkamins commented Aug 9, 2020

OK - so do we close #2211, or keep it open to get back to it in the future? (for me it is easier not to have #2211 as it will complicate codebase because it requires special handling in many functions, but if you feel we can go back to it some day then let us keep it open)

we should continue thinking about better ways to make skipmissing work easily inside transform.

Agreed - and thank you for spending your time on this.

@pdeffebach
Copy link
Contributor

Let's close it if it's easiest. I think DataFramesMeta can handle a lot of it's points.

@bkamins
Copy link
Member Author

bkamins commented Aug 9, 2020

Now thinking of it I realized that the issue also is that ifelse is limited in what you can express with it, and filter can take a quite complex predicate. But maybe this does not justify having it.

Also ifelse is eager, so it evaluates left and right hand side therefore the following fails:

julia> df = DataFrame(x = -3:3)
7×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ -3    │
│ 2   │ -2    │
│ 3   │ -1    │
│ 4   │ 0     │
│ 5   │ 1     │
│ 6   │ 2     │
│ 7   │ 3     │

julia> transform(df, :x => x-> ifelse.(x .> 0, log.(x), missing))
ERROR: DomainError with -3.0:
log will only return a complex result if called with a complex argument. Try log(Complex(x)).

@pdeffebach
Copy link
Contributor

Yes that is one problem. I've had that before and been frustrated by it.

We really need a lazy version of this in Base to be honest...

Another issue is that that "generate a standardized income index, among women, for all women" is tough.

ifelse.(woman == 1, standardized_index(income, wealth), missing)

The income and wealth aren't subsetted by gender above, whereas a View of the data frame would fix that problem.

Then again, I also am not sure what

gen income_index = ... if woman == 1

does in stata. I don't know how the variables are subsetted.

@matthieugomez
Copy link
Contributor

Right, I did not think about that . The ifelse pattern can only be used in transform for functions that are elementwise, not functions that compute reductions.

@pdeffebach
Copy link
Contributor

But ultimately I agree that this kind of ifelse pattern could be handled by a macro or some sort of struct with lazy broadcasting. It's a tool that has general convenience beyond dataframes and thus doesn't have to live here.

@bkamins
Copy link
Member Author

bkamins commented Aug 10, 2020

Yes, but it should not be ifelse but something else. ifelse is eager on purpose, as it makes it fast, because then it avoids branching in generated native code.

@bkamins
Copy link
Member Author

bkamins commented Aug 22, 2020

Given the decision in #2314 do we need the functionality described in #2211 or not?

Or maybe it is enough if we have this functionality for setindex!, insertcols! and broadcasted assignment (this would be relatively easy) but keep disallowing it in transform! and select! (this will be more complex both for SubDataFrame and GroupedDataFrame{SubDataFrame} if SubDataFrame is based on Index)?

(i.e. do we need mutating views - note that I recently noticed that we already had this for rename! but it was not intentional - now it is documented)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

4 participants