Improwe workflows with filtered DataFrame #2354

bkamins · 2020-08-05T18:34:59Z

This has been discussed in several places I create a separate issue for this to keep track of it as it is an important functionality I think. What we want is filter(predicate, df, view=true) to return a view (so then we can conveniently update this view for example).

The text was updated successfully, but these errors were encountered:

bkamins · 2020-08-06T07:06:37Z

Now given #2211 (comment) and #2211 (comment)

my question is if we should not also add dropmissing keyword argument that would:

if nothing do what we have now
if true then missings would be excluded from the selection
if false then missings would be included in the selection

This is essentially as adding coalesce wrapper to the filtering function (so maybe we feel that it is not needed), but maybe you will judge that:

filter(:x => >(1), df, dropmissing=yes)

reads better than:

filter(:x => x -> coalesce(x > 1, false), df)

we do not save typing here, but maybe you will find it more readable?

CC @nalimilan @matthieugomez @pdeffebach

matthieugomez · 2020-08-06T14:24:23Z

I think I would prefer a kwarg skipmissing, where true corresponds to dropmissing = true, and false corresponds to dropmissing = nothing. I am not sure it is worth introducing a new kwarg dropmissing just to get the behavior dropmissing = false.

More generally, I think that the decision (and also the default of the kwarg option) should be taken in conjonction with the discussion for transform/select etc... #2314

bkamins · 2020-08-06T16:32:36Z

A good point. But I would not alter the defaults (sorry for this position here but I do want to avoid breaking changes of this kind till 1.0).

bkamins · 2020-08-06T16:38:41Z

The only issue is that skipmissing in select etc. will have a bit different behavior than skipmissing here.

matthieugomez · 2020-08-06T16:45:43Z

Yes. Personally, I prefer an argument that always have the same name (skipmissing), even if it has slightly different meanings in different contexts, rather than a different keyword argument everytime. I am not a fan on how R has na.rm, na.exclude, na.omit etc.

bkamins · 2020-08-06T16:47:17Z

(ok - and I would appreciate a comment in #2314 what you think should be the results on the cases I have listed there 😄)

bkamins · 2020-08-08T09:07:23Z

I was thinking about these things and I would close #2211, #2314 and #2258 in favour of this issue which I would aim to solve all the issues raised in one consistent design.

So my proposal would be the following:

in filter and dropmissing functions if they take DataFrame or SubDataFrame with index Index (not just any AbstractDataFrame, and not for filter! and dropmissing!) add view::Bool=flase keyword argument. Of course view also allows to create it
if view=true then SubDataFrame is created with Index
We would add support for ! functions for such SubDataFrame (now it is an error) (here is a big task to list all that we feel it makes sense to support) and several more special cases:
- setproperty!, setindex!, select!, transform!, sort!, filter!, dropmissing! (where the schema of the parent is allowed to be updated by the operation with the rules that rows excluded by view are left untouched and if new columns are created they are filled with missing); if such functions return a source data frame then SubDataFrame does not get unwrapped and calling parent to require explicit unwrapping is required (this is because otherwise it is problematic in my opinion to ensure consistent and intuitive for entry-level users behaviour between transform and transform!) - this is what I would call "sticky behaviour"
- in particular it is allowed to do groupby on SubDataFrame with Index and it will have the same rules for select! and transform! as described above

So the only thing that does not go as you want is "sticky behavior", which I think we should have because of two reasons:

you can then chain operations and you do not have to filter rows again
we get a consistent behavior between ! and no-! functions; which I believe will be easier to understand for users

The decision to be made is if we want automatic column promotion for such SubDataFrame, but I think we do not want it and an error should be thrown in cases where promotion would be needed (this is what we currently do in other places).

A side benefit is that it will be much easier to implement it than doing all new stuff for WhereDataFrame and users will not have to learn one more type, which we already have many.

Now how it addresses the issues I mention:

Creating new columns on a view should fill in missings everywhere else. #2211 : it will just be what is requested there
Skipping missing values more easily #2314 : using dropmissing with view=true will just do this, and it would be sticky as requested by @nalimilan
Add a skipmissing kwarg to select/transform/combine #2258 : it will just be what is requested there except that unwrapping will require calling parent after the operation, and by default we are sticky (I think it is not a great problem)

What do you think?

nalimilan · 2020-08-08T20:38:05Z

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values. Recommending that pattern as the standard way to skip missing values could even be dangerous, as people would be likely to do transform(dropmissing(df, view=true), :a => ... => :a_new) when they mean transform(dropmissing(df, :a, view=true), :a => ... => :a_new).

pdeffebach · 2020-08-08T22:06:12Z

Makes sense. But I'm not sure it would really fix #2258 and #2314, as repeating dropmissing(df, view=true) everywhere is still quite verbose, and you'd have to repeat the columns for which you want to drop missing values.

I agree with @nalimilan. I think I have been confusing two distinct issues in my comments, which I apologize for.

Operations like "create a variable that is the average income tax payed by people, but only for women". For this you would use a view with all of the operations you described above.
Operations on data with many missing values, where you don't want to use passmissing and skipmissing everywhere. Actually doing a view here would require a lot of repeating which variables are the problem. For this you would use the skipmissing = true keyword argument, potentially.

bkamins · 2020-08-08T22:17:35Z

Ah - OK. So for #2258 and #2314 it would be a "partial fix" (via a more general mechanism, which has it as a special case, but I agree we can think of more convenient patterns).

So do you think what I propose is OK for resolving #2211 (and having a partial solution for the other issues does not hurt, but of course let us discuss more convenient options)?

matthieugomez · 2020-08-09T16:25:12Z

If the goal is to be able to make it easier to replace certain rows of a column if a condition is satisfied, I am not sure it is worth it.

That is because I do not think that the pattern

parent(transform!(filter(:x => >(1), df, view = true), :x => 1)))

is simpler than

transform!(df, :x => x-> ifelse.(x .>= 1, 1, x))

bkamins · 2020-08-09T17:28:42Z

Yes, but all that was asked for in #2211 can be handled by ifelse and this is the currently (i.e. before adding views) recommended pattern.

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

(even if we do not add #2211 functionality, still we can add view kwarg to filter and dropmissing as this is useful in its own rights, as it is a memory efficient way to filter a data frame)

pdeffebach · 2020-08-09T17:38:01Z

@pdeffebach - given the discussion we have what use cases of #2211 do you see?

I agree with you both. i don't think this is that important.

I do like the way stata reads gen x if y > z. But that convenience can be handled by DataFramesMeta potentially.

Therefore I agree that we should continue thinking about better ways to make skipmissing work easily inside transform.

bkamins · 2020-08-09T17:43:25Z

OK - so do we close #2211, or keep it open to get back to it in the future? (for me it is easier not to have #2211 as it will complicate codebase because it requires special handling in many functions, but if you feel we can go back to it some day then let us keep it open)

we should continue thinking about better ways to make skipmissing work easily inside transform.

Agreed - and thank you for spending your time on this.

pdeffebach · 2020-08-09T17:50:16Z

Let's close it if it's easiest. I think DataFramesMeta can handle a lot of it's points.

bkamins · 2020-08-09T17:55:50Z

Now thinking of it I realized that the issue also is that ifelse is limited in what you can express with it, and filter can take a quite complex predicate. But maybe this does not justify having it.

Also ifelse is eager, so it evaluates left and right hand side therefore the following fails:

julia> df = DataFrame(x = -3:3)
7×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ -3    │
│ 2   │ -2    │
│ 3   │ -1    │
│ 4   │ 0     │
│ 5   │ 1     │
│ 6   │ 2     │
│ 7   │ 3     │

julia> transform(df, :x => x-> ifelse.(x .> 0, log.(x), missing))
ERROR: DomainError with -3.0:
log will only return a complex result if called with a complex argument. Try log(Complex(x)).

pdeffebach · 2020-08-09T18:05:12Z

Yes that is one problem. I've had that before and been frustrated by it.

We really need a lazy version of this in Base to be honest...

Another issue is that that "generate a standardized income index, among women, for all women" is tough.

ifelse.(woman == 1, standardized_index(income, wealth), missing)

The income and wealth aren't subsetted by gender above, whereas a View of the data frame would fix that problem.

Then again, I also am not sure what

gen income_index = ... if woman == 1

does in stata. I don't know how the variables are subsetted.

matthieugomez · 2020-08-09T18:45:57Z

Right, I did not think about that . The ifelse pattern can only be used in transform for functions that are elementwise, not functions that compute reductions.

pdeffebach · 2020-08-10T17:57:15Z

But ultimately I agree that this kind of ifelse pattern could be handled by a macro or some sort of struct with lazy broadcasting. It's a tool that has general convenience beyond dataframes and thus doesn't have to live here.

bkamins · 2020-08-10T18:13:43Z

Yes, but it should not be ifelse but something else. ifelse is eager on purpose, as it makes it fast, because then it avoids branching in generated native code.

bkamins · 2020-08-22T22:32:18Z

Given the decision in #2314 do we need the functionality described in #2211 or not?

Or maybe it is enough if we have this functionality for setindex!, insertcols! and broadcasted assignment (this would be relatively easy) but keep disallowing it in transform! and select! (this will be more complex both for SubDataFrame and GroupedDataFrame{SubDataFrame} if SubDataFrame is based on Index)?

(i.e. do we need mutating views - note that I recently noticed that we already had this for rename! but it was not intentional - now it is documented)

bkamins added this to the 1.0 milestone Aug 5, 2020

bkamins added feature non-breaking The proposed change is not breaking labels Aug 5, 2020

bkamins changed the title ~~allow filter to return a view~~ Improwe workflows with filtered DataFrame Aug 8, 2020

matthieugomez mentioned this issue Oct 3, 2020

add WhereDataFrame #2467

Closed

bkamins mentioned this issue Mar 4, 2021

Release 1.0 tracking #2640

Closed

19 tasks

pdeffebach mentioned this issue Mar 7, 2021

@replace for in-place conditional updating JuliaData/DataFramesMeta.jl#144

Open

bkamins modified the milestones: 1.0, 1.x Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improwe workflows with filtered DataFrame #2354

Improwe workflows with filtered DataFrame #2354

bkamins commented Aug 5, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 •

edited

bkamins commented Aug 6, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 •

edited

bkamins commented Aug 6, 2020

bkamins commented Aug 8, 2020

nalimilan commented Aug 8, 2020

pdeffebach commented Aug 8, 2020

bkamins commented Aug 8, 2020

matthieugomez commented Aug 9, 2020 •

edited

bkamins commented Aug 9, 2020

pdeffebach commented Aug 9, 2020

bkamins commented Aug 9, 2020

pdeffebach commented Aug 9, 2020

bkamins commented Aug 9, 2020 •

edited

pdeffebach commented Aug 9, 2020

matthieugomez commented Aug 9, 2020

pdeffebach commented Aug 10, 2020

bkamins commented Aug 10, 2020

bkamins commented Aug 22, 2020

Improwe workflows with filtered DataFrame #2354

Improwe workflows with filtered DataFrame #2354

Comments

bkamins commented Aug 5, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 • edited

bkamins commented Aug 6, 2020

bkamins commented Aug 6, 2020

matthieugomez commented Aug 6, 2020 • edited

bkamins commented Aug 6, 2020

bkamins commented Aug 8, 2020

nalimilan commented Aug 8, 2020

pdeffebach commented Aug 8, 2020

bkamins commented Aug 8, 2020

matthieugomez commented Aug 9, 2020 • edited

bkamins commented Aug 9, 2020

pdeffebach commented Aug 9, 2020

bkamins commented Aug 9, 2020

pdeffebach commented Aug 9, 2020

bkamins commented Aug 9, 2020 • edited

pdeffebach commented Aug 9, 2020

matthieugomez commented Aug 9, 2020

pdeffebach commented Aug 10, 2020

bkamins commented Aug 10, 2020

bkamins commented Aug 22, 2020

matthieugomez commented Aug 6, 2020 •

edited

matthieugomez commented Aug 6, 2020 •

edited

matthieugomez commented Aug 9, 2020 •

edited

bkamins commented Aug 9, 2020 •

edited