replace/filter interfaces for DataFrames #43

HarlanH · 2012-07-26T22:41:29Z

The existing nafilter/Filter and similar methods, and their flags in the DataVecs, seem limited when it comes to working with DataFrames. I like the way that different columns have different behaviors (replace/filter modes), but it's not clear how to combine them. For example, if building a model matrix for an OLS model, do you do a complete_cases operation? The naFilter iterator generator doesn't really work usefully in that context.

One option would be to have a filter_nas() method that generates a SubDataFrame without any rows that contained an NA in a column with filtering mode set. The result could then be iterated over row-wise, with NAs being replaced in any columns in replace mode. Other options and variations are certainly possible.

See also #4.

tshort · 2012-07-26T23:46:40Z

I like the SubDataFrame idea. We could use nafilter and nareplace for that. Or, maybe those should generate a new df, and nafilter_sub and nareplace_sub could return SubDataFrames. complete_cases(df) could return the row index of complete cases. Maybe that's all we need. Then, the user could do sub(df, complete_cases(df)) or df[complete_cases(df),:].

ViralBShah · 2013-06-28T13:41:55Z

I often find myself wanting filter for DataFrames. I basically want to create a new DataFrame (but it could even be a SubDataFrame) by filtering rows of an existing DataFrame. Currently, I roll out my own code, but a filter interface would be really handy.

johnmyleswhite · 2013-06-28T13:51:33Z

This is basically what subset does. We could rename it to filter.

-- John

On Jun 28, 2013, at 9:41 AM, "Viral B. Shah" notifications@github.com wrote:

I often find myself wanting filter for DataFrames. I basically want to create a new DataFrame (but it could even be a SubDataFrame) by filtering rows of an existing DataFrame. Currently, I roll out my own code, but a filter interface would be really handy.

—
Reply to this email directly or view it on GitHub.

ViralBShah · 2013-06-28T16:46:39Z

I wonder how I missed subset. I now see that it is not in the function reference. When I come across such things, should I just go ahead and add to the existing function reference documentation? I am hoping that we will be able to convert it to the helpdb format and have it accessible with help soon.

It would be nice to rename it to filter, with the slight caveat that the behaviour is different from matrices. However, it still seems like it is the right name for this operation.

johnmyleswhite · 2013-06-28T18:46:56Z

We need to have a big conversation about documentation formats next week.

tshort · 2013-06-28T20:30:57Z

I like the name subset better than filter.

Also, just plain row indexing gives you a copy of a subset of a DataFrame.

On Fri, Jun 28, 2013 at 2:46 PM, John Myles White
notifications@github.comwrote:

We need to have a big conversation about documentation formats next week.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/43#issuecomment-20206834
.

ViralBShah · 2013-06-29T04:10:50Z

Now that I know about subset, I am ok with that. Perhaps just mark this as a doc issue?

johnmyleswhite · 2013-06-29T11:40:45Z

We could fix the docs.

I do kind of like only having filter: one of the things I like Julia is the possibility that multiple dispatch can shrink the language's vocabulary to a very small number of basic abstractions that apply in all domains.

StefanKarpinski · 2013-06-29T15:16:27Z

one of the things I like Julia is the possibility that multiple dispatch can shrink the language's vocabulary to a very small number of basic abstractions that apply in all domains.

THIS. We should work that into our manual / philosophy somewhere.

johnmyleswhite · 2013-06-29T15:19:04Z

Hopefully we can fix the typo before we do.

StefanKarpinski · 2013-06-29T15:22:39Z

I can't even spot the typo after reading it multiple times...

johnmyleswhite · 2013-06-29T15:23:16Z

"one of the things I like Julia" -> one of the things I like ABOUT Julia"

sbromberger · 2014-12-13T22:37:34Z

I think this is related, but I can't figure out how to use filter (or sub) on a DataFrame/DataArray:

filter with regular DataArray fails (though I don't know why this constructor doesn't work: filter(f::Function,As::AbstractArray{T,N}) at array.jl:1209)

julia> df[:net]
5-element DataArray{IPv4net,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"4.5.6.7",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"4.5.6.7",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

julia> filter(x->IPnetwork.contains(x,a), df[:net])
ERROR: type: typeassert: expected AbstractArray{Bool,N}, got DataArray{Any,1}
 in filter at array.jl:1209

Changing the DataArray to an Array works:

julia> z = [x for x in df[:net]]
5-element Array{Any,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"4.5.6.7",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"4.5.6.7",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

julia> filter(x->IPnetwork.contains(x,a), z)
3-element Array{Any,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

sub doesn't apparently like functions:

julia> sub(x->IPnetwork.contains(x,a), df[:net])
ERROR: `sub` has no method matching sub(::Function, ::DataArray{IPv4net,1})

My recommendation would be to add a constructor to filter to accept DataArrays.

sbromberger mentioned this issue Dec 13, 2014

Inconsistency between docs and observed DataArray behavior with respect to Array. #741

Closed

nalimilan mentioned this issue Dec 23, 2017

Add filter() and filter!() methods #1330

Merged

ararslan closed this as completed in #1330 Dec 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace/filter interfaces for DataFrames #43

replace/filter interfaces for DataFrames #43

HarlanH commented Jul 26, 2012

tshort commented Jul 26, 2012

ViralBShah commented Jun 28, 2013

johnmyleswhite commented Jun 28, 2013

ViralBShah commented Jun 28, 2013

johnmyleswhite commented Jun 28, 2013

tshort commented Jun 28, 2013

ViralBShah commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

StefanKarpinski commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

StefanKarpinski commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

sbromberger commented Dec 13, 2014

replace/filter interfaces for DataFrames #43

replace/filter interfaces for DataFrames #43

Comments

HarlanH commented Jul 26, 2012

tshort commented Jul 26, 2012

ViralBShah commented Jun 28, 2013

johnmyleswhite commented Jun 28, 2013

ViralBShah commented Jun 28, 2013

johnmyleswhite commented Jun 28, 2013

tshort commented Jun 28, 2013

ViralBShah commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

StefanKarpinski commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

StefanKarpinski commented Jun 29, 2013

johnmyleswhite commented Jun 29, 2013

sbromberger commented Dec 13, 2014