Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace/filter interfaces for DataFrames #43

Closed
HarlanH opened this issue Jul 26, 2012 · 13 comments
Closed

replace/filter interfaces for DataFrames #43

HarlanH opened this issue Jul 26, 2012 · 13 comments
Labels

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Jul 26, 2012

The existing nafilter/Filter and similar methods, and their flags in the DataVecs, seem limited when it comes to working with DataFrames. I like the way that different columns have different behaviors (replace/filter modes), but it's not clear how to combine them. For example, if building a model matrix for an OLS model, do you do a complete_cases operation? The naFilter iterator generator doesn't really work usefully in that context.

One option would be to have a filter_nas() method that generates a SubDataFrame without any rows that contained an NA in a column with filtering mode set. The result could then be iterated over row-wise, with NAs being replaced in any columns in replace mode. Other options and variations are certainly possible.

See also #4.

@tshort
Copy link
Contributor

tshort commented Jul 26, 2012

I like the SubDataFrame idea. We could use nafilter and nareplace for that. Or, maybe those should generate a new df, and nafilter_sub and nareplace_sub could return SubDataFrames. complete_cases(df) could return the row index of complete cases. Maybe that's all we need. Then, the user could do sub(df, complete_cases(df)) or df[complete_cases(df),:].

@ViralBShah
Copy link
Contributor

I often find myself wanting filter for DataFrames. I basically want to create a new DataFrame (but it could even be a SubDataFrame) by filtering rows of an existing DataFrame. Currently, I roll out my own code, but a filter interface would be really handy.

@johnmyleswhite
Copy link
Contributor

This is basically what subset does. We could rename it to filter.

-- John

On Jun 28, 2013, at 9:41 AM, "Viral B. Shah" notifications@github.com wrote:

I often find myself wanting filter for DataFrames. I basically want to create a new DataFrame (but it could even be a SubDataFrame) by filtering rows of an existing DataFrame. Currently, I roll out my own code, but a filter interface would be really handy.


Reply to this email directly or view it on GitHub.

@ViralBShah
Copy link
Contributor

I wonder how I missed subset. I now see that it is not in the function reference. When I come across such things, should I just go ahead and add to the existing function reference documentation? I am hoping that we will be able to convert it to the helpdb format and have it accessible with help soon.

It would be nice to rename it to filter, with the slight caveat that the behaviour is different from matrices. However, it still seems like it is the right name for this operation.

@johnmyleswhite
Copy link
Contributor

We need to have a big conversation about documentation formats next week.

@tshort
Copy link
Contributor

tshort commented Jun 28, 2013

I like the name subset better than filter.

Also, just plain row indexing gives you a copy of a subset of a DataFrame.

On Fri, Jun 28, 2013 at 2:46 PM, John Myles White
notifications@github.comwrote:

We need to have a big conversation about documentation formats next week.


Reply to this email directly or view it on GitHubhttps://github.com//issues/43#issuecomment-20206834
.

@ViralBShah
Copy link
Contributor

Now that I know about subset, I am ok with that. Perhaps just mark this as a doc issue?

@johnmyleswhite
Copy link
Contributor

We could fix the docs.

I do kind of like only having filter: one of the things I like Julia is the possibility that multiple dispatch can shrink the language's vocabulary to a very small number of basic abstractions that apply in all domains.

@StefanKarpinski
Copy link
Member

one of the things I like Julia is the possibility that multiple dispatch can shrink the language's vocabulary to a very small number of basic abstractions that apply in all domains.

THIS. We should work that into our manual / philosophy somewhere.

@johnmyleswhite
Copy link
Contributor

Hopefully we can fix the typo before we do.

@StefanKarpinski
Copy link
Member

I can't even spot the typo after reading it multiple times...

@johnmyleswhite
Copy link
Contributor

"one of the things I like Julia" -> one of the things I like ABOUT Julia"

@sbromberger
Copy link

I think this is related, but I can't figure out how to use filter (or sub) on a DataFrame/DataArray:

filter with regular DataArray fails (though I don't know why this constructor doesn't work: filter(f::Function,As::AbstractArray{T,N}) at array.jl:1209)

julia> df[:net]
5-element DataArray{IPv4net,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"4.5.6.7",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"4.5.6.7",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

julia> filter(x->IPnetwork.contains(x,a), df[:net])
ERROR: type: typeassert: expected AbstractArray{Bool,N}, got DataArray{Any,1}
 in filter at array.jl:1209

Changing the DataArray to an Array works:

julia> z = [x for x in df[:net]]
5-element Array{Any,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"4.5.6.7",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"4.5.6.7",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

julia> filter(x->IPnetwork.contains(x,a), z)
3-element Array{Any,1}:
 IPv4net(ip"1.2.3.0",ip"255.255.255.0")
 IPv4net(ip"1.2.3.0",ip"255.255.0.0")
 IPv4net(ip"1.2.3.0",ip"255.0.0.0")

sub doesn't apparently like functions:

julia> sub(x->IPnetwork.contains(x,a), df[:net])
ERROR: `sub` has no method matching sub(::Function, ::DataArray{IPv4net,1})

My recommendation would be to add a constructor to filter to accept DataArrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants