Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

set operations for DataArrays #129

Open
WestleyArgentum opened this issue Nov 15, 2014 · 4 comments
Open

set operations for DataArrays #129

WestleyArgentum opened this issue Nov 15, 2014 · 4 comments

Comments

@WestleyArgentum
Copy link

It would be nice to have versions of the standard set operations (union, setdiff, intersect, etc) defined for DataArrays so that dealing with NAs was easier. Right now the functions try to output regular arrays of the same type as the DataArray, and run into problems when they can't convert NA to said type:

julia> setdiff(@data([1,2,3,4]), @data([3,4,5,NA]))
ERROR: `convert` has no method matching convert(::Type{Int64}, ::NAtype)
 in setindex! at dict.jl:545
 in push! at set.jl:18
 in union! at set.jl:23
 in setdiff at array.jl:1358
@garborg
Copy link
Member

garborg commented Nov 15, 2014

To fit with how NA is treated conceptually throughout the package, when either argument has any NAs, setdiff pretty much has to throw an error. (Set(DataVector[..., NA, ...]) is unknown, and while defining Set(DataVector) would be convenient, I think the principle that has guided design here and in DataFrames favors forcing explicit handling of NAs to avoid introducing subtle bugs).

Possible options -- friendlier error message, or a more convenient syntax than setdiff(each_dropna(dv1), each_dropna(dv2))

@nalimilan
Copy link
Member

Probably setdiff(dv1, dv2, skipna=true) would be nicer and more consistent with e.g. mapreduce?

@garborg
Copy link
Member

garborg commented Nov 15, 2014

That's nicer, for sure, and defining the three methods for each function would allow for tuning. On the other hand, it sounds like the quantity of methods that need to be reimplemented has slowed down development of data/nullable/categorical/etc. container types, so maybe slicker iterators would give more mileage:
e.g.

setdiff(v, dv, skipna = true) # =>
setdiff(v, skipna(dv)) # or
setdiff(v, SkipNA(dv))

setdiff(dv1, dv2, skipna = true) # =>
setdiff(skipna(dv1), skipna(dv2)) # or
setdiff(SkipNA(dv1), SkipNA(dv2))

@johnmyleswhite
Copy link
Member

One thing I'd like to do now that Nullable is in Base is to define iterators like skipnulls that live outside of DataArrays (since you might need to apply them to Array{Any}, for example) and lean on those iterators more heavily.

In general, we could do a much better job of making clear when functions can accept arbitrary iterators and when they need a fully materialized input to work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants