Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map applies function to unused pool levels #36

Open
nalimilan opened this issue Jun 25, 2020 · 6 comments
Open

map applies function to unused pool levels #36

nalimilan opened this issue Jun 25, 2020 · 6 comments

Comments

@nalimilan
Copy link
Member

map applies the function to all values in the pool, whether they are used or not. This is notably a problem if you replace missing values:

julia> x = PooledArray([missing, 1])
2-element PooledArray{Union{Missing, Int64},UInt8,1,Array{UInt8,1}}:
  missing
 1

julia> x[1] = 0
0

julia> map(Int, x)
ERROR: MethodError: no method matching Int64(::Missing)

This is a difficult problem to fix, as it would require going over all values to check which ones are not used. One possibility would be to wrap the call in try... catch, store a special value in case of error, and check at the end whether that value was used. If so, call the function again to trigger the error.

@quinnj
Copy link
Member

quinnj commented Jun 27, 2020

A couple of thoughts:

  • I didn't even know we tried this optimization; one solution is just dropping it in favor of generic indexing + map
  • It's really only an issue when you have a Union eltype, replace all entries of one of the Union types, and then map; so not terribly likely, but this is probably the most common situation you'd run into this. And in that regard, it'd probably be good to support dropmissing! and maybe replace! here which would excise Missing from the eltype
  • A slightly less expensive solution than the try...catch proposal would be to put a anychanged::Bool flag in the PooledArray type; on any delete! or setindex! or other destructive operation, we'd set the flag; if the flag is set, we would fallback to map via indexing, only using the optimization if anychanged === false.

@nalimilan
Copy link
Member Author

Yes at least we could avoid adding the try... catch to concrete types. But why do you say the try... catch approach would be expensive? If no error happens (most common case) the overhead should be quite small. And even if there's an error, it should be faster than falling back to the generic map.

@quinnj
Copy link
Member

quinnj commented Jul 4, 2020

But won't we run into weird clashes if they have a legit error thrown in their mapping function? Like, it should fail, but now it silently succeeds and they get an empty array or something?

@nalimilan
Copy link
Member Author

Well the approach I propose would assume the function is pure. So if it fails that's not an issue as we won't use its result.

@msavael
Copy link

msavael commented Nov 3, 2020

This tripped me up in a slightly different situation: I had taken a slice of a PooledArray, then I wanted to map its values through a dictionary that (due to the program's logic) would only have values for entries in the slice. It errors because it is trying to look up values that were in the original PooledArray but aren't in the slice.

Perhaps a function to "trim"/ remove unused refs from a PooledArray would be generally useful?

@bkamins
Copy link
Member

bkamins commented Nov 3, 2020

Yes - this function would be needed in many other situations also. Also a function to reorder the pool would be useful. Both are supported in CategoricalArrays.jl and it would be best if we had them exposed in DataAPI.jl so that we can write generic code.

quinnj added a commit that referenced this issue Nov 21, 2020
Addresses #36. This PR proposes ways to address the issue in #36, namely
that unused pool levels can cause unexpected results when mapping (and
possibly other operations). It introduces a new exported
`dropunusedlevels!` function that will scan a `PooledArray` and remove
any pooled values not being reference in the ref array any more. It also
updates the `map` implementation to allow a keyword argument
`dropunusedlevels=true` to call this before mapping. We could perhaps
consider making that `true` by default to avoid the original issue
happening, though it comes with a performance cost of checking for
unused levels.

This PR also updates the implementation of `map`; it was previously
doing quite a bit of extra work: mapping over keys twice for some
reason, in addition to checking the resulting mapped values for
duplicates. As far as I can tell/understand, having duplicates in the
pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting
code which I admit I haven't look at very closely.

Anyway, still need to add tests here, but figured I would put up the PR
as-is for discussion on the approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants