`map` applies function to unused pool levels #36

nalimilan · 2020-06-25T20:24:35Z

map applies the function to all values in the pool, whether they are used or not. This is notably a problem if you replace missing values:

julia> x = PooledArray([missing, 1])
2-element PooledArray{Union{Missing, Int64},UInt8,1,Array{UInt8,1}}:
  missing
 1

julia> x[1] = 0
0

julia> map(Int, x)
ERROR: MethodError: no method matching Int64(::Missing)

This is a difficult problem to fix, as it would require going over all values to check which ones are not used. One possibility would be to wrap the call in try... catch, store a special value in case of error, and check at the end whether that value was used. If so, call the function again to trigger the error.

The text was updated successfully, but these errors were encountered:

quinnj · 2020-06-27T18:05:16Z

A couple of thoughts:

I didn't even know we tried this optimization; one solution is just dropping it in favor of generic indexing + map
It's really only an issue when you have a Union eltype, replace all entries of one of the Union types, and then map; so not terribly likely, but this is probably the most common situation you'd run into this. And in that regard, it'd probably be good to support dropmissing! and maybe replace! here which would excise Missing from the eltype
A slightly less expensive solution than the try...catch proposal would be to put a anychanged::Bool flag in the PooledArray type; on any delete! or setindex! or other destructive operation, we'd set the flag; if the flag is set, we would fallback to map via indexing, only using the optimization if anychanged === false.

nalimilan · 2020-07-03T07:53:24Z

Yes at least we could avoid adding the try... catch to concrete types. But why do you say the try... catch approach would be expensive? If no error happens (most common case) the overhead should be quite small. And even if there's an error, it should be faster than falling back to the generic map.

quinnj · 2020-07-04T22:37:30Z

But won't we run into weird clashes if they have a legit error thrown in their mapping function? Like, it should fail, but now it silently succeeds and they get an empty array or something?

nalimilan · 2020-07-05T09:51:51Z

Well the approach I propose would assume the function is pure. So if it fails that's not an issue as we won't use its result.

msavael · 2020-11-03T14:50:29Z

This tripped me up in a slightly different situation: I had taken a slice of a PooledArray, then I wanted to map its values through a dictionary that (due to the program's logic) would only have values for entries in the slice. It errors because it is trying to look up values that were in the original PooledArray but aren't in the slice.

Perhaps a function to "trim"/ remove unused refs from a PooledArray would be generally useful?

bkamins · 2020-11-03T15:31:27Z

Yes - this function would be needed in many other situations also. Also a function to reorder the pool would be useful. Both are supported in CategoricalArrays.jl and it would be best if we had them exposed in DataAPI.jl so that we can write generic code.

Addresses #36. This PR proposes ways to address the issue in #36, namely that unused pool levels can cause unexpected results when mapping (and possibly other operations). It introduces a new exported `dropunusedlevels!` function that will scan a `PooledArray` and remove any pooled values not being reference in the ref array any more. It also updates the `map` implementation to allow a keyword argument `dropunusedlevels=true` to call this before mapping. We could perhaps consider making that `true` by default to avoid the original issue happening, though it comes with a performance cost of checking for unused levels. This PR also updates the implementation of `map`; it was previously doing quite a bit of extra work: mapping over keys twice for some reason, in addition to checking the resulting mapped values for duplicates. As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely. Anyway, still need to add tests here, but figured I would put up the PR as-is for discussion on the approach.

nalimilan mentioned this issue Jun 27, 2020

Add optimized constructors/convert/copy!/copyto! using DataAPI.refarray JuliaData/CategoricalArrays.jl#281

Merged

quinnj mentioned this issue Nov 21, 2020

Introduce new dropunusedlevels! function and allow using from map #43

Open

bkamins mentioned this issue Apr 1, 2021

view of PooledArray #44

Open

oxinabox mentioned this issue Apr 1, 2021

Purity assumption of map #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`map` applies function to unused pool levels #36

`map` applies function to unused pool levels #36

nalimilan commented Jun 25, 2020

quinnj commented Jun 27, 2020

nalimilan commented Jul 3, 2020

quinnj commented Jul 4, 2020

nalimilan commented Jul 5, 2020

msavael commented Nov 3, 2020

bkamins commented Nov 3, 2020

map applies function to unused pool levels #36

map applies function to unused pool levels #36

Comments

nalimilan commented Jun 25, 2020

quinnj commented Jun 27, 2020

nalimilan commented Jul 3, 2020

quinnj commented Jul 4, 2020

nalimilan commented Jul 5, 2020

msavael commented Nov 3, 2020

bkamins commented Nov 3, 2020

`map` applies function to unused pool levels #36

`map` applies function to unused pool levels #36