Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale.xdiscrete levels #1130

Open
Mattriks opened this issue Apr 5, 2018 · 18 comments
Open

Scale.xdiscrete levels #1130

Mattriks opened this issue Apr 5, 2018 · 18 comments

Comments

@Mattriks
Copy link
Member

Mattriks commented Apr 5, 2018

According to the docs, the levels argument of Scale.x-discrete behaves such that "Order will be respected and anything in the data that's not represented in levels will be set to NA".

  • This is what I get on Gadfly 0.6.5:
mpg = dataset("ggplot2","mpg")
ux = sort(unique(mpg[:Class]))
pa = plot(mpg, x=:Class, y=:Hwy, Geom.boxplot, color=:Class, Scale.x_discrete(levels=ux[1:4]))
pb = plot(mpg, x=:Class, y=:Hwy, Geom.boxplot, Scale.x_discrete(levels=ux[1:4]))
vstack(pa,pb)

boxps2
so it works as the docs suggest, but with obvious issues in the top plot, and maybe users would prefer for the NA column not to appear (or to appear only if NA is added to the levels vector)?

  • And on master (for both plots):
ArgumentError: values not in levels encountered
Stacktrace:
[1]discretize_make_ia(::CategoricalArrays.CategoricalArray{String,1,UInt8,String,CategoricalArrays.CategoricalString{UInt8},Union{}}, ::Array{String,1}) 
@andreasnoack
Copy link
Contributor

Thanks for capturing this. We actually discussed this behavior that results in the error above and we weren't sure what the right solution was. The handling of missing values is often not easy to spot in the code and I wouldn't be surprised if there were other issues with missing values. It probably isn't too hard to get the old behavior back (if we want it) but, as mentioned previously, it would probably be great if the handling of missing values was more centralized if possible.

@Mattriks
Copy link
Member Author

Mattriks commented Apr 5, 2018

With a dataset containing 7 categories, if I use levels to specify 4 categories e.g. Scale.x_discrete(levels=ux[1:4]) (as above) then the minimum behaviour I would expect is a plot with 4 levels, which is currently not what happens on master.

The NA label is added (by Gadfly) to the 3 levels I didn't specify (its not part of the raw data), which then becomes the first boxplot in the plots above. I'm also undecided whether that should be part of the default behaviour. But as mentioned, the minimum behaviour should be a plot with the 4 levels I specified.

@bjarthur
Copy link
Member

bjarthur commented Apr 5, 2018

perhaps related, and sorry to hijack this issue if not, but the Geom.segment example is throwing the following error on master now:

ERROR: MethodError: no method matching atan2(::Array{Union{Float64, Missings.Missing},1}, ::Array{Union{Float64, Missings.Missing},1})

@andreasnoack
Copy link
Contributor

Let's file a separate issue for the atan2+Missings issue. There might be a few more of these showing up as we battle test the new version.

@nalimilan
Copy link
Contributor

So what would be the best behavior? Show a "missing" category or hide it? I'm not familiar with Gadfly, but it looks weird to group all values which do not appear in the specified levels in a "missing" category, while missing values are not represented by default.

@Mattriks
Copy link
Member Author

Mattriks commented Apr 7, 2018

I think that showing the "Other" boxplot (i.e. the categories I didn't specify in levels), should be an optional behaviour. There are cases where having an "Other" boxplot could be useful.

But for now, I would like to see the minimum behaviour (as described above) work on master.

@Mattriks
Copy link
Member Author

Another issue I encountered on Gadfly master is a 255 "group" limit, I don't recall hitting a limit with PooledDataArray. This is potentially an issue for polygon-based maps:

using DataFrames, VoronoiDelaunay

srand(123)
tess = DelaunayTessellation()
width = max_coord - min_coord
a= Point.(min_coord+rand(130)*width, min_coord+rand(130)*width) 
push!(tess, a)

# In the next line, change 256 to 257 and the plots fail
triangles = [geta.(tess._trigs) getb.(tess._trigs) getc.(tess._trigs)][2:256,:]
n = size(triangles, 1)
D = DataFrame(x=vec(getx.(triangles)'), y=vec(gety.(triangles)'), id=string.(vec([1:n 1:n 1:n]')))

pa = plot(D, x=:x, y=:y, Geom.polygon(fill=true, preserve_order=true), color=:id, Theme(key_position=:none))
 pb = plot(D, x=:x, y=:y, Geom.polygon(preserve_order=true), group=:id) 
hstack(pa,pb)

delaunay

@nalimilan
Copy link
Contributor

@Mattriks Mattriks mentioned this issue Apr 28, 2018
@bjarthur
Copy link
Member

@andreasnoack it'd be nice to tag 0.7 at some point, but this issue needs resolved first. it's not clear to me what the best solution is to the NA problem above, but at least the path is clear for the UInt8 problem. since it's related to your work on IndirectArrays, would you mind taking care of at least the latter? the only suggestion i'd make here is that perhaps UInt16 would suffice. not sure of the performance implications.

@tlnagy
Copy link
Member

tlnagy commented May 3, 2018

Bump. @andreasnoack, it would be great to have this fixed!

If there's anything we can do to help, let us know.

@andreasnoack
Copy link
Contributor

Indeed. I think I know how to fix these issues but I need the time and, unfortunately, other tasks take priority right now.

@bjarthur
Copy link
Member

we should think about whether we want to get these dataframes related regressions fixed before or after we merge any breaking changes. ideally it'd be nice to have a 0.7.x which completely worked. on the otherhand, no sense delaying progress just because of semantic versioning.

@tlnagy
Copy link
Member

tlnagy commented May 30, 2018

Using @Mattriks' code from above, I was able to verify that master fails to plot when using 257 in the code above

ERROR: LoadError: InexactError()
Stacktrace:
 [1] copy!(::IndexLinear, ::Array{UInt8,1}, ::IndexLinear, ::Array{Int64,1}) at ./abstractarray.jl:656
 [2] discretize_make_ia(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/misc.jl:397
 [3] discretize(::Array{String,1}, ::Array{String,1}, ::Void, ::Bool) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:269
 [4] discretize(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:258
 [5] apply_scale(::Gadfly.Scale.DiscreteColorScale, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:477
 [6] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:34
 [7] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Gadfly.Data) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:53
 [8] render_prepare(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:677
 [9] render(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:757
 [10] collect(::Base.Generator{Tuple{Gadfly.Plot,Gadfly.Plot},Gadfly.##110#111}) at ./array.jl:475
 [11] hstack(::Gadfly.Plot, ::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:923
 [12] include_from_node1(::String) at ./loading.jl:576
 [13] include(::String) at ./sysimg.jl:14

and that @nalimilan's solution seems to fix that error.

 discretize_make_ia(values::AbstractVector, levels) =
-    IndirectArray(Array{UInt8}(indexin(values, levels)), levels)
+    IndirectArray(Array{UInt}(indexin(values, levels)), levels)

image

But there's still the issue of the tiling breaking at the bottom right of the plot. Not sure what is going on there.

@bjarthur
Copy link
Member

@andreasnoack julia 0.7 is just around the corner and it'd be nice to have this issue resolved before dropping support for 0.6. will you have time soon? i could try to take a stab, but you are much more familiar with the changes you've made to support DataFrames. thanks.

@andreasnoack
Copy link
Contributor

I'll try to take a look during July but can't promise anything.

@bjarthur
Copy link
Member

bjarthur commented Aug 2, 2018

@andreasnoack just a friendly bump. i'd like to start working on 0.7, but don't want to drop 0.6 support before this issue is resolved.

@bjarthur
Copy link
Member

would be nice to resolve this regression now that gadfly dependencies work on julia 0.7. anyone want to volunteer? :)

@nalimilan
Copy link
Contributor

As I noted above, it shouldn't be hard to fix at least some/most of the issues by using UInt explicitly everywhere IndirectArray is used. It should be trivial for somebody familiar with the Gadfly code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants