Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor unstack #1309

Merged
merged 2 commits into from Dec 15, 2017
Merged

Refactor unstack #1309

merged 2 commits into from Dec 15, 2017

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 5, 2017

Changes:

  1. allow multiple columns in rowkeys
  2. remove incorrect and slower implementation for single column in rowkey

@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2017

@nalimilan @cjprybol
Just for a sake of completeness of refactoring unstack let me mention that currently we do not handle the following case 100% correct:

julia> y = DataFrame(variable=["x","x"], value=[missing,missing], id=[1,1])
2×3 DataFrames.DataFrame
│ Row │ variable │ value   │ id │
├─────┼──────────┼─────────┼────┤
│ 1   │ x        │ missing │ 1  │
│ 2   │ x        │ missing │ 1  │

julia> unstack(y)
1×2 DataFrames.DataFrame
│ Row │ id │ x       │
├─────┼────┼─────────┤
│ 1   │ 1  │ missing │

There is a duplicate in y that is not reported because it is missing (and missing is used to indicate yet unfilled data).

The second thing is that unstack makes all columns accept missing even if we are not sure that we have to (this means that stack-unstack procedure changes types of columns).

The solution would be to use a temporary data structure that would hold binary indicators if a position is filled (so we have no duplicate missing issue). Later those indicators can be also used to decide if we have to make a column accept Missing or not.

But I am not sure if you would feel it is worth the effort as it will introduce some memory and computational overhead.

@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2017

tests fail because the fix of allowmissing! is not tagged yet so we have rerun CI after the update of dependencies.

@nalimilan
Copy link
Member

There is a duplicate in y that is not reported because it is missing (and missing is used to indicate yet unfilled data).

The second thing is that unstack makes all columns accept missing even if we are not sure that we have to (this means that stack-unstack procedure changes types of columns).

The solution would be to use a temporary data structure that would hold binary indicators if a position is filled (so we have no duplicate missing issue). Later those indicators can be also used to decide if we have to make a column accept Missing or not.

But I am not sure if you would feel it is worth the effort as it will introduce some memory and computational overhead.

Why not, I guess the overhead could be kept relatively low. With compiler improvements we should be able to convert an Array{Union{Missing, T} to an Array{T} very efficiently, and keeping track of whether an entry is missing or not shouldn't be too costly, if you're willing to do the work. Among other things, ismissing(payload[j][i]) is probably quite slow currently (waiting for more compiler optimizations), and could be much faster if we used a boolean array.

It would make sense to check what other implementations do, though.

@@ -159,7 +159,7 @@ unstack(df::AbstractDataFrame)

* `df` : the AbstractDataFrame to be unstacked

* `rowkey` : the column with a unique key for each row, if not given,
* `rowkeys` : the column or columns with a unique key for each row, if not given,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "column(s)".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -150,7 +150,7 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol=
Unstacks a DataFrame; convert from a long to wide format

```julia
unstack(df::AbstractDataFrame, rowkey, colkey, value)
unstack(df::AbstractDataFrame, rowkeys, colkey, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be useful to mention accepted types too. I guess you could have one line with ::Union{Symbol, Integer}, and another one with ::AbstractVector{<:Union{Symbol, Integer}}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (although a bit messy)

insert!(payload, 1, copy!(col, levs), _names(df)[rowkey])
end
unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int) =
unstack(df, [rowkey], colkey, value)
unstack(df::AbstractDataFrame, rowkey::ColumnIndex,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a line break since you do the same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

```
Note that there are some differences between the widened results above.

"""
function unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you find it natural that the groupby approach is more efficient than the approach used here? AFAIK it should be quite faster, especially if the original column is a CategoricalArray (since we don't have an optimized groupby implementation for CategoricalArray at the moment, and it's not trivial to do).

Since the code exists, I'm reluctant to remove it since we have sometimes found out that some trivial optimizations could be applied which made a dramatic difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the benchmark:

julia> x = DataFrame(rand(1000, 10));

julia> x[:id] = 1:1000;

julia> y = stack(x);

julia> z = copy(y);

julia> categorical!(z, :id);

julia> eltypes(y)
3-element Array{Type,1}:
 Symbol
 Float64
 Int64

julia> eltypes(z)
3-element Array{Type,1}:
 Symbol
 Float64
 CategoricalArrays.CategoricalValue{Int64,UInt32}

julia> @benchmark unstack(y)
BenchmarkTools.Trial:
  memory estimate:  2.01 MiB
  allocs estimate:  78752
  --------------
  minimum time:     7.898 ms (0.00% GC)
  median time:      8.700 ms (0.00% GC)
  mean time:        8.998 ms (3.72% GC)
  maximum time:     59.264 ms (84.15% GC)
  --------------
  samples:          556
  evals/sample:     1

julia> @benchmark unstack(y, :variable, :value)
BenchmarkTools.Trial:
  memory estimate:  1.85 MiB
  allocs estimate:  57636
  --------------
  minimum time:     6.008 ms (0.00% GC)
  median time:      6.769 ms (0.00% GC)
  mean time:        6.965 ms (3.64% GC)
  maximum time:     58.657 ms (87.87% GC)
  --------------
  samples:          717
  evals/sample:     1

julia> @benchmark unstack(z)
BenchmarkTools.Trial:
  memory estimate:  1.65 MiB
  allocs estimate:  68695
  --------------
  minimum time:     10.947 ms (0.00% GC)
  median time:      11.326 ms (0.00% GC)
  mean time:        11.677 ms (2.53% GC)
  maximum time:     64.046 ms (81.62% GC)
  --------------
  samples:          428
  evals/sample:     1

julia> @benchmark unstack(z, :variable, :value)
BenchmarkTools.Trial:
  memory estimate:  1.92 MiB
  allocs estimate:  56270
  --------------
  minimum time:     9.071 ms (0.00% GC)
  median time:      9.890 ms (0.00% GC)
  mean time:        10.179 ms (2.94% GC)
  maximum time:     65.845 ms (84.51% GC)
  --------------
  samples:          491
  evals/sample:     1

The difference is because of the lines:

i = Int(CategoricalArrays.order(refkeycol.pool)[refkeycol.refs[k]]) # more expensive

and

i = rowkey[k] # cheaper

Do you see any area for optimization? If single-column version is not significantly faster I would remove it as now:

  1. we have a duplicate implementation of the same feature (which increases package complexity and maintenance cost);
  2. the single-column version has a bug when unstacking on id-column that has missing values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the second line is cheaper, but it only comes after a call to groupby which should take much more time. So it looks like something is going wrong with the other method. Have you tried profiling it?

I agree with you that removing duplicated methods is generally a good idea. But I don't want to do that until we are certain to understand the situation, and here it looks like something doesn't work as I would expect it to. For example, we have recently realized that building a CategoricalArray was much slower than it should have been. If we could fix a similar issue here, lots of places would benefit from the fix (even if in the end we remove the method).

A possibility here is that constructing a CategoricalArray is relatively slow since IDs are all unique, and CategoricalArray isn't efficient for this kind of data. Or maybe the compiler is not able to move the CategoricalArrays.order(keycol.pool) call outside of the loop, i.e. we should manually store the order in a variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the profiling - I will post it below.

But actually observe having all unique IDs is a point of working stack/unstack (i.e. they are unique after unstack but are not unique after stack).

@@ -319,6 +319,27 @@ module TestDataFrame
df4[1,:Mass] = missing
@test df2 ≅ df4

# test empty set of grouping variables
@testthrows ArgumentError unstack(df, Int[], :Key, :Value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test_throws.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

I have fixed correct handling of missing. I have decided to retain that all columns accept missing - this would introduce performance overhead to control for it and I think it is not crucially problematic.

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

First part - timing of new vs old code (new code uses mask Boolean test):

# Old code
julia> @benchmark unstack(y)
  memory estimate:  2.01 MiB
  mean time:        8.176 ms (3.61% GC)
julia> @benchmark unstack(y, :variable, :value)
  memory estimate:  1.85 MiB
  mean time:        6.263 ms (3.67% GC)
julia> @benchmark unstack(z)
  memory estimate:  1.65 MiB
  mean time:        11.022 ms (2.53% GC)
julia> @benchmark unstack(z, :variable, :value)
  memory estimate:  1.92 MiB
  mean time:        10.190 ms (2.86% GC)

# The proposed code with mask array
julia> @benchmark unstack(y)
  memory estimate:  1.85 MiB
  mean time:        6.001 ms (4.54% GC)
julia> @benchmark unstack(y, :variable, :value)
  memory estimate:  1.85 MiB
  mean time:        5.891 ms (3.69% GC)
julia> @benchmark unstack(z)
  memory estimate:  1.92 MiB
  mean time:        9.367 ms (3.08% GC)
julia> @benchmark unstack(z, :variable, :value)
  memory estimate:  1.92 MiB
  mean time:        9.391 ms (2.90% GC)

EDIT: under Julia 0.7

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

I performed the profiling and actually it seems that creation of groupby:

g = groupby(df, setdiff(_names(df), _names(df)[[colkey, value]]), sort=true)

is faster than:

keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])

if they have to be run. This means that unstack without groupby would be faster if colkey and rowkey column would be categorial already.

In consequence the question is if we want to make stack turn :variable column into Categorical (it will make stack slower but unstack faster).

@nalimilan
Copy link
Member

I performed the profiling and actually it seems that creation of groupby:

g = groupby(df, setdiff(_names(df), _names(df)[[colkey, value]]), sort=true)

is faster than:

keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])

if they have to be run.

Why not g = groupby(df, rowkeys, sort=true) as in the code? Though in this case it should be equivalent?

Also, it's not fair to include keycol, since it needs to be created in both cases, right?

This means that unstack without groupby would be faster if colkey and rowkey column would be categorial already.

But is that indeed the case in benchmarks? In my quick tests, it wasn't, which is really weird.

Anyway, if groupby than creating a CategoricalArray, that means that hashing a column first, and only then perform dictionary lookups based on these hashes is faster that hashing and doing the lookups on the fly at the same time. Indeed that's the main difference I can think of between the two approaches.

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

I was testing on the implementation before my changes (that is the reason for code differences). I have used @profile - those are the lines that came up (and actually single Categorical creation was slower).

However, it is weird indeed - I will investigate more into it. What bothers me is:

using BenchmarkTools

df = DataFrame(rand(1000,10))
df[:id] = string.("a", 1:1000)
sdf = stack(df)
sdf = sdf[randperm(nrow(sdf)), :]

f1(s) = groupby(s, [:id], sort=true)
f2(s) = CategoricalArray{Union{eltype(s[:id]), Missing}}(s[:id])

@benchmark f1($sdf)
@benchmark f2($sdf)

gives:

julia> @benchmark f1($sdf)
BenchmarkTools.Trial:
  memory estimate:  601.81 KiB
  allocs estimate:  12377
  --------------
  minimum time:     2.283 ms (0.00% GC)
  median time:      2.422 ms (0.00% GC)
  mean time:        2.528 ms (3.56% GC)
  maximum time:     55.176 ms (94.73% GC)
  --------------
  samples:          1975
  evals/sample:     1

julia> @benchmark f2($sdf)
BenchmarkTools.Trial:
  memory estimate:  756.02 KiB
  allocs estimate:  15795
  --------------
  minimum time:     1.448 ms (0.00% GC)
  median time:      1.549 ms (0.00% GC)
  mean time:        1.693 ms (7.64% GC)
  maximum time:     55.446 ms (97.12% GC)
  --------------
  samples:          2940
  evals/sample:     1

which is in line with your intuition and against the results I report above.

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

Actually running @code_warntype on both implementations is all red. I will check if we can improve something here.

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

OK. I have it. We will leave separate implementations for single and multiple columns.
The problem is solved in a standard way by barrier function:

function unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int)
    refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])
    keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
    _unstack(df, rowkey, colkey, value, refkeycol, keycol)
end

(it is a fast writeup - I will push a clean solution when I have it but roughly it is 2x speedup on the test I report above)

@@ -150,16 +150,20 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol=
Unstacks a DataFrame; convert from a long to wide format

```julia
unstack(df::AbstractDataFrame, rowkey, colkey, value)
unstack(df::AbstractDataFrame, colkey, value)
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union{Symbol, Integer} -> ColumnIndex. Currently, ColumnIndex is actually Union{Symbol, Real}, but I've had a PR on my to-do list for a while to change it to Union{Symbol, Integer}. Let's stay consistent and hopefully we can further restrict the types of all column-accepting functions at once by just changing ColumnIndex to Union{Symbol, Integer}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it's not exported currently, and even if it was I'm not sure it's a good idea to use it in docs since it's less explicit than Union. Ideally we could interpolate it using $ColumnIndex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have Union{Symbol, Integer} everywhere in my local repo added. I will probably push this during the weekend when I finish tuning the performance.

Copy link
Contributor

@cjprybol cjprybol Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use 4-space indent for code blocks rather than backticks the variable interpolation works, just tried it out

@bkamins
Copy link
Member Author

bkamins commented Dec 8, 2017

In this commit I have concentrated on functionality (some testing appreciated):

  1. the speedup of execution should be around 2x;
  2. correctly and consistently handle missing in :id column - in groupby and CategoricalArray implementation;
  3. correctly handle a situation with missing in :variable column (now it throws a warning - I can change it to an error if you feel it would be a better approach).

@bkamins bkamins force-pushed the improve_unstack branch 4 times, most recently from db59b39 to 5b9bad6 Compare December 8, 2017 23:49
@coveralls
Copy link

Coverage Status

Coverage increased (+1.5%) to 74.71% when pulling 5b9bad6 on bkamins:improve_unstack into 83323bb on JuliaData:master.

@bkamins
Copy link
Member Author

bkamins commented Dec 9, 2017

Here are the benchmarks (I have cut the output to leave what is important - timing is given after unstack is compiled for given arguments).
The conclusion is that we are 2x faster with this implementation and 0.7 is faster than 0.6 (and keep in mind that the new implementation does more work as it has to handle missing values in rowkeys and colkey).

Data setup code:

x = DataFrame(rand(10000, 1000))
x[:id] = 1:10000
x[:id2] = string.("a", x[:id])
x[:s] = [randstring() for i in 1:10000]
y = melt(x, [:id, :id2])

Summary of benchmark results:

Julia 0.6

OLD
julia> @time unstack(y);
 10.974269 seconds (121.05 M allocations: 2.254 GiB, 22.68% gc time)
julia> @time unstack(y, :variable, :value);
 10.412161 seconds (73.60 M allocations: 1.859 GiB, 30.93% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
 10.393829 seconds (111.06 M allocations: 1.919 GiB, 22.76% gc time)
julia> @time unstack(y, :variable, :value);
 11.879770 seconds (83.64 M allocations: 2.010 GiB, 31.39% gc time)

NEW 
julia> @time unstack(y);
  4.848957 seconds (34.63 M allocations: 1.006 GiB, 42.45% gc time)
julia> @time unstack(y, :variable, :value);
  8.192743 seconds (24.48 M allocations: 1.128 GiB, 56.92% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
  4.182978 seconds (24.64 M allocations: 687.415 MiB, 46.12% gc time)
julia> @time unstack(y, :variable, :value);
  9.793470 seconds (34.52 M allocations: 1.280 GiB, 55.00% gc time)

Julia 0.7

OLD
julia> @time unstack(y);
 11.376244 seconds (121.07 M allocations: 2.254 GiB, 11.02% gc time)
julia> @time unstack(y, :variable, :value);
  9.536922 seconds (73.59 M allocations: 1.859 GiB, 22.01% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
 10.165981 seconds (111.06 M allocations: 1.919 GiB, 11.03% gc time)
julia> @time unstack(y, :variable, :value);
 10.698634 seconds (83.64 M allocations: 2.010 GiB, 22.48% gc time)

NEW
julia> @time unstack(y);
  4.928271 seconds (34.56 M allocations: 1.005 GiB, 41.12% gc time)
julia>  @time unstack(y, :variable, :value);
  5.809069 seconds (24.48 M allocations: 1.128 GiB, 42.01% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
  3.382329 seconds (24.55 M allocations: 686.091 MiB, 28.77% gc time)
julia> @time unstack(y, :variable, :value);
  6.865709 seconds (34.53 M allocations: 1.280 GiB, 40.00% gc time)

@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 9, 2017
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Though, why is unstack(y) slightly faster than unstack(y, :variable, :value)?

You should be able to make it even dramatically more efficient by passing a tuple of columns to _unstack rather than a DataFrame. See what I did at JuliaData/DataTables.jl#79.

@@ -171,6 +175,10 @@ unstack(df::AbstractDataFrame)

* `::DataFrame` : the wide-format DataFrame

If `colkey` contains `missing` values then they will be skipped and warning will be printed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a warning". Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

colkey::Int, value::Int, keycol, valuecol, df2, refkeycol)
Nrow = length(refkeycol.pool)
Ncol = length(keycol.pool)
hadmissing = false # have we encounered missing in refkeycol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"encountered".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled?
nowarning = true # do we print duplicate entries warning
nowarning2 = true # do we print missing in keycol
keycol_pool = Vector{Int}(CategoricalArrays.order(keycol.pool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better call this keycol_order. Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

hadmissing = false # have we encounered missing in refkeycol
mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled?
nowarning = true # do we print duplicate entries warning
nowarning2 = true # do we print missing in keycol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better call this nowarning_missing, or maybe warned_missing (and revert it). Same for nowarning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if nowarning && !ismissing(payload[j][i])
warn("Duplicate entries in unstack.")
nowarning = false
keycol_refs = keycol.refs[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plural is weird since it's a single value. Same for refkeycol_ref. Could almost be kref and rkref.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (actually I had similar names originally, but thought them to be less informative)

end
j = keycol_pool[keycol_refs]
refkeycol_refs = refkeycol.refs[k]
if refkeycol_refs == 0 # we have found missing in rowkeys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use <= 0, I have possible plans to use negative values to support multiple types of missing values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed both for kref and refkref.

end
if nowarning && mask_filled[j, i]
warn("Duplicate entries in unstack at row $k.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be nice to print the value and the name of the column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skip this. It gets messy as I would have to say something like

duplicate value for index "something which can be multiple columns" and variable "something"

I feel it is easy enough to to check the contents of the reported column anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, you can check it manually, but I find it really makes the difference between software which works and software which is pleasant and efficient to work with. For example, we print the index and the dimensions of the indexed array with BoundsError, and it makes a big difference from R where you need to start the debugger to find it out.

Don't do it if that's too much work, but if that's possible it would be really nice, even if the output isn't completely pretty (e.g. it's fine to print a one-element tuple when there's only one column, though join's last argument is very useful to format this properly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex, value::ColumnIndex) =
unstack(df, rowkeys, index(df)[colkey], index(df)[value])

unstack(df::AbstractDataFrame, rowkeys::AbstractVector{T}, colkey::Int, value::Int) where T<:Real =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use AbstractVector{<:Real}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Out of curiosity - why it is a preferred style?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because it's shorter, you can read it directly instead of looking for T at the end of the line, and it makes it clear T isn't used in the body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Nrow = length(refkeycol.pool)
Ncol = length(keycol.pool)
payload = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol)))
nowarning = true
df2 = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more natural to build the returned data frame inside _unstack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (and improving performance as reported below)

@@ -377,6 +398,36 @@ module TestDataFrame
a = unstack(df, :id, :variable, :value)
b = unstack(df, :variable, :value)
@test a ≅ b ≅ DataFrame(id = [1, 2], a = [3, missing], b = [missing, 4])

df = DataFrame(variable=["x","x"], value=[missing,missing], id=[1,1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use spaces around == and after commas for consistency.

@bkamins
Copy link
Member Author

bkamins commented Dec 9, 2017

Great! Though, why is unstack(y) slightly faster than unstack(y, :variable, :value)?

Because unstack(y) uses single column :id (and does not use groupby) and unstack(y, :variable, :value) works on [:id, :id2] (and uses groupby).

You should be able to make it even dramatically more efficient by passing a tuple of columns to _unstack rather than a DataFrame. See what I did at JuliaData/DataTables.jl#79.

This is essentially what I do. I pass df only to get names for rows. But I can create df2 later (and work on an array when assigning values) - I will check if it helps.

@bkamins
Copy link
Member Author

bkamins commented Dec 9, 2017

@nalimilan good advice - delaying creation of df2 is another 2x speedup.

@nalimilan
Copy link
Member

This is essentially what I do. I pass df only to get names for rows. But I can create df2 later (and work on an array when assigning values) - I will check if it helps.

Oh, right, that's for df2 that type unstability still hurts. Anyway it makes sense to create that data frame inside _unstack.

@bkamins bkamins force-pushed the improve_unstack branch 2 times, most recently from 2b9d2ee to 81be94b Compare December 9, 2017 23:00
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])
# make sure we report only levels actually present in rowkey column
# this is consistent with how the other unstack method based on groupby works
refkeycol = copy(CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need copy, IIRC I ensured that CategoricalArray constructors always make a copy. BTW, that means that if you don't need a copy for refkeycol, better use convert to avoid it if the input is already a CategoricalArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a copy as I droplevels! in the next line. Done. I add test to check that we actually make a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you need a copy for keycol too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do (and there was another bug related to this that I have fixed if keycol has nonexistent levels).

However, I have a question here - why we want keycol to allow Missing even if originally it does not have missings? (this is the original design and I respect it, but I do not understand the reason for such an approach)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original design was probably based on DataArray everywhere, so it makes sense to change it to return arrays which do not allow for missing where possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I removed allowing missing if not needed (I hope it is ok - you know Missings best 😄):

  • I create categorical using categorical;
  • I do not force col in single index case to allow missings;
  • I do not force df1 to allow missings;

valuecol = df[value]
Nrow = length(refkeycol.pool)
Ncol = length(keycol.pool)
df2m = [similar_missing(valuecol, Nrow) for i in 1:Ncol]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create this here? Don't you have all the needed information inside _unstack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bacause it is ~5% faster this way. I guess Julia does not like functions with large bodies and _unstack is already heavy. I do not know why in detail - those are benchmark results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5% doesn't sound worth it to me. And this kind of difference can change with compiler improvements. Better prioritize the clearer design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, better not call it df since it's not a data frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and changed name to unstacked_val

nowarning = true
hadmissing = false # have we encountered missing in refkeycol
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?
warned_missing_1 = false # do we print duplicate entries warning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to call this warned_dup or something like that, to get rid of 1 and 2. :-)

And the comment would be "have we already printed..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

kref = keycol.refs[k]
if kref <= 0 # we have found missing in colkey
if !warned_missing_2
warn("Missing value in '$(_names(df)[colkey])' variable at row $k. Skipping.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just checked, the convention used in the DataFrame constructor is to print column names without quotes. Maybe clearer with "variable X" (rather than "X variable").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although now we print "Missing value in variable variable at row"

hadmissing = true
# we use the fact that missing is greater than anything
for i in eachindex(df2m)
push!(df2m[i], missing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common do we expect this to happen? Unless that reflects malformed data, we could allocate Nrow+1 entries and resize to Nrow when creating df2m, so that this line does not have to make any copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should not happen normally.
This will take place only if we have missing in index column (I have even considered throwing an error in this case but in the end I think we can accept missing as index).
Therefore I wanted to have a typical-case clean and use push! only as an exception.

else
i = refkeycol_order[refkref]
end
if (!warned_missing_1) && mask_filled[j, i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for parentheses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave them - I have learned from my students 😄 that they are often confused about operator precedence.
But if you feel that it is cleaner to remove them I will do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use parentheses with ! in any of the Julia codebases I know. I agree precedence can be confusing at times, but in this case it's pretty clear, so please remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

end
levs = levels(refkeycol)
# we have to handle a case with missings in refkeycol as levs will skip missing
col = similar_missing(df[rowkey], length(levs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use length(levs) + hadmissing for the length, then you can do hadmissing && col[end] = missing below. That way you avoid a copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

end
df2m[j][i] = valuecol[k]
mask_filled[j, i] = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[j, i] is less natural than [i, j], is this to optimize memory access patterns? Do we expect the data to be sorted (and therefore filled) by rows or by columns? Given the layout of data frames, it would be optimal to fill one column at a time. I admit I don't really understand whether that's the case currently or not.

We should also ensure that's consistent with what stack outputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stack sorts by variable not by index. So if we unstack something that was stacked j is changing faster than i so I guess this layout is better. Yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case (haven't checked), then it's not optimal: we need i to change the fastest, since accessing one column at a time is more efficient for data frames. Maybe we should change stack? Though we should check what R/dplyr/Pandas do first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points (contradictory though):

  • I was wrong we fill the data column-by-column (so i is changing faster)
  • However, both - result of stack and then random permutation of result of stack are minimally faster when we have the current order (i.e. mask_filled = falses(Ncol, Nrow)). I do not understand why.

Anyway this does not matter much - it is again the case of minimal difference in performance. So I would leave it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for checking. That's really weird, but well... For maximal performance, we could also use an Array{Bool}, but probably not worth the increased memory use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, have you tried adding @inbounds before indexing expressions where we are sure indices are correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is complex and easy to make mistake in reasoning. I would leave out @inbounds until it is well tested by users. I will add a TODO in the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a TODO: either you know it's OK or you don't.

At least it should be fine when you index refkeycol and keycol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know 😄, but conditional on what you have written, so I would not risk. I have checked that the gains form @inbounds are minimal <1%.

end

@testset "missing values in colkey" begin
df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around =.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the style without spaces is recommended by Julia Manual. Also you do not have spaces in this very file (see e.g. at its beginning - most of calls to DataFrame do not use spaces here).
I can fix it - but let us decide - do we want spaces or no spaces everywhere?
(adding spaces will change much more lines)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's already inconsistent, so let's leave it that way.

@@ -296,12 +296,14 @@ module TestDataFrame

#Check the output of unstack
df = DataFrame(Fish = CategoricalArray{Union{String, Missing}}(["Bob", "Bob", "Batman", "Batman"]),
Key = Union{String, Missing}["Mass", "Color", "Mass", "Color"],
Key = CategoricalArray{Union{String, Missing}}["Mass", "Color", "Mass", "Color"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have similar tests with non-categorical columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@bkamins
Copy link
Member Author

bkamins commented Dec 10, 2017

@nalimilan This is probably a bug in CategoricalArrays which causes me problems with unstack 😄:

julia> x = categorical([1,2,3])
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> levels!(x, [4,3,2,1])
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical(x)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> droplevels!(y)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> x
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 3
 2
 1

The reason is that categorical creates new pool but refs is the same in x and y.

@bkamins
Copy link
Member Author

bkamins commented Dec 10, 2017

@nalimilan Temporarily I have wrapped categorical in deepcopy to make the tests pass, but this should be removed when categocial is fixed or some other decision in JuliaData/CategoricalArrays.jl#107 is made.

unstack(df::AbstractDataFrame, colkey, value)
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
unstack(df::AbstractDataFrame, rowkeys::Union{AbstractVector{<:Union{Symbol, Integer}}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One too much Union.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol]
hadmissing = false # have we encountered missing in refkeycol
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?
warned_dup = false # hawe we already printed duplicate entries warning?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"have"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

nowarning = true
unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol]
hadmissing = false # have we encountered missing in refkeycol
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, that col, row order bugs me, and will probably bug many people. Unless the performance difference is really significant, better use the more natural order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed (the difference is minimal)

@@ -347,7 +396,7 @@ module TestDataFrame
@test udf == DataFrame(Any[Union{Int, Missing}[1, 2], Union{Int, Missing}[1, 5],
Union{Int, Missing}[2, 6], Union{Int, Missing}[3, 7],
Union{Int, Missing}[4, 8]], [:id, :a, :b, :c, :d])
@test all(isa.(udf.columns, Vector{Union{Int, Missing}}))
@test all(isa.(udf.columns[2:end], Vector{Union{Int, Missing}}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test the type of the first column. Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@testset "missing values in colkey" begin
df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2],
variable=["a", "b", missing, "a", "b", "missing", "a", "b", "missing"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "missing"? Doesn't seem to be tested specifically here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional. What I test:

  • if I pass missing a warning should be printed;
  • if I pass "missing" a column should be created with that name;

I added lines to test this.

@bkamins bkamins force-pushed the improve_unstack branch 3 times, most recently from fd496dc to f699fb8 Compare December 11, 2017 23:21
@bkamins
Copy link
Member Author

bkamins commented Dec 13, 2017

build errors seem unrelated

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the end @inbounds wasn't worth it?

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2017

@inbounds - minimal benefit and the code is complex.

@nalimilan
Copy link
Member

I've added a dependency on CategoricalArrays 0.3.2 since tests wouldn't pass with older versions.

@nalimilan
Copy link
Member

Can you suggest a commit message, or squash and add a message to the commit?

…d to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output
@bkamins
Copy link
Member Author

bkamins commented Dec 15, 2017

I have squashed and rebased. It ended up quite big - many things required fixing.

@nalimilan nalimilan merged commit 740978e into JuliaData:master Dec 15, 2017
@nalimilan
Copy link
Member

Thanks!

nalimilan pushed a commit that referenced this pull request Dec 23, 2017
1) better performance, 2) rowkey(s) are not converted to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output
@bkamins bkamins deleted the improve_unstack branch December 15, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants