Refactor unstack #1309

bkamins · 2017-12-05T22:29:00Z

Changes:

allow multiple columns in rowkeys
remove incorrect and slower implementation for single column in rowkey

bkamins · 2017-12-05T23:04:42Z

@nalimilan @cjprybol
Just for a sake of completeness of refactoring unstack let me mention that currently we do not handle the following case 100% correct:

julia> y = DataFrame(variable=["x","x"], value=[missing,missing], id=[1,1])
2×3 DataFrames.DataFrame
│ Row │ variable │ value   │ id │
├─────┼──────────┼─────────┼────┤
│ 1   │ x        │ missing │ 1  │
│ 2   │ x        │ missing │ 1  │

julia> unstack(y)
1×2 DataFrames.DataFrame
│ Row │ id │ x       │
├─────┼────┼─────────┤
│ 1   │ 1  │ missing │

There is a duplicate in y that is not reported because it is missing (and missing is used to indicate yet unfilled data).

The second thing is that unstack makes all columns accept missing even if we are not sure that we have to (this means that stack-unstack procedure changes types of columns).

The solution would be to use a temporary data structure that would hold binary indicators if a position is filled (so we have no duplicate missing issue). Later those indicators can be also used to decide if we have to make a column accept Missing or not.

But I am not sure if you would feel it is worth the effort as it will introduce some memory and computational overhead.

bkamins · 2017-12-05T23:52:43Z

tests fail because the fix of allowmissing! is not tagged yet so we have rerun CI after the update of dependencies.

nalimilan · 2017-12-06T08:52:14Z

There is a duplicate in y that is not reported because it is missing (and missing is used to indicate yet unfilled data).

The second thing is that unstack makes all columns accept missing even if we are not sure that we have to (this means that stack-unstack procedure changes types of columns).

The solution would be to use a temporary data structure that would hold binary indicators if a position is filled (so we have no duplicate missing issue). Later those indicators can be also used to decide if we have to make a column accept Missing or not.

But I am not sure if you would feel it is worth the effort as it will introduce some memory and computational overhead.

Why not, I guess the overhead could be kept relatively low. With compiler improvements we should be able to convert an Array{Union{Missing, T} to an Array{T} very efficiently, and keeping track of whether an entry is missing or not shouldn't be too costly, if you're willing to do the work. Among other things, ismissing(payload[j][i]) is probably quite slow currently (waiting for more compiler optimizations), and could be much faster if we used a boolean array.

It would make sense to check what other implementations do, though.

nalimilan · 2017-12-06T08:37:22Z

src/abstractdataframe/reshape.jl

@@ -159,7 +159,7 @@ unstack(df::AbstractDataFrame)

 * `df` : the AbstractDataFrame to be unstacked

-* `rowkey` : the column with a unique key for each row, if not given,
+* `rowkeys` : the column or columns with a unique key for each row, if not given,


Just "column(s)".

nalimilan · 2017-12-06T08:39:39Z

src/abstractdataframe/reshape.jl

@@ -150,7 +150,7 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol=
 Unstacks a DataFrame; convert from a long to wide format

 ```julia
-unstack(df::AbstractDataFrame, rowkey, colkey, value)
+unstack(df::AbstractDataFrame, rowkeys, colkey, value)


I would be useful to mention accepted types too. I guess you could have one line with ::Union{Symbol, Integer}, and another one with ::AbstractVector{<:Union{Symbol, Integer}}.

fixed (although a bit messy)

nalimilan · 2017-12-06T08:41:29Z

src/abstractdataframe/reshape.jl

-    insert!(payload, 1, copy!(col, levs), _names(df)[rowkey])
-end
+unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int) =
+    unstack(df, [rowkey], colkey, value)
 unstack(df::AbstractDataFrame, rowkey::ColumnIndex,


Add a line break since you do the same below.

nalimilan · 2017-12-06T08:46:05Z

src/abstractdataframe/reshape.jl

 ```
 Note that there are some differences between the widened results above.

 """
-function unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int)


Can you explain why you find it natural that the groupby approach is more efficient than the approach used here? AFAIK it should be quite faster, especially if the original column is a CategoricalArray (since we don't have an optimized groupby implementation for CategoricalArray at the moment, and it's not trivial to do).

Since the code exists, I'm reluctant to remove it since we have sometimes found out that some trivial optimizations could be applied which made a dramatic difference.

See the benchmark:

julia> x = DataFrame(rand(1000, 10)); julia> x[:id] = 1:1000; julia> y = stack(x); julia> z = copy(y); julia> categorical!(z, :id); julia> eltypes(y) 3-element Array{Type,1}: Symbol Float64 Int64 julia> eltypes(z) 3-element Array{Type,1}: Symbol Float64 CategoricalArrays.CategoricalValue{Int64,UInt32} julia> @benchmark unstack(y) BenchmarkTools.Trial: memory estimate: 2.01 MiB allocs estimate: 78752 -------------- minimum time: 7.898 ms (0.00% GC) median time: 8.700 ms (0.00% GC) mean time: 8.998 ms (3.72% GC) maximum time: 59.264 ms (84.15% GC) -------------- samples: 556 evals/sample: 1 julia> @benchmark unstack(y, :variable, :value) BenchmarkTools.Trial: memory estimate: 1.85 MiB allocs estimate: 57636 -------------- minimum time: 6.008 ms (0.00% GC) median time: 6.769 ms (0.00% GC) mean time: 6.965 ms (3.64% GC) maximum time: 58.657 ms (87.87% GC) -------------- samples: 717 evals/sample: 1 julia> @benchmark unstack(z) BenchmarkTools.Trial: memory estimate: 1.65 MiB allocs estimate: 68695 -------------- minimum time: 10.947 ms (0.00% GC) median time: 11.326 ms (0.00% GC) mean time: 11.677 ms (2.53% GC) maximum time: 64.046 ms (81.62% GC) -------------- samples: 428 evals/sample: 1 julia> @benchmark unstack(z, :variable, :value) BenchmarkTools.Trial: memory estimate: 1.92 MiB allocs estimate: 56270 -------------- minimum time: 9.071 ms (0.00% GC) median time: 9.890 ms (0.00% GC) mean time: 10.179 ms (2.94% GC) maximum time: 65.845 ms (84.51% GC) -------------- samples: 491 evals/sample: 1

The difference is because of the lines:

i = Int(CategoricalArrays.order(refkeycol.pool)[refkeycol.refs[k]]) # more expensive

and

i = rowkey[k] # cheaper

Do you see any area for optimization? If single-column version is not significantly faster I would remove it as now:

we have a duplicate implementation of the same feature (which increases package complexity and maintenance cost);

the single-column version has a bug when unstacking on id-column that has missing values.

Yeah, the second line is cheaper, but it only comes after a call to groupby which should take much more time. So it looks like something is going wrong with the other method. Have you tried profiling it?

I agree with you that removing duplicated methods is generally a good idea. But I don't want to do that until we are certain to understand the situation, and here it looks like something doesn't work as I would expect it to. For example, we have recently realized that building a CategoricalArray was much slower than it should have been. If we could fix a similar issue here, lots of places would benefit from the fix (even if in the end we remove the method).

A possibility here is that constructing a CategoricalArray is relatively slow since IDs are all unique, and CategoricalArray isn't efficient for this kind of data. Or maybe the compiler is not able to move the CategoricalArrays.order(keycol.pool) call outside of the loop, i.e. we should manually store the order in a variable?

I did the profiling - I will post it below.

But actually observe having all unique IDs is a point of working stack/unstack (i.e. they are unique after unstack but are not unique after stack).

nalimilan · 2017-12-06T08:54:29Z

test/dataframe.jl

@@ -319,6 +319,27 @@ module TestDataFrame
    df4[1,:Mass] = missing
    @test df2 ≅ df4

+    # test empty set of grouping variables
+    @testthrows ArgumentError unstack(df, Int[], :Key, :Value)


@test_throws.

bkamins · 2017-12-07T14:34:42Z

I have fixed correct handling of missing. I have decided to retain that all columns accept missing - this would introduce performance overhead to control for it and I think it is not crucially problematic.

bkamins · 2017-12-07T14:54:38Z

First part - timing of new vs old code (new code uses mask Boolean test):

# Old code
julia> @benchmark unstack(y)
  memory estimate:  2.01 MiB
  mean time:        8.176 ms (3.61% GC)
julia> @benchmark unstack(y, :variable, :value)
  memory estimate:  1.85 MiB
  mean time:        6.263 ms (3.67% GC)
julia> @benchmark unstack(z)
  memory estimate:  1.65 MiB
  mean time:        11.022 ms (2.53% GC)
julia> @benchmark unstack(z, :variable, :value)
  memory estimate:  1.92 MiB
  mean time:        10.190 ms (2.86% GC)

# The proposed code with mask array
julia> @benchmark unstack(y)
  memory estimate:  1.85 MiB
  mean time:        6.001 ms (4.54% GC)
julia> @benchmark unstack(y, :variable, :value)
  memory estimate:  1.85 MiB
  mean time:        5.891 ms (3.69% GC)
julia> @benchmark unstack(z)
  memory estimate:  1.92 MiB
  mean time:        9.367 ms (3.08% GC)
julia> @benchmark unstack(z, :variable, :value)
  memory estimate:  1.92 MiB
  mean time:        9.391 ms (2.90% GC)

EDIT: under Julia 0.7

bkamins · 2017-12-07T15:26:01Z

I performed the profiling and actually it seems that creation of groupby:

g = groupby(df, setdiff(_names(df), _names(df)[[colkey, value]]), sort=true)

is faster than:

keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])

if they have to be run. This means that unstack without groupby would be faster if colkey and rowkey column would be categorial already.

In consequence the question is if we want to make stack turn :variable column into Categorical (it will make stack slower but unstack faster).

nalimilan · 2017-12-07T15:58:25Z

I performed the profiling and actually it seems that creation of groupby:

g = groupby(df, setdiff(_names(df), _names(df)[[colkey, value]]), sort=true)

is faster than:

keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])

if they have to be run.

Why not g = groupby(df, rowkeys, sort=true) as in the code? Though in this case it should be equivalent?

Also, it's not fair to include keycol, since it needs to be created in both cases, right?

This means that unstack without groupby would be faster if colkey and rowkey column would be categorial already.

But is that indeed the case in benchmarks? In my quick tests, it wasn't, which is really weird.

Anyway, if groupby than creating a CategoricalArray, that means that hashing a column first, and only then perform dictionary lookups based on these hashes is faster that hashing and doing the lookups on the fly at the same time. Indeed that's the main difference I can think of between the two approaches.

bkamins · 2017-12-07T21:29:02Z

I was testing on the implementation before my changes (that is the reason for code differences). I have used @profile - those are the lines that came up (and actually single Categorical creation was slower).

However, it is weird indeed - I will investigate more into it. What bothers me is:

using BenchmarkTools

df = DataFrame(rand(1000,10))
df[:id] = string.("a", 1:1000)
sdf = stack(df)
sdf = sdf[randperm(nrow(sdf)), :]

f1(s) = groupby(s, [:id], sort=true)
f2(s) = CategoricalArray{Union{eltype(s[:id]), Missing}}(s[:id])

@benchmark f1($sdf)
@benchmark f2($sdf)

gives:

julia> @benchmark f1($sdf)
BenchmarkTools.Trial:
  memory estimate:  601.81 KiB
  allocs estimate:  12377
  --------------
  minimum time:     2.283 ms (0.00% GC)
  median time:      2.422 ms (0.00% GC)
  mean time:        2.528 ms (3.56% GC)
  maximum time:     55.176 ms (94.73% GC)
  --------------
  samples:          1975
  evals/sample:     1

julia> @benchmark f2($sdf)
BenchmarkTools.Trial:
  memory estimate:  756.02 KiB
  allocs estimate:  15795
  --------------
  minimum time:     1.448 ms (0.00% GC)
  median time:      1.549 ms (0.00% GC)
  mean time:        1.693 ms (7.64% GC)
  maximum time:     55.446 ms (97.12% GC)
  --------------
  samples:          2940
  evals/sample:     1

which is in line with your intuition and against the results I report above.

bkamins · 2017-12-07T21:31:49Z

Actually running @code_warntype on both implementations is all red. I will check if we can improve something here.

bkamins · 2017-12-07T21:40:47Z

OK. I have it. We will leave separate implementations for single and multiple columns.
The problem is solved in a standard way by barrier function:

function unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int)
    refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])
    keycol = CategoricalArray{Union{eltype(df[colkey]), Missing}}(df[colkey])
    _unstack(df, rowkey, colkey, value, refkeycol, keycol)
end

(it is a fast writeup - I will push a clean solution when I have it but roughly it is 2x speedup on the test I report above)

cjprybol · 2017-12-08T04:42:43Z

src/abstractdataframe/reshape.jl

@@ -150,16 +150,20 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol=
 Unstacks a DataFrame; convert from a long to wide format

 ```julia
-unstack(df::AbstractDataFrame, rowkey, colkey, value)
-unstack(df::AbstractDataFrame, colkey, value)
+unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},


Union{Symbol, Integer} -> ColumnIndex. Currently, ColumnIndex is actually Union{Symbol, Real}, but I've had a PR on my to-do list for a while to change it to Union{Symbol, Integer}. Let's stay consistent and hopefully we can further restrict the types of all column-accepting functions at once by just changing ColumnIndex to Union{Symbol, Integer}

The problem is that it's not exported currently, and even if it was I'm not sure it's a good idea to use it in docs since it's less explicit than Union. Ideally we could interpolate it using $ColumnIndex.

I have Union{Symbol, Integer} everywhere in my local repo added. I will probably push this during the weekend when I finish tuning the performance.

If we use 4-space indent for code blocks rather than backticks the variable interpolation works, just tried it out

bkamins · 2017-12-08T22:29:10Z

In this commit I have concentrated on functionality (some testing appreciated):

the speedup of execution should be around 2x;
correctly and consistently handle missing in :id column - in groupby and CategoricalArray implementation;
correctly handle a situation with missing in :variable column (now it throws a warning - I can change it to an error if you feel it would be a better approach).

coveralls · 2017-12-09T02:47:33Z

Coverage increased (+1.5%) to 74.71% when pulling 5b9bad6 on bkamins:improve_unstack into 83323bb on JuliaData:master.

bkamins · 2017-12-09T07:45:53Z

Here are the benchmarks (I have cut the output to leave what is important - timing is given after unstack is compiled for given arguments).
The conclusion is that we are 2x faster with this implementation and 0.7 is faster than 0.6 (and keep in mind that the new implementation does more work as it has to handle missing values in rowkeys and colkey).

Data setup code:

x = DataFrame(rand(10000, 1000))
x[:id] = 1:10000
x[:id2] = string.("a", x[:id])
x[:s] = [randstring() for i in 1:10000]
y = melt(x, [:id, :id2])

Summary of benchmark results:

Julia 0.6

OLD
julia> @time unstack(y);
 10.974269 seconds (121.05 M allocations: 2.254 GiB, 22.68% gc time)
julia> @time unstack(y, :variable, :value);
 10.412161 seconds (73.60 M allocations: 1.859 GiB, 30.93% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
 10.393829 seconds (111.06 M allocations: 1.919 GiB, 22.76% gc time)
julia> @time unstack(y, :variable, :value);
 11.879770 seconds (83.64 M allocations: 2.010 GiB, 31.39% gc time)

NEW 
julia> @time unstack(y);
  4.848957 seconds (34.63 M allocations: 1.006 GiB, 42.45% gc time)
julia> @time unstack(y, :variable, :value);
  8.192743 seconds (24.48 M allocations: 1.128 GiB, 56.92% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
  4.182978 seconds (24.64 M allocations: 687.415 MiB, 46.12% gc time)
julia> @time unstack(y, :variable, :value);
  9.793470 seconds (34.52 M allocations: 1.280 GiB, 55.00% gc time)

Julia 0.7

OLD
julia> @time unstack(y);
 11.376244 seconds (121.07 M allocations: 2.254 GiB, 11.02% gc time)
julia> @time unstack(y, :variable, :value);
  9.536922 seconds (73.59 M allocations: 1.859 GiB, 22.01% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
 10.165981 seconds (111.06 M allocations: 1.919 GiB, 11.03% gc time)
julia> @time unstack(y, :variable, :value);
 10.698634 seconds (83.64 M allocations: 2.010 GiB, 22.48% gc time)

NEW
julia> @time unstack(y);
  4.928271 seconds (34.56 M allocations: 1.005 GiB, 41.12% gc time)
julia>  @time unstack(y, :variable, :value);
  5.809069 seconds (24.48 M allocations: 1.128 GiB, 42.01% gc time)

 categorical!(y, [:id, :id2])
julia> @time unstack(y);
  3.382329 seconds (24.55 M allocations: 686.091 MiB, 28.77% gc time)
julia> @time unstack(y, :variable, :value);
  6.865709 seconds (34.53 M allocations: 1.280 GiB, 40.00% gc time)

nalimilan

Great! Though, why is unstack(y) slightly faster than unstack(y, :variable, :value)?

You should be able to make it even dramatically more efficient by passing a tuple of columns to _unstack rather than a DataFrame. See what I did at JuliaData/DataTables.jl#79.

nalimilan · 2017-12-09T12:16:20Z

src/abstractdataframe/reshape.jl

@@ -171,6 +175,10 @@ unstack(df::AbstractDataFrame)

 * `::DataFrame` : the wide-format DataFrame

+If `colkey` contains `missing` values then they will be skipped and warning will be printed.


"a warning". Same below.

nalimilan · 2017-12-09T12:17:54Z

src/abstractdataframe/reshape.jl

+                  colkey::Int, value::Int, keycol, valuecol, df2, refkeycol)
+    Nrow = length(refkeycol.pool)
+    Ncol = length(keycol.pool)
+    hadmissing = false # have we encounered missing in refkeycol


"encountered".

nalimilan · 2017-12-09T12:21:04Z

src/abstractdataframe/reshape.jl

+    mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled?
+    nowarning = true # do we print duplicate entries warning
+    nowarning2 = true # do we print missing in keycol
+    keycol_pool = Vector{Int}(CategoricalArrays.order(keycol.pool))


Better call this keycol_order. Same below.

nalimilan · 2017-12-09T12:22:47Z

src/abstractdataframe/reshape.jl

+    hadmissing = false # have we encounered missing in refkeycol
+    mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled?
+    nowarning = true # do we print duplicate entries warning
+    nowarning2 = true # do we print missing in keycol


Better call this nowarning_missing, or maybe warned_missing (and revert it). Same for nowarning.

nalimilan · 2017-12-09T12:23:37Z

src/abstractdataframe/reshape.jl

-            if nowarning && !ismissing(payload[j][i])
-                warn("Duplicate entries in unstack.")
-                nowarning = false
+        keycol_refs = keycol.refs[k]


Plural is weird since it's a single value. Same for refkeycol_ref. Could almost be kref and rkref.

fixed (actually I had similar names originally, but thought them to be less informative)

nalimilan · 2017-12-09T12:46:08Z

src/abstractdataframe/reshape.jl

+        end
+        j = keycol_pool[keycol_refs]
+        refkeycol_refs = refkeycol.refs[k]
+        if refkeycol_refs == 0 # we have found missing in rowkeys


Use <= 0, I have possible plans to use negative values to support multiple types of missing values.

fixed both for kref and refkref.

nalimilan · 2017-12-09T12:48:23Z

src/abstractdataframe/reshape.jl

        end
+        if nowarning && mask_filled[j, i]
+            warn("Duplicate entries in unstack at row $k.")


Would also be nice to print the value and the name of the column.

I skip this. It gets messy as I would have to say something like

duplicate value for index "something which can be multiple columns" and variable "something"

I feel it is easy enough to to check the contents of the reported column anyway.

Sure, you can check it manually, but I find it really makes the difference between software which works and software which is pleasant and efficient to work with. For example, we print the index and the dimensions of the indexed array with BoundsError, and it makes a big difference from R where you need to start the debugger to find it out.

Don't do it if that's too much work, but if that's possible it would be really nice, even if the output isn't completely pretty (e.g. it's fine to print a one-element tuple when there's only one column, though join's last argument is very useful to format this properly).

nalimilan · 2017-12-09T12:49:01Z

src/abstractdataframe/reshape.jl

+unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex, value::ColumnIndex) =
+    unstack(df, rowkeys, index(df)[colkey], index(df)[value])
+
+unstack(df::AbstractDataFrame, rowkeys::AbstractVector{T}, colkey::Int, value::Int) where T<:Real =


Use AbstractVector{<:Real}.

OK. Out of curiosity - why it is a preferred style?

I think because it's shorter, you can read it directly instead of looking for T at the end of the line, and it makes it clear T isn't used in the body.

nalimilan · 2017-12-09T12:51:21Z

src/abstractdataframe/reshape.jl

    Nrow = length(refkeycol.pool)
    Ncol = length(keycol.pool)
-    payload = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol)))
-    nowarning = true
+    df2 = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol)))


It would be more natural to build the returned data frame inside _unstack.

fixed (and improving performance as reported below)

nalimilan · 2017-12-09T12:57:30Z

test/dataframe.jl

@@ -377,6 +398,36 @@ module TestDataFrame
        a = unstack(df, :id, :variable, :value)
        b = unstack(df, :variable, :value)
        @test a ≅ b ≅ DataFrame(id = [1, 2], a = [3, missing], b = [missing, 4])
+
+        df = DataFrame(variable=["x","x"], value=[missing,missing], id=[1,1])


Use spaces around == and after commas for consistency.

bkamins · 2017-12-09T21:55:41Z

Great! Though, why is unstack(y) slightly faster than unstack(y, :variable, :value)?

Because unstack(y) uses single column :id (and does not use groupby) and unstack(y, :variable, :value) works on [:id, :id2] (and uses groupby).

You should be able to make it even dramatically more efficient by passing a tuple of columns to _unstack rather than a DataFrame. See what I did at JuliaData/DataTables.jl#79.

This is essentially what I do. I pass df only to get names for rows. But I can create df2 later (and work on an array when assigning values) - I will check if it helps.

bkamins · 2017-12-09T22:05:12Z

@nalimilan good advice - delaying creation of df2 is another 2x speedup.

nalimilan · 2017-12-09T22:05:35Z

This is essentially what I do. I pass df only to get names for rows. But I can create df2 later (and work on an array when assigning values) - I will check if it helps.

Oh, right, that's for df2 that type unstability still hurts. Anyway it makes sense to create that data frame inside _unstack.

nalimilan · 2017-12-10T10:31:55Z

src/abstractdataframe/reshape.jl

-    refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])
-    # make sure we report only levels actually present in rowkey column
-    # this is consistent with how the other unstack method based on groupby works
+    refkeycol = copy(CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey]))


I don't think you need copy, IIRC I ensured that CategoricalArray constructors always make a copy. BTW, that means that if you don't need a copy for refkeycol, better use convert to avoid it if the input is already a CategoricalArray.

I need a copy as I droplevels! in the next line. Done. I add test to check that we actually make a copy.

But do you need a copy for keycol too?

We do (and there was another bug related to this that I have fixed if keycol has nonexistent levels).

However, I have a question here - why we want keycol to allow Missing even if originally it does not have missings? (this is the original design and I respect it, but I do not understand the reason for such an approach)

The original design was probably based on DataArray everywhere, so it makes sense to change it to return arrays which do not allow for missing where possible.

ok. I removed allowing missing if not needed (I hope it is ok - you know Missings best 😄):

I create categorical using categorical;

I do not force col in single index case to allow missings;

I do not force df1 to allow missings;

nalimilan · 2017-12-10T10:33:59Z

src/abstractdataframe/reshape.jl

+    valuecol = df[value]
+    Nrow = length(refkeycol.pool)
+    Ncol = length(keycol.pool)
+    df2m = [similar_missing(valuecol, Nrow) for i in 1:Ncol]


Why create this here? Don't you have all the needed information inside _unstack?

bacause it is ~5% faster this way. I guess Julia does not like functions with large bodies and _unstack is already heavy. I do not know why in detail - those are benchmark results.

5% doesn't sound worth it to me. And this kind of difference can change with compiler improvements. Better prioritize the clearer design.

BTW, better not call it df since it's not a data frame.

Moved and changed name to unstacked_val

nalimilan · 2017-12-10T10:34:47Z

src/abstractdataframe/reshape.jl

-    nowarning = true
+    hadmissing = false # have we encountered missing in refkeycol
+    mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?
+    warned_missing_1 = false # do we print duplicate entries warning


My suggestion was to call this warned_dup or something like that, to get rid of 1 and 2. :-)

And the comment would be "have we already printed..."

nalimilan · 2017-12-10T10:38:57Z

src/abstractdataframe/reshape.jl

+        kref = keycol.refs[k]
+        if kref <= 0 # we have found missing in colkey
+            if !warned_missing_2
+                warn("Missing value in '$(_names(df)[colkey])' variable at row $k. Skipping.")


I've just checked, the convention used in the DataFrame constructor is to print column names without quotes. Maybe clearer with "variable X" (rather than "X variable").

although now we print "Missing value in variable variable at row"

nalimilan · 2017-12-10T10:42:14Z

src/abstractdataframe/reshape.jl

+                hadmissing = true
+                # we use the fact that missing is greater than anything
+                for i in eachindex(df2m)
+                    push!(df2m[i], missing)


How common do we expect this to happen? Unless that reflects malformed data, we could allocate Nrow+1 entries and resize to Nrow when creating df2m, so that this line does not have to make any copy.

I guess this should not happen normally.
This will take place only if we have missing in index column (I have even considered throwing an error in this case but in the end I think we can accept missing as index).
Therefore I wanted to have a typical-case clean and use push! only as an exception.

nalimilan · 2017-12-10T10:43:42Z

src/abstractdataframe/reshape.jl

+        else
+            i = refkeycol_order[refkref]
+        end
+        if (!warned_missing_1) && mask_filled[j, i]


No need for parentheses.

I would leave them - I have learned from my students 😄 that they are often confused about operator precedence.
But if you feel that it is cleaner to remove them I will do it.

I don't think we use parentheses with ! in any of the Julia codebases I know. I agree precedence can be confusing at times, but in this case it's pretty clear, so please remove it.

nalimilan · 2017-12-10T10:45:09Z

src/abstractdataframe/reshape.jl

    end
    levs = levels(refkeycol)
+    # we have to handle a case with missings in refkeycol as levs will skip missing
    col = similar_missing(df[rowkey], length(levs))


Use length(levs) + hadmissing for the length, then you can do hadmissing && col[end] = missing below. That way you avoid a copy.

nalimilan · 2017-12-10T10:50:03Z

src/abstractdataframe/reshape.jl

        end
+        df2m[j][i] = valuecol[k]
+        mask_filled[j, i] = true


[j, i] is less natural than [i, j], is this to optimize memory access patterns? Do we expect the data to be sorted (and therefore filled) by rows or by columns? Given the layout of data frames, it would be optimal to fill one column at a time. I admit I don't really understand whether that's the case currently or not.

We should also ensure that's consistent with what stack outputs.

stack sorts by variable not by index. So if we unstack something that was stacked j is changing faster than i so I guess this layout is better. Yes?

If that's the case (haven't checked), then it's not optimal: we need i to change the fastest, since accessing one column at a time is more efficient for data frames. Maybe we should change stack? Though we should check what R/dplyr/Pandas do first.

Two points (contradictory though):

I was wrong we fill the data column-by-column (so i is changing faster)

However, both - result of stack and then random permutation of result of stack are minimally faster when we have the current order (i.e. mask_filled = falses(Ncol, Nrow)). I do not understand why.

Anyway this does not matter much - it is again the case of minimal difference in performance. So I would leave it as is.

OK, thanks for checking. That's really weird, but well... For maximal performance, we could also use an Array{Bool}, but probably not worth the increased memory use.

BTW, have you tried adding @inbounds before indexing expressions where we are sure indices are correct?

the code is complex and easy to make mistake in reasoning. I would leave out @inbounds until it is well tested by users. I will add a TODO in the comment.

No need for a TODO: either you know it's OK or you don't.

At least it should be fine when you index refkeycol and keycol.

I know 😄, but conditional on what you have written, so I would not risk. I have checked that the gains form @inbounds are minimal <1%.

nalimilan · 2017-12-10T10:51:42Z

test/dataframe.jl

+    end
+
+    @testset "missing values in colkey" begin
+        df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2],


Spaces around =.

I thought the style without spaces is recommended by Julia Manual. Also you do not have spaces in this very file (see e.g. at its beginning - most of calls to DataFrame do not use spaces here).
I can fix it - but let us decide - do we want spaces or no spaces everywhere?
(adding spaces will change much more lines)

OK, it's already inconsistent, so let's leave it that way.

nalimilan · 2017-12-10T20:14:23Z

test/dataframe.jl

@@ -296,12 +296,14 @@ module TestDataFrame

    #Check the output of unstack
    df = DataFrame(Fish = CategoricalArray{Union{String, Missing}}(["Bob", "Bob", "Batman", "Batman"]),
-                   Key = Union{String, Missing}["Mass", "Color", "Mass", "Color"],
+                   Key = CategoricalArray{Union{String, Missing}}["Mass", "Color", "Mass", "Color"],


Do we still have similar tests with non-categorical columns?

bkamins · 2017-12-10T21:24:12Z

@nalimilan This is probably a bug in CategoricalArrays which causes me problems with unstack 😄:

julia> x = categorical([1,2,3])
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> levels!(x, [4,3,2,1])
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> y = categorical(x)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> droplevels!(y)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

julia> x
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 3
 2
 1

The reason is that categorical creates new pool but refs is the same in x and y.

bkamins · 2017-12-10T22:03:12Z

@nalimilan Temporarily I have wrapped categorical in deepcopy to make the tests pass, but this should be removed when categocial is fixed or some other decision in JuliaData/CategoricalArrays.jl#107 is made.

nalimilan · 2017-12-11T11:24:31Z

src/abstractdataframe/reshape.jl

-unstack(df::AbstractDataFrame, colkey, value)
+unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer},
+        colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer})
+unstack(df::AbstractDataFrame, rowkeys::Union{AbstractVector{<:Union{Symbol, Integer}}},


One too much Union.

nalimilan · 2017-12-11T11:26:35Z

src/abstractdataframe/reshape.jl

+    unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol]
+    hadmissing = false # have we encountered missing in refkeycol
+    mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?
+    warned_dup = false # hawe we already printed duplicate entries warning?


nalimilan · 2017-12-11T11:27:35Z

src/abstractdataframe/reshape.jl

-    nowarning = true
+    unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol]
+    hadmissing = false # have we encountered missing in refkeycol
+    mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled?


Really, that col, row order bugs me, and will probably bug many people. Unless the performance difference is really significant, better use the more natural order.

ok, fixed (the difference is minimal)

nalimilan · 2017-12-11T12:47:07Z

test/dataframe.jl

@@ -347,7 +396,7 @@ module TestDataFrame
        @test udf == DataFrame(Any[Union{Int, Missing}[1, 2], Union{Int, Missing}[1, 5],
                                   Union{Int, Missing}[2, 6], Union{Int, Missing}[3, 7],
                                   Union{Int, Missing}[4, 8]], [:id, :a, :b, :c, :d])
-        @test all(isa.(udf.columns, Vector{Union{Int, Missing}}))
+        @test all(isa.(udf.columns[2:end], Vector{Union{Int, Missing}}))


Also test the type of the first column. Same below.

nalimilan · 2017-12-11T12:48:29Z

test/dataframe.jl

+
+    @testset "missing values in colkey" begin
+        df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2],
+                       variable=["a", "b", missing, "a", "b", "missing", "a", "b", "missing"],


Why "missing"? Doesn't seem to be tested specifically here.

It is intentional. What I test:

if I pass missing a warning should be printed;

if I pass "missing" a column should be created with that name;

I added lines to test this.

bkamins · 2017-12-13T23:16:15Z

build errors seem unrelated

nalimilan

So in the end @inbounds wasn't worth it?

bkamins · 2017-12-14T19:12:12Z

@inbounds - minimal benefit and the code is complex.

nalimilan · 2017-12-15T09:03:35Z

I've added a dependency on CategoricalArrays 0.3.2 since tests wouldn't pass with older versions.

nalimilan · 2017-12-15T17:05:10Z

Can you suggest a commit message, or squash and add a message to the commit?

…d to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output

bkamins · 2017-12-15T18:18:58Z

I have squashed and rebased. It ended up quite big - many things required fixing.

nalimilan · 2017-12-15T21:14:07Z

Thanks!

1) better performance, 2) rowkey(s) are not converted to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output

nalimilan reviewed Dec 6, 2017

View reviewed changes

bkamins force-pushed the improve_unstack branch from 617cabf to 9e418fa Compare December 7, 2017 15:28

cjprybol reviewed Dec 8, 2017

View reviewed changes

bkamins force-pushed the improve_unstack branch from bf797e1 to 46bcf5b Compare December 8, 2017 22:26

bkamins force-pushed the improve_unstack branch 4 times, most recently from db59b39 to 5b9bad6 Compare December 8, 2017 23:49

JuliaData deleted a comment from coveralls Dec 9, 2017

nalimilan reviewed Dec 9, 2017

View reviewed changes

bkamins force-pushed the improve_unstack branch 2 times, most recently from 2b9d2ee to 81be94b Compare December 9, 2017 23:00

nalimilan reviewed Dec 10, 2017

View reviewed changes

bkamins force-pushed the improve_unstack branch from 497000e to 7de450d Compare December 10, 2017 12:22

nalimilan reviewed Dec 10, 2017

View reviewed changes

bkamins mentioned this pull request Dec 10, 2017

CategoricalArray constructor sometimes does not copy all data JuliaData/CategoricalArrays.jl#107

Closed

bkamins force-pushed the improve_unstack branch 4 times, most recently from 152183f to fbfe182 Compare December 10, 2017 22:00

nalimilan reviewed Dec 11, 2017

View reviewed changes

bkamins force-pushed the improve_unstack branch 3 times, most recently from fd496dc to f699fb8 Compare December 11, 2017 23:21

nalimilan approved these changes Dec 14, 2017

View reviewed changes

improve unstack: 1) better performance, 2) rowkey(s) are not converte…

a96a422

…d to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output

bkamins force-pushed the improve_unstack branch from 13f3886 to a96a422 Compare December 15, 2017 18:15

Small fix in comment

9787f2a

nalimilan merged commit 740978e into JuliaData:master Dec 15, 2017

bkamins deleted the improve_unstack branch December 15, 2020 19:46

		@@ -171,6 +175,10 @@ unstack(df::AbstractDataFrame)

		* `::DataFrame` : the wide-format DataFrame

		If `colkey` contains `missing` values then they will be skipped and warning will be printed.

Refactor unstack #1309

Refactor unstack #1309

Conversation

bkamins commented Dec 5, 2017

bkamins commented Dec 5, 2017

bkamins commented Dec 5, 2017

nalimilan commented Dec 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Dec 7, 2017

bkamins commented Dec 7, 2017 • edited

bkamins commented Dec 7, 2017

nalimilan commented Dec 7, 2017

bkamins commented Dec 7, 2017 • edited by nalimilan

bkamins commented Dec 7, 2017

bkamins commented Dec 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol Dec 8, 2017 • edited

Choose a reason for hiding this comment

bkamins commented Dec 8, 2017

coveralls commented Dec 9, 2017

bkamins commented Dec 9, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Dec 9, 2017

bkamins commented Dec 9, 2017

nalimilan commented Dec 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Dec 7, 2017 •

edited

bkamins commented Dec 7, 2017 •

edited by nalimilan

cjprybol Dec 8, 2017 •

edited