New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor unstack #1309
Refactor unstack #1309
Conversation
@nalimilan @cjprybol
There is a duplicate in The second thing is that The solution would be to use a temporary data structure that would hold binary indicators if a position is filled (so we have no duplicate But I am not sure if you would feel it is worth the effort as it will introduce some memory and computational overhead. |
tests fail because the fix of |
Why not, I guess the overhead could be kept relatively low. With compiler improvements we should be able to convert an It would make sense to check what other implementations do, though. |
src/abstractdataframe/reshape.jl
Outdated
@@ -159,7 +159,7 @@ unstack(df::AbstractDataFrame) | |||
|
|||
* `df` : the AbstractDataFrame to be unstacked | |||
|
|||
* `rowkey` : the column with a unique key for each row, if not given, | |||
* `rowkeys` : the column or columns with a unique key for each row, if not given, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just "column(s)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
@@ -150,7 +150,7 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol= | |||
Unstacks a DataFrame; convert from a long to wide format | |||
|
|||
```julia | |||
unstack(df::AbstractDataFrame, rowkey, colkey, value) | |||
unstack(df::AbstractDataFrame, rowkeys, colkey, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be useful to mention accepted types too. I guess you could have one line with ::Union{Symbol, Integer}
, and another one with ::AbstractVector{<:Union{Symbol, Integer}}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (although a bit messy)
src/abstractdataframe/reshape.jl
Outdated
insert!(payload, 1, copy!(col, levs), _names(df)[rowkey]) | ||
end | ||
unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int) = | ||
unstack(df, [rowkey], colkey, value) | ||
unstack(df::AbstractDataFrame, rowkey::ColumnIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a line break since you do the same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
``` | ||
Note that there are some differences between the widened results above. | ||
|
||
""" | ||
function unstack(df::AbstractDataFrame, rowkey::Int, colkey::Int, value::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why you find it natural that the groupby
approach is more efficient than the approach used here? AFAIK it should be quite faster, especially if the original column is a CategoricalArray
(since we don't have an optimized groupby
implementation for CategoricalArray
at the moment, and it's not trivial to do).
Since the code exists, I'm reluctant to remove it since we have sometimes found out that some trivial optimizations could be applied which made a dramatic difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the benchmark:
julia> x = DataFrame(rand(1000, 10));
julia> x[:id] = 1:1000;
julia> y = stack(x);
julia> z = copy(y);
julia> categorical!(z, :id);
julia> eltypes(y)
3-element Array{Type,1}:
Symbol
Float64
Int64
julia> eltypes(z)
3-element Array{Type,1}:
Symbol
Float64
CategoricalArrays.CategoricalValue{Int64,UInt32}
julia> @benchmark unstack(y)
BenchmarkTools.Trial:
memory estimate: 2.01 MiB
allocs estimate: 78752
--------------
minimum time: 7.898 ms (0.00% GC)
median time: 8.700 ms (0.00% GC)
mean time: 8.998 ms (3.72% GC)
maximum time: 59.264 ms (84.15% GC)
--------------
samples: 556
evals/sample: 1
julia> @benchmark unstack(y, :variable, :value)
BenchmarkTools.Trial:
memory estimate: 1.85 MiB
allocs estimate: 57636
--------------
minimum time: 6.008 ms (0.00% GC)
median time: 6.769 ms (0.00% GC)
mean time: 6.965 ms (3.64% GC)
maximum time: 58.657 ms (87.87% GC)
--------------
samples: 717
evals/sample: 1
julia> @benchmark unstack(z)
BenchmarkTools.Trial:
memory estimate: 1.65 MiB
allocs estimate: 68695
--------------
minimum time: 10.947 ms (0.00% GC)
median time: 11.326 ms (0.00% GC)
mean time: 11.677 ms (2.53% GC)
maximum time: 64.046 ms (81.62% GC)
--------------
samples: 428
evals/sample: 1
julia> @benchmark unstack(z, :variable, :value)
BenchmarkTools.Trial:
memory estimate: 1.92 MiB
allocs estimate: 56270
--------------
minimum time: 9.071 ms (0.00% GC)
median time: 9.890 ms (0.00% GC)
mean time: 10.179 ms (2.94% GC)
maximum time: 65.845 ms (84.51% GC)
--------------
samples: 491
evals/sample: 1
The difference is because of the lines:
i = Int(CategoricalArrays.order(refkeycol.pool)[refkeycol.refs[k]]) # more expensive
and
i = rowkey[k] # cheaper
Do you see any area for optimization? If single-column version is not significantly faster I would remove it as now:
- we have a duplicate implementation of the same feature (which increases package complexity and maintenance cost);
- the single-column version has a bug when unstacking on id-column that has missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the second line is cheaper, but it only comes after a call to groupby
which should take much more time. So it looks like something is going wrong with the other method. Have you tried profiling it?
I agree with you that removing duplicated methods is generally a good idea. But I don't want to do that until we are certain to understand the situation, and here it looks like something doesn't work as I would expect it to. For example, we have recently realized that building a CategoricalArray
was much slower than it should have been. If we could fix a similar issue here, lots of places would benefit from the fix (even if in the end we remove the method).
A possibility here is that constructing a CategoricalArray
is relatively slow since IDs are all unique, and CategoricalArray
isn't efficient for this kind of data. Or maybe the compiler is not able to move the CategoricalArrays.order(keycol.pool)
call outside of the loop, i.e. we should manually store the order in a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the profiling - I will post it below.
But actually observe having all unique IDs is a point of working stack
/unstack
(i.e. they are unique after unstack
but are not unique after stack
).
test/dataframe.jl
Outdated
@@ -319,6 +319,27 @@ module TestDataFrame | |||
df4[1,:Mass] = missing | |||
@test df2 ≅ df4 | |||
|
|||
# test empty set of grouping variables | |||
@testthrows ArgumentError unstack(df, Int[], :Key, :Value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@test_throws
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I have fixed correct handling of missing. I have decided to retain that all columns accept missing - this would introduce performance overhead to control for it and I think it is not crucially problematic. |
First part - timing of new vs old code (new code uses
EDIT: under Julia 0.7 |
I performed the profiling and actually it seems that creation of
is faster than:
if they have to be run. This means that In consequence the question is if we want to make |
617cabf
to
9e418fa
Compare
Why not Also, it's not fair to include
But is that indeed the case in benchmarks? In my quick tests, it wasn't, which is really weird. Anyway, if |
I was testing on the implementation before my changes (that is the reason for code differences). I have used However, it is weird indeed - I will investigate more into it. What bothers me is:
gives:
which is in line with your intuition and against the results I report above. |
Actually running |
OK. I have it. We will leave separate implementations for single and multiple columns.
(it is a fast writeup - I will push a clean solution when I have it but roughly it is 2x speedup on the test I report above) |
@@ -150,16 +150,20 @@ melt(df::AbstractDataFrame; variable_name::Symbol=:variable, value_name::Symbol= | |||
Unstacks a DataFrame; convert from a long to wide format | |||
|
|||
```julia | |||
unstack(df::AbstractDataFrame, rowkey, colkey, value) | |||
unstack(df::AbstractDataFrame, colkey, value) | |||
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union{Symbol, Integer}
-> ColumnIndex
. Currently, ColumnIndex
is actually Union{Symbol, Real}
, but I've had a PR on my to-do list for a while to change it to Union{Symbol, Integer}
. Let's stay consistent and hopefully we can further restrict the types of all column-accepting functions at once by just changing ColumnIndex
to Union{Symbol, Integer}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it's not exported currently, and even if it was I'm not sure it's a good idea to use it in docs since it's less explicit than Union
. Ideally we could interpolate it using $ColumnIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have Union{Symbol, Integer}
everywhere in my local repo added. I will probably push this during the weekend when I finish tuning the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use 4-space indent for code blocks rather than backticks the variable interpolation works, just tried it out
bf797e1
to
46bcf5b
Compare
In this commit I have concentrated on functionality (some testing appreciated):
|
db59b39
to
5b9bad6
Compare
Here are the benchmarks (I have cut the output to leave what is important - timing is given after Data setup code:
Summary of benchmark results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Though, why is unstack(y)
slightly faster than unstack(y, :variable, :value)
?
You should be able to make it even dramatically more efficient by passing a tuple of columns to _unstack
rather than a DataFrame
. See what I did at JuliaData/DataTables.jl#79.
src/abstractdataframe/reshape.jl
Outdated
@@ -171,6 +175,10 @@ unstack(df::AbstractDataFrame) | |||
|
|||
* `::DataFrame` : the wide-format DataFrame | |||
|
|||
If `colkey` contains `missing` values then they will be skipped and warning will be printed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a warning". Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
colkey::Int, value::Int, keycol, valuecol, df2, refkeycol) | ||
Nrow = length(refkeycol.pool) | ||
Ncol = length(keycol.pool) | ||
hadmissing = false # have we encounered missing in refkeycol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"encountered".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled? | ||
nowarning = true # do we print duplicate entries warning | ||
nowarning2 = true # do we print missing in keycol | ||
keycol_pool = Vector{Int}(CategoricalArrays.order(keycol.pool)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call this keycol_order
. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
hadmissing = false # have we encounered missing in refkeycol | ||
mask_filled = falses(Ncol, Nrow) # has a given [col,row] entry been filled? | ||
nowarning = true # do we print duplicate entries warning | ||
nowarning2 = true # do we print missing in keycol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call this nowarning_missing
, or maybe warned_missing
(and revert it). Same for nowarning
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
if nowarning && !ismissing(payload[j][i]) | ||
warn("Duplicate entries in unstack.") | ||
nowarning = false | ||
keycol_refs = keycol.refs[k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plural is weird since it's a single value. Same for refkeycol_ref
. Could almost be kref
and rkref
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (actually I had similar names originally, but thought them to be less informative)
src/abstractdataframe/reshape.jl
Outdated
end | ||
j = keycol_pool[keycol_refs] | ||
refkeycol_refs = refkeycol.refs[k] | ||
if refkeycol_refs == 0 # we have found missing in rowkeys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use <= 0
, I have possible plans to use negative values to support multiple types of missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed both for kref
and refkref
.
src/abstractdataframe/reshape.jl
Outdated
end | ||
if nowarning && mask_filled[j, i] | ||
warn("Duplicate entries in unstack at row $k.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be nice to print the value and the name of the column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skip this. It gets messy as I would have to say something like
duplicate value for index "something which can be multiple columns" and variable "something"
I feel it is easy enough to to check the contents of the reported column anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you can check it manually, but I find it really makes the difference between software which works and software which is pleasant and efficient to work with. For example, we print the index and the dimensions of the indexed array with BoundsError
, and it makes a big difference from R where you need to start the debugger to find it out.
Don't do it if that's too much work, but if that's possible it would be really nice, even if the output isn't completely pretty (e.g. it's fine to print a one-element tuple when there's only one column, though join
's last
argument is very useful to format this properly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
src/abstractdataframe/reshape.jl
Outdated
unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex, value::ColumnIndex) = | ||
unstack(df, rowkeys, index(df)[colkey], index(df)[value]) | ||
|
||
unstack(df::AbstractDataFrame, rowkeys::AbstractVector{T}, colkey::Int, value::Int) where T<:Real = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AbstractVector{<:Real}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Out of curiosity - why it is a preferred style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because it's shorter, you can read it directly instead of looking for T
at the end of the line, and it makes it clear T
isn't used in the body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/abstractdataframe/reshape.jl
Outdated
Nrow = length(refkeycol.pool) | ||
Ncol = length(keycol.pool) | ||
payload = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol))) | ||
nowarning = true | ||
df2 = DataFrame(Any[similar_missing(valuecol, Nrow) for i in 1:Ncol], map(Symbol, levels(keycol))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more natural to build the returned data frame inside _unstack
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (and improving performance as reported below)
test/dataframe.jl
Outdated
@@ -377,6 +398,36 @@ module TestDataFrame | |||
a = unstack(df, :id, :variable, :value) | |||
b = unstack(df, :variable, :value) | |||
@test a ≅ b ≅ DataFrame(id = [1, 2], a = [3, missing], b = [missing, 4]) | |||
|
|||
df = DataFrame(variable=["x","x"], value=[missing,missing], id=[1,1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use spaces around ==
and after commas for consistency.
Because
This is essentially what I do. I pass |
@nalimilan good advice - delaying creation of |
Oh, right, that's for |
2b9d2ee
to
81be94b
Compare
src/abstractdataframe/reshape.jl
Outdated
refkeycol = CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey]) | ||
# make sure we report only levels actually present in rowkey column | ||
# this is consistent with how the other unstack method based on groupby works | ||
refkeycol = copy(CategoricalArray{Union{eltype(df[rowkey]), Missing}}(df[rowkey])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need copy
, IIRC I ensured that CategoricalArray
constructors always make a copy. BTW, that means that if you don't need a copy for refkeycol
, better use convert
to avoid it if the input is already a CategoricalArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a copy as I droplevels!
in the next line. Done. I add test to check that we actually make a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do you need a copy for keycol
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do (and there was another bug related to this that I have fixed if keycol
has nonexistent levels).
However, I have a question here - why we want keycol
to allow Missing
even if originally it does not have missing
s? (this is the original design and I respect it, but I do not understand the reason for such an approach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original design was probably based on DataArray
everywhere, so it makes sense to change it to return arrays which do not allow for missing where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I removed allowing missing if not needed (I hope it is ok - you know Missings
best 😄):
- I create categorical using
categorical
; - I do not force
col
in single index case to allow missings; - I do not force
df1
to allow missings;
src/abstractdataframe/reshape.jl
Outdated
valuecol = df[value] | ||
Nrow = length(refkeycol.pool) | ||
Ncol = length(keycol.pool) | ||
df2m = [similar_missing(valuecol, Nrow) for i in 1:Ncol] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create this here? Don't you have all the needed information inside _unstack
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bacause it is ~5% faster this way. I guess Julia does not like functions with large bodies and _unstack
is already heavy. I do not know why in detail - those are benchmark results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5% doesn't sound worth it to me. And this kind of difference can change with compiler improvements. Better prioritize the clearer design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, better not call it df
since it's not a data frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and changed name to unstacked_val
src/abstractdataframe/reshape.jl
Outdated
nowarning = true | ||
hadmissing = false # have we encountered missing in refkeycol | ||
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled? | ||
warned_missing_1 = false # do we print duplicate entries warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was to call this warned_dup
or something like that, to get rid of 1 and 2. :-)
And the comment would be "have we already printed..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
kref = keycol.refs[k] | ||
if kref <= 0 # we have found missing in colkey | ||
if !warned_missing_2 | ||
warn("Missing value in '$(_names(df)[colkey])' variable at row $k. Skipping.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just checked, the convention used in the DataFrame
constructor is to print column names without quotes. Maybe clearer with "variable X" (rather than "X variable").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although now we print "Missing value in variable variable at row"
src/abstractdataframe/reshape.jl
Outdated
hadmissing = true | ||
# we use the fact that missing is greater than anything | ||
for i in eachindex(df2m) | ||
push!(df2m[i], missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How common do we expect this to happen? Unless that reflects malformed data, we could allocate Nrow+1
entries and resize to Nrow
when creating df2m
, so that this line does not have to make any copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should not happen normally.
This will take place only if we have missing
in index column (I have even considered throwing an error in this case but in the end I think we can accept missing
as index).
Therefore I wanted to have a typical-case clean and use push!
only as an exception.
src/abstractdataframe/reshape.jl
Outdated
else | ||
i = refkeycol_order[refkref] | ||
end | ||
if (!warned_missing_1) && mask_filled[j, i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave them - I have learned from my students 😄 that they are often confused about operator precedence.
But if you feel that it is cleaner to remove them I will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we use parentheses with !
in any of the Julia codebases I know. I agree precedence can be confusing at times, but in this case it's pretty clear, so please remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
end | ||
levs = levels(refkeycol) | ||
# we have to handle a case with missings in refkeycol as levs will skip missing | ||
col = similar_missing(df[rowkey], length(levs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use length(levs) + hadmissing
for the length, then you can do hadmissing && col[end] = missing
below. That way you avoid a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
end | ||
df2m[j][i] = valuecol[k] | ||
mask_filled[j, i] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[j, i]
is less natural than [i, j]
, is this to optimize memory access patterns? Do we expect the data to be sorted (and therefore filled) by rows or by columns? Given the layout of data frames, it would be optimal to fill one column at a time. I admit I don't really understand whether that's the case currently or not.
We should also ensure that's consistent with what stack
outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack
sorts by variable not by index. So if we unstack
something that was stacked j
is changing faster than i
so I guess this layout is better. Yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case (haven't checked), then it's not optimal: we need i
to change the fastest, since accessing one column at a time is more efficient for data frames. Maybe we should change stack
? Though we should check what R/dplyr/Pandas do first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points (contradictory though):
- I was wrong we fill the data column-by-column (so
i
is changing faster) - However, both - result of
stack
and then random permutation of result ofstack
are minimally faster when we have the current order (i.e.mask_filled = falses(Ncol, Nrow)
). I do not understand why.
Anyway this does not matter much - it is again the case of minimal difference in performance. So I would leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for checking. That's really weird, but well... For maximal performance, we could also use an Array{Bool}
, but probably not worth the increased memory use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, have you tried adding @inbounds
before indexing expressions where we are sure indices are correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is complex and easy to make mistake in reasoning. I would leave out @inbounds
until it is well tested by users. I will add a TODO in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a TODO: either you know it's OK or you don't.
At least it should be fine when you index refkeycol
and keycol
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know 😄, but conditional on what you have written, so I would not risk. I have checked that the gains form @inbounds
are minimal <1%.
test/dataframe.jl
Outdated
end | ||
|
||
@testset "missing values in colkey" begin | ||
df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around =
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the style without spaces is recommended by Julia Manual. Also you do not have spaces in this very file (see e.g. at its beginning - most of calls to DataFrame
do not use spaces here).
I can fix it - but let us decide - do we want spaces or no spaces everywhere?
(adding spaces will change much more lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's already inconsistent, so let's leave it that way.
497000e
to
7de450d
Compare
test/dataframe.jl
Outdated
@@ -296,12 +296,14 @@ module TestDataFrame | |||
|
|||
#Check the output of unstack | |||
df = DataFrame(Fish = CategoricalArray{Union{String, Missing}}(["Bob", "Bob", "Batman", "Batman"]), | |||
Key = Union{String, Missing}["Mass", "Color", "Mass", "Color"], | |||
Key = CategoricalArray{Union{String, Missing}}["Mass", "Color", "Mass", "Color"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have similar tests with non-categorical columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@nalimilan This is probably a bug in
The reason is that |
152183f
to
fbfe182
Compare
@nalimilan Temporarily I have wrapped |
src/abstractdataframe/reshape.jl
Outdated
unstack(df::AbstractDataFrame, colkey, value) | ||
unstack(df::AbstractDataFrame, rowkeys::Union{Symbol, Integer}, | ||
colkey::Union{Symbol, Integer}, value::Union{Symbol, Integer}) | ||
unstack(df::AbstractDataFrame, rowkeys::Union{AbstractVector{<:Union{Symbol, Integer}}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One too much Union
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol] | ||
hadmissing = false # have we encountered missing in refkeycol | ||
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled? | ||
warned_dup = false # hawe we already printed duplicate entries warning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"have"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/abstractdataframe/reshape.jl
Outdated
nowarning = true | ||
unstacked_val = [similar_missing(valuecol, Nrow) for i in 1:Ncol] | ||
hadmissing = false # have we encountered missing in refkeycol | ||
mask_filled = falses(Ncol, Nrow+1) # has a given [col,row] entry been filled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, that col, row
order bugs me, and will probably bug many people. Unless the performance difference is really significant, better use the more natural order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed (the difference is minimal)
test/dataframe.jl
Outdated
@@ -347,7 +396,7 @@ module TestDataFrame | |||
@test udf == DataFrame(Any[Union{Int, Missing}[1, 2], Union{Int, Missing}[1, 5], | |||
Union{Int, Missing}[2, 6], Union{Int, Missing}[3, 7], | |||
Union{Int, Missing}[4, 8]], [:id, :a, :b, :c, :d]) | |||
@test all(isa.(udf.columns, Vector{Union{Int, Missing}})) | |||
@test all(isa.(udf.columns[2:end], Vector{Union{Int, Missing}})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test the type of the first column. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@testset "missing values in colkey" begin | ||
df = DataFrame(id=[1, 1, 1, missing, missing, missing, 2, 2, 2], | ||
variable=["a", "b", missing, "a", "b", "missing", "a", "b", "missing"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "missing"
? Doesn't seem to be tested specifically here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is intentional. What I test:
- if I pass
missing
a warning should be printed; - if I pass
"missing"
a column should be created with that name;
I added lines to test this.
fd496dc
to
f699fb8
Compare
build errors seem unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the end @inbounds
wasn't worth it?
|
I've added a dependency on CategoricalArrays 0.3.2 since tests wouldn't pass with older versions. |
Can you suggest a commit message, or squash and add a message to the commit? |
…d to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output
13f3886
to
a96a422
Compare
I have squashed and rebased. It ended up quite big - many things required fixing. |
Thanks! |
1) better performance, 2) rowkey(s) are not converted to allow Missing if not needed, 3) proper handling of missing values in rowkey(s) and colkey, 4) only existing levels in colkey and rowkey if they are CategoricalArrays are used in the output
Changes: