adjust outer join behavior (types and right outer join bug) #44

cjprybol · 2017-03-31T01:51:28Z

Separated from #30 as part of a PR refactor. Fixes a bug found in #34 (comment). Also fixes the issue that in the current implementation, resize! has the ability to introduce #undef values where Nullables should be placed.

cjprybol · 2017-03-31T04:05:46Z

src/abstractdatatable/join.jl

    NullableCategoricalArray(T, dims)

-similar_nullable(dt::AbstractDataTable, dims::Int) =


similar_nullable is only used by compose_joined_table and it does not use this call form, so this is not needed.

So say it in the commit message.

nalimilan · 2017-04-01T20:03:16Z

src/abstractdatatable/join.jl

    NullableCategoricalArray(T, dims)

-similar_nullable(dt::AbstractDataTable, dims::Int) =


So say it in the commit message.

nalimilan · 2017-04-01T20:04:02Z

src/abstractdatatable/join.jl

    NullableCategoricalArray(T, dims)

-similar_nullable(dt::AbstractDataTable, dims::Int) =
-    DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt)))
+similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =


R isn't needed.

nalimilan · 2017-04-01T20:04:36Z

src/abstractdatatable/join.jl

-similar_nullable(dt::AbstractDataTable, dims::Int) =
-    DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt)))
+similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =
+    NullableCategoricalArray(T, dims)


Use the NullableCategoricalArray{T}(dims) construct, this one is deprecated (or should be, if you didn't get a warning...).

nalimilan · 2017-04-01T20:12:05Z

src/abstractdatatable/join.jl

+    cols = Vector{Any}(ncleft + ncol(dtr_noon))
+    for (i, col) in enumerate(columns(joiner.dtl))
+        cols[i] = kind == :inner ? col[all_orig_left_ixs] :
+                                   copy!(similar_nullable(col, nrow), col[all_orig_left_ixs])


col[all_orig_left_ixs] creates a temporary copy. Here it is worth doing:

newcol = similar_nullable(col, nrow) @inbounds for (j, k) in enumerate(all_orig_left_ixs) newcol[j] = col[k] end

Same below, handling right_perm inside the loop to avoid a second allocation (even when kind == :inner).

nalimilan · 2017-04-01T20:14:12Z

src/abstractdatatable/join.jl

+    @assert nrow == length(all_orig_right_ixs) + loil
+    ncleft = ncol(joiner.dtl)
+    cols = Vector{Any}(ncleft + ncol(dtr_noon))
+    for (i, col) in enumerate(columns(joiner.dtl))


Can you add a comment explaining why we create nullable arrays when kind != :inner?

nalimilan · 2017-04-01T20:14:41Z

src/abstractdatatable/join.jl

@@ -207,7 +217,8 @@ join(dt1::AbstractDataTable,
  - `:cross` : a full Cartesian product of the key combinations; every
    row of `dt1` is matched with every row of `dt2`

-Null values are filled in where needed to complete joins.
+For the three join operations that may introduce missing values, `:outer`, `:left`,


Parentheses rather than comma before the list of join types.

nalimilan · 2017-04-01T20:16:16Z

test/join.jl

@@ -111,10 +111,63 @@ module TestJoin
                                                           Mass = [1.5, 2.2, 1.1, 1.5])

    # Test that join works when mixing Array and NullableArray (#1151)
-    dt = DataTable([collect(1:10), collect(2:11)], [:x, :y])
+    dt = DataTable([NullableArray(1:10), NullableArray(2:11)], [:x, :y])


This goes against the comment above. Why change it?

nalimilan · 2017-04-01T20:20:33Z

Thanks. The commit message should explain what the fixed bug was. Also, I don't think this really "type-stabilizes" join, it just avoid returning #undef entries by making them null entries of a nullable array.

cjprybol · 2017-04-01T21:23:36Z

test/join.jl

@@ -112,9 +112,62 @@ module TestJoin

    # Test that join works when mixing Array and NullableArray (#1151)
    dt = DataTable([collect(1:10), collect(2:11)], [:x, :y])
-    dtnull = DataTable(x = 1:10, z = 3:12)
+    dtnull = DataTable(x = NullableArray(1:10), z = NullableArray(3:12))


It'll be useful to make this explicit now so the test will still work when columns are not NullableArrays by default

No, this change isn't related to this PR so it should be made elsewhere (and anyway the comment should be updated if the code changes).

nalimilan · 2017-04-02T09:51:07Z

src/abstractdatatable/join.jl


    if length(rightonly_ixs.join) > 0
        # some left rows are nulls, so the values of the "on" columns
        # need to be taken from the right
        for (on_col_ix, on_col) in enumerate(joiner.on_cols)
            # fix the result of the rightjoin by taking the nonnull values from the right table
-            res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
+            # end-length(rightonly_ixs.orig)+1:end was rightonly_ixs.join. Try and FIXME
+            res[on_col][end-length(rightonly_ixs.orig)+1:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]


Do you need this change in this PR? Better make it separately.

This bug was found as part of this PR, I don't see any substantial benefit to making a separate PR for it, and it's tested by the full set of joins included in this PR. I've added more detail about the bug to the commit message. Let me know if there's any additional information you'd like added.

What's annoying is that you're leaving a FIXME, which seems to mean it's going to hold back the PR until we're sure it's the right fix. But if you say this is tested, then the fix is probably correct and no FIXME should be left?

Also, end-length(rightonly_ixs.orig)+1:end would be clearer with parentheses around end-length(rightonly_ixs.orig)+1 IMHO.

The FIXME refers to rightonly_ixs.join and that I would like to change this line back to using res[on_col][rightonly_ixs.join] = ... when we get around to figuring out why rightonly_ixs.join doesn't contain the expected values. I could remove the FIXME and open an issue? I could also remove the FIXME and just keep a personal note to get around to fixing this? Not sure what the preferred way of tracking this would be.

cjprybol · 2017-04-04T04:43:50Z

FIXME removed and determining the offset has been cleaned up.

nalimilan

A few more comments, but looks mostly good.

nalimilan · 2017-05-02T21:03:36Z

src/abstractdatatable/join.jl

+    @assert nrow == length(all_orig_right_ixs) + loil
+    ncleft = ncol(joiner.dtl)
+    cols = Vector{Any}(ncleft + ncol(dtr_noon))
+    _similar_type = kind == :inner ? similar : similar_nullable


Just call this _similar, "type" is more confusing than anything here.

nalimilan · 2017-05-02T21:05:35Z

src/abstractdatatable/join.jl

@@ -210,7 +223,8 @@ join(dt1::AbstractDataTable,
  - `:cross` : a full Cartesian product of the key combinations; every
    row of `dt1` is matched with every row of `dt2`

-Null values are filled in where needed to complete joins.
+For the three join operations that may introduce missing values (`:outer`, `:left`,
+and `:right`), all columns of the returned datatable will be nullable.


"data table"

nalimilan · 2017-05-02T21:08:35Z

src/abstractdatatable/join.jl


    if length(rightonly_ixs.join) > 0
        # some left rows are nulls, so the values of the "on" columns
        # need to be taken from the right
        for (on_col_ix, on_col) in enumerate(joiner.on_cols)
            # fix the result of the rightjoin by taking the nonnull values from the right table
-            res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
+            offset = nrow - length(rightonly_ixs.orig) + 1
+            res[on_col][offset:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]


Using a loop as above, you will be able to avoid making a copy for each index in rightonly_ixs.orig.

nalimilan · 2017-05-02T21:11:32Z

test/join.jl

+        large = DataTable(Any[[0, 1, 2, 3, 4], [0.0, 1.0, 2.0, 3.0, 4.0]], [:id, :fid])
+        N = Nullable()
+
+        @test join(small, large, kind=:cross) == DataTable(Any[repeat([1, 3, 5], inner=5),


Break the line after == to avoid going beyond 92 chars.

Also, it would be good to store the result of the join and check the column types to make sure we get NullableArray and not Array{Nullable}. Should there also be tests for categorical arrays?

cjprybol · 2017-05-03T06:40:41Z

Didn't see anything in the coverage drop that looks related.

nalimilan · 2017-05-03T09:47:12Z

src/abstractdatatable/join.jl

+    _similar = kind == :inner ? similar : similar_nullable
+    for (i, col) in enumerate(columns(joiner.dtl))
+        cols[i] = _similar(col, nrow)
+        @inbounds for (j, k) in enumerate(all_orig_left_ixs)


I've just realized this is likely to be quite slow due to the type-instability. You should be able to move this loop to a separate kernel function which will take col, cols[i] and all_orig_left_ixs, and which will be specialized on the column type. It could also take an optional offset, so that the same function could be used for the two similar loops below. I imagine this should make a noticeable speed difference on a simple benchmark.

nalimilan · 2017-05-03T09:56:12Z

test/join.jl

+        DataTable([NullableArray(1:10), NullableArray(3:12), collect(2:11)], [:x, :z, :y])
+
+    @testset "all joins" begin
+        small = DataTable(Any[[1, 3, 5], [1.0, 3.0, 5.0]], [:id, :fid])


"small" and "large" really don't apply now. How about "dt1" and "dt2"?

nalimilan · 2017-05-03T09:57:01Z

test/join.jl

+              s([:id, :fid]) == DataTable(Any[[1, 3], [1, 3]], [:id, :fid])
+        @test typeof.(s(:id).columns) ==
+              typeof.(s(:fid).columns) ==
+              typeof.(s([:id, :fid]).columns) == [CategoricalVector{Int, UInt32},


As in the other PR, leave UInt32 out.

Hmm, or replace it with CategoricalArrays.DefaultRefType since you need it here.

outer joins need to return nullable tables as they may introduce missing data. similar_nullable on DataTables has been removed (unused) and replaced with a similar_nullable that works on NullableCategoricalArrays, and this change is made to support the new changes to join. The 3 outer joins share a function with inner joins, and this shared function (compose_joined_table) now performs a check to see if the join type is :inner, and if so, it will return the same column type as the parent table rather than promoting to a nullable column. A bug was found in right-outer join behavior where the values unique to the right table were added to the table in the incorrect locations, overwriting data and leaving nulls where they shouldn't be. This bug, due to incorrect values in rightonly_ixs.join, was fixed by filling the last n-rows of the datatable where n = length(rightonly_ixs.join). Tests were checked for accuracy against pandas.

nalimilan · 2017-05-12T13:11:44Z

Thanks! Out of curiosity, have you checked whether the new code is measurably faster than the old one?

cjprybol · 2017-05-13T20:05:14Z

Thanks here too! I don't remember benchmarking any of these changes, but let's save runtime testing for after #62.

cjprybol commented Mar 31, 2017

View reviewed changes

nalimilan reviewed Apr 1, 2017

View reviewed changes

cjprybol changed the title ~~type stabilize join, fix right outer join bug~~ adjust outer join behavior (types and right outer join bug) Apr 1, 2017

cjprybol commented Apr 1, 2017

View reviewed changes

nalimilan reviewed Apr 2, 2017

View reviewed changes

cjprybol mentioned this pull request Apr 3, 2017

Stop auto-promoting column-types #30

Closed

cjprybol mentioned this pull request Apr 16, 2017

Stack should use similar_nullable, not NullableArray #54

Merged

nalimilan reviewed May 2, 2017

View reviewed changes

nalimilan reviewed May 3, 2017

View reviewed changes

nalimilan merged commit 12443f4 into JuliaData:master May 12, 2017

cjprybol deleted the cjp/join branch May 13, 2017 20:05

nalimilan mentioned this pull request Oct 13, 2017

join() doesn't preserve categorical levels order JuliaData/DataFrames.jl#1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adjust outer join behavior (types and right outer join bug) #44

adjust outer join behavior (types and right outer join bug) #44

cjprybol commented Mar 31, 2017 •

edited

Loading

cjprybol Mar 31, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan Apr 1, 2017

nalimilan commented Apr 1, 2017

cjprybol Apr 1, 2017

nalimilan Apr 2, 2017

nalimilan Apr 2, 2017

cjprybol Apr 2, 2017 •

edited

Loading

nalimilan Apr 3, 2017

cjprybol Apr 3, 2017

cjprybol commented Apr 4, 2017

nalimilan left a comment

nalimilan May 2, 2017

nalimilan May 2, 2017

nalimilan May 2, 2017

nalimilan May 2, 2017

cjprybol May 3, 2017

cjprybol commented May 3, 2017

nalimilan May 3, 2017

nalimilan May 3, 2017

nalimilan May 3, 2017

nalimilan May 3, 2017

nalimilan commented May 12, 2017

cjprybol commented May 13, 2017

		NullableCategoricalArray(T, dims)

		similar_nullable(dt::AbstractDataTable, dims::Int) =

adjust outer join behavior (types and right outer join bug) #44

adjust outer join behavior (types and right outer join bug) #44

Conversation

cjprybol commented Mar 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Apr 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol Apr 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented Apr 4, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented May 12, 2017

cjprybol commented May 13, 2017

cjprybol commented Mar 31, 2017 •

edited

Loading

cjprybol Apr 2, 2017 •

edited

Loading