Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Stop auto-promoting column-types #30

Closed
wants to merge 46 commits into from
Closed

Stop auto-promoting column-types #30

wants to merge 46 commits into from

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Mar 10, 2017

I was hitting issues during the rebase and thought it would be smarter to just branch from master, dump the edits in, and push to overwrite the old branch. Looks like that was a bad idea and the prior PR was automatically closed.

To summarize some of the features discussed earlier and still present in this PR:

  • Automatic promotion of columns to nullable arrays no longer occurs
  • all constructors will try and recycle inputs as a byproduct of consolidating the constructors
  • new functions nullify!, nullify, denullify!, and denullify have been added for converting DataTables back and forth from nullable and nullfree states.

#24

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review the rest of the tests, but this should cover the most substantive parts.

if length(uniqueheaders) == 0
return DataTable()
elseif length(unique(map(length, uniqueheaders))) > 1
throw(ArgumentError("not all DataTables have the same number of columns. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Resolve" might not be completely clear. Maybe "unmatched" (sorry, not a native speaker)?

Also it would still be interesting to print the number of columns of each table, since that's what the message says.

elseif length(unique(map(length, uniqueheaders))) > 1
throw(ArgumentError("not all DataTables have the same number of columns. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))
elseif length(uniqueheaders) > 1
throw(ArgumentError("Column names do not match. Use `rename!` or `names!` to adjust columns names. Resolve column(s): $(setdiff(union(allheaders...), intersect(allheaders...)))"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think errors should start with a lowercase letter.

Int64
```

See also [`denullify`](@ref) & [`nullify!`](@ref).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use "and" than "&".

function denullify!(dt::AbstractDataTable)
for i in 1:size(dt,2)
if !anynull(dt[i])
dt[i] = dropnull(dt[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use dropnull!? That will avoid a copy for NullableArray, and it won't make a difference for other arrays. Since there a no nulls, the original arrays won't be modified.

denullify!(dt::AbstractDataTable)

Convert columns with a `Nullable` element type without any null values
to a non-`Nullable` equivalent array type. The table `dt` is modified in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that new columns may alias the old ones, even when they were converted. Same for denullify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should denullify switch to deepcopy? I'm not sure I like this alias behavior for denullify.

julia> using DataTables

julia> dt = DataTable(A = 1:3, B = NullableArray(1:3))
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> ddt = denullify(dt)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> dt[:A] === ddt[:A]
true

julia> ddt[:A] = 1
1

julia> ddt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 212 │
│ 313 │

julia> dt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 212 │
│ 313

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and nullify...

julia> dt = DataTable(A = 1:3, B = NullableArray(1:3))
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> ndt = nullify(dt)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> dt[:B] === ndt[:B]
true

julia> ndt[:B] = 3
3

julia> dt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 113 │
│ 223 │
│ 333 │

julia> ndt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 113 │
│ 223 │
│ 333

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, we would need concrete use cases to decide. In both cases people can easily make a copy manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed these back to using copy as you suggested and added a note for nullify and denullify that if users want fully alias-free copies, they should use nullify!(deepcopy(dt)) and denullify!(deepcopy(dt)). Hopefully if anyone hits this issue and doesn't know why columns are changed across multiple tables, they'll open the help docstrings and see the note.

x2 = [1.0, 1.0, 1.0]))
@test isequal(dt, DataTable(x1 = [0.0, 0.0, 0.0],
x2 = [1.0, 1.0, 1.0],
@test isequal(dt, DataTable(x1 = NullableCategoricalVector([0.0, 0.0, 0.0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullableVector rather.


dt = DataTable([Int, Float64], [:x1, :x2], 2)
@test size(dt) == (2, 2)
@test eltypes(dt) == [Nullable{Int}, Nullable{Float64}]

@test isequal(dt, DataTable([Int, Float64], 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor still seems legitimate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor is legitimate but the test is asking if two datatables with uninitialized columns contain the same values.

by the same logic these two vectors should be equal

julia> Vector{Float64}(10)
10-element Array{Float64,1}:
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314
 2.29759e-314

julia> Vector{Float64}(10)
10-element Array{Float64,1}:
 2.29514e-314
 9.88131e-324
 2.25711e-314
 2.29432e-314
 0.0
 2.29469e-314
 2.29469e-314
 2.29514e-314
 2.29469e-314
 5.45361e-312

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So use Nullable eltypes, which should compare as isequal even with nulls values.

@test DataTable(a=1, b=1:2) == DataTable(a=[1,1], b=[1,2])

@testset "associative" begin
dt = DataTable(Dict(k => v for (k,v) in zip([:A, :B], [1:3, 4:6])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just Dict(A => 1:3, B => 4:6).


@testset "associative" begin
dt = DataTable(Dict(k => v for (k,v) in zip([:A, :B], [1:3, 4:6])))
@test dt == DataTable(A = 1:3, B = 4:6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test the type of the columns (would maybe make sense elsewhere too).

if isa(columns[i], Range)
columns[i] = collect(columns[i])
end
repeats = div(maxlen, length(columns[i]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this recycling behavior which feels too magical. JuliaData/DataFrames.jl#882 only recycles single-element arrays, and that feels safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructors seem like they are already pretty forgiving given that recycling is attempted at all. It would be an interesting duality of stances: forgiving and friendly during the construction of a DataTable but enforce strict behavior rules during use. We've taken a pretty clear stance on the latter by deciding no to allowing dt[row, :col] = Nullabe() to autopromote to a NullableArray and accept the assignment. I think there's precedent for the recycling behavior too. I'd like some additional perspectives here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of this mainly because I saw it in the tests

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

d1 = DataTable(a = repeat([1:3;], inner = [4]),

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

dt = DataTable(a = repeat([1, 2, 3, 4], outer=[2]),

d1 = DataTable(a = repeat([1:3;], inner = [4]),

cols = Any[[repeat(c, inner=r2) for c in columns(dt1)];

Of course, the shortcoming here is that we decide for the user which is the default, inner or outer repeat. But at least one of these DataTable constructor behaviors will have a smaller declaration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructors seem like they are already pretty forgiving given that recycling is attempted at all. It would be an interesting duality of stances: forgiving and friendly during the construction of a DataTable but enforce strict behavior rules during use.

What do you mean? Constructors do not support any recycling currently. I'm just suggesting we extend to constructors the current behavior of setindex!, i.e. recycle scalars (and maybe on-element arrays, but that's up for discussion).

The tests/examples are not really representative of work with real data, where you more rarely need to generate vectors by recycling.

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 14, 2017

New, potentially disruptive (but probably not?) change I have added to this. stack/unstack & meltdt/unstackdt were all performing lots of Array type conversions, the first to NullableCategoricalArrays/NullableArrays and the later to the internal AbstractVector types of RepeatedArray/StackedArray. Removing the automatic array type promotion of stack/unstack wasn't too bad, but removing it from meltdt/unstackdt is essentially impossible without a rewrite of how those work (and the custom backend array types). RepeatedArrays and StackedArrays don't support all indexing types (Bool/BitArray were the two I cared about) and also, they are both another case of auto promotion of Array type that we're trying to get rid of. They don't appear to be commonly used (and are definitely not well tested), otherwise I'm sure someone would have caught these basic usability issues already.


This saves memory. To create the view, several AbstractVectors are defined:

`:variable` column -- `EachRepeatedVector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any EachRepeatedVectors left in the code so this is probably overdue for removal

@cjprybol cjprybol changed the title Use whatever column-type you want 2.0 Stop auto-promoting column-types Mar 14, 2017
@nalimilan
Copy link
Member

nalimilan commented Mar 15, 2017 via email

@nalimilan
Copy link
Member

nalimilan commented Mar 15, 2017 via email

@nalimilan
Copy link
Member

nalimilan commented Mar 15, 2017 via email

@nalimilan
Copy link
Member

nalimilan commented Mar 15, 2017 via email

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

@@ -76,9 +76,9 @@ function stack(dt::AbstractDataTable, measure_vars::Vector{Int},
cnames = names(dt)[id_vars]
insert!(cnames, 1, value_name)
insert!(cnames, 1, variable_name)
DataTable(Any[Compat.repeat(_names(dt)[measure_vars], inner=nrow(dt)), # variable
DataTable(Any[repeat(_names(dt)[measure_vars], inner=nrow(dt)), # variable
vcat([dt[c] for c in measure_vars]...), # value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should align this comment too.

NullableArray(eltype(T), dims)

similar_nullable{T,R}(dv::CategoricalArray{T,R}, dims::@compat(Union{Int, Tuple{Vararg{Int}}})) =
similar_nullable{T,R}(dv::CategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =
NullableCategoricalArray(T, dims)

similar_nullable(dt::AbstractDataTable, dims::Int) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this definition still used somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, only the Array ones are still used. I'll take this one out.

@@ -45,7 +45,7 @@ function printtable(io::IO,
if !isnull(dt[j][i])
if ! (etypes[j] <: Real)
print(io, quotemark)
x = isa(dt[i, j], Nullable) ? _unsafe_get(dt[i, j]) : dt[i, j]
x = isa(dt[i, j], Nullable) ? NullableArrays.unsafe_get(dt[i, j]) : dt[i, j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not rely on NullableArrays internals for this. Also in Julia 0.6 `unsafe_get exists in Julia Base, so better not make it look like a NullableArrays-specific function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was working on 0.5 when I wrote this and couldn't find the Base.unsafe_get. Will use @compat(unsafeget)

similar_nullable{T}(dv::AbstractVector{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
NullableVector{T}(dims)
similar_nullable{T}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
NullableArray(T, dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do NullableArrays not support the proper type parameter convention? I thought that was just DataArrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they do. This is reverting back to the current master syntax. Now that I've added tests for CategoricalArrays in unstack (or maybe just ran a Pkg.update?) I caught a new deprecation warning.

julia> DataTables.similar_nullable(CategoricalArray([1, 1, 1]), 2)
WARNING: NullableCategoricalArray{T}(::Type{T}, dims::Int...; ordered=false) is deprecated, use NullableCategoricalArray{T}(dims, ordered=ordered) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64
 [2] #NullableCategoricalArray#137(::Bool, ::Type{T} where T, ::Type{Int64}, ::Int64, ::Vararg{Int64,N} where N) at ./deprecated.jl:51
 [3] CategoricalArrays.NullableCategoricalArray(::Type{Int64}, ::Int64) at ./deprecated.jl:51
 [4] similar_nullable(::CategoricalArrays.CategoricalArray{Int64,1,UInt32}, ::Int64) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:12
 [5] eval(::Module, ::Any) at ./boot.jl:235
 [6] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [7] macro expansion at ./REPL.jl:97 [inlined]
 [8] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
2-element CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}:
 #NULL
 #NULL

julia> DataTables.similar_nullable(NullableCategoricalArray([1, 1, 1]), 2)
WARNING: NullableCategoricalArray{T}(::Type{T}, dims::Int...; ordered=false) is deprecated, use NullableCategoricalArray{T}(dims, ordered=ordered) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64
 [2] #NullableCategoricalArray#137(::Bool, ::Type{T} where T, ::Type{Int64}, ::Int64, ::Vararg{Int64,N} where N) at ./deprecated.jl:51
 [3] CategoricalArrays.NullableCategoricalArray(::Type{Int64}, ::Int64) at ./deprecated.jl:51
 [4] similar_nullable(::CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}, ::Int64) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/join.jl:15
 [5] eval(::Module, ::Any) at ./boot.jl:235
 [6] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [7] macro expansion at ./REPL.jl:97 [inlined]
 [8] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
2-element CategoricalArrays.NullableCategoricalArray{Int64,1,UInt32}:
 #NULL
 #NULL

I changed all of these to the new syntax but @nalimilan asked if I would revert and save the changes for another PR, too many changes at once.

hcat!(dt::DataTable, x::NullableVector) = hcat!(dt, DataTable(Any[x]))
hcat!(dt::DataTable, x::Vector) = hcat!(dt, DataTable(Any[NullableArray(x)]))
hcat!(dt::DataTable, x) = hcat!(dt, DataTable(Any[NullableArray([x])]))
hcat!(dt::DataTable, x) = hcat!(dt, DataTable(Any[x]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the scalar recycling in this PR these can all be consolidated into a single function

unionunique = union(uniqueheaders...)
coldiff = setdiff(unionunique, intersect(uniqueheaders...))
if !isempty(coldiff)
# if any datatables are a full superset of names, skip them
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each unique set of column names I'm throwing the error to tell which columns are missing from each of the sets. If any of the inputs to vcat have all of the column names then we can't show which are missing, so they're dropped from the error output

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 18, 2017

  • CategoricalArrays issue has been filed NullableCategoricalArrays constructor and levels ordering CategoricalArrays.jl#62 fixed
  • vcat errors are extensively tested, both for error type and error message
  • joins are fully type stabilized
  • unstack is type stabilized and order stabilized except for when columns are NullableCategoricalArrays, but that will fix itself when NullableCategoricalArrays constructor and levels ordering CategoricalArrays.jl#62 is fixed (and we'll know it's fixed because some of the unstack tests here will fail, and it will be a rare moment when we are excited for failures!)
  • everything else unrelated has been reverted (except for incorrect spacing and compat removals, which I see no reason to revert).
  • I don't see any more issues when reviewing the diff by eye.
  • Let's ignore the coverage drop? I've added many tests here and caught dozens of behavioral bugs that are now corrected. We're also still up 2% on DataFrames and I'll increase coverage a bit more with Remove unused functions and increase test coverage #31.
  • constructors are all consolidated in behavior and return the correct column type, and recycle properly.
  • right join errors are patched

pending no major issues during reviews, I think this is ready to merge.

@@ -18,5 +18,5 @@ import Base: keys, values, insert!

@deprecate sub(dt::AbstractDataTable, rows) view(dt, rows)

@deprecate stackdf stackdt
@deprecate meltdf meltdt
@deprecate stackdf stack
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be considered an unrelated change but given the issues with the stackdt/meltdt maybe we should be steering people towards the non-view versions? I'm happy to revert this if others would like it reverted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd revert it for now, as it's unrelated. We can discuss steering people toward stack separately.

@@ -61,13 +61,13 @@ d = stackdt(iris)

This saves memory. To create the view, several AbstractVectors are defined:

`:variable` column -- `EachRepeatedVector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these probably are important. 2 spaces == newline. I forgot this is markdown interpreter dependant

dt2 = DataTable(A = 1:3)
# right missing 1 column
err = @test_throws ArgumentError vcat(dt1, dt2)
@test err.value.msg == "column(s) B are missing from argument(s) 2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that this is possible (catching the error output in a variable to test the string). Thanks for the suggestion here

valuecol = dt[value]
keycol = NullableCategoricalArray(dt[colkey])
levels!(keycol, unique(dt[colkey]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupidxs = [g.idx[g.starts[i]:g.ends[i]] for i in 1:length(g.starts)]
rowkey = zeros(Int, size(dt, 1))
for i in 1:length(groupidxs)
rowkey[groupidxs[i]] = i
end
keycol = NullableCategoricalArray(dt[colkey])
levels!(keycol, unique(dt[colkey]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, couldn't find the time to review the new changes.

After discussing this with others, I think we should adopt the following strategy to merge this:

  • I hadn't really thought about the stackdt/meltdt issue. I turns out it's not OK to remove these. The fact that nobody noticed they were semi-broken doesn't mean they aren't useful, since DataFrames master hasn't been used widely before forking to DataTables. So these need to be restored and adapted like you did for stack and melt.

  • The changes to tests implied by the removal of automatic promotion to NullableArray are somewhat problematic. Indeed, that means by default data tables won't be using NullableArray, so we won't test that they are still supported by all functions. Ideally, we would test both Array and NullableArray, and their combination, but for now to ensure we don't regress (now or in the future) I'm afraid all tests will have to be converted to use NullableArray columns, except of course for specific cases where it's justified.

In order to make this less messy, could you refactor this PR in (at least) two parts? The first part can contain most changes to the code (but not to tests), except that the promotion to NullableArray would still happen. That way, tests do not need to be adapted too much. The second part can then change the default, and adapt tests to use NullableArray columns. Since that's going to touch a lot of code, it's better to have it separate so that more fundamental changes do not get lost in the noise.

Thanks!

matchingloci = find(h -> u == h, allheaders)
headerdiff = filter(x -> !in(x, u), coldiff)
headerdiff = length(headerdiff) > 1 ?
join(string.(headerdiff[1:end-1]), ", ") * " and " * string(headerdiff[end]) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather align everything on length, without further indentation. Anyway if you keep the indentation, it should be four spaces. Same below.

Anyway, no need for this length check nor to handle the last element manually: as I said, just use join's last argument.

# multiple missing 1 column
err = @test_throws ArgumentError vcat(dt1, dt2, dt2, dt2, dt2, dt2)
@test err.value.msg == "column(s) B are missing from argument(s) 2, 3, 4, 5 and 6"
# argument missing >1columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space.

string(matchingloci[end])
estrings[i] = "column(s) $headerdiff are missing from argument(s) $matchingloci"
end
throw(ArgumentError(join(estrings, ", and ")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use last here for "and".

test/io.jl Outdated
@@ -49,6 +49,6 @@ module TestIO
F = NullableArray(fill(Nullable(), 3)),
G = fill(Nullable(), 3))

answer = Sys.WORD_SIZE == 64 ? 0xd4b5a035796ad770 : 0x1950ccd7
answer = Sys.WORD_SIZE == 64 ? 0x937e94e70d642cce : 0x2b8864d8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually while this trick saves some space, I'd prefer copying the full output (using triple quotes). That way the logs show the actual and expected output on failure, which makes it clearer what doesn't work.

@testset "null conversions" begin
dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
nullfree = Any[Array{Int,1},Array{Int,1},Array{Int,1}]
nullified = convert(Vector{Any}, fill(NullableArray{Int,1}, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this conversion? Why use a different approach from the line above? Also I don't think you need the Any above, and Vector{Int} is shorter.

@@ -271,13 +272,6 @@ module TestData
v1 = randn(10)
)

dt2 = DataTable(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, please keep this for now.

@@ -35,8 +35,6 @@ module TestConversions
@test isa(ai, Matrix{Int})
@test ai == convert(Matrix{Int}, dt)

dt[1,1] = Nullable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tested somewhere else? If not, should be kept (after adapting of course).

@testset "associative" begin
dt = DataTable(Dict(:A => 1:3, :B => 4:6))
@test dt == DataTable(A = 1:3, B = 4:6)
@test all(e -> e <: Int, eltypes(dt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just eltypes(dt) == [Int, Int].


@test isequal(dt, DataTable([Int, Float64], 2))
@test isequal(dt, DataTable([Nullable{Int}, Nullable{Float64}], 2))
@test all(isnull, (dt[:x1], dt[:x2]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just @test isnull(dt[:x1]); @test isnull(dt[:x2]) (on a new line). That also makes the tests more precise about what's wrong if they fail.

@@ -18,8 +18,6 @@ module TestConstructors

@test isequal(dt, DataTable(Any[NullableCategoricalVector(zeros(3)),
NullableCategoricalVector(ones(3))]))
@test isequal(dt, DataTable(x1 = [0.0, 0.0, 0.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test still applies: it checks that comparisons work across types. You can convert it to NullableArray like below.

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 27, 2017

That sounds like a great plan, @nalimilan, I can definitely do that!

I think the logical units present in this PR that are necessary to stop autopromotion or otherwise assist with the transition away from autopromotion are:

I agree with all of your specific code comments/reviews as well, so I'll incorporate those where appropriate. Thanks for the feedback, and please don't feel the need to apologize for any delays! I'm sure you're busy enough before you start volunteering your time to maintain and develop this package, and DataTables isn't even the only package you maintain and develop. @nalimilan and @ararslan both deserve 🥇🥇s, as do the maintainers of many other packages and the language itself.

I'll leave this PR open as a reference, and now that we're going to refactor this, I'll update #31 to not be dependant on this PR and try and get those changes merged soon too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants