Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Stop auto-promoting column-types #30

Closed
wants to merge 46 commits into from
Closed
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
8db2821
add changes
cjprybol Mar 10, 2017
4a939fe
make vcat error more informative
cjprybol Mar 13, 2017
f5a53a1
add docstring for vcat
cjprybol Mar 13, 2017
2c95f13
incorporate edits suggested during review
cjprybol Mar 13, 2017
412ceaa
_unsafe_get -> NullableArrays.unsafe_get
cjprybol Mar 13, 2017
f142df5
Merge branch 'master' into cjp/rebaseretaintype
cjprybol Mar 13, 2017
cc95658
fix new tests from master
cjprybol Mar 13, 2017
06dc914
remove RepeatedVector, StackedVector, unstackdt, meltdt
cjprybol Mar 14, 2017
c4e218e
DataFrames doensn't reshape 2d Arrays -> Vectors so don't do it here
cjprybol Mar 14, 2017
e954226
minor cleanup
cjprybol Mar 14, 2017
ed8a515
change (de)nullify back to copy and cleanup docstrings
cjprybol Mar 15, 2017
1636a0c
NullableArrays.unsafe_get -> compat(unsafe_get)
cjprybol Mar 15, 2017
91233d3
default to NullableArray for joins that may introduce missing data
cjprybol Mar 15, 2017
7462612
align comments
cjprybol Mar 15, 2017
9b65533
lots of edits
cjprybol Mar 15, 2017
b643ff8
tests and no need for compat
cjprybol Mar 15, 2017
4c68452
spacing mistakes
cjprybol Mar 15, 2017
7310681
throw errors on 1-d matrices and change confusing variable name
cjprybol Mar 15, 2017
de280ba
add back check to differentiate scalars from AbstractArrays
cjprybol Mar 15, 2017
88b20ca
save work
cjprybol Mar 16, 2017
be1cacd
save progress, switch to test master
cjprybol Mar 16, 2017
19ffb58
join is ready and tests in place. right join still broken
cjprybol Mar 16, 2017
3f2cd63
fix right join
cjprybol Mar 17, 2017
9c3ad21
update join help message and add note about temp fix
cjprybol Mar 17, 2017
1e7d26e
indentation
cjprybol Mar 17, 2017
e39ba63
changes
cjprybol Mar 17, 2017
04cb9ee
spacing
cjprybol Mar 17, 2017
5d70685
put old unstack back and stabilize types, ordering
cjprybol Mar 18, 2017
7859132
fix bad copy and paste spacing and condense scalar recycling code
cjprybol Mar 18, 2017
6496acf
update vcat error
cjprybol Mar 18, 2017
f47810f
unused function, another test, remove unused variable
cjprybol Mar 18, 2017
259ceef
revert function removal to appease new code failures?
cjprybol Mar 18, 2017
26e87ac
fix v0.5 issue
cjprybol Mar 18, 2017
e0f7982
update vcat testing and change similar_nullable constructor call
cjprybol Mar 18, 2017
d65385e
:Merge branch 'cjp/retaintype' of github.com:cjprybol/DataTables.jl i…
cjprybol Mar 18, 2017
b0c29b4
remove old error message from docstring
cjprybol Mar 18, 2017
95a6f31
and change docstring to doctest
cjprybol Mar 18, 2017
7df712f
change similar_nullable back and fix unrelated copy paste space removal
cjprybol Mar 18, 2017
27da644
add missing rightperm reordering and properly unify hcat! functions
cjprybol Mar 18, 2017
5fa8fa0
accidental spacing changes
cjprybol Mar 18, 2017
a1d58f9
forgot one spacing change
cjprybol Mar 18, 2017
db87443
change deprecations
cjprybol Mar 18, 2017
9c66a1e
add back extra spaces
cjprybol Mar 18, 2017
887346b
bump catarrays version, remove manual resetting of levels in unstack
cjprybol Mar 20, 2017
00c08cc
Merge branch 'master' into cjp/retaintype
cjprybol Mar 24, 2017
020c88e
only use "and" when joining the last estring
cjprybol Mar 24, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/src/man/reshaping_and_pivoting.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ d = stackdt(iris)

This saves memory. To create the view, several AbstractVectors are defined:

`:variable` column -- `EachRepeatedVector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, these probably are important. 2 spaces == newline. I forgot this is markdown interpreter dependant

`:variable` column -- `EachRepeatedVector`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any EachRepeatedVectors left in the code so this is probably overdue for removal

This repeats the variables N times where N is the number of rows of the original AbstractDataTable.

`:value` column -- `StackedVector`
`:value` column -- `StackedVector`
This is provides a view of the original columns stacked together.

Id columns -- `RepeatedVector`
Id columns -- `RepeatedVector`
This repeats the original columns N times where N is the number of columns stacked.

For more details on the storage representation, see:
Expand Down
4 changes: 4 additions & 0 deletions src/DataTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ export @~,
combine,
completecases,
deleterows!,
denullify!,
denullify,
describe,
dropnull,
dropnull!,
Expand All @@ -61,6 +63,8 @@ export @~,
nonunique,
nrow,
nullable!,
nullify!,
nullify,
order,
printtable,
rename!,
Expand Down
312 changes: 243 additions & 69 deletions src/abstractdatatable/abstractdatatable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ The following are normally implemented for AbstractDataTables:
* [`nonunique`](@ref) : indexes of duplicate rows
* [`unique!`](@ref) : remove duplicate rows
* `similar` : a DataTable with similar columns as `d`
* `denullify` : unwrap `Nullable` columns
* `denullify!` : unwrap `Nullable` columns in-place
* `nullify` : convert all columns to NullableArrays
* `nullify!` : convert all columns to NullableArrays in-place

**Indexing**

Expand Down Expand Up @@ -706,83 +710,79 @@ Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt[:, :], dt2)
Base.hcat(dt::AbstractDataTable, x, y...) = hcat!(hcat(dt, x), y...)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable, dtn::AbstractDataTable...) = hcat!(hcat(dt1, dt2), dtn...)

# vcat only accepts DataTables. Finds union of columns, maintaining order
# of first dt. Missing data become null values.
"""
vcat(dts::AbstractDataTable...)

Base.vcat(dt::AbstractDataTable) = dt
Vertically concatenate `AbstractDataTables` that have the same column names in
the same order.

Base.vcat(dts::AbstractDataTable...) = vcat(AbstractDataTable[dts...])
```julia
julia> dt1 = DataTable(A=1:3, B=1:3);

function Base.vcat{T<:AbstractDataTable}(dts::Vector{T})
isempty(dts) && return DataTable()
coltyps, colnams, similars = _colinfo(dts)

res = DataTable()
Nrow = sum(nrow, dts)
for j in 1:length(colnams)
colnam = colnams[j]
col = similar(similars[j], coltyps[j], Nrow)

i = 1
for dt in dts
if haskey(dt, colnam)
copy!(col, i, dt[colnam])
end
i += size(dt, 1)
end
julia> dt2 = DataTable(A=4:6, B=4:6);

res[colnam] = col
end
res
end
julia> dt3 = DataTable(A=7:9, B=7:9, C=7:9);

_isnullable{T}(::AbstractArray{T}) = T <: Nullable
const EMPTY_DATA = NullableArray(Void, 0)

function _colinfo{T<:AbstractDataTable}(dts::Vector{T})
dt1 = dts[1]
colindex = copy(index(dt1))
coltyps = eltypes(dt1)
similars = collect(columns(dt1))
nonnull_ct = Int[_isnullable(c) for c in columns(dt1)]

for i in 2:length(dts)
dt = dts[i]
for j in 1:size(dt, 2)
col = dt[j]
cn, ct = _names(dt)[j], eltype(col)
if haskey(colindex, cn)
idx = colindex[cn]

oldtyp = coltyps[idx]
if !(ct <: oldtyp)
coltyps[idx] = promote_type(oldtyp, ct)
# Needed on Julia 0.4 since e.g.
# promote_type(Nullable{Int}, Nullable{Float64}) gives Nullable{T},
# which is not a usable type: fall back to Nullable{Any}
if VERSION < v"0.5.0-dev" &&
coltyps[idx] <: Nullable && !isa(coltyps[idx].types[2], DataType)
coltyps[idx] = Nullable{Any}
end
end
nonnull_ct[idx] += !_isnullable(col)
else # new column
push!(colindex, cn)
push!(coltyps, ct)
push!(similars, col)
push!(nonnull_ct, !_isnullable(col))
end
end
end
julia> vcat(dt1, dt2)
6×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │

for j in 1:length(colindex)
if nonnull_ct[j] < length(dts) && !_isnullable(similars[j])
similars[j] = EMPTY_DATA
julia> vcat(dt1, dt2, dt3)
ERROR: ArgumentError: columns (A, B) of input(s) (1, 2) != columns (A, B, C) of input(s) (3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the error message could be confusing: it seems to mean that the problem is that columns are different. Probably clearer: "column names of input(s) X != column names of input(s) Y: (A, B) != (A, B, C)".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you asked me to change this before too and I'm still struggling to think of a better way. This will handle all conditions for any number of inputs. Ideally, we would present the differences in a format like a git diff where only differences are shown and (bonus feature:) they would be shown colorized (red for missing, green for extra columns).

julia> dt1 = DataTable(A = 1, B = 1)
1×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │

julia> dt2 = DataTable(B = 1, A = 1)
1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │

julia> dt3 = DataTable(B = 1, A = 1, C = 1)
1×3 DataTables.DataTable
│ Row │ B │ A │ C │
├─────┼───┼───┼───┤
│ 1111 │

julia> vcat(dt1, dt2)
ERROR: ArgumentError: columns (A, B) of input(s) (1) != columns (B, A) of input(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:756

julia> vcat(dt2, dt3)
ERROR: ArgumentError: columns (B, A) of input(s) (1) != columns (B, A, C) of input(s) (2)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:756

julia> vcat(dt1, dt2, dt3)
ERROR: ArgumentError: columns (A, B) of input(s) (1) != columns (B, A) of input(s) (2) != columns (B, A, C) of input(s) (3)
Stacktrace:
 [1] vcat(::DataTables.DataTable, ::DataTables.DataTable, ::DataTables.DataTable, ::Vararg{DataTables.DataTable,N} where N) at /Users/Cameron/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:756

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stays compact as long as the inputs are mostly correct because it will just start to extend the list of inputs that match the column condition

julia> dt3, dt4, dt5, dt6, dt7, dt8 = dt2, dt2, dt2, dt2, dt2, dt2
(1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │, 1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │, 1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │, 1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │, 1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │, 1×2 DataTables.DataTable
│ Row │ B │ A │
├─────┼───┼───┤
│ 111 │)

julia> vcat(dt1, dt2, dt3, dt4, dt5, dt6, dt7, dt8)
ERROR: ArgumentError: columns (A, B) of input(s) (1) != columns (B, A) of input(s) (2, 3, 4, 5, 6, 7, 8)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous comment I had proposed a solution to make this simpler: first check that the number of columns match, and if not just print the names of columns which are missing somewhere. That way you don't need to care about the order at that point. Then, if the number of columns is the same but the names are different, print the non matching names and their position. Finally, if names are the same but not in the same order, just say so, possibly giving the number of the first problematic column and the corresponding names.

What really matters here is not flooding the output with 500 variables for large datasets.

```
"""
Base.vcat(dt::AbstractDataTable) = dt
function Base.vcat(dts::AbstractDataTable...)
isempty(dts) && return DataTable()
allheaders = map(names, dts)
# don't vcat empty DataTables
notempty = find(x -> length(x) > 0, allheaders)
uniqueheaders = unique(allheaders[notempty])
if length(uniqueheaders) == 0
return DataTable()
end
if length(uniqueheaders) > 1
unionunique = union(uniqueheaders...)
coldiff = setdiff(unionunique, intersect(uniqueheaders...))
if !isempty(coldiff)
# if any datatables are a full superset of names, skip them
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each unique set of column names I'm throwing the error to tell which columns are missing from each of the sets. If any of the inputs to vcat have all of the column names then we can't show which are missing, so they're dropped from the error output

filter!(u -> Set(u) != Set(unionunique), uniqueheaders)
estrings = Vector{String}(length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
matchingloci = find(h -> u == h, allheaders)
headerdiff = filter(x -> !in(x, u), coldiff)
headerdiff = length(headerdiff) > 1 ?
join(string.(headerdiff[1:end-1]), ", ") * " and " * string(headerdiff[end]) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather align everything on length, without further indentation. Anyway if you keep the indentation, it should be four spaces. Same below.

Anyway, no need for this length check nor to handle the last element manually: as I said, just use join's last argument.

string(headerdiff[end])
matchingloci = length(matchingloci) > 1 ?
join(string.(matchingloci[1:end-1]), ", ") * " and " * string(matchingloci[end]) :
string(matchingloci[end])
estrings[i] = "column(s) $headerdiff are missing from argument(s) $matchingloci"
end
throw(ArgumentError(join(estrings, ", and ")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use last here for "and".

else
estrings = Vector{String}(length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
indices = find(a -> a == u, allheaders)
indices = length(indices) > 1 ?
join(string.(indices[1:end-1]), ", ") * " and " * string(indices[end]) :
string(indices[end])
estrings[i] = "column order of argument(s) $indices"
end
throw(ArgumentError(join(estrings, " != ")))
end
else
header = uniqueheaders[1]
dts_to_vcat = dts[notempty]
return DataTable(Any[vcat(map(dt -> dt[col], dts_to_vcat)...) for col in header], header)
end
colnams = _names(colindex)

coltyps, colnams, similars
end

##############################################################################
Expand All @@ -801,6 +801,180 @@ function Base.hash(dt::AbstractDataTable)
return @compat UInt(h)
end

"""
denullify!(dt::AbstractDataTable)

Convert columns with a `Nullable` element type without any null values
to a non-`Nullable` equivalent array type. The table `dt` is modified in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that new columns may alias the old ones, even when they were converted. Same for denullify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should denullify switch to deepcopy? I'm not sure I like this alias behavior for denullify.

julia> using DataTables

julia> dt = DataTable(A = 1:3, B = NullableArray(1:3))
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> ddt = denullify(dt)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> dt[:A] === ddt[:A]
true

julia> ddt[:A] = 1
1

julia> ddt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 212 │
│ 313 │

julia> dt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 212 │
│ 313

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and nullify...

julia> dt = DataTable(A = 1:3, B = NullableArray(1:3))
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> ndt = nullify(dt)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 111 │
│ 222 │
│ 333 │

julia> dt[:B] === ndt[:B]
true

julia> ndt[:B] = 3
3

julia> dt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 113 │
│ 223 │
│ 333 │

julia> ndt
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 113 │
│ 223 │
│ 333

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, we would need concrete use cases to decide. In both cases people can easily make a copy manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed these back to using copy as you suggested and added a note for nullify and denullify that if users want fully alias-free copies, they should use nullify!(deepcopy(dt)) and denullify!(deepcopy(dt)). Hopefully if anyone hits this issue and doesn't know why columns are changed across multiple tables, they'll open the help docstrings and see the note.


Columns in the returned `AbstractDataTable` may alias the columns of the
input `dt`.

# Examples

```jldoctest
julia> dt = DataTable(A = NullableArray(1:3), B = [Nullable(i) for i=1:3])
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │

julia> eltypes(dt)
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}

julia> eltypes(denullify!(dt))
2-element Array{Type,1}:
Int64
Int64

julia> eltypes(dt)
2-element Array{Type,1}:
Int64
Int64
```

See also [`denullify`](@ref) and [`nullify!`](@ref).
"""
function denullify!(dt::AbstractDataTable)
for i in 1:size(dt,2)
if !anynull(dt[i])
dt[i] = dropnull!(dt[i])
end
end
dt
end

"""
denullify(dt::AbstractDataTable)

Return a copy of `dt` where columns with a `Nullable` element type without any
null values have been converted to a non-`Nullable` equivalent array type.

Columns in the returned `AbstractDataTable` may alias the columns of the
input `dt`. If no aliasing is desired, use `denullify!(deepcopy(dt))`.

# Examples

```jldoctest
julia> dt = DataTable(A = NullableArray(1:3), B = [Nullable(i) for i=1:3])
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │

julia> eltypes(dt)
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}

julia> eltypes(denullify(dt))
2-element Array{Type,1}:
Int64
Int64

julia> eltypes(dt)
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}
```

See also [`denullify!`] and [`nullify`](@ref).
"""
denullify(dt::AbstractDataTable) = denullify!(copy(dt))

"""
nullify!(dt::AbstractDataTable)

Convert all columns of `dt` to nullable arrays. The table `dt` is modified in place.

Columns in the returned `AbstractDataTable` may alias the columns of the
input `dt`.

# Examples

```jldoctest
julia> dt = DataTable(A = 1:3, B = 1:3)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │

julia> eltypes(dt)
2-element Array{Type,1}:
Int64
Int64

julia> eltypes(nullify!(dt))
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}

julia> eltypes(dt)
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}
```

See also [`nullify`](@ref) and [`denullify!`](@ref).
"""
function nullify!(dt::AbstractDataTable)
for i in 1:size(dt,2)
dt[i] = nullify(dt[i])
end
dt
end

nullify(x::AbstractArray) = convert(NullableArray, x)
nullify(x::AbstractCategoricalArray) = convert(NullableCategoricalArray, x)

"""
nullify(dt::AbstractDataTable)

Return a copy of `dt` with all columns converted to nullable arrays.

Columns in the returned `AbstractDataTable` may alias the columns of the
input `dt`. If no aliasing is desired, use `nullify!(deepcopy(dt))`.

# Examples

```jldoctest
julia> dt = DataTable(A = 1:3, B = 1:3)
3×2 DataTables.DataTable
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │

julia> eltypes(dt)
2-element Array{Type,1}:
Int64
Int64

julia> eltypes(nullify(dt))
2-element Array{Type,1}:
Nullable{Int64}
Nullable{Int64}

julia> eltypes(dt)
2-element Array{Type,1}:
Int64
Int64
```

See also [`nullify!`](@ref) and [`denullify`](@ref).
"""
function nullify(dt::AbstractDataTable)
nullify!(copy(dt))
end

## Documentation for methods defined elsewhere

Expand Down