Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

WIP: Consolidate hcat, append, merge, vcat and split functional overlaps #70

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

cjprybol
Copy link
Contributor

removes hcat and hcat!, append(!) now adds columns to end of DataTable, merge(!)
concatenates DataTables horizontally, and vcat still concatenates
vertically. append! and vcat no longer perform the same function. hcat(!)
and merge no longer perform the same function.

Aims to resolve #52

removes hcat, append now adds columns to end of DataTable, merge
concatenates DataTables horizontally, and vcat still concatenates
vertically. append! and vcat no longer perform the same function. hcat
and merge no longer perform the same function.
@@ -201,7 +201,11 @@ function combine(ga::GroupApplied)
idx[j + (1:n)] = gd.idx[start]
j += n
end
hcat!(gd.parent[idx, gd.cols], valscat)
if isa(valscat, DataTable)
Copy link
Contributor Author

@cjprybol cjprybol May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I broke this during my changes it helped me understand how combine and my changes to aggregate in #65 differ. This hcat! call was handling both DataTables and Vectors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand. valscat = vcat(ga.vals) and ga.vals is necessarily a Vector{<:AbstractDataTable}. So valscat is always an AbstractDataTable. Why do you need to call append! when it's not a DataTable?

@@ -109,7 +109,7 @@ module TestDataTable
dt = DataTable(a=[1, 2], b=[3., 4.])
dt2 = DataTable(b=["a", "b"], c=[:c, :d])
@test isequal(merge!(dt, dt2), dt)
@test isequal(dt, DataTable(a=[1, 2], b=["a", "b"], c=[:c, :d]))
@test isequal(dt, DataTable(a=[1, 2], b=[3., 4.], b_1=["a", "b"], c=[:c, :d]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another point of confusion. Depending on which method you used to horizontally concatenate, you could either have columns overwrite one another or have column names updated so duplicates are unique. Is everyone ok with defaulting to not overwriting column names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic merge(!) methods use the value from the last argument in case of conflict, so I think we should respect this. It could make sense to also support passing a combine function.

Maybe that's an argument to keep hcat, with a different behavior. For that function, the situation is very similar to that faced by NamedArray or AxisArray: duplicate column names are not allowed, so either an error must be thrown, or names have to be made unique. I would suggest throwing an error by default, with a boolean keyword argument to make names unique when needed; but other solutions can be considered (we could discuss that with the authors of the two named array packages).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you add docstrings explaining in a few words the DataTable-specific behavior? I.e. that it operates on columns (AFAICT that's the only precision which is needed beyond the generic docstring from Base). Also, if that's possible a few deprecations wouldn't hurt. Finally, I see you made some methods more generic by accepting any AbstractDataTable as first argument. This probably needs to throw an error for SubDataTable (and be tested)?


# catch-all to cover cases where indexing returns a DataTable and copy doesn't
Base.hcat(dt::AbstractDataTable, x) = hcat!(dt[:, :], x)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt1[:, :], dt2)
Base.merge(dt::AbstractDataTable, dtn::AbstractDataTable...) = merge!(dt[:, :], dtn...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deepcopy would be more natural, even if the result is the same?

Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt1[:, :], dt2)
Base.merge(dt::AbstractDataTable, dtn::AbstractDataTable...) = merge!(dt[:, :], dtn...)

function Base.append!(dt1::AbstractDataTable, x::AbstractVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a one-liner here and below.

@@ -39,6 +39,7 @@ export @~,
SubDataTable,

aggregate,
append,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I can see the point of implementing generic functions which exist in Base, I don't like the idea of creating append, with such a general name and so similar to append!. Especially since merge does the same thing.

@@ -109,7 +109,7 @@ module TestDataTable
dt = DataTable(a=[1, 2], b=[3., 4.])
dt2 = DataTable(b=["a", "b"], c=[:c, :d])
@test isequal(merge!(dt, dt2), dt)
@test isequal(dt, DataTable(a=[1, 2], b=["a", "b"], c=[:c, :d]))
@test isequal(dt, DataTable(a=[1, 2], b=[3., 4.], b_1=["a", "b"], c=[:c, :d]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic merge(!) methods use the value from the last argument in case of conflict, so I think we should respect this. It could make sense to also support passing a combine function.

Maybe that's an argument to keep hcat, with a different behavior. For that function, the situation is very similar to that faced by NamedArray or AxisArray: duplicate column names are not allowed, so either an error must be thrown, or names have to be made unique. I would suggest throwing an error by default, with a boolean keyword argument to make names unique when needed; but other solutions can be considered (we could discuss that with the authors of the two named array packages).

@@ -201,7 +201,11 @@ function combine(ga::GroupApplied)
idx[j + (1:n)] = gd.idx[start]
j += n
end
hcat!(gd.parent[idx, gd.cols], valscat)
if isa(valscat, DataTable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand. valscat = vcat(ga.vals) and ga.vals is necessarily a Vector{<:AbstractDataTable}. So valscat is always an AbstractDataTable. Why do you need to call append! when it's not a DataTable?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicate functionality in merge! and hcat!
2 participants