Skip to content

Commit

Permalink
Merge pull request #39 from FugroRoames/ajf/select-and-calc
Browse files Browse the repository at this point in the history
RFC: Property interface via macros `@Select` and `@Compute`
  • Loading branch information
andyferris committed Nov 30, 2018
2 parents a354034 + 65a9ac2 commit a67883a
Show file tree
Hide file tree
Showing 12 changed files with 686 additions and 55 deletions.
1 change: 0 additions & 1 deletion docs/src/man/group.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ It is frequently useful to break data appart into different *groups* for process

In a powerful environment such as Julia, that fully supports nested containers, it makes sense to represent each group as distinct containers, with an outer container acting as a "dictionary" of the groups. This is in contrast to environments with a less rich system of containers, such as SQL, which has popularized a slightly different notion of grouping data into a single flat tabular structure, where one (or more) columns act as the grouping key. Here we focus on the former approach.


## Using the `group` function

*SplitApplyCombine* provides a `group` function, which can operate on arbitary Julia objects. The function has the signature `group(by, f, iter)` where `iter` is a container that can be iterated, `by` is a function from the elements of `iter` to the grouping *key*, and the optional argument `f` is a mapping applied to the grouped elements (by default, `f = identity`, the identity function).
Expand Down
9 changes: 9 additions & 0 deletions docs/src/man/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,12 @@ TypedTables.FlexTable
TypedTables.columns
TypedTables.columnnames
```

## Convenience macros

These macros return *functions* that can be applied to tables and rows.

```@docs
TypedTables.@Compute
TypedTables.@Select
```
59 changes: 46 additions & 13 deletions docs/src/man/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,22 +327,55 @@ Table with 2 columns and 3 rows:
3 │ C false
```

It is worth being aware of a special function `getproperty`, which is Julia's function for
the `.` operator - that is `a.b` is just convenient shorthand syntax for
`getproperty(a, :b)`. The function `getproperty(:b)` returns *another function* such that
`getproperty(:b)(a)` is the same as `a.b`. If you wish to programmatically select a column
of a Table, you can use `getproperty` to do so.
Writing anonymous functions can become laborious when dealing with many rows, so the
convenience macros `@Select` and `@Compute` are provided to aid in their construction.

The `@Select` macro returns a function that can map a row to a new row (or a table to a
new table) by defining a functional mapping for each output column. The above example can
alternatively be written as:

```julia
julia> map(@Select(initial = first($name), is_old = $age > 40), t)
Table with 2 columns and 3 rows:
initial is_old
┌────────────────
1 │ A false
2 │ B true
3 │ C false
```

For shorthand, the `= ...` can be ommited to simply extract a column. For example, we can
reorder the columns via

```
julia> @Select(age, name)(t)
Table with 2 columns and 3 rows:
age name
┌─────────────
1 │ 25 Alice
2 │ 42 Bob
3 │ 37 Charlie
```
(Note that here we "select" columns directly, rather than using `map` to select the fields
of each row.)

The `@Compute` macro returns a function that maps a row to a value. As for `@Select`, the
input column names are prepended with `$`, for example:

```julia
julia> map(getproperty(:name), t)
julia> map(@Compute($name), t)
3-element Array{String,1}:
"Alice"
"Bob"
"Charlie"
```
In fact, `Table` will know that getting a certain field of every row via `map` is the same
as simply extracting the column `name`, and this operation will be fast. This will be most
useful in the operations below.

Unlike an anonymous function, these two macros create an introspectable function that allows
computations to take advantage of columnar storage and advanced features like acceleration
indices. You may find calculations may be performed faster with the macros for a wide
variety of functions like `map`, `broadcast`, `filter`, `findall`, `reduce`, `group` and
`innerjoin`. For instance, the example above simply extracts the `name` column from `t`,
without performing an explicit map.

## Grouping data

Expand Down Expand Up @@ -392,7 +425,7 @@ Sometimes you may want to transform the grouped data - you can do so by passing
mapping function. For example, we may want to group firstnames by lastname.

```julia
julia> group(getproperty(:lastname), getproperty(:firstname), t2)
julia> group(@Compute($lastname), $Compute($firstname), t2)
Dict{String,Array{String,1}} with 4 entries:
"King" => ["Arthur"]
"Williams" => ["Adam", "Eve"]
Expand All @@ -406,7 +439,7 @@ If instead, our group elements are rows (named tuples), each group will itslef b
For example, we can keep the entire row by dropping the second function.

```julia
julia> families = group(getproperty(:lastname), t2)
julia> families = group(@Compute($lastname), t2)
Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,String,Int64}},1,NamedTuple{(:firstname, :lastname, :age),Tuple{Array{String,1},Array{String,1},Array{Int64,1}}}},Dict{String,Array{Int64,1}}} with 4 entries:
"King" => Table with 3 columns and 1 row:
"Williams" => Table with 3 columns and 2 rows:
Expand All @@ -417,7 +450,7 @@ Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,St
The results are only summarized above (for compactness), but can be easily accessed.

```julia
julia> familes["Smith"]
julia> families["Smith"]
Table with 3 columns and 3 rows:
firstname lastname age
┌─────────────────────────
Expand Down Expand Up @@ -465,7 +498,7 @@ function expects two functions, to describe the joining key of the first table a
joining key of the second table. We will use `getproperty` to select the columns.

```julia
julia> innerjoin(getproperty(:id), getproperty(:customer_id), customers, orders)
julia> innerjoin(@Compute($id), @Compute($customer_id), customers, orders)
Table with 5 columns and 4 rows:
id name address customer_id items
┌─────────────────────────────────────────────────────
Expand Down
43 changes: 43 additions & 0 deletions src/FlexTable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ function Base.setproperty!(t::FlexTable, name::Symbol, ::Nothing)
return t
end

propertytype(::FlexTable{N}) where {N} = FlexTable{N}

"""
columnnames(table)
Expand Down Expand Up @@ -265,4 +267,45 @@ function Base.vec(t::FlexTable)
return FlexTable{1}(map(vec, columns(t)))
end

# "Bulk" operations on FlexTables should generally first unrwap to Tables
_flex(t::Table{<:Any, N}) where {N} = FlexTable(columns(t))
_flex(t) = t

Broadcast.broadcastable(t::FlexTable) = Table(t)

Base.map(f, t::FlexTable{N}) where {N} = _flex(map(f, rows(t)))::AbstractArray{<:Any, N}
Base.map(f, t::FlexTable{N}, t2) where {N} = _flex(map(f, rows(t), t2))::AbstractArray{<:Any, N}
Base.map(f, t, t2::FlexTable{N}) where {N} = _flex(map(f, t, rows(t2)))::AbstractArray{<:Any, N}
Base.map(f, t::FlexTable{N}, t2::FlexTable{N}) where {N} = _flex(map(f, rows(t), rows(t2)))::AbstractArray{<:Any, N}

Base.mapreduce(f, op, t::FlexTable; kwargs...) = mapreduce(f, op, rows(t); kwargs...)

Base.filter(f, t::FlexTable{N}) where {N} = FlexTable(filter(f, rows(t)))::FlexTable{N}

SplitApplyCombine.mapview(f, t::FlexTable{N}) where {N} = _flex(mapview(f, rows(t)))::AbstractArray{<:Any, N}
SplitApplyCombine.mapview(f, t::FlexTable{N}, t2) where {N} = _flex(mapview(f, rows(t), t2))::AbstractArray{<:Any, N}
SplitApplyCombine.mapview(f, t, t2::FlexTable{N}) where {N} = _flex(mapview(f, t, rows(t2)))::AbstractArray{<:Any, N}
SplitApplyCombine.mapview(f, t::FlexTable{N}, t2::FlexTable{N}) where {N} = _flex(mapview(f, rows(t), rows(t2)))::AbstractArray{<:Any, N}

SplitApplyCombine.group(by, f, t::FlexTable) = group(by, f, rows(t))
SplitApplyCombine.groupview(by, f, t::FlexTable) = groupview(by, f, rows(t))
SplitApplyCombine.groupinds(by, t::FlexTable) = groupinds(by, rows(t))
SplitApplyCombine.groupreduce(by, f, op, t::FlexTable; kwargs...) = groupreduce(by, f, op, rows(t); kwargs...)

SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1::FlexTable, t2) = _flex(innerjoin(lkey, rkey, f, cmp, rows(t1), t2))
SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1, t2::FlexTable) = _flex(innerjoin(lkey, rkey, f, cmp, t1, rows(t2)))
SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1::FlexTable, t2::FlexTable) = _flex(innerjoin(lkey, rkey, f, cmp, rows(t1), rows(t2)))

Base.:(==)(t1::FlexTable{N}, t2::AbstractArray{<:Any,N}) where {N} = (rows(t1) == t2)
Base.:(==)(t1::AbstractArray{<:Any,N}, t2::FlexTable{N}) where {N} = (t1 == rows(t2))
Base.:(==)(t1::FlexTable{N}, t2::FlexTable{N}) where {N} = (rows(t1) == rows(t2))

Base.isequal(t1::FlexTable{N}, t2::AbstractArray{<:Any,N}) where {N} = isequal(rows(t1), t2)
Base.isequal(t1::AbstractArray{<:Any,N}, t2::FlexTable{N}) where {N} = isequal(t1, rows(t2))
Base.isequal(t1::FlexTable{N}, t2::FlexTable{N}) where {N} = isequal(rows(t1), rows(t2))

Base.isless(t1::FlexTable{1}, t2::AbstractVector) = isless(rows(t1), t2)
Base.isless(t1::AbstractVector, t2::FlexTable{1}) = isless(t1, rows(t2))
Base.isless(t1::FlexTable{1}, t2::FlexTable{1}) = isless(rows(t1), rows(t2))

Base.hash(t::FlexTable, h::UInt) = hash(rows(t), h)
2 changes: 2 additions & 0 deletions src/Table.jl
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ function Base.setproperty!(t::Table, name::Symbol, a)
error("type Table is immutable. Set the values of an existing column with the `.=` operator, e.g. `table.name .= array`.")
end

propertytype(::Table) = Table

"""
columnnames(table)
Expand Down
13 changes: 2 additions & 11 deletions src/TypedTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,9 @@ using SplitApplyCombine
using Base: @propagate_inbounds, @pure, OneTo, Fix2
import Tables.columns, Tables.rows

export @Compute, @Select
export Table, FlexTable, columns, rows, columnnames, showtable

# GetProperty
struct GetProperty{name}
end
@inline GetProperty(name::Symbol) = GetProperty{name}()

@inline function Base.getproperty(sym::Symbol)
return GetProperty(sym)
end

@inline (::GetProperty{name})(x) where {name} = getproperty(x, name)

# Resultant element type of given column arrays
@generated function _eltypes(a::NamedTuple{names, T}) where {names, T <: Tuple{Vararg{AbstractArray}}}
Ts = []
Expand All @@ -44,6 +34,7 @@ let
end
end

include("properties.jl")
include("Table.jl")
include("FlexTable.jl")
include("columnops.jl")
Expand Down
150 changes: 134 additions & 16 deletions src/columnops.jl
Original file line number Diff line number Diff line change
@@ -1,33 +1,77 @@
# Column-based operations: Some operations on rows are faster when considering columns

# In `map`, the output shouldn't alias inputs, so copies are made
Base.map(::typeof(identity), t::Union{FlexTable, Table}) = copy(t)
Base.map(::typeof(identity), t::Table) = copy(t)

Base.map(::typeof(merge), t::Union{FlexTable, Table}) = copy(t)
Base.map(::typeof(merge), t::Table) = copy(t)

function Base.map(::typeof(merge), t1::Table, t2::Table)
return copy(Table(merge(columns(t1), columns(t2))))
end

function Base.map(::typeof(merge), df1::Union{Table{<:Any, N}, FlexTable{N}}, df2::Union{Table{<:Any, N}, FlexTable{N}}) where {N}
return copy(FlexTable{N}(merge(columns(df1), columns(df2))))
function Base.map(f::GetProperty, t::Table)
return copy(f(t))
end

function Base.map(::GetProperty{name}, t::Union{Table{<:Any, N}, FlexTable{N}}) where {name, N}
return copy(getproperty(t, name::Symbol))::AbstractArray{<:Any, N}
@inline function Base.map(f::GetProperties, t::Table)
return copy(f(t))
end

@inline function Base.map(f::Compute{names}, t::Table) where {names}
# minimize number of columns before iterating over the rows
map(f, GetProperties(names)(t))
end

@inline function Base.map(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
# efficient to iterate over rows with a minimal number of columns
if length(names) == 1 # unwrap in the simple cases
return map(f.f, getproperty(names[1])(t))
elseif length(names) == 2
return map(f.f, getproperty(names[1])(t), getproperty(names[2])(t))
end

invoke(map, Tuple{Function, typeof(t)}, f, t)
end

@generated function Base.map(s::Select{names}, t::Table) where {names}
exprs = [:($(names[i]) = map(s.fs[$i], t)) for i in 1:length(names)]

return :(Table($(Expr(:tuple, exprs...))))
end

# In `mapview`, the output should alias the inputs
SplitApplyCombine.mapview(::typeof(merge), t::Table) = t

function SplitApplyCombine.mapview(::typeof(merge), t1::Table, t2::Table)
return Table(merge(columns(t1), columns(t2)))
end

function SplitApplyCombine.mapview(::typeof(merge), df1::Union{Table{<:Any, N}, FlexTable{N}}, df2::Union{Table{<:Any, N}, FlexTable{N}}) where {N}
return FlexTable{N}(merge(columns(df1), columns(df2)))
@inline function SplitApplyCombine.mapview(f::GetProperty, t::Table)
return f(t)
end

@inline function SplitApplyCombine.mapview(f::GetProperties, t::Table)
return f(t)
end

@inline function SplitApplyCombine.mapview(f::GetProperty{name}, t::Union{Table{<:Any, N}, FlexTable{N}}) where {name, N}
return getproperty(t, name::Symbol)::AbstractArray{<:Any, N}
@inline function SplitApplyCombine.mapview(f::Compute{names}, t::Table) where {names}
# minimize number of columns before iterating over the rows
mapview(f, GetProperties(names)(t))
end

@inline function SplitApplyCombine.mapview(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
# efficient to iterate over rows with a minimal number of columns
if length(names) == 1 # unwrap in the simple cases (consider 2-argument version)
return mapview(f.f, getproperty(names[1])(t))
end

invoke(mapview, Tuple{Function, typeof(t)}, f, t)
end

@generated function SplitApplyCombine.mapview(s::Select{names}, t::Table) where {names}
exprs = [:($(names[i]) = mapview(s.fs[$i], t)) for i in 1:length(names)]

return :(Table($(Expr(:tuple, exprs...))))
end

# broadcast
Expand All @@ -36,14 +80,88 @@ end
Table(merge(map(columns, ts)...))
end

@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, ::typeof(merge), ts::Union{Table{<:Any, N},FlexTable{N}}...) where {N}
FlexTable{N}(merge(map(columns, ts)...))
@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty, t::Table{<:Any, N}) where {N}
return f(t)
end

@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty{names}, t::Table{<:Any, N}) where {N, name}
return getproperty(t, name::Symbol)
@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperties, t::Table{<:Any, N}) where {N}
return f(t)
end

@inline function Broadcast.broadcasted(style::Broadcast.DefaultArrayStyle{N}, f::Compute{names}, t::Table{<:NamedTuple, N}) where {N, names}
# minimize number of columns before iterating over the rows
return Broadcast.broadcasted(style, f, GetProperties(names)(t))
end

@inline function Broadcast.broadcasted(style::Broadcast.DefaultArrayStyle{N}, f::Compute{names}, t::Table{<:NamedTuple{names}, N}) where {N, names}
# efficient to iterate over rows with a minimal number of columns
if length(names) == 1 # unwrap in the simple cases
return Broadcast.broadcasted(f.f, getproperty(names[1])(t))
elseif length(names) == 2
return Broadcast.broadcasted(f.f, getproperty(names[1])(t), getproperty(names[2])(t))
end

invoke(Broadcast.broadcasted, Tuple{typeof(style), Function, typeof(t)}, style, f, t)
end

@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty{names}, t::FlexTable{N}) where {N, name}
return getproperty(t, name::Symbol)::AbstractArray{<:Any, N}
@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::Select, t::Table{<:Any, N}) where {N}
return mapview(f, t)
end

# I'm not 100% sure how wise this pattern is...
Broadcast.materialize(t::Table) = Table(map(_materialize, columns(t)))
_materialize(x) = Broadcast.materialize(x)
_materialize(x::MappedArray) = copy(x)

# mapreduce

function Base.mapreduce(f::GetProperty, op, t::Table; kwargs...)
return mapreduce(identity, op, f(t); kwargs...)
end

function Base.mapreduce(f::GetProperties, op, t::Table; kwargs...)
return mapreduce(identity, op, f(t); kwargs...)
end

function Base.mapreduce(f::Compute{names}, op, t::Table; kwargs...) where {names}
# minimize number of columns before iterating over the rows
t2 = GetProperties(names)(t)
return mapreduce(f, op, t2; kwargs...)
end

function Base.mapreduce(f::Compute{names}, op, t::Table{<:NamedTuple{names}}; kwargs...) where {names}
# efficient to iterate over rows with a minimal number of columns
if length(names) == 1 # unwrap in the simple cases
return mapreduce(f.f, op, getproperty(names[1])(t))
elseif length(names) == 2
return mapreduce(f.f, op, getproperty(names[1])(t), getproperty(names[2])(t))
end

invoke(mapreduce, Tuple{Function, typeof(op), typeof(t)}, f, op, t; kwargs...)
end

# `filter(f, t)` defaults to `t[map(f, t)]`

function Base.filter(f::GetProperty, t::Table)
return @inbounds t[f(t)::AbstractArray{Bool}]
end

# findall

function Base.findall(f::GetProperty, t::Table)
return findall(identity, f(t))
end

function Base.findall(f::Compute{names}, t::Table) where {names}
# minimize number of columns before iterating over the rows
return findall(f, GetProperties(names)(t))
end

function Base.findall(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
# efficient to iterate over rows with a minimal number of columns
if length(names) == 1 # unwrap in the simple cases
return findall(f.f, getproperty(names[1])(t))
end

invoke(findall, Tuple{Function, typeof(t)}, f, t)
end

0 comments on commit a67883a

Please sign in to comment.