Merge pull request #39 from FugroRoames/ajf/select-and-calc

RFC: Property interface via macros `@Select` and `@Compute`
JuliaData · Nov 30, 2018 · a67883a · a67883a
2 parents a354034 + 65a9ac2
commit a67883a
Show file tree

Hide file tree

Showing 12 changed files with 686 additions and 55 deletions.
diff --git a/docs/src/man/group.md b/docs/src/man/group.md
@@ -4,7 +4,6 @@ It is frequently useful to break data appart into different *groups* for process
 
 In a powerful environment such as Julia, that fully supports nested containers, it makes sense to represent each group as distinct containers, with an outer container acting as a "dictionary" of the groups. This is in contrast to environments with a less rich system of containers, such as SQL, which has popularized a slightly different notion of grouping data into a single flat tabular structure, where one (or more) columns act as the grouping key. Here we focus on the former approach.
 
-
 ## Using the `group` function
 
 *SplitApplyCombine* provides a `group` function, which can operate on arbitary Julia objects. The function has the signature `group(by, f, iter)` where `iter` is a container that can be iterated, `by` is a function from the elements of `iter` to the grouping *key*, and the optional argument `f` is a mapping applied to the grouped elements (by default, `f = identity`, the identity function).

diff --git a/docs/src/man/reference.md b/docs/src/man/reference.md
@@ -19,3 +19,12 @@ TypedTables.FlexTable
 TypedTables.columns
 TypedTables.columnnames
 ```
+
+## Convenience macros
+
+These macros return *functions* that can be applied to tables and rows.
+
+```@docs
+TypedTables.@Compute
+TypedTables.@Select
+```
diff --git a/docs/src/man/tutorial.md b/docs/src/man/tutorial.md
@@ -327,22 +327,55 @@ Table with 2 columns and 3 rows:
  3 │ C        false
 ```
 
-It is worth being aware of a special function `getproperty`, which is Julia's function for
-the `.` operator - that is `a.b` is just convenient shorthand syntax for
-`getproperty(a, :b)`. The function `getproperty(:b)` returns *another function* such that
-`getproperty(:b)(a)` is the same as `a.b`. If you wish to programmatically select a column
-of a Table, you can use `getproperty` to do so.
+Writing anonymous functions can become laborious when dealing with many rows, so the
+convenience macros `@Select` and `@Compute` are provided to aid in their construction.
+
+The `@Select` macro returns a function that can map a row to a new row (or a table to a
+new table) by defining a functional mapping for each output column. The above example can
+alternatively be written as:
+
+```julia
+julia> map(@Select(initial = first($name), is_old = $age > 40), t)
+Table with 2 columns and 3 rows:
+     initial  is_old
+   ┌────────────────
+ 1 │ A        false
+ 2 │ B        true
+ 3 │ C        false
+```
+
+For shorthand, the `= ...` can be ommited to simply extract a column. For example, we can
+reorder the columns via
+
+```
+julia> @Select(age, name)(t)
+Table with 2 columns and 3 rows:
+     age  name
+   ┌─────────────
+ 1 │ 25   Alice
+ 2 │ 42   Bob
+ 3 │ 37   Charlie
+```
+(Note that here we "select" columns directly, rather than using `map` to select the fields
+of each row.)
+
+The `@Compute` macro returns a function that maps a row to a value. As for `@Select`, the
+input column names are prepended with `$`, for example:
 
 ```julia
-julia> map(getproperty(:name), t)
+julia> map(@Compute($name), t)
 3-element Array{String,1}:
  "Alice"  
  "Bob"    
  "Charlie"
 ```
-In fact, `Table` will know that getting a certain field of every row via `map` is the same
-as simply extracting the column `name`, and this operation will be fast. This will be most
-useful in the operations below.
+
+Unlike an anonymous function, these two macros create an introspectable function that allows
+computations to take advantage of columnar storage and advanced features like acceleration
+indices. You may find calculations may be performed faster with the macros for a wide
+variety of functions like `map`, `broadcast`, `filter`, `findall`, `reduce`, `group` and
+`innerjoin`. For instance, the example above simply extracts the `name` column from `t`,
+without performing an explicit map.
 
 ## Grouping data
 
@@ -392,7 +425,7 @@ Sometimes you may want to transform the grouped data - you can do so by passing
 mapping function. For example, we may want to group firstnames by lastname.
 
 ```julia
-julia> group(getproperty(:lastname), getproperty(:firstname), t2)
+julia> group(@Compute($lastname), $Compute($firstname), t2)
 Dict{String,Array{String,1}} with 4 entries:
   "King"     => ["Arthur"]
   "Williams" => ["Adam", "Eve"]
@@ -406,7 +439,7 @@ If instead, our group elements are rows (named tuples), each group will itslef b
 For example, we can keep the entire row by dropping the second function.
 
 ```julia
-julia> families = group(getproperty(:lastname), t2)
+julia> families = group(@Compute($lastname), t2)
 Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,String,Int64}},1,NamedTuple{(:firstname, :lastname, :age),Tuple{Array{String,1},Array{String,1},Array{Int64,1}}}},Dict{String,Array{Int64,1}}} with 4 entries:
   "King"     => Table with 3 columns and 1 row:…
   "Williams" => Table with 3 columns and 2 rows:…
@@ -417,7 +450,7 @@ Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,St
 The results are only summarized above (for compactness), but can be easily accessed.
 
 ```julia
-julia> familes["Smith"]
+julia> families["Smith"]
 Table with 3 columns and 3 rows:
      firstname  lastname  age
    ┌─────────────────────────
@@ -465,7 +498,7 @@ function expects two functions, to describe the joining key of the first table a
 joining key of the second table. We will use `getproperty` to select the columns.
 
 ```julia
-julia> innerjoin(getproperty(:id), getproperty(:customer_id), customers, orders)
+julia> innerjoin(@Compute($id), @Compute($customer_id), customers, orders)
 Table with 5 columns and 4 rows:
      id  name     address          customer_id  items
    ┌─────────────────────────────────────────────────────

diff --git a/src/FlexTable.jl b/src/FlexTable.jl
@@ -62,6 +62,8 @@ function Base.setproperty!(t::FlexTable, name::Symbol, ::Nothing)
     return t
 end
 
+propertytype(::FlexTable{N}) where {N} = FlexTable{N}
+
 """
     columnnames(table)
 
@@ -265,4 +267,45 @@ function Base.vec(t::FlexTable)
     return FlexTable{1}(map(vec, columns(t)))
 end
 
+# "Bulk" operations on FlexTables should generally first unrwap to Tables
+_flex(t::Table{<:Any, N}) where {N} = FlexTable(columns(t))
+_flex(t) = t
+
+Broadcast.broadcastable(t::FlexTable) = Table(t)
+
+Base.map(f, t::FlexTable{N}) where {N} = _flex(map(f, rows(t)))::AbstractArray{<:Any, N}
+Base.map(f, t::FlexTable{N}, t2) where {N} = _flex(map(f, rows(t), t2))::AbstractArray{<:Any, N}
+Base.map(f, t, t2::FlexTable{N}) where {N} = _flex(map(f, t, rows(t2)))::AbstractArray{<:Any, N}
+Base.map(f, t::FlexTable{N}, t2::FlexTable{N}) where {N} = _flex(map(f, rows(t), rows(t2)))::AbstractArray{<:Any, N}
+
+Base.mapreduce(f, op, t::FlexTable; kwargs...) = mapreduce(f, op, rows(t); kwargs...)
+
+Base.filter(f, t::FlexTable{N}) where {N} = FlexTable(filter(f, rows(t)))::FlexTable{N}
+
+SplitApplyCombine.mapview(f, t::FlexTable{N}) where {N} = _flex(mapview(f, rows(t)))::AbstractArray{<:Any, N}
+SplitApplyCombine.mapview(f, t::FlexTable{N}, t2) where {N} = _flex(mapview(f, rows(t), t2))::AbstractArray{<:Any, N}
+SplitApplyCombine.mapview(f, t, t2::FlexTable{N}) where {N} = _flex(mapview(f, t, rows(t2)))::AbstractArray{<:Any, N}
+SplitApplyCombine.mapview(f, t::FlexTable{N}, t2::FlexTable{N}) where {N} = _flex(mapview(f, rows(t), rows(t2)))::AbstractArray{<:Any, N}
+
 SplitApplyCombine.group(by, f, t::FlexTable) = group(by, f, rows(t))
+SplitApplyCombine.groupview(by, f, t::FlexTable) = groupview(by, f, rows(t))
+SplitApplyCombine.groupinds(by, t::FlexTable) = groupinds(by, rows(t))
+SplitApplyCombine.groupreduce(by, f, op, t::FlexTable; kwargs...) = groupreduce(by, f, op, rows(t); kwargs...)
+
+SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1::FlexTable, t2) = _flex(innerjoin(lkey, rkey, f, cmp, rows(t1), t2))
+SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1, t2::FlexTable) = _flex(innerjoin(lkey, rkey, f, cmp, t1, rows(t2)))
+SplitApplyCombine.innerjoin(lkey, rkey, f, cmp, t1::FlexTable, t2::FlexTable) = _flex(innerjoin(lkey, rkey, f, cmp, rows(t1), rows(t2)))
+
+Base.:(==)(t1::FlexTable{N}, t2::AbstractArray{<:Any,N}) where {N} = (rows(t1) == t2)
+Base.:(==)(t1::AbstractArray{<:Any,N}, t2::FlexTable{N}) where {N} = (t1 == rows(t2))
+Base.:(==)(t1::FlexTable{N}, t2::FlexTable{N}) where {N} = (rows(t1) == rows(t2))
+
+Base.isequal(t1::FlexTable{N}, t2::AbstractArray{<:Any,N}) where {N} = isequal(rows(t1), t2)
+Base.isequal(t1::AbstractArray{<:Any,N}, t2::FlexTable{N}) where {N} = isequal(t1, rows(t2))
+Base.isequal(t1::FlexTable{N}, t2::FlexTable{N}) where {N} = isequal(rows(t1), rows(t2))
+
+Base.isless(t1::FlexTable{1}, t2::AbstractVector) = isless(rows(t1), t2)
+Base.isless(t1::AbstractVector, t2::FlexTable{1}) = isless(t1, rows(t2))
+Base.isless(t1::FlexTable{1}, t2::FlexTable{1}) = isless(rows(t1), rows(t2))
+
+Base.hash(t::FlexTable, h::UInt) = hash(rows(t), h)
diff --git a/src/Table.jl b/src/Table.jl
@@ -88,6 +88,8 @@ function Base.setproperty!(t::Table, name::Symbol, a)
     error("type Table is immutable. Set the values of an existing column with the `.=` operator, e.g. `table.name .= array`.")
 end
 
+propertytype(::Table) = Table
+
 """
     columnnames(table)
 

diff --git a/src/TypedTables.jl b/src/TypedTables.jl
@@ -7,19 +7,9 @@ using SplitApplyCombine
 using Base: @propagate_inbounds, @pure, OneTo, Fix2
 import Tables.columns, Tables.rows
 
+export @Compute, @Select
 export Table, FlexTable, columns, rows, columnnames, showtable
 
-# GetProperty
-struct GetProperty{name}
-end
-@inline GetProperty(name::Symbol) = GetProperty{name}()
-
-@inline function Base.getproperty(sym::Symbol)
-	return GetProperty(sym)
-end
-
-@inline (::GetProperty{name})(x) where {name} = getproperty(x, name)
-
 # Resultant element type of given column arrays
 @generated function _eltypes(a::NamedTuple{names, T}) where {names, T <: Tuple{Vararg{AbstractArray}}}
     Ts = []
@@ -44,6 +34,7 @@ let
     end
 end
 
+include("properties.jl")
 include("Table.jl")
 include("FlexTable.jl")
 include("columnops.jl")

diff --git a/src/columnops.jl b/src/columnops.jl
@@ -1,33 +1,77 @@
 # Column-based operations: Some operations on rows are faster when considering columns
 
 # In `map`, the output shouldn't alias inputs, so copies are made
-Base.map(::typeof(identity), t::Union{FlexTable, Table}) = copy(t)
+Base.map(::typeof(identity), t::Table) = copy(t)
 
-Base.map(::typeof(merge), t::Union{FlexTable, Table}) = copy(t)
+Base.map(::typeof(merge), t::Table) = copy(t)
 
 function Base.map(::typeof(merge), t1::Table, t2::Table)
     return copy(Table(merge(columns(t1), columns(t2))))
 end
 
-function Base.map(::typeof(merge), df1::Union{Table{<:Any, N}, FlexTable{N}}, df2::Union{Table{<:Any, N}, FlexTable{N}}) where {N}
-    return copy(FlexTable{N}(merge(columns(df1), columns(df2))))
+function Base.map(f::GetProperty, t::Table)
+    return copy(f(t))
 end
 
-function Base.map(::GetProperty{name}, t::Union{Table{<:Any, N}, FlexTable{N}}) where {name, N}
-    return copy(getproperty(t, name::Symbol))::AbstractArray{<:Any, N}
+@inline function Base.map(f::GetProperties, t::Table)
+    return copy(f(t))
+end
+
+@inline function Base.map(f::Compute{names}, t::Table) where {names}
+    # minimize number of columns before iterating over the rows
+    map(f, GetProperties(names)(t))
+end
+
+@inline function Base.map(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
+    # efficient to iterate over rows with a minimal number of columns
+    if length(names) == 1 # unwrap in the simple cases
+        return map(f.f, getproperty(names[1])(t))
+    elseif length(names) == 2
+        return map(f.f, getproperty(names[1])(t), getproperty(names[2])(t))
+    end
+
+    invoke(map, Tuple{Function, typeof(t)}, f, t)
+end
+
+@generated function Base.map(s::Select{names}, t::Table) where {names}
+    exprs = [:($(names[i]) = map(s.fs[$i], t)) for i in 1:length(names)]
+
+    return :(Table($(Expr(:tuple, exprs...))))
 end
 
 # In `mapview`, the output should alias the inputs
+SplitApplyCombine.mapview(::typeof(merge), t::Table) = t
+
 function SplitApplyCombine.mapview(::typeof(merge), t1::Table, t2::Table)
     return Table(merge(columns(t1), columns(t2)))
 end
 
-function SplitApplyCombine.mapview(::typeof(merge), df1::Union{Table{<:Any, N}, FlexTable{N}}, df2::Union{Table{<:Any, N}, FlexTable{N}}) where {N}
-    return FlexTable{N}(merge(columns(df1), columns(df2)))
+@inline function SplitApplyCombine.mapview(f::GetProperty, t::Table)
+    return f(t)
+end
+
+@inline function SplitApplyCombine.mapview(f::GetProperties, t::Table)
+    return f(t)
 end
 
-@inline function SplitApplyCombine.mapview(f::GetProperty{name}, t::Union{Table{<:Any, N}, FlexTable{N}}) where {name,  N}
-    return getproperty(t, name::Symbol)::AbstractArray{<:Any, N}
+@inline function SplitApplyCombine.mapview(f::Compute{names}, t::Table) where {names}
+    # minimize number of columns before iterating over the rows
+    mapview(f, GetProperties(names)(t))
+end
+
+@inline function SplitApplyCombine.mapview(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
+    # efficient to iterate over rows with a minimal number of columns
+    if length(names) == 1 # unwrap in the simple cases (consider 2-argument version)
+        return mapview(f.f, getproperty(names[1])(t))
+    end
+
+    invoke(mapview, Tuple{Function, typeof(t)}, f, t)
+end
+
+@generated function SplitApplyCombine.mapview(s::Select{names}, t::Table) where {names}
+    exprs = [:($(names[i]) = mapview(s.fs[$i], t)) for i in 1:length(names)]
+
+    return :(Table($(Expr(:tuple, exprs...))))
 end
 
 # broadcast
@@ -36,14 +80,88 @@ end
 	Table(merge(map(columns, ts)...))
 end
 
-@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, ::typeof(merge), ts::Union{Table{<:Any, N},FlexTable{N}}...) where {N}
-	FlexTable{N}(merge(map(columns, ts)...))
+@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty, t::Table{<:Any, N}) where {N}
+	return f(t)
 end
 
-@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty{names}, t::Table{<:Any, N}) where {N, name}
-	return getproperty(t, name::Symbol)
+@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperties, t::Table{<:Any, N}) where {N}
+    return f(t)
+end
+
+@inline function Broadcast.broadcasted(style::Broadcast.DefaultArrayStyle{N}, f::Compute{names}, t::Table{<:NamedTuple, N}) where {N, names}
+    # minimize number of columns before iterating over the rows
+    return Broadcast.broadcasted(style, f, GetProperties(names)(t))
+end
+
+@inline function Broadcast.broadcasted(style::Broadcast.DefaultArrayStyle{N}, f::Compute{names}, t::Table{<:NamedTuple{names}, N}) where {N, names}
+    # efficient to iterate over rows with a minimal number of columns
+    if length(names) == 1 # unwrap in the simple cases
+        return Broadcast.broadcasted(f.f, getproperty(names[1])(t))
+    elseif length(names) == 2
+        return Broadcast.broadcasted(f.f, getproperty(names[1])(t), getproperty(names[2])(t))
+    end
+
+    invoke(Broadcast.broadcasted, Tuple{typeof(style), Function, typeof(t)}, style, f, t)
 end
 
-@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::GetProperty{names}, t::FlexTable{N}) where {N, name}
-	return getproperty(t, name::Symbol)::AbstractArray{<:Any, N}
+@inline function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{N}, f::Select, t::Table{<:Any, N}) where {N}
+    return mapview(f, t)
+end
+
+# I'm not 100% sure how wise this pattern is...
+Broadcast.materialize(t::Table) = Table(map(_materialize, columns(t)))
+_materialize(x) = Broadcast.materialize(x)
+_materialize(x::MappedArray) = copy(x)
+
+# mapreduce
+
+function Base.mapreduce(f::GetProperty, op, t::Table; kwargs...)
+    return mapreduce(identity, op, f(t); kwargs...)
+end
+
+function Base.mapreduce(f::GetProperties, op, t::Table; kwargs...)
+    return mapreduce(identity, op, f(t); kwargs...)
+end
+
+function Base.mapreduce(f::Compute{names}, op, t::Table; kwargs...) where {names}
+    # minimize number of columns before iterating over the rows
+    t2 = GetProperties(names)(t)
+    return mapreduce(f, op, t2; kwargs...)
+end
+
+function Base.mapreduce(f::Compute{names}, op, t::Table{<:NamedTuple{names}}; kwargs...) where {names}
+    # efficient to iterate over rows with a minimal number of columns
+    if length(names) == 1 # unwrap in the simple cases
+        return mapreduce(f.f, op, getproperty(names[1])(t))
+    elseif length(names) == 2
+        return mapreduce(f.f, op, getproperty(names[1])(t), getproperty(names[2])(t))
+    end
+
+    invoke(mapreduce, Tuple{Function, typeof(op), typeof(t)}, f, op, t; kwargs...)
+end
+
+# `filter(f, t)` defaults to `t[map(f, t)]`
+
+function Base.filter(f::GetProperty, t::Table)
+    return @inbounds t[f(t)::AbstractArray{Bool}]
+end
+
+# findall
+
+function Base.findall(f::GetProperty, t::Table)
+    return findall(identity, f(t))
+end
+
+function Base.findall(f::Compute{names}, t::Table) where {names}
+    # minimize number of columns before iterating over the rows
+    return findall(f, GetProperties(names)(t))
+end
+
+function Base.findall(f::Compute{names}, t::Table{<:NamedTuple{names}}) where {names}
+    # efficient to iterate over rows with a minimal number of columns
+    if length(names) == 1 # unwrap in the simple cases
+        return findall(f.f, getproperty(names[1])(t))
+    end
+
+    invoke(findall, Tuple{Function, typeof(t)}, f, t)
 end