RFC: Property interface via macros `@Select` and `@Compute` #39

andyferris · 2018-11-24T14:32:50Z

OK, here is a preview of my solution to the properties interface. This replaces #38 and I think this branch will remove the plural-getproperties stuff.

For user-facing tools, there are macros @Select and ~~@calc~~ @Compute which return functions. That's right - these aren't direct operations, let's call them "higher-order macros" :)

They are designed to act on any container that support getproperty. The @compute macro is more-or-less convenient syntax for building a simple anonymous function. You use $ to indicate any input property and all other parts of the expression are evaluated as written.

julia> @Compute($a + $b)((a=1, b=2.0))
3.0

~~I'd like to think of a better name for @calc, so ideas very welcome.~~ In the backend this creates a Compute object which is a type of Function that knows what property names it requires (useful info for columnar-storage optimizations, ~~still WIP~~ now done).

The @Select macro returns an object with a number of properties, possibly simply replicated, and sometimes they are calculations of their own. Here's a preview:

julia> @Select(a, b = $b, sum = $a + $b)((a=1, b=2.0))
(a = 1, b = 2.0, sum = 3.0)

Generally it's a name = function_expression pair but you can just nominate a symbol to replicate. This creates a Select object which is a type of Function that generally contains GetProperty or Compute objects (again, column names are known for ~~potential~~ implemented columnar-storage optimizations).

This PR does contain columnar-storage optimizations for GetProperties in the columnops.jl file, from #38 as well as Compute and Select (we automatically pre-project tables so that iteration works on fewer columns). I don't think we'll need getproperties for anything in the end, so I will probably ~~delete~~ not export that. But right now I gotta go to bed.

Now - how to use on a Table? Well, you have two options, you can manipulate the table directly, as in @Select(...)(t), which performs a transformation on columns as entire arrays. Or you can broadcast this over the rows, as in @Select(...).(t) or map(@Select(...), t), and the result can be globbed back into a table (~~the former is still WIP~~).

cc @quinnj compared to what I see in TableOperations.jl, I see this as being more generic/fundamental about properties rather than tables, but still preserving the information critical for columnar-based storage optimizations.

Todos:

We should ideally add some columnar accelerations to mapreduce, and maybe to filter (and findall).
Tutorial documentation
Test coverage
Make @select not clash with Query.jl (renamed to @Select)

One can pick more than one column with `getproperties`. The output type is configurable - defaulting to NamedTuple. This begins to form the basis of a "properties" interface. We will need a few more convenience functions yet, like a generic "select" that is friendly for columnar storage.

andyferris · 2018-11-24T14:37:09Z

The relevant details are more visible in the second commit

coveralls · 2018-11-24T14:38:52Z

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

coveralls · 2018-11-24T14:38:53Z

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

coveralls · 2018-11-24T14:38:54Z

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

coveralls · 2018-11-24T14:38:55Z

Coverage increased (+1.5%) to 61.587% when pulling 65a9ac2 on ajf/select-and-calc into c006e67 on master.

codecov-io · 2018-11-24T14:40:55Z

Codecov Report

Merging #39 into master will increase coverage by 1.45%.
The diff coverage is 63.88%.

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
+ Coverage   60.12%   61.58%   +1.45%     
==========================================
  Files           5        6       +1     
  Lines         311      479     +168     
==========================================
+ Hits          187      295     +108     
- Misses        124      184      +60

Impacted Files	Coverage Δ
src/TypedTables.jl	`100% <ø> (ø)`	⬆️
src/Table.jl	`74.72% <0%> (-0.84%)`	⬇️
src/columnops.jl	`49.01% <45.83%> (-9.32%)`	⬇️
src/properties.jl	`68.86% <68.86%> (ø)`
src/FlexTable.jl	`75.45% <80%> (+1.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c006e67...65a9ac2. Read the comment docs.

…mizations

andyferris · 2018-11-27T12:11:44Z

OK, I've now implemented automatic wrapping and unwrapping of functions in the @compute macro and Compute methods, including creations of Base.Fix1 and Base.Fix2.

What this means is that AcceleratedArrays.jl accelerations will now fire by default. For example

filter(@compute(isequal($a, 100)), table)
@select(a, b, isless($a, $b)).(table)
findall(@compute($position ∈ Sphere(centre, radius)), pointcloud)

may all potentially use secondary acceleration indices. (Which means once the docs for this is done I can can get back to finishing the implemenation of accelerations for SortIndex and UniqueSortIndex - yay!)
OK, I've now implemented automatic wrapping and unwrapping of functions in the @compute macro and Compute methods, including creations of Base.Fix1 and Base.Fix2.

What this means is that AcceleratedArrays.jl accelerations will now fire by default. For example

filter(@compute(isless($a, 100)), table)
@select(a, b, isequal($a, $b)).(table)
findall(@compute($position ∈ Sphere(centre, radius)), pointcloud)

may all potentially use secondary acceleration indices. (Which means once the docs for this is done I can can get back to finishing the implemenation of accelerations for SortIndex and UniqueSortIndex - yay!)
CC @c42f in case you find the last 3rd example above interesting :) Of course you could write findall(in(Sphere(centre, radius)), pointcloud.position) anyway but it's nice that using the tabular macros doesn't lead to shooting yourself in the foot. And you can use filter directly instead of findall, in the same way.

andyferris · 2018-11-27T12:13:52Z

Of course the worst thing about this PR is the clash with Query.@select. Not sure how to handle that... potentially @Select and @Compute?

andyferris · 2018-11-27T22:56:59Z

@Select and @Compute are the new spellings. They are (almost) constructors for Selects and Computes and this naming doesn't clash with Query.jl.

c42f · 2018-11-28T11:06:06Z

Oh, this is very interesting, I like it. The use of $ for extracting fields is very natural.

One thing I've been finding slightly frustrating is the syntax of SplitApplyCombine.innerjoin is rather... disjointed. In particular, there are several functions in a call to innerjoin which work together to produce the result. But without suggestive syntax, reading an invocation of innerjoin is rather confusing.

What you've got here looks tantalizingly like it could almost solve those problems, is this is part of the plan?

I must admit I'm not super keen on @Compute because it's both generic and nonstandard. Of course it's hard. It seems to be about mapping properties through a function... maybe @MapProps, @withprops @useprops @mapprops? Haha... matlab style @propfun :-D

c42f · 2018-11-28T11:07:11Z

@bsxpropfun heh heh 🤢

andyferris · 2018-11-28T11:49:50Z

Thanks for the feedback.

I too am happier with @Select than with @Compute. The $ seems to work well. However, I do foresee that a good solution to using _ in Julia to create closures could make @Compute mostly (or completely) unnecessary; this would be the ideal outcome. I think we could already use _. instead of $ inside the macros if we really wanted to.

Regarding innerjoin... well there's a few things here about this as a function signature

I more-or-less copied Microsoft here (the C# LINQ Method signatures). A static language and good IDE would make this easier for users; and in fact the special LINQ syntax (C# lowering magic) means a lot of C# users wouldn't use the signatures directly.
I use positional functional arguments much like the Base functions, and yet have the same issue you are having but with mapreduce. So far I managed to move init to a keword argument, which helps considerably, but maybe I just want reduce with a map keyword argument. (Incidentally, Microsoft similarly provide a final function, which can e.g. divide for the mean.) Not sure if the community would agree.
Edit: I did at first want to reduce the number of arguments compared to LINQ, but upon implementation it seems necessary to have all of them to make a clean, generic implementation that could possibly do things like take advantage of columnar storage or acceleration indices.

Finally... what's the "plan" for joins? OK, here's a wide-open space. We could use some keyword arguments to clean it up. We could defer to Query.jl macros (or similar) to call innerjoin, so user's mostly don't deal with the burden. Other syntax approaches could be using generators or filtering an outer product of tables. Ideas welcome.

But I do have one secret dream. I like the idea of using the "full" power of relational algebra. Let me paint a picture.

Imagine a RelationalAlgebra package. This would define the join operator ⨝ to mean natural join. It would do something sensible with Table and Vector{<:NamedTuple} by default. It would also define "abstract relations". Remember a relation is both a set of named tuples and is isomorphic to a function that asks "is this particular named tuple in the set?". I'm thinking a relation might be defined with another macro

@Relate($a == $b)

This set is infinte but can be joined on one or more tables with columns a and b to get a finite output.

Users could put together powerful queries like this

table ⨝ @Relate($a > 100)   # filter for table.a .> 100
table ⨝ @Relate($a == $b) ⨝ table2  # Join table1 with table2 matching the table1.a column with the table.b column
pointcloud ⨝ polygon   # return points inside polygon

There is also a dual operator to ⨝ which kind-of appends relations together (if relation is one-to-one to a predicate, you can either && two predicates or || two predicates, to get a new relation - the && case is a natural join / filter). The best place to see someone implement something like this is the Python "Dee" package, which follows the admitedly ranty "Third Manifesto" on how to design relational database systems (which IMO makes a couple of good points).

Not sure how practical any of that would be to actually use, but damn, things like pointcloud ⨝ polygon seem so cool.

c42f · 2018-11-28T13:14:17Z

Cool well stated. That's a lot to think about.

I think we could already use _. instead of $ inside the macros if we really wanted to

True, I think $ is arguably a nicer little DSL for "working in the context of a named tuple". Though one that requires writing the macro name.

table ⨝ @Relate($a == $b) ⨝ table2

It would be beautiful to be able to write this. Though I'm not sure about some annoying practicalities like two tables having the same column names for different things etc. Having the macro does allow some insight into the expression as presumably required for accelerating the join.

andyferris · 2018-11-29T01:44:23Z

Though I'm not sure about some annoying practicalities like two tables having the same column names for different things etc.

Yeah, this is the bit I was referrering to - the user probably needs to buy into natural joins before developing the data model.

Having the macro does allow some insight into the expression as presumably required for accelerating the join.

Exactly. Introspection of @code_lowered for anonymous functions may also be possible - but this is getting into Cassette.jl territory, even I haven't been game enough to go there (yet...).

c42f · 2018-11-30T03:01:31Z

the user probably needs to buy into natural joins before developing the data model

Oh right. So columns need always to have consistent names in tables to be joined. I feel like this would simply be too inflexible to be practical, even if you name all columns with a prefix of the entity name (so that foreign keys can naturally match the key in the source table). I wonder whether there's some middle ground of partially automated column renaming which would make this work neatly.

andyferris · 2018-11-30T11:11:12Z

To me, it would be ideal if the join operator ⨝ would detect the matching column names and then call innerjoin. In the general case, people can use innerjoin or a convenience interface along the lines of Query.jl that calls it for you.

andyferris · 2018-11-30T11:19:17Z

Let's try this out on master. Feedback from users very welcome.

Andy Ferris added 2 commits November 24, 2018 22:31

Property interface via macros @select and @calc

e571970

andyferris mentioned this pull request Nov 24, 2018

Add a getproperties conenience function #38

Merged

Andy Ferris added 3 commits November 25, 2018 16:05

Do most of the columnar optimizations

9583d77

Add filter, findall, etc

2d0152b

Better @calc and @select for simple cases, more tests and columnar ops

2bf47c7

andyferris force-pushed the ajf/select-and-calc branch from 765597f to 32d05e6 Compare November 27, 2018 02:03

Rename @calc to @compute

4bee895

andyferris force-pushed the ajf/select-and-calc branch from 32d05e6 to 4bee895 Compare November 27, 2018 02:05

Wrap and unwrap @compute functions to enable AcceleratedArrays opti…

e81a82c

…mizations

andyferris changed the title ~~WIP/RFC: Property interface via macros @select and @calc~~ RFC: Property interface via macros @select and @compute Nov 27, 2018

Rename @select to @Select, and @compute to @Compute.

9d57a2b

Update documentation

65a9ac2

andyferris mentioned this pull request Nov 30, 2018

Support Tables.materializer #41

Merged

andyferris changed the title ~~RFC: Property interface via macros @select and @compute~~ RFC: Property interface via macros @Select and @Compute Nov 30, 2018

andyferris merged commit a67883a into master Nov 30, 2018

andyferris deleted the ajf/select-and-calc branch November 30, 2018 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Property interface via macros `@Select` and `@Compute` #39

RFC: Property interface via macros `@Select` and `@Compute` #39

andyferris commented Nov 24, 2018 •

edited

Loading

andyferris commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018 •

edited

Loading

codecov-io commented Nov 24, 2018 •

edited

Loading

andyferris commented Nov 27, 2018 •

edited

Loading

andyferris commented Nov 27, 2018

andyferris commented Nov 27, 2018

c42f commented Nov 28, 2018

c42f commented Nov 28, 2018

andyferris commented Nov 28, 2018 •

edited

Loading

c42f commented Nov 28, 2018

andyferris commented Nov 29, 2018

c42f commented Nov 30, 2018 •

edited

Loading

andyferris commented Nov 30, 2018

andyferris commented Nov 30, 2018

RFC: Property interface via macros @Select and @Compute #39

RFC: Property interface via macros @Select and @Compute #39

Conversation

andyferris commented Nov 24, 2018 • edited Loading

andyferris commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018

coveralls commented Nov 24, 2018 • edited Loading

codecov-io commented Nov 24, 2018 • edited Loading

Codecov Report

andyferris commented Nov 27, 2018 • edited Loading

andyferris commented Nov 27, 2018

andyferris commented Nov 27, 2018

c42f commented Nov 28, 2018

c42f commented Nov 28, 2018

andyferris commented Nov 28, 2018 • edited Loading

c42f commented Nov 28, 2018

andyferris commented Nov 29, 2018

c42f commented Nov 30, 2018 • edited Loading

andyferris commented Nov 30, 2018

andyferris commented Nov 30, 2018

RFC: Property interface via macros `@Select` and `@Compute` #39

RFC: Property interface via macros `@Select` and `@Compute` #39

andyferris commented Nov 24, 2018 •

edited

Loading

coveralls commented Nov 24, 2018 •

edited

Loading

codecov-io commented Nov 24, 2018 •

edited

Loading

andyferris commented Nov 27, 2018 •

edited

Loading

andyferris commented Nov 28, 2018 •

edited

Loading

c42f commented Nov 30, 2018 •

edited

Loading