Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Property interface via macros @Select and @Compute #39

Merged
merged 9 commits into from
Nov 30, 2018

Conversation

andyferris
Copy link
Member

@andyferris andyferris commented Nov 24, 2018

OK, here is a preview of my solution to the properties interface. This replaces #38 and I think this branch will remove the plural-getproperties stuff.

For user-facing tools, there are macros @Select and @calc @Compute which return functions. That's right - these aren't direct operations, let's call them "higher-order macros" :)

They are designed to act on any container that support getproperty. The @compute macro is more-or-less convenient syntax for building a simple anonymous function. You use $ to indicate any input property and all other parts of the expression are evaluated as written.

julia> @Compute($a + $b)((a=1, b=2.0))
3.0

I'd like to think of a better name for @calc, so ideas very welcome. In the backend this creates a Compute object which is a type of Function that knows what property names it requires (useful info for columnar-storage optimizations, still WIP now done).

The @Select macro returns an object with a number of properties, possibly simply replicated, and sometimes they are calculations of their own. Here's a preview:

julia> @Select(a, b = $b, sum = $a + $b)((a=1, b=2.0))
(a = 1, b = 2.0, sum = 3.0)

Generally it's a name = function_expression pair but you can just nominate a symbol to replicate. This creates a Select object which is a type of Function that generally contains GetProperty or Compute objects (again, column names are known for potential implemented columnar-storage optimizations).

This PR does contain columnar-storage optimizations for GetProperties in the columnops.jl file, from #38 as well as Compute and Select (we automatically pre-project tables so that iteration works on fewer columns). I don't think we'll need getproperties for anything in the end, so I will probably delete not export that. But right now I gotta go to bed.

Now - how to use on a Table? Well, you have two options, you can manipulate the table directly, as in @Select(...)(t), which performs a transformation on columns as entire arrays. Or you can broadcast this over the rows, as in @Select(...).(t) or map(@Select(...), t), and the result can be globbed back into a table (the former is still WIP).

cc @quinnj compared to what I see in TableOperations.jl, I see this as being more generic/fundamental about properties rather than tables, but still preserving the information critical for columnar-based storage optimizations.

Todos:

  • We should ideally add some columnar accelerations to mapreduce, and maybe to filter (and findall).
  • Tutorial documentation
  • Test coverage
  • Make @select not clash with Query.jl (renamed to @Select)

Andy Ferris added 2 commits November 24, 2018 22:31
One can pick more than one column with `getproperties`. The output type
is configurable - defaulting to NamedTuple.

This begins to form the basis of a "properties" interface. We will need
a few more convenience functions yet, like a generic "select" that is
friendly for columnar storage.
@andyferris
Copy link
Member Author

The relevant details are more visible in the second commit

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.9%) to 57.267% when pulling e571970 on ajf/select-and-calc into c006e67 on master.

@coveralls
Copy link

coveralls commented Nov 24, 2018

Coverage Status

Coverage increased (+1.5%) to 61.587% when pulling 65a9ac2 on ajf/select-and-calc into c006e67 on master.

@codecov-io
Copy link

codecov-io commented Nov 24, 2018

Codecov Report

Merging #39 into master will increase coverage by 1.45%.
The diff coverage is 63.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #39      +/-   ##
==========================================
+ Coverage   60.12%   61.58%   +1.45%     
==========================================
  Files           5        6       +1     
  Lines         311      479     +168     
==========================================
+ Hits          187      295     +108     
- Misses        124      184      +60
Impacted Files Coverage Δ
src/TypedTables.jl 100% <ø> (ø) ⬆️
src/Table.jl 74.72% <0%> (-0.84%) ⬇️
src/columnops.jl 49.01% <45.83%> (-9.32%) ⬇️
src/properties.jl 68.86% <68.86%> (ø)
src/FlexTable.jl 75.45% <80%> (+1.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c006e67...65a9ac2. Read the comment docs.

@andyferris andyferris changed the title WIP/RFC: Property interface via macros @select and @calc RFC: Property interface via macros @select and @compute Nov 27, 2018
@andyferris
Copy link
Member Author

andyferris commented Nov 27, 2018

OK, I've now implemented automatic wrapping and unwrapping of functions in the @compute macro and Compute methods, including creations of Base.Fix1 and Base.Fix2.

What this means is that AcceleratedArrays.jl accelerations will now fire by default. For example

filter(@compute(isequal($a, 100)), table)
@select(a, b, isless($a, $b)).(table)
findall(@compute($position  Sphere(centre, radius)), pointcloud)

may all potentially use secondary acceleration indices. (Which means once the docs for this is done I can can get back to finishing the implemenation of accelerations for SortIndex and UniqueSortIndex - yay!)
OK, I've now implemented automatic wrapping and unwrapping of functions in the @compute macro and Compute methods, including creations of Base.Fix1 and Base.Fix2.

What this means is that AcceleratedArrays.jl accelerations will now fire by default. For example

filter(@compute(isless($a, 100)), table)
@select(a, b, isequal($a, $b)).(table)
findall(@compute($position  Sphere(centre, radius)), pointcloud)

may all potentially use secondary acceleration indices. (Which means once the docs for this is done I can can get back to finishing the implemenation of accelerations for SortIndex and UniqueSortIndex - yay!)
CC @c42f in case you find the last 3rd example above interesting :) Of course you could write findall(in(Sphere(centre, radius)), pointcloud.position) anyway but it's nice that using the tabular macros doesn't lead to shooting yourself in the foot. And you can use filter directly instead of findall, in the same way.

@andyferris
Copy link
Member Author

Of course the worst thing about this PR is the clash with Query.@select. Not sure how to handle that... potentially @Select and @Compute?

@andyferris
Copy link
Member Author

@Select and @Compute are the new spellings. They are (almost) constructors for Selects and Computes and this naming doesn't clash with Query.jl.

@c42f
Copy link
Contributor

c42f commented Nov 28, 2018

Oh, this is very interesting, I like it. The use of $ for extracting fields is very natural.

One thing I've been finding slightly frustrating is the syntax of SplitApplyCombine.innerjoin is rather... disjointed. In particular, there are several functions in a call to innerjoin which work together to produce the result. But without suggestive syntax, reading an invocation of innerjoin is rather confusing.

What you've got here looks tantalizingly like it could almost solve those problems, is this is part of the plan?

I must admit I'm not super keen on @Compute because it's both generic and nonstandard. Of course it's hard. It seems to be about mapping properties through a function... maybe @MapProps, @withprops @useprops @mapprops? Haha... matlab style @propfun :-D

@c42f
Copy link
Contributor

c42f commented Nov 28, 2018

@bsxpropfun heh heh 🤢

@andyferris
Copy link
Member Author

andyferris commented Nov 28, 2018

Thanks for the feedback.

I too am happier with @Select than with @Compute. The $ seems to work well. However, I do foresee that a good solution to using _ in Julia to create closures could make @Compute mostly (or completely) unnecessary; this would be the ideal outcome. I think we could already use _. instead of $ inside the macros if we really wanted to.

Regarding innerjoin... well there's a few things here about this as a function signature

  • I more-or-less copied Microsoft here (the C# LINQ Method signatures). A static language and good IDE would make this easier for users; and in fact the special LINQ syntax (C# lowering magic) means a lot of C# users wouldn't use the signatures directly.
  • I use positional functional arguments much like the Base functions, and yet have the same issue you are having but with mapreduce. So far I managed to move init to a keword argument, which helps considerably, but maybe I just want reduce with a map keyword argument. (Incidentally, Microsoft similarly provide a final function, which can e.g. divide for the mean.) Not sure if the community would agree.
  • Edit: I did at first want to reduce the number of arguments compared to LINQ, but upon implementation it seems necessary to have all of them to make a clean, generic implementation that could possibly do things like take advantage of columnar storage or acceleration indices.

Finally... what's the "plan" for joins? OK, here's a wide-open space. We could use some keyword arguments to clean it up. We could defer to Query.jl macros (or similar) to call innerjoin, so user's mostly don't deal with the burden. Other syntax approaches could be using generators or filtering an outer product of tables. Ideas welcome.

But I do have one secret dream. I like the idea of using the "full" power of relational algebra. Let me paint a picture.

Imagine a RelationalAlgebra package. This would define the join operator to mean natural join. It would do something sensible with Table and Vector{<:NamedTuple} by default. It would also define "abstract relations". Remember a relation is both a set of named tuples and is isomorphic to a function that asks "is this particular named tuple in the set?". I'm thinking a relation might be defined with another macro

@Relate($a == $b)

This set is infinte but can be joined on one or more tables with columns a and b to get a finite output.

Users could put together powerful queries like this

table  @Relate($a > 100)   # filter for table.a .> 100
table  @Relate($a == $b)  table2  # Join table1 with table2 matching the table1.a column with the table.b column
pointcloud  polygon   # return points inside polygon

There is also a dual operator to which kind-of appends relations together (if relation is one-to-one to a predicate, you can either && two predicates or || two predicates, to get a new relation - the && case is a natural join / filter). The best place to see someone implement something like this is the Python "Dee" package, which follows the admitedly ranty "Third Manifesto" on how to design relational database systems (which IMO makes a couple of good points).

Not sure how practical any of that would be to actually use, but damn, things like pointcloud ⨝ polygon seem so cool.

@c42f
Copy link
Contributor

c42f commented Nov 28, 2018

Cool well stated. That's a lot to think about.

I think we could already use _. instead of $ inside the macros if we really wanted to

True, I think $ is arguably a nicer little DSL for "working in the context of a named tuple". Though one that requires writing the macro name.

table ⨝ @Relate($a == $b) ⨝ table2

It would be beautiful to be able to write this. Though I'm not sure about some annoying practicalities like two tables having the same column names for different things etc. Having the macro does allow some insight into the expression as presumably required for accelerating the join.

@andyferris
Copy link
Member Author

Though I'm not sure about some annoying practicalities like two tables having the same column names for different things etc.

Yeah, this is the bit I was referrering to - the user probably needs to buy into natural joins before developing the data model.

Having the macro does allow some insight into the expression as presumably required for accelerating the join.

Exactly. Introspection of @code_lowered for anonymous functions may also be possible - but this is getting into Cassette.jl territory, even I haven't been game enough to go there (yet...).

@c42f
Copy link
Contributor

c42f commented Nov 30, 2018

the user probably needs to buy into natural joins before developing the data model

Oh right. So columns need always to have consistent names in tables to be joined. I feel like this would simply be too inflexible to be practical, even if you name all columns with a prefix of the entity name (so that foreign keys can naturally match the key in the source table). I wonder whether there's some middle ground of partially automated column renaming which would make this work neatly.

@andyferris
Copy link
Member Author

To me, it would be ideal if the join operator would detect the matching column names and then call innerjoin. In the general case, people can use innerjoin or a convenience interface along the lines of Query.jl that calls it for you.

@andyferris andyferris changed the title RFC: Property interface via macros @select and @compute RFC: Property interface via macros @Select and @Compute Nov 30, 2018
@andyferris andyferris merged commit a67883a into master Nov 30, 2018
@andyferris
Copy link
Member Author

Let's try this out on master. Feedback from users very welcome.

@andyferris andyferris deleted the ajf/select-and-calc branch November 30, 2018 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants