-
Notifications
You must be signed in to change notification settings - Fork 320
Add SortValues #479
Add SortValues #479
Conversation
A contrib module that provides a PTransform which performs local(non-distributed) sorting. It will sort in memory until the buffer is full, then flush to disk and use external sorting. Consumes a PCollection of KVs from primary key to iterable of secondary key and value KVs and sorts the iterables. Would probably be called after a GroupByKey. Uses coders to convert secondary keys and values into byte arrays and does a lexicographical comparison on the secondary keys. Uses Hadoop as an external sorting library. Backport of apache/beam#1199
|
Note: Everything is a pretty much direct backport except for pom.xml. The beam and dataflow poms are different enough that it was easier to take the original pom.xml Marian had written for the earlier version of the sort library and manually make the necessary changes to dependencies and shading. So aside from the CR done on Marian's version of the file months ago, that file hasn't been code reviewed. Ran mvn clean verify successfully on the sorter contrib module and the parent (looks like the parent pom doesn't do builds of the contrib modules in dataflow sdk). |
davorbonaci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a few comments from me, but will wait until @dhalperi takes a quick look too.
contrib/sorter/AUTHORS.md
Outdated
| @@ -0,0 +1,5 @@ | |||
| # Authors of 'sorter' module | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed unless there's somebody other than Google to add here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
contrib/sorter/README.md
Outdated
| This module provides the SortValues transform, which takes a `PCollection<KV<K, Iterable<KV<K2, V>>>>` and produces a `PCollection<KV<K, Iterable<KV<K2, V>>>>` where, for each primary key `K` the paired `Iterable<KV<K2, V>>` has been sorted by the byte encoding of secondary key (`K2`). It will efficiently and scalably sort the iterables, even if they are large (do not fit in memory). | ||
|
|
||
| ##Caveats | ||
| * This transform performs value-only sorting; the iterable accompanying each key is sorted, but *there is no relationship between different keys*, as Beam does not support any defined relationship between different elements in a PCollection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beam -> Dataflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing I should have done a case insensitive grep for beam in the first place... Have done that now and the only two instances were the ones you found. Fixed
| } | ||
| } | ||
|
|
||
| /** Matcher for KVs. Forked from Beam's com.google.cloud.dataflow/sdk/TestUtils.java */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
LGTM, thanks. |
A contrib module that provides a PTransform which performs
local(non-distributed) sorting. It will sort in memory until the buffer
is full, then flush to disk and use external sorting.
Consumes a PCollection of KVs from primary key to iterable of secondary
key and value KVs and sorts the iterables. Would probably be called
after a GroupByKey. Uses coders to convert secondary keys and values
into byte arrays and does a lexicographical comparison on the secondary
keys.
Uses Hadoop as an external sorting library.
Backport of apache/beam#1199