Skip to content
This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Conversation

@mizitch
Copy link
Contributor

@mizitch mizitch commented Nov 4, 2016

A contrib module that provides a PTransform which performs
local(non-distributed) sorting. It will sort in memory until the buffer
is full, then flush to disk and use external sorting.

Consumes a PCollection of KVs from primary key to iterable of secondary
key and value KVs and sorts the iterables. Would probably be called
after a GroupByKey. Uses coders to convert secondary keys and values
into byte arrays and does a lexicographical comparison on the secondary
keys.

Uses Hadoop as an external sorting library.

Backport of apache/beam#1199

A contrib module that provides a PTransform which performs
local(non-distributed) sorting. It will sort in memory until the buffer
is full, then flush to disk and use external sorting.

Consumes a PCollection of KVs from primary key to iterable of secondary
key and value KVs and sorts the iterables. Would probably be called
after a GroupByKey. Uses coders to convert secondary keys and values
into byte arrays and does a lexicographical comparison on the secondary
keys.

Uses Hadoop as an external sorting library.

Backport of apache/beam#1199
@mizitch
Copy link
Contributor Author

mizitch commented Nov 4, 2016

Note: Everything is a pretty much direct backport except for pom.xml. The beam and dataflow poms are different enough that it was easier to take the original pom.xml Marian had written for the earlier version of the sort library and manually make the necessary changes to dependencies and shading. So aside from the CR done on Marian's version of the file months ago, that file hasn't been code reviewed.

Ran mvn clean verify successfully on the sorter contrib module and the parent (looks like the parent pom doesn't do builds of the contrib modules in dataflow sdk).

Copy link
Contributor

@davorbonaci davorbonaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a few comments from me, but will wait until @dhalperi takes a quick look too.

@@ -0,0 +1,5 @@
# Authors of 'sorter' module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed unless there's somebody other than Google to add here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

This module provides the SortValues transform, which takes a `PCollection<KV<K, Iterable<KV<K2, V>>>>` and produces a `PCollection<KV<K, Iterable<KV<K2, V>>>>` where, for each primary key `K` the paired `Iterable<KV<K2, V>>` has been sorted by the byte encoding of secondary key (`K2`). It will efficiently and scalably sort the iterables, even if they are large (do not fit in memory).

##Caveats
* This transform performs value-only sorting; the iterable accompanying each key is sorted, but *there is no relationship between different keys*, as Beam does not support any defined relationship between different elements in a PCollection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beam -> Dataflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing I should have done a case insensitive grep for beam in the first place... Have done that now and the only two instances were the ones you found. Fixed

}
}

/** Matcher for KVs. Forked from Beam's com.google.cloud.dataflow/sdk/TestUtils.java */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@dhalperi
Copy link
Contributor

dhalperi commented Nov 7, 2016

LGTM, thanks.

@dhalperi dhalperi merged commit 1b7954e into GoogleCloudPlatform:master Nov 7, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants