Implement MapReducer.stream() #19

tyrasd · 2018-09-18T16:04:01Z

Implements the idea from #17

Works in JDBC (singlethread/multithread) backends as well as the "AffinityCall" ignite backend (other backends fall back to .collect().stream(). For the other ignite backends it is not very straight-forward to implement. Maybe it can be implemented at some later stage, or the oshdb can (by default) fall back to another backend if it finds that the current one doesn't implement streaming?

…l" backend

which is similar to performing something like `MapAggregator.collect().entrySet().stream()` with the difference that the latter would return all individual results for a particular index value as a list, while the stream returns them individually. the current MapAggregator.collect() could be mimicked by doing `MapAggregator.stream().collect(Collectors.groupingBy(Entry::getKey, Collectors.mapping(Entry::getValue, Collectors.toList())))`

these are the methods that process the contents of the individual osh cells (cellIterator.*, grouping results by entity id, reducing)

tyrasd · 2018-09-19T17:14:24Z

oshdb-api/src/main/java/org/heigit/bigspatialdata/oshdb/api/mapreducer/backend/Kernels.java

+    return (oshEntityCell, cellIterator) -> {
+      AtomicReference<S> accInternal = new AtomicReference<>(identitySupplier.get());
+      // iterate over the history of all OSM objects in the current cell
+      List<OSMContribution> contributions = new ArrayList<>();


btw: while testing, I found that here (and in a few places below) re-using this array for multiple consecutive entities can cause a problem, when one just directly uses these arrays as output, e.g. the following would produce different results:

….groupByEntity().collect() // <- produces wrong result ….groupByEntity().map(ArrayList::new).collect() // works fine

This is kind of a corner case, since (at first glance) it only happens when using groupByEntity and one does not specify any map function and one does not aggregate the results. But it could still be confusing to run into this and (I think) it doesn't matter too much from a performance point of view. I guess we should just not re-use the list here, or what do you think?

new generic mapreducer tests must also fail when there are missing tables/caches

streaming kernels now return collections of the individual cell results (instead of streams of individual cell results), because streams can't be serialized to be used in a remote "peer class loading" environment like ignite

makes the onClose actually work as expected/described in #30 (comment): the old implementation didn't run the callback in the same job / broadcast runnable as the main map-reduce routine, thus triggering another serialization-deserialization of the remote object for the onClose callback, which would then see a different (new) object, which cannot to used to close a db connection for example.

tyrasd added 3 commits September 18, 2018 16:20

implement streaming of non-aggregated results for Ignite "AffinityCal…

00cecbc

…l" backend

implement streaming accessors for jdbc backends

260014e

add basic tests for streaming queries

1af7870

tyrasd added the enhancement New feature or request label Sep 18, 2018

tyrasd added 3 commits September 18, 2018 18:15

update/tweak some javadoc deprecation comments

bb47f7a

de-duplicate commonly used code in various backends

d3a44a8

these are the methods that process the contents of the individual osh cells (cellIterator.*, grouping results by entity id, reducing)

tyrasd force-pushed the streams branch from aff5c3c to d3a44a8 Compare September 19, 2018 15:50

tyrasd commented Sep 19, 2018

View reviewed changes

tyrasd requested review from rtroilo and sfendrich September 19, 2018 17:16

tyrasd added 3 commits September 27, 2018 14:32

Merge branch 'master' into streams

6a2feac

update tests to master branch

9cdeda0

new generic mapreducer tests must also fail when there are missing tables/caches

Merge branch 'master' into streams

a7ff5dc

tyrasd force-pushed the streams branch from 217fc64 to a7ff5dc Compare October 8, 2018 09:00

tyrasd added 3 commits October 8, 2018 16:57

switch stream kernels to return collections

3668619

streaming kernels now return collections of the individual cell results (instead of streams of individual cell results), because streams can't be serialized to be used in a remote "peer class loading" environment like ignite

call onClose callback also when using streaming backend

5cb51b9

rtroilo approved these changes Oct 9, 2018

View reviewed changes

add test for MapAggregator::stream

a32189c

tyrasd merged commit 9b902ad into master Oct 10, 2018

tyrasd deleted the streams branch October 10, 2018 08:50

tyrasd added this to the release 0.5.0 milestone Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MapReducer.stream() #19

Implement MapReducer.stream() #19

tyrasd commented Sep 18, 2018

tyrasd Sep 19, 2018 •

edited

Implement MapReducer.stream() #19

Implement MapReducer.stream() #19

Conversation

tyrasd commented Sep 18, 2018

tyrasd Sep 19, 2018 • edited

Choose a reason for hiding this comment

tyrasd Sep 19, 2018 •

edited