Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate de-dup step when joining RDDs partitioned using a Quad tree #131

Merged
merged 1 commit into from Oct 6, 2017

Conversation

@mbasmanova
Copy link

commented Oct 4, 2017

De-dup step at the end of the join appears to be quite expensive as it involves a shuffle. There are two sources of duplication which require de-dup:

  1. rectangles defining individual partitions may overlap (this is the case with R-Tree partitioning),
  2. individual polygons may overlap multiple partitions.

(1) can be eliminated by choosing partitioning method which generates non-overlapping partition rectangles (e.g. quad-tree or Hilbert).

(2) can be eliminated by enhancing the join procedure for individual partitions to check whether reference point of the intersection of the geometries belongs to the current partition's extent.

This PR modifies the join algorithm to eliminate de-dup step when joining RDDs partitioned using a Quad tree. The set of changes is as follows:

  • Use RDD boundary padded at the top and on the right when building a Quad tree from samples;
  • Treat Quad tree zones as rectangles open at the top and on the right when partitioning points; this guarantees that partitioning of points never generates copies; before this PR, points on the Quad tree grid lines were copied into multiple partitions;
  • For each pair of matching geometries where neither geometry is a point, check whether reference point of the intersection of these geometries belongs to this partition's extent.

Also, I ran into OutOfMemory error while partitioning particularly large dataset due to Quad tree being too big for broadcasting. To reduce the size of the Quad tree used for partitioning I added logic to remove samples stored in the Quad tree nodes. Partitioning logic uses only Quad tree zones add therefore doesn't need any elements stored in the tree. (See QuadTreePartitioner.java).

This PR could be extended to support RDDs partitioning using Equal and Hilbert grids. However, at the moment this is not possible because spatialPartitioning(otherGrids) API doesn't specify the grid type and therefore makes it impossible to distinguish between RDDs partitioned using grids which support inline de-dup within the partition-level join and grids which do not.

@mbasmanova

This comment has been minimized.

Copy link
Author

commented Oct 4, 2017

@jiayuasu , the build failed with "The job exceeded the maximum time limit for jobs, and has been terminated." How can I trigger a re-run?

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Oct 4, 2017

@mbasmanova Hi Masha, I restarted the build but this build failed again. It seems this PR significantly slows down GeoSpark performance because it takes more than 48 min still cannot finish. Before, the time was 20-25 mins.

Can you please double check what's going on?

@mbasmanova

This comment has been minimized.

Copy link
Author

commented Oct 4, 2017

@jiayuasu , the testing time increased because I modified {Point,LineString,Polygon,Rectangle}JoinTest suites to test 3 grid types (QuadTree, RTree and Hilbert) instead of just one. I did this to increase test coverage. Before, only one grid type was tested, the RTree.

Can we increase the allowed build time? If not, do you think I should reduce test coverage back?

@mbasmanova mbasmanova force-pushed the mbasmanova:inline-dedup branch from 923bfdc to e09e0b8 Oct 4, 2017

@mbasmanova

This comment has been minimized.

Copy link
Author

commented Oct 4, 2017

@jiayuasu , *JoinTest variations using Hilbert partitioning are particularly slow. I removed these. Let's see if this helps.

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Oct 5, 2017

@mbasmanova

For the questions in removing duplicate parts,

(1) Yes, I agree. use non-overlapped partitioning method such as Quad-Tree and Voronoi diagram can solve this problem.

(2) I don't quite understand how you use reference point to avoid the duplicate removal part. Are you following any literatures?

@mbasmanova

This comment has been minimized.

Copy link
Author

commented Oct 5, 2017

@jiayuasu , re: 2 - I'm building on the idea from "4.3.5 Avoiding Duplicate Results" section of http://www.cs.umd.edu/~hjs/pubs/jacoxtrjoin07.pdf

Duplicates appear if intersecting geometries A and B appear in multiple partitions. Assuming that partitions themselves don't intersect and in aggregate cover full extent of the dataset, reference point of the intersection of A and B belongs to exactly one partition. If reference point is selected in such a way that it belongs to the intersection itself (e.g. choosing centroid for reference point would not work as centroid may be outside of the geometry), then it is guaranteed that the partition containing reference point of the intersection contains geometries A and B as well. Therefore if a matching pair A and B is reported only when the reference point of the intersection of A and B belongs to the partition's extent it is guaranteed that each matching pair is reported exactly once.

In this PR I define reference point as follows:

  • RefPoint(point) = point
  • RefPoint(linestring) = a vertex with smallest value of y among vertexes with smallest value of x
  • RefPoint(polygon) = RefPoint(exteriorRing)
@jiayuasu
Copy link
Member

left a comment

This PR implements the referencing point idea proposed by Dr.Jens-Peter Dittrich.

Dittrich, J-P., and Bernhard Seeger. "Data redundancy and duplicate detection in spatial join processing." Data Engineering, 2000. Proceedings. 16th International Conference on. IEEE, 2000.

@jiayuasu jiayuasu merged commit 92891a0 into DataSystemsLab:master Oct 6, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.