Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue #297: added an overload for GeoJsonReader.readToGeometryRDD wit… #298

Merged
merged 2 commits into from Dec 2, 2018

Conversation

@AntonPeniaziev
Copy link
Contributor

commented Nov 24, 2018

…h additional parameter to skip invalid geometries in result RDD

Is this PR related to a proposed Issue?

#297

What changes were proposed in this PR?

Add an option to create an RDD from geoJson, skipping invalid geometries

How was this patch tested?

Added a test with both valid and invalid geometries.

Did this PR include necessary documentation updates?

No

issue #297: added an overload for GeoJsonReader.readToGeometryRDD wit…
…h additional parameter to skip invalid geometries in result RDD
@jiayuasu

This comment has been minimized.

Copy link
Member

commented Nov 25, 2018

Hi @AntonPeniaziev ,

Thank you for the patch!

First, the patch cannot pass Travis-CI test. Please check.

Second, I am not sure why you want to validate invalid geometry (see the definition below) in the first place. This is something should be handled by "ST_IsValid" after the reader successfully loads GeoJSON data. I just tried, the original GeoSparkGeoJSON reader can actually read the json file in your patch without any error.

Third, I think we probably need to clarify two concepts:

(1) Malformed geojson format (string with syntax error): I think you actually want to solve this kind of format, isn't it?

(2) Invalid geometry: No syntax error but actually makes no sense in terms of topology: see PostGIS def https://postgis.net/docs/ST_IsValid.html

If you want to actually solve "Malformed geojson format ", you probably should simply add a try catch to capture the exception (if the user allows this capture), return null in the format mapper and add a Spark's log4j warning. "[GeoSpark][FormatMapper] Catch a malformed geojson geometry. Produce null."

Please correct me if I am wrong.

Thank you again for your help!

Cheers,
Jia

@AntonPeniaziev

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2018

Hi @jiayuasu ,

  1. CI fail resolved

  2. The objective is not to validate the invalid geometries rather give to user an option to build and RDD from geoJson that will contain only topologically valid geometries. Note that this cannot be done with ST_Isvalid expression, since it receives single geometry, when GeoJsonReader.readToGeometryRDD receives a path to file. Theoretically there is an option to make a filter, somehow with lambda in java when using
    val isvalidop = new IsValidOp(geometry)
    isvalidop.isValid

but I haven't succeed to do so due to Java version issues. I think public API should be more friendly in that way.

  1. Another topic I haven't covere here is huge geoJsons with invalid syntax geometry.
    I've seen @netanel246 's issue #291 relates it.
    We actually have a use case with a big GeoJson (ten's of millions rows) that contains few geometries with invlid syntax, for example - a polygon that is not closed or a polygon with zero area, like
    POLYGON((a, b, a))
    Of course we are not talking about total syntax mess, but there are GEO processing tools in the market that can create those kind of syntactically invalid geometries that I've mentioned above and we need to deal with them.

I've managed to solve it with some try-catch warappers around each readGeometry. Do you think it's worse another issue opened or I can just edit a title a bit?

Regards,

Anton

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

@AntonPeniaziev
Got your point.

Since GeoSpark has a bunch of different format readers: GeoJSON, Shapefile, WKT, WKB, HDF...

So I recommend that, don't do validation for GeoJSON.

Instead, please create a new function in SpatialRDD:
https://github.com/DataSystemsLab/GeoSpark/blob/master/core/src/main/java/org/datasyslab/geospark/spatialRDD/SpatialRDD.java#L118

called RemoveInvalidGeometry(). It checks rawSpatialRDD and filters out all geometries whose isvalidop.isValid is false.

This provides a very generic API for all input formats. So the user first loads all geometries to rawSpatialRDD and then call "RemoveInvalidGeometry" to remove invalid geometries.

Most importantly, this will help all other users.

Regarding your Point3, we probably need to have a separate PR to discuss.

Thank you again for your effort!

Jia

@AntonPeniaziev

This comment has been minimized.

Copy link
Contributor Author

commented Dec 2, 2018

@jiayuasu ,
Regarding RemoveInvalidGeometry function
I've looked at SpatialRdd code and I don't see any proper way to implement it since after filtering invalid geometries index becomes obsolete and calculating new one could sometimes be against users intentions. Like it was mentioned in Geospark paper there are cases when absence of index improved the performance.

The only aternative I can think about is implemeting Reader for each type like GeoJsonReader,
and use the functionality I've already implemented in FormatMapper (use allowInvalidGeometries flag)

Tell me please what do you think is better approach.

Regards,

Anton

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Dec 2, 2018

@AntonPeniaziev OK. Understand, it makes sense. Let me first accept this PR and then figure out how to do the same check for other formats.

In your another PR for Issue #299, please make sure you differentiate invalidTopologyGeom and invalidSyntaxGeom in the overload API, otherwise the user may feel confused.

@jiayuasu jiayuasu self-requested a review Dec 2, 2018

@jiayuasu jiayuasu merged commit 746ddb4 into DataSystemsLab:master Dec 2, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
jiayuasu added a commit that referenced this pull request Dec 2, 2018
issue #297: added an overload for GeoJsonReader.readToGeometryRDD wit… (
#298)

* issue #297: added an overload for GeoJsonReader.readToGeometryRDD with additional parameter to skip invalid geometries in result RDD

* issue #297, fixed location of the test resource file
jiayuasu added a commit that referenced this pull request Dec 2, 2018
issue #297: added an overload for GeoJsonReader.readToGeometryRDD wit… (
#298)

* issue #297: added an overload for GeoJsonReader.readToGeometryRDD with additional parameter to skip invalid geometries in result RDD

* issue #297, fixed location of the test resource file

@AntonPeniaziev AntonPeniaziev deleted the AntonPeniaziev:AllowInvalidGeos branch Dec 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.