Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
issue #297: added an overload for GeoJsonReader.readToGeometryRDD wit… #298
…h additional parameter to skip invalid geometries in result RDD
Is this PR related to a proposed Issue?
What changes were proposed in this PR?
Add an option to create an RDD from geoJson, skipping invalid geometries
How was this patch tested?
Added a test with both valid and invalid geometries.
Did this PR include necessary documentation updates?
Hi @AntonPeniaziev ,
Thank you for the patch!
First, the patch cannot pass Travis-CI test. Please check.
Second, I am not sure why you want to validate invalid geometry (see the definition below) in the first place. This is something should be handled by "ST_IsValid" after the reader successfully loads GeoJSON data. I just tried, the original GeoSparkGeoJSON reader can actually read the json file in your patch without any error.
Third, I think we probably need to clarify two concepts:
(1) Malformed geojson format (string with syntax error): I think you actually want to solve this kind of format, isn't it?
(2) Invalid geometry: No syntax error but actually makes no sense in terms of topology: see PostGIS def https://postgis.net/docs/ST_IsValid.html
If you want to actually solve "Malformed geojson format ", you probably should simply add a try catch to capture the exception (if the user allows this capture), return null in the format mapper and add a Spark's log4j warning. "[GeoSpark][FormatMapper] Catch a malformed geojson geometry. Produce null."
Please correct me if I am wrong.
Thank you again for your help!
Hi @jiayuasu ,
but I haven't succeed to do so due to Java version issues. I think public API should be more friendly in that way.
I've managed to solve it with some try-catch warappers around each readGeometry. Do you think it's worse another issue opened or I can just edit a title a bit?
Since GeoSpark has a bunch of different format readers: GeoJSON, Shapefile, WKT, WKB, HDF...
So I recommend that, don't do validation for GeoJSON.
Instead, please create a new function in SpatialRDD:
called RemoveInvalidGeometry(). It checks rawSpatialRDD and filters out all geometries whose isvalidop.isValid is false.
This provides a very generic API for all input formats. So the user first loads all geometries to rawSpatialRDD and then call "RemoveInvalidGeometry" to remove invalid geometries.
Most importantly, this will help all other users.
Regarding your Point3, we probably need to have a separate PR to discuss.
Thank you again for your effort!
The only aternative I can think about is implemeting Reader for each type like GeoJsonReader,
Tell me please what do you think is better approach.
@AntonPeniaziev OK. Understand, it makes sense. Let me first accept this PR and then figure out how to do the same check for other formats.
In your another PR for Issue #299, please make sure you differentiate invalidTopologyGeom and invalidSyntaxGeom in the overload API, otherwise the user may feel confused.