Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues while loading GeoJson file obtained from ESRI ArcMap tool #224

Closed
SrinivasRIL opened this issue Apr 26, 2018 · 6 comments

Comments

@SrinivasRIL
Copy link

commented Apr 26, 2018

Hi @jiayuasu
Appreciate your help in helping us out with our issues.

We are using ESRI ArcMap to convert our feature classes to GeoJson data. Our geojson data looks like this

{"type":"FeatureCollection","crs":{"type":"name","properties":{"name":"EPSG:4326"}},"features":[{"type":"Feature","id":1,"geometry":{"type":"Polygon","coordinates":[[[74.479138523000074,15.23637526400006],[74.477792431000069,15.224213660000032],[74.488047980000033,15.228113961000076],[74.479138523000074,15.23637526400006]]]},"properties":{"OBJECTID":1,"SHAPE_Length":0.035358255765521845,"SHAPE_Area":5.9736880883337009e-05}},{"type":"Feature","id":2,"geometry":{"type":"Polygon","coordinates":[[[74.462140491000071,15.209399061000056],[74.462573410000061,15.196601632000068],[74.476942429000076,15.19804866000004],[74.474797396000042,15.210276462000024],[74.462140491000071,15.209399061000056]]]},"properties":{"OBJECTID":2,"SHAPE_Length":0.052348246074740742,"SHAPE_Area":0.00017058056436916421}}]}

So As you can see, when we tried to load this geojson we get the error as

org.wololo.geojson.FeatureCollection cannot be cast to org.wololo.geojson.Feature

So we removed the firstline and since we had 2 records, those two records had to be on two different lines (else it showed the count as 1 only). Then we were able to load the geojson file.
We referred the sample geojson in your github page
So basically our geojson after removing the first line now looks like

{"type":"Feature","id":1,"geometry":{"type":"Polygon","coordinates":[[[74.479138523000074,15.23637526400006],[74.477792431000069,15.224213660000032],[74.488047980000033,15.228113961000076],[74.479138523000074,15.23637526400006]]]},"properties":{"OBJECTID":1,"SHAPE_Length":0.035358255765521845,"SHAPE_Area":5.9736880883337009e-05}}, {"type":"Feature","id":2,"geometry":{"type":"Polygon","coordinates":[[[74.462140491000071,15.209399061000056],[74.462573410000061,15.196601632000068],[74.476942429000076,15.19804866000004],[74.474797396000042,15.210276462000024],[74.462140491000071,15.209399061000056]]]},"properties":{"OBJECTID":2,"SHAPE_Length":0.052348246074740742,"SHAPE_Area":0.00017058056436916421}}
with two records on two different lines exactly like your geojson sample in github.

Since we would be working with a huge datasets (shape file wont be feasible as our Geodatabases can reach 22.5 gb and shp files have a limitation of 2gb) can you help us out? I am not sure by removing the first line and the corresponding braces and brackets through a python script will solve this issue either.

Can you add the first line (till the features array) in the next fix or is there any other workaround to this problem?

Thanks a lot

Settings

GeoSpark version = 1.1.2

Apache Spark version = 2.1

JRE version = 1.8?

API type = Scala

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Apr 29, 2018

@SrinivasRIL We will try to fix this in the next major release which is 1.2.0. But it won't come out soon because it will contain many new functions and API changes. To solve this issue for now, I have several suggestions:

  1. Generate your data to WKT or WKB. They are single-line data which is perfectly supported by GeoSpark RDD and SQL. They don't have any size limitation like Shapefile.
  2. Use GeoJSON, but remove the header "FeatureCollections {" and the footer "}" using a small script. Currently, this is the only way to solve it.

I know this is very annoying however the design of GeoJSON really complicates the data parsing. To fix this trivial issue perfectly, we may have to write a big chunk of code to customize a Spark input reader like GeoSpark shapefile reader

@SrinivasRIL

This comment has been minimized.

Copy link
Author

commented May 2, 2018

Thanks @jiayuasu
Currently as you had suggested we are developing a script in python to remove the header and footer as per our convenience and may not be the best solution. Hopefully you guys can come up with the newer version soon.
Also staying with this current geospark version with geojson.
using ST_GeomFromGeojson, it results in reading geometry/shape only, can you tell us how to read other attributes from the geojson (included in the properties object) like statename, cityname etc from the dataframe??

We also tried to create a spatial rdd using filedatasplitter as 'geojson' and carryallattributes as 'true' and convert the resulting rdd into a dataframe but we are getting this error

val polyRDDInputLocation = "hdfs:///gis/Mumbai11195.json" val polyRDDSplitter = FileDataSplitter.GEOJSON val carryOtherAttributes = true // Carry Column 1 (hotel, gas, bar...) var objectRDD = new PolygonRDD(sc, polyRDDInputLocation, polyRDDSplitter, carryOtherAttributes) var polydf = Adapter.toDf(rddWithOtherAttributes,sparkSession) polydf.createOrReplaceTempView("polydf") polydf.printSchema()

error: overloaded method value toDf with alternatives:
(spatialPairRDD: org.apache.spark.api.java.JavaPairRDD[com.vividsolutions.jts.geom.Geometry,com.vividsolutions.jts.geom.Geometry],sparkSession: org.apache.spark.sql.SparkSession)org.apache.spark.sql.DataFrame
(spatialRDD: org.datasyslab.geospark.spatialRDD.SpatialRDD[com.vividsolutions.jts.geom.Geometry],sparkSession: org.apache.spark.sql.SparkSession)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.SparkSession)

@jiayuasu

This comment has been minimized.

Copy link
Member

commented May 4, 2018

@SrinivasRIL If you use RDD/SQL API to load GeoJSON, the other attributes are stored in geometry's UserData attribute. Use "myDf.map()..." "myRDD.map.." to manipulate.

The GeoJSON support in GeoSpark is limited. I will fix this issue in 1.2.0 (will be out late May or early June). For now, WKT and WKB are preferred.

@SrinivasRIL

This comment has been minimized.

Copy link
Author

commented May 5, 2018

@jiayuasu
Thanks a lot, appreciate your help, awaiting for the new release to test it with our huge dataset, it should work.
We are working using wkt , hopefully it should work..

@jiayuasu jiayuasu added this to the 1.2.0 milestone Jun 13, 2018

@kalvinnchau

This comment has been minimized.

Copy link

commented Jun 22, 2018

@jiayuasu
Does the SQL API read the other attributes correctly? I am able to get the RDD API to load up the attributes, but when I use the SQL API it fails to load it up.

Code:

    val geoJsonFile = "file:///data/test.geojson"

    val pointRDDSplitter = FileDataSplitter.GEOJSON
    val carryAttributes  = true
    val rdd              = new PolygonRDD(spark.sparkContext, geoJsonFile, pointRDDSplitter, carryAttributes)

    rdd.rawSpatialRDD.take(1).asScala.foreach(println)
    val rddWithOtherAttributes =
      rdd.rawSpatialRDD.rdd.map[String](f => f.getUserData.asInstanceOf[String])
    rddWithOtherAttributes.take(1).foreach(println)

    var df = spark.read
      .format("csv")
      .option("delimiter", "\t")
      .option("header", "false")
      .load(geoJsonFile)

    df.show(1, false)
    df.createOrReplaceTempView("tabblock")

    val converted = spark.sql("""
        | SELECT ST_GeomFromGeoJSON(tabblock._c0) AS shape,
        | FROM tabblock
      """.stripMargin)

    converted.show(1, false)
    converted.createOrReplaceTempView("poly_coords")

RDD Output

POLYGON ((-135.275376 56.883444,
 -135.275037 56.884441, 
-135.27467 56.885352, 
-135.274355 56.885938, ...))
	{tabblock_id=02220000100191}

SQL Show() (no _c* columns)

truncated to save space
|_c0|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     +--------------------------------+
|{"type":"Feature",
"properties":{ "tabblock_id" : "022200001001962"},
"geometry":{"type":"MultiPolygon",
"coordinates":[[[[-135.275376,56.883444]
,[-135.275037,56.884441]
,[-135.27467,56.885352]....[-135.275376,56.883444]]]]}}|

SQL Show() after running SELECT ST_GeomFromGeoJSON(tabblock._c0) AS shape

truncated to save space
|shape|
---------
POLYGON ((-135.275376 56.883444,
 -135.275037 56.884441, 
-135.27467 56.885352, 
-135.274355 56.885938, 
-135.274264 56.886466, 
-135.273923 56.887044,....))

Do we have to map over the entire _c0 column as a string and manually extract the attributes we want to get it into the dataframe?

@jiayuasu

This comment has been minimized.

Copy link
Member

commented Jun 22, 2018

@kalvinnchau Unfortunately, yes, you have to manually extract the attributes. This will be fixed in 1.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.