geospark geomesa interoperation #253

geoHeil · 2018-07-11T05:49:43Z

As a data scientist I want to be able to mix and match spatial libraries for spark. Currently, it is rather XOR as they do not integrate with each other and have overlapping classes and UDF function names.

In particular I would want to be able to easily integrate geospark and geomesa

One possibility could be to write my own udf registrator: https://github.com/DataSystemsLab/GeoSpark/blob/master/sql/src/main/scala/org/datasyslab/geosparksql/UDF/UdfRegistrator.scala

def registerAll(sparkSession: SparkSession): Unit = {
Catalog.expressions.foreach(f=>FunctionRegistry.builtin.registerFunction("geospark_"+f.getClass.getSimpleName.dropRight(1),f))
    Catalog.aggregateExpressions.foreach(f=>sparkSession.udf.register("geospark_"+f.getClass.getSimpleName,f))
  }

However this is still not handling overlapping classes (JTS, geotools)

geoHeil · 2018-07-12T09:51:51Z

The suggested workaround seems to work partially.
When not renaming UDF, this is a fallback to geomesa's functions.

But when renaming also the UDF (I actually want to get the speedup of geospark) the functions do not seem to be properly registered

Exception in thread "main" org.apache.spark.sql.AnalysisException: Undefined function: 'geospark_ST_Point'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0

geoHeil · 2018-07-12T09:59:55Z

An reproducible example can be found at https://github.com/geoHeil/geomesa-geospark

geoHeil · 2018-07-17T19:39:37Z

I get the following problems:

clash of classes

18-07-17 21:36:03 WARN UDTRegistration: Cannot register UDT for com.vividsolutions.jts.geom.Geometry, which is already registered.

when changing the scope from compileOnly to compile and executing in IDEA. Execution via the fat jar from the build tool in a shell fails with a time out.

jiayuasu · 2018-07-18T08:33:02Z

@geoHeil This is probably because GeoMesa also has its own customize Geometry kryo serializer which is same as GeoSpark. GeoSpark wrote a bunch of code to put spatial indexes and geometries into an array. Since we both utilize JTS geometry, this could be a conflict.

geoHeil · 2018-07-18T19:41:48Z

See the latest updates to https://github.com/geoHeil/geomesa-geospark

with the help of https://github.com/geoHeil/geomesa-geospark/pull/1/files I dentified that the wrong registrator was used and ordering of registrations is important

one problems remains:

`18/07/18 21:13:33 WARN UDTRegistration: Cannot register UDT for com.vividsolutions.jts.geom.Geometry, which is already registered. How to fix this easily? Shading JTS & registrator does not seem to be a maintainble idea
understand why ordering is imortant and why if geomesa first and geospark second the error is:

Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow

query plans are impacted. Geospark optimizations are only used when not using it in conjunction with geomesa

make runGeosparkSolo

geospark & geomesa

regular join

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#120L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#124L])
      +- *Project
         +- BroadcastNestedLoopJoin BuildRight, Inner,  **org.apache.spark.sql.geosparksql.expressions.ST_Contains$**
            :- LocalTableScan [geom_polygons#72]
            +- BroadcastExchange IdentityBroadcastMode
               +- LocalTableScan [geom_points#60]

geospark solo

optimized range join

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count#81L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#85L])
      +- *Project
         +- RangeJoin geom_polygons#43: geometry, geom_points#31: geometry, false
            :- LocalTableScan [geom_polygons#43]
            +- LocalTableScan [geom_points#31]

geoHeil · 2018-07-18T19:45:24Z

@jiayuasu do you believe this conflict is causing the problem that spark resorts to regular, i.e. no longer optimized joins?

jiayuasu · 2018-07-18T19:51:43Z

@geoHeil You probably can try to register GeoSpark join strategy manually: https://github.com/DataSystemsLab/GeoSpark/blob/master/sql/src/main/scala/org/datasyslab/geosparksql/utils/GeoSparkSQLRegistrator.scala

In other words, add the following line:

sparkSession.experimental.extraStrategies = JoinQueryDetector :: Nil

geoHeil · 2018-07-18T20:11:48Z

@jiayuasu thanks a lot. This is correct / was lacking from my registrator. Now optimized range joins are used as well.

Do you have an opinion regarding UDT registration / clashing class names? Or a better Idea than my own above with shading?

geospark

import com.vividsolutions.jts.geom.Geometry
import com.vividsolutions.jts.index.SpatialIndex
UDTRegistration.register(classOf[Geometry].getName, classOf[GeometryUDT].getName)
UDTRegistration.register(classOf[SpatialIndex].getName, classOf[IndexUDT].getName)

geomesa

import com.vividsolutions.jts.geom._
val typeMap: Map[Class[_], Class[_ <: UserDefinedType[_]]] = Map(
    classOf[Geometry]            -> classOf[GeometryUDT],
    classOf[Point]               -> classOf[PointUDT],
    classOf[LineString]          -> classOf[LineStringUDT],
    classOf[Polygon]             -> classOf[PolygonUDT],
    classOf[MultiPoint]          -> classOf[MultiPointUDT],
    classOf[MultiLineString]     -> classOf[MultiLineStringUDT],
    classOf[MultiPolygon]        -> classOf[MultiPolygonUDT],
    classOf[GeometryCollection]  -> classOf[GeometryCollectionUDT]
  )

is there a good way to merge the registrations (prevent double registrations (or can these simply be ignored?
is geospark adding some custom (https://github.com/jiayuasu/JTSplus) code under the namespace of com.vividsolutions.jts.* i.e. when the order is
- geospark
- geomesa
  no functionality is lost, any warning regarding double registration of the types could be ignored

geoHeil · 2018-07-19T14:13:06Z

Geomesa will serialize using JTS

override def serialize(obj: T): InternalRow = {
    new GenericInternalRow(Array[Any](WKBUtils.write(obj)))
  }

  override def sqlType: DataType = StructType(Seq(
    StructField("wkb", DataTypes.BinaryType)
  ))
  override def deserialize(datum: Any): T = {
    val ir = datum.asInstanceOf[InternalRow]
    WKBUtils.read(ir.getBinary(0)).asInstanceOf[T]
  }

geospark using

def serialize(geometry: Geometry): Array[Byte] = {
    val out = new ByteArrayOutputStream()
    val kryo = new Kryo()
    val geometrySerde = new GeometrySerde()
    val output = new Output(out)
    geometrySerde.write(kryo, output, geometry)
    output.close()
    return out.toByteArray
  }

  def deserialize(values: ArrayData): Geometry = {
    val in = new ByteArrayInputStream(values.toByteArray())
    val kryo = new Kryo()
    val geometrySerde = new GeometrySerde()
    val input = new Input(in)
    val geometry = geometrySerde.read(kryo, input, classOf[Geometry])
    input.close()
    return geometry.asInstanceOf[Geometry]
  }

is there any problem if JTS code (from geomesa) is serialized via the geospark serializer? Any problems regarding efficiency?

geoHeil · 2018-07-19T16:50:04Z

According to James (from geomesa gitter chat)

One strategy that might work would be for GeoSpark and GeoMesa to agree on the classnames for UDT registrations
and then for end users to register ONE AND ONLY ONE set of the UDTs...
the UDFs could be based on those classnames, and there's a fight chance that'd let someone 'mix and match' (as well as combine UDFs between packages)

is there some interest from both projects to collaborate here?

geoHeil · 2018-07-31T10:19:08Z

cannot resolve 'CAST(`hw_aggreagtion_area` AS ARRAY<TINYINT>)' due to data type mismatch: cannot cast org.apache.spark.sql.jts.PointUDT@449554e8 to org.apache.spark.sql.geosparksql.UDT.GeometryUDT@3f2c1eb5;

this is then the problem of clashing UDT. How could this be resolved (quickly)?

jnh5y · 2019-06-03T22:21:04Z

@geoHeil your blog post about this work is great!

@jiayuasu any thoughts integration points?

geoHeil · 2020-01-22T22:56:14Z

With the upgrade to geomesa 2.4.x geotools was upgrade to version 21 this also internally switches to locationtech based JTS.

#410

Unfortunately, having two versions of geotools on the classpath is causing troubles for me

geoHeil · 2020-01-24T07:02:38Z

Instead of some discussions in the gitter channels of geospark and geoemsa perhaps this is the better place to continue the discussion #253

Tasks to be done:

Short term

produce an additional jar which only contains geospark code without transitive dependencies
for the currently existing fat-jar consider shading any transitive dependency. An effort to work on the shading of geotools has already started in maven jar is a fat jar including geotools #410

Long term

align versions of JTS used, upgrade geospark's vividsolutions to locationtech
align version of geotools being used
potentially merge geospark jts plus additions over to JTS - some discussion is needed

@jiayuasu , @jnh5y what do you think about this?
@jnh5y I believe in some of the gitter discussion you (or maybe Emilio - at least someone) mentioned to maybe have some interns working on this. Do you know of any progress there? I believe this was on the geomesa gitter channel.

elahrvivaz · 2020-01-27T14:01:18Z

Just as a clarification, shading refers to packaging the transitive dependencies in an uber-jar, but what you are referring to is shading + relocation, which will hide the transitive dependencies from everything else on the classpath. That seems like a good medium-term solution, although I will note that you may run into issues in your end project if you try to use the shade plugin there, while having a dependency that is also shaded/relocated (i.e. 2 levels of shading will likely not work).

geoHeil mentioned this issue Jan 9, 2019

discuss JTSPlus addition locationtech/jts#364

Open

harryprince mentioned this issue Apr 9, 2019

Feature Request: Add GeoMesa Support harryprince/geospark#8

Open

geoHeil mentioned this issue Jan 24, 2020

maven jar is a fat jar including geotools #410

Closed

jiayuasu closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geospark geomesa interoperation #253

geospark geomesa interoperation #253

geoHeil commented Jul 11, 2018

geoHeil commented Jul 12, 2018 •

edited

geoHeil commented Jul 12, 2018

geoHeil commented Jul 17, 2018 •

edited

jiayuasu commented Jul 18, 2018

geoHeil commented Jul 18, 2018

geoHeil commented Jul 18, 2018

jiayuasu commented Jul 18, 2018

geoHeil commented Jul 18, 2018 •

edited

geoHeil commented Jul 19, 2018

geoHeil commented Jul 19, 2018

geoHeil commented Jul 31, 2018

jnh5y commented Jun 3, 2019

geoHeil commented Jan 22, 2020

geoHeil commented Jan 24, 2020

elahrvivaz commented Jan 27, 2020

geospark geomesa interoperation #253

geospark geomesa interoperation #253

Comments

geoHeil commented Jul 11, 2018

geoHeil commented Jul 12, 2018 • edited

geoHeil commented Jul 12, 2018

geoHeil commented Jul 17, 2018 • edited

jiayuasu commented Jul 18, 2018

geoHeil commented Jul 18, 2018

geospark & geomesa

geospark solo

geoHeil commented Jul 18, 2018

jiayuasu commented Jul 18, 2018

geoHeil commented Jul 18, 2018 • edited

geoHeil commented Jul 19, 2018

geoHeil commented Jul 19, 2018

geoHeil commented Jul 31, 2018

jnh5y commented Jun 3, 2019

geoHeil commented Jan 22, 2020

geoHeil commented Jan 24, 2020

elahrvivaz commented Jan 27, 2020

geoHeil commented Jul 12, 2018 •

edited

geoHeil commented Jul 17, 2018 •

edited

geoHeil commented Jul 18, 2018 •

edited