Skip to content

Spark SQL 1.3 - Exception in Elasticsearch when executing JOIN with DataFrame created from Oracle's table #449

Closed
@difin

Description

@difin

Hi,

Environment: Spark 1.3.0
Elasticsearch: 1.4.4
Elasticsearch-Hadoop: 2.1.0.Beta4
Oracle Express 11g XE

I am trying to run a join between 2 DataFrames, one from Elasticsearch and another one from Oracle and getting exception as below.
This issue only happens when loading Elasticsearch data using Spark SQL "load" function. It doesn't happen when loading using esDF function. However, when using esDF function, mappings of columns to column names is broken (issue: #451)

com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): org.elasticsearch.spark.sql.ScalaEsRow
    at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
    at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
    at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Code:

    //=======================================================================
    // Preparing Elasticsearch DataFrame
    //=======================================================================

    val sqlContext = new SQLContext(sc)
    val hours = sqlContext.load("summary/hours", "org.elasticsearch.spark.sql")
    hours.foreach (x => println(x))

    //=======================================================================
    // Preparing Oracle DataFrame
    //=======================================================================

    val users = sqlContext.load("jdbc", Map("url" -> dbConnectionString,
                                        "dbtable" -> "sats.users",
                                        "driver"  -> "oracle.jdbc.driver.OracleDriver"))
    users.foreach (x => println(x))

    //=======================================================================
    // Joining ES and Oracle DataFrames
    //=======================================================================

    val hoursAug = hours.join(users, hours("User") === users("USERNAME"))
    hoursAug.foreach (x => println(x))

The elements of Elasticsearch and Oracle DataFrames are printed successfully on the screen (first two foreach{...}) and the exceptions happens when join is executed.

Can you please advise if this is a bug and if yes, is there any workaround?

Thanks,
Dmitriy Fingerman

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions