New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArrayIndexOutOfBoundsException on Spark SQL with 2.1.0.rc1 #482
Comments
@sjoerdmulder I'm having issues following the report. Can you please provide an actual reproducible example? |
Oh, and please turn on logging (see the docs) to |
@costin Thanks for the quick response, ignore the initial mapping story :) I cannot provide you an example. But have investigated the problem: In our data the actual field in ES was always set to an empty array. This doesn't seem to update the mapping in ES. But it does exist in the document, so then ES Spark will try to detect the mapping for a field that doesn't have a mapping in ES. When adding a value in the array the mapping is updated and the ES spark query works. |
I see - so if I understand correctly, you have a I'd still like to understand how you ended up in this situation so maybe while you cannot replicate your environment, provide a sample Spark script just so I understand better the situation? |
Here you go import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.elasticsearch.spark._
object ESMappingIssue {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ESMappingIssue").setMaster("local[1]")
val sc = new SparkContext(conf)
val json = """{"foo" : 5, "nested": { "bar" : [], "what": "now" } }"""
sc.makeRDD(Seq(json)).saveJsonToEs("spark/mappingtest")
val df = new SQLContext(sc).read.format("org.elasticsearch.spark.sql").load("spark/mappingtest")
df.collect().foreach(println)
sc.stop()
}
} |
the java.lang.IndexOutOfBoundsException also occurs if the "bar" array contains valid json objects. val json = """{"foo" : 5, "nested": { "bar" : [{"test":"1", "age":20},{"test":"2", "age":21}], "what": "now" } }""" https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-array-type.html |
Hi Costin, We are getting the same java.lang.ArrayIndexOutOfBoundsException. We are using elasticsearch-hadoop-2.1.0.jar along with Spark 1.3 and ElasticSearch 1.4.4. It turned out that the one of the key inside a nested object in the JSON document was empty string and that caused the issue. For example: the document was something like
Note that the User.UserGroup was empty. If it is non-empty, spark-sql query works fine. This is similar to the empty array (bar) in the example given by sjoerdmulder. As a work-around we are keeping a default value for all the keys, but my question is -- Is there any known solution / work-around for this? |
Checking in to see if this was replicated. We've been unable to use |
is this related to: #484 |
It looks like it might be yes. Have you tried the latest dev builds? |
Just tried the latest dev build. The
Here is the scala test that I'm running:
|
btw, this works
HOWEVER, this can be broken as well. Fully qualified field names using dot notation similar to field name filters would be more robust, to avoid ambiguous field names, since this then breaks it.
|
@jeffsteinmetz I've refactored the logic and pushed the updates into master and also published the new dev builds. |
Confirmed. Thank you. This worked (I tried multiple array fields):
|
Closing the issue... |
Using ES-hadoop 2.1.0.rc1, Spark 1.4.0. Elasticsearch 1.6.0
The ES index that we use contains various events with a variaty of fields but the (custom) schema that we defined has the "common" fields that the SQL query will use.
Somehow it still tries to map a field that is not in the schema nor used in the SQL causing a
ArrayIndexOutOfBoundsException
since theindexOf
is returning-1
The text was updated successfully, but these errors were encountered: