SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

hdominguez-stratio · 2015-11-25T17:19:38Z

When I try to execute a query with an IN operator, if the column is of type LONG, the datasource throws the next exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.util.TaskCompletionListenerException: SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed; shardFailures {[W61yWz-NRv6PzH1kPzisiA][databasetest][0]: SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"filtered":{ "query":{"match_all":{}},"filter": { "and" : [ {"query":{"match":{"ident":"0 10 5 27"}}} ] } }}}]]]; nested: NumberFormatException[For input string: "0 10 5 27"]; }]
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:90)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
        at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
        at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
        at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
        at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
        at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

The column is of type LONG.

The executed test is the next (CUCUMBER FORMAT)

  Scenario: SELECT * FROM tabletest WHERE ident IN (0,10,5,27);
    When I execute 'SELECT * FROM tabletest WHERE ident IN (0,10,5,27)'
    Then The result has to have '2' rows ignoring the order:
      | ident-long | name-string   | money-double  |  new-boolean  | date-date  |
      |    0       | name_0        | 10.2          |  true         | 1999-11-30 |
      |    5       | name_5        | 15.2          |  true         | 2005-05-05 |

The text was updated successfully, but these errors were encountered:

hdominguez-stratio · 2015-11-25T17:26:07Z

When I try the same but with a column of type double or date, a similar exception is thrown

costin · 2015-11-25T17:56:20Z

What version of es-hadoop are you using?
On Nov 25, 2015 7:19 PM, "Hugo Domínguez Sanz" notifications@github.com
wrote:

When I try to execute a query with an IN operator, if the column is of
type LONG, the datasource throws the next exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost):
org.apache.spark.util.TaskCompletionListenerException:
SearchPhaseExecutionException[Failed to execute phase [init_scan], all
shards failed; shardFailures {[W61yWz-NRv6PzH1kPzisiA][databasetest][0]:
SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure
[Failed to parse source [{"query":{"filtered":{
"query":{"match_all":{}},"filter": { "and" : [
{"query":{"match":{"ident":"0 10 5 27"}}} ] } }}}]]]; nested:
NumberFormatException[For input string: "0 10 5 27"]; }]
at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

The column is of type LONG.

The executed test is the next (CUCUMBER FORMAT)

Scenario: SELECT * FROM tabletest WHERE ident IN (0,10,5,27);
When I execute 'SELECT * FROM tabletest WHERE ident IN (0,10,5,27)'
Then The result has to have '2' rows ignoring the order:
| ident-long | name-string | money-double | new-boolean | date-date |
| 0 | name_0 | 10.2 | true | 1999-11-30 |
| 5 | name_5 | 15.2 | true | 2005-05-05 |

—
Reply to this email directly or view it on GitHub
#615.

hdominguez-stratio · 2015-11-26T08:32:36Z

I'm using the version 2.1.1 and 1.7.3 of elasticsearch

costin · 2015-11-26T09:40:12Z

Please use 2.1.2 as it likely fixed your issue.

hdominguez-stratio · 2015-11-26T10:15:45Z

Thanks, in version 2.1.2 is fixed, but with DATE type does not works.

ES MAPPING:

{"databasetest":{"mappings":{"tabletest":{"properties":{"date":{"type":"date","format":"dateOptionalTime"},"ident":{"type":"long"},"money":{"type":"double"},"name":{"type":"string"},"new":{"type":"boolean"}}}}}}

CUCUMBER TEST:

 Scenario: [ES] SELECT date FROM tabletest WHERE date IN ('1999-11-30','1998-12-25','2005-05-05','2008-2-27');
    When I execute 'SELECT date FROM tabletest WHERE date IN ('1999-11-30','1998-12-25','2005-05-05','2008-2-27')'
    Then The result has to have '2' rows ignoring the order:
       | date-date  |
       | 1999-11-30 |
       | 2005-05-05 |

EXCEPTION:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 85, localhost): org.apache.spark.util.TaskCompletionListenerException: SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed; shardFailures {[OPZ3P8qmTFSGSYiqg_Z5VA][databasetest][0]: SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"filtered":{ "query":{"match_all":{}},"filter": { "and" : [ {"or":{"filters":[{"query":{"match":{"date":""1999-11-30T00:00:00+01:00" "1998-12-25T00:00:00+01:00" "2005-05-05T00:00:00+02:00" "2008-02-27T00:00:00+01:00""}}}]}} ] } }}}]]]; nested: QueryParsingException[[databasetest] Failed to parse]; nested: JsonParseException[Unexpected character ('1' (code 49)): was expecting comma to separate OBJECT entries
 at [Source: [B@3a91aa51; line: 1, column: 118]]; }]
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:90)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
    at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
    at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
    at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
    at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

I do not know if this issue could be an Spark issue.

ebuildy · 2015-12-28T08:46:50Z

Maybe a comma is missing here:

https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/DefaultSource.scala#L261

strings.mkString("\"", " ", "\"") ==> strings.mkString("\"", ",", "\"")

?

When using Date types with Spark IN filter, apply a terms query instead of match relates #615

When using Date types with Spark IN filter, apply a terms query instead of match relates #615 (cherry picked from commit 6b212e7)

costin added :Spark v2.2.0-rc1 v2.1.3 labels Nov 26, 2015

costin added a commit that referenced this issue Dec 29, 2015

[SPARK] Use terms query for date types

6b212e7

When using Date types with Spark IN filter, apply a terms query instead of match relates #615

costin closed this as completed Jan 8, 2016

costin added a commit that referenced this issue Jan 16, 2016

[SPARK] Use terms query for date types

eb0bd77

When using Date types with Spark IN filter, apply a terms query instead of match relates #615 (cherry picked from commit 6b212e7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

hdominguez-stratio commented Nov 25, 2015

hdominguez-stratio commented Nov 25, 2015

costin commented Nov 25, 2015

hdominguez-stratio commented Nov 26, 2015

costin commented Nov 26, 2015

hdominguez-stratio commented Nov 26, 2015

ebuildy commented Dec 28, 2015

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

Comments

hdominguez-stratio commented Nov 25, 2015

hdominguez-stratio commented Nov 25, 2015

costin commented Nov 25, 2015

hdominguez-stratio commented Nov 26, 2015

costin commented Nov 26, 2015

hdominguez-stratio commented Nov 26, 2015

ebuildy commented Dec 28, 2015