Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

Closed
hdominguez-stratio opened this issue Nov 25, 2015 · 6 comments
Closed

SELECT * FROM tabletest WHERE col1 IN (0,10,5,27 ) #615

hdominguez-stratio opened this issue Nov 25, 2015 · 6 comments

Comments

@hdominguez-stratio
Copy link

When I try to execute a query with an IN operator, if the column is of type LONG, the datasource throws the next exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.util.TaskCompletionListenerException: SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed; shardFailures {[W61yWz-NRv6PzH1kPzisiA][databasetest][0]: SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"filtered":{ "query":{"match_all":{}},"filter": { "and" : [ {"query":{"match":{"ident":"0 10 5 27"}}} ] } }}}]]]; nested: NumberFormatException[For input string: "0 10 5 27"]; }]
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:90)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
        at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
        at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
        at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
        at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
        at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

The column is of type LONG.

The executed test is the next (CUCUMBER FORMAT)

  Scenario: SELECT * FROM tabletest WHERE ident IN (0,10,5,27);
    When I execute 'SELECT * FROM tabletest WHERE ident IN (0,10,5,27)'
    Then The result has to have '2' rows ignoring the order:
      | ident-long | name-string   | money-double  |  new-boolean  | date-date  |
      |    0       | name_0        | 10.2          |  true         | 1999-11-30 |
      |    5       | name_5        | 15.2          |  true         | 2005-05-05 |
@hdominguez-stratio
Copy link
Author

When I try the same but with a column of type double or date, a similar exception is thrown

@costin
Copy link
Member

costin commented Nov 25, 2015

What version of es-hadoop are you using?
On Nov 25, 2015 7:19 PM, "Hugo Domínguez Sanz" notifications@github.com
wrote:

When I try to execute a query with an IN operator, if the column is of
type LONG, the datasource throws the next exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost):
org.apache.spark.util.TaskCompletionListenerException:
SearchPhaseExecutionException[Failed to execute phase [init_scan], all
shards failed; shardFailures {[W61yWz-NRv6PzH1kPzisiA][databasetest][0]:
SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure
[Failed to parse source [{"query":{"filtered":{
"query":{"match_all":{}},"filter": { "and" : [
{"query":{"match":{"ident":"0 10 5 27"}}} ] } }}}]]]; nested:
NumberFormatException[For input string: "0 10 5 27"]; }]
at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

The column is of type LONG.

The executed test is the next (CUCUMBER FORMAT)

Scenario: SELECT * FROM tabletest WHERE ident IN (0,10,5,27);
When I execute 'SELECT * FROM tabletest WHERE ident IN (0,10,5,27)'
Then The result has to have '2' rows ignoring the order:
| ident-long | name-string | money-double | new-boolean | date-date |
| 0 | name_0 | 10.2 | true | 1999-11-30 |
| 5 | name_5 | 15.2 | true | 2005-05-05 |


Reply to this email directly or view it on GitHub
#615.

@hdominguez-stratio
Copy link
Author

I'm using the version 2.1.1 and 1.7.3 of elasticsearch

@costin
Copy link
Member

costin commented Nov 26, 2015

Please use 2.1.2 as it likely fixed your issue.

@hdominguez-stratio
Copy link
Author

Thanks, in version 2.1.2 is fixed, but with DATE type does not works.

ES MAPPING:

{"databasetest":{"mappings":{"tabletest":{"properties":{"date":{"type":"date","format":"dateOptionalTime"},"ident":{"type":"long"},"money":{"type":"double"},"name":{"type":"string"},"new":{"type":"boolean"}}}}}}

CUCUMBER TEST:

 Scenario: [ES] SELECT date FROM tabletest WHERE date IN ('1999-11-30','1998-12-25','2005-05-05','2008-2-27');
    When I execute 'SELECT date FROM tabletest WHERE date IN ('1999-11-30','1998-12-25','2005-05-05','2008-2-27')'
    Then The result has to have '2' rows ignoring the order:
       | date-date  |
       | 1999-11-30 |
       | 2005-05-05 |
EXCEPTION:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 85, localhost): org.apache.spark.util.TaskCompletionListenerException: SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed; shardFailures {[OPZ3P8qmTFSGSYiqg_Z5VA][databasetest][0]: SearchParseException[[databasetest][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"filtered":{ "query":{"match_all":{}},"filter": { "and" : [ {"or":{"filters":[{"query":{"match":{"date":""1999-11-30T00:00:00+01:00" "1998-12-25T00:00:00+01:00" "2005-05-05T00:00:00+02:00" "2008-02-27T00:00:00+01:00""}}}]}} ] } }}}]]]; nested: QueryParsingException[[databasetest] Failed to parse]; nested: JsonParseException[Unexpected character ('1' (code 49)): was expecting comma to separate OBJECT entries
 at [Source: [B@3a91aa51; line: 1, column: 118]]; }]
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:90)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
    at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
    at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
    at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
    at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

I do not know if this issue could be an Spark issue.

@ebuildy
Copy link
Contributor

ebuildy commented Dec 28, 2015

Maybe a comma is missing here:

https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-13/src/main/scala/org/elasticsearch/spark/sql/DefaultSource.scala#L261

strings.mkString("\"", " ", "\"") ==> strings.mkString("\"", ",", "\"")

?

costin added a commit that referenced this issue Dec 29, 2015
When using Date types with Spark IN filter, apply a terms query
 instead of match

relates #615
@costin costin closed this as completed Jan 8, 2016
costin added a commit that referenced this issue Jan 16, 2016
When using Date types with Spark IN filter, apply a terms query
 instead of match

relates #615

(cherry picked from commit 6b212e7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants