Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading boolean 0/1 values from ES into Spark does not work (although false/true is ok) #795

Closed
1 task
mathieu-rossignol opened this issue Jun 29, 2016 · 1 comment

Comments

@mathieu-rossignol
Copy link

mathieu-rossignol commented Jun 29, 2016

What kind an issue is this?

  • [X ] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.
  • Feature Request. Start by telling us what problem you’re trying to solve.
    Often a solution already exists! Don’t send pull requests to implement new features without
    first getting our support. Sometimes we leave features out on purpose to keep the project small.

Issue description

Define some boolean fields in ES (using mapping). Load values 0/1 for those fields. Try to read them from Spark. This throws an exception. Only false/true values are supported.

I understand that for custom dates its also not working (See Issue #624 ). But for booleans, it seems pretty easy to support correct boolean values as specified here : https://www.elastic.co/guide/en/elasticsearch/reference/current/boolean.html. That is:
False values: false, "false", "off", "no", "0", "" (empty string), 0, 0.0
True values: Anything that isn’t false.
Using the correct java.lang.Boolean value should be easy.

Possible workarounds:

I Think that it makes sense, but if I am wrong, well tell me, no problem.

Steps to reproduce

Code:

package bug.reproduce;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;

/**
 * Reproduce bug where reading 0/1 boolean values from ES is not working.
 * Prepare ES index with test data using the curl or Sense following commands:

DELETE /spark

POST /spark
{
  "mappings": {
    "data": {
      "dynamic_templates": [
        {
          "startingWithBool": {
            "match": "bool*",
            "match_mapping_type": "string",
            "mapping": {
              "type": "boolean"
            }
          }
        }
      ]
    }
  }
}

GET /spark/_mapping


POST /spark/data/1
{
  "id": "1",
  "boolField" : "false"
}

POST /spark/data/2
{
  "id": "2",
  "boolField" : "true"
}

POST /spark/data/3
{
  "id": "3",
  "boolField" : "0"
}

POST /spark/data/4
{
  "id": "4",
  "boolField" : "1"
}

GET /spark/data/1
GET /spark/data/2
GET /spark/data/3
GET /spark/data/4

 */
public class SparkEsBooleanBug {

    public static void main(String[] args)
    {
        /**
         * Setup Spark and ES connection
         */
        SparkConf sparkConf = new SparkConf().setAppName("Spark-ES Reproduce Bug").setMaster("local");

        // Point to ES local instance
        sparkConf.set("spark.es.nodes", "localhost");
        sparkConf.set("spark.es.port", "9200");
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
        SQLContext sqlContext = new SQLContext(javaSparkContext);

        // Setup table wrapping ES index
        DataFrame esDataFrame = JavaEsSparkSQL.esDF(sqlContext, "spark/data");
        esDataFrame.registerTempTable("TEST_TABLE");

        /**
         * Read documents with false/true values-> no problem
         */

        System.out.println("################# Reading documents with false/true values, this should be ok...");

        DataFrame resultDataFrame = sqlContext.sql("SELECT * FROM TEST_TABLE WHERE  id <= '2'");         
        resultDataFrame.show();

        /**
         * Read documents with 0/1 values-> exception:
         * org.elasticsearch.hadoop.rest.EsHadoopParsingException: Cannot parse value [1] for field [boolField]
         */

        System.out.println("################# Reading documents with 0/1 values, this raises an exception...");

        resultDataFrame = sqlContext.sql("SELECT * FROM TEST_TABLE WHERE  id >= '3'");         
        resultDataFrame.show();
    }
}

Strack trace:

org.elasticsearch.hadoop.rest.EsHadoopParsingException: Cannot parse value [1] for field [boolField]
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:713)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:704)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:458)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:383)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:278)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:251)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:456)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:86)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: For input string: "1"
    at scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:238)
    at scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:226)
    at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:31)
    at org.elasticsearch.spark.serialization.ScalaValueReader.parseBoolean(ScalaValueReader.scala:112)
    at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$booleanValue$1.apply(ScalaValueReader.scala:111)
    at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$booleanValue$1.apply(ScalaValueReader.scala:111)
    at org.elasticsearch.spark.serialization.ScalaValueReader.checkNull(ScalaValueReader.scala:81)
    at org.elasticsearch.spark.serialization.ScalaValueReader.booleanValue(ScalaValueReader.scala:111)
    at org.elasticsearch.spark.serialization.ScalaValueReader.readValue(ScalaValueReader.scala:67)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.readValue(ScalaEsRowValueReader.scala:28)
    at org.elasticsearch.hadoop.serialization.ScrollReader.parseValue(ScrollReader.java:726)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:711)
    ... 34 more

Version Info

OS: : Linux
JVM : 7
Hadoop/Spark: spark-sql_2.10, version 1.6.1
ES-Hadoop : spark-sql_2.10, version 1.6.1
ES : elasticsearch-spark_2.10, version 2.3.2 and also 2.3.2 for external instance

@jbaiera
Copy link
Member

jbaiera commented Jun 29, 2016

Thanks for raising this issue! After combing through the code, I have opened a related issue (#798).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants