Spark: UpdateScriptParams: JSON serialization error #351

jakrol · 2015-01-07T13:18:39Z

I'm using Spark and JavaEsSpark.saveJsonToEs for an upsert to Elasticsearch.
When i try to pass two parameters of type String only the second one is quoted. It's quite hard to explain so maybe a code snipplet will clarify things:

...
Map<String, String> settings = new HashMap<String, String>();
settings.put(ConfigurationOptions.ES_RESOURCE_WRITE, "test/pojo");
settings.put(ConfigurationOptions.ES_WRITE_OPERATION, ConfigurationOptions.ES_OPERATION_UPSERT);
settings.put(ConfigurationOptions.ES_UPDATE_SCRIPT, "ctx._source.s1=s1;ctx._source.s1=s1;");
settings.put(ConfigurationOptions.ES_UPDATE_SCRIPT_PARAMS, "s1:s1,s2:s2");
settings.put(ConfigurationOptions.ES_MAPPING_ID, "id");

// pojo.toString() generates {"id":1,"s1":"val1","s2":"val2"}
JavaRDD<String> json = sc.parallelize(Arrays.asList(pojo)).map(p -> p.toString());
JavaEsSpark.saveJsonToEs(json, settings);

The execution results in:

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: JsonParseException[Unrecognized token 'val1': was expecting ('true', 'false' or 'null')
 at [Source: [B@6055dec5; line: 1, column: 22]]; fragment[e":{"_id":"1"}}
{"pa]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:322)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:299)
    at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:149)
    at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:199)
    at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:223)
    at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
    at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
    at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
    at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
    at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Wireshark provides:

{"update":{"_id":"1"}}
{"params":{"s1":val1,"s2":"val2"},"script":"ctx._source.s1=s1;ctx._source.s1=s1;","upsert":{"id":1,"s1":"val1","s2":"val2"}}

I'm using elasticsearch 1.4.2 and tried both elasticsearch-hadoop 2.1.0.Beta3 and SNAPSHOT.

Any Ideas / workarounds are appreciated.

The text was updated successfully, but these errors were encountered:

jakrol · 2015-01-09T11:20:04Z

I did some source code digging an finally was able to fix it (at least for my use cases). What I did:

ParsingUtils: values() returns List<Object> instead of List<String>
AbstractDefaultParamsExtractor: access modifiers for some properties changed from private to protected
UpdateScriptParamsExtractor added.
JsonFieldExtractors: uses UpdateScriptParamsExtractor instead of anonymous AbstractDefaultParamsExtractor instance
AbstractBulkFactory.FieldWriter: write(Object object): handling of lists changed. All values are written using doWrite(value, false) instead of using doWrite(value, true) for the last list entry

It's not been thoroughly tested! You can find the changes at https://github.com/jakrol/elasticsearch-hadoop

costin · 2015-01-09T16:51:55Z

Hi @jakrol

Thanks a lot for the bug report and for trying to fix the issue. I've posted a different type of fix in master (I've reworked a bit the JSON templating engine) to address not only your issue, but also handle the JSON values better (in particular to not quote null, boolean and number types).
It would be great if you could try the latest dev snapshot (already published in maven) and report back.

P.S. Again, thanks for working on a fix. I hope to see more in the future 👍
A couple of remarks if I may: considering using pull requests (see the contributing guidelines that appear when you open one) and second, configure your editor to format only the parts that are modified / added by you. Eclipse supports this and I suspect so does IntelliJ and the rest of the Java IDEs out there.
This significantly reduces the noise as otherwise a single char modified can reformat an entire file and make the diff huge and barely readable.

Cheers,

jakrol · 2015-01-12T12:01:00Z

Hi @costin,

thanks for your response. I just tried 2.1.0.BUILD-SNAPSHOT an everything works fine.

Greetings,

costin · 2015-01-12T12:23:23Z

Great! Thanks again for everything!

costin added bug :Rest v2.1.0.Beta4 labels Jan 9, 2015

jakrol closed this as completed Jan 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: UpdateScriptParams: JSON serialization error #351

Spark: UpdateScriptParams: JSON serialization error #351

jakrol commented Jan 7, 2015

jakrol commented Jan 9, 2015

costin commented Jan 9, 2015

jakrol commented Jan 12, 2015

costin commented Jan 12, 2015

Spark: UpdateScriptParams: JSON serialization error #351

Spark: UpdateScriptParams: JSON serialization error #351

Comments

jakrol commented Jan 7, 2015

jakrol commented Jan 9, 2015

costin commented Jan 9, 2015

jakrol commented Jan 12, 2015

costin commented Jan 12, 2015