Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: UpdateScriptParams: JSON serialization error #351

Closed
jakrol opened this issue Jan 7, 2015 · 4 comments
Closed

Spark: UpdateScriptParams: JSON serialization error #351

jakrol opened this issue Jan 7, 2015 · 4 comments

Comments

@jakrol
Copy link

jakrol commented Jan 7, 2015

I'm using Spark and JavaEsSpark.saveJsonToEs for an upsert to Elasticsearch.
When i try to pass two parameters of type String only the second one is quoted. It's quite hard to explain so maybe a code snipplet will clarify things:

...
Map<String, String> settings = new HashMap<String, String>();
settings.put(ConfigurationOptions.ES_RESOURCE_WRITE, "test/pojo");
settings.put(ConfigurationOptions.ES_WRITE_OPERATION, ConfigurationOptions.ES_OPERATION_UPSERT);
settings.put(ConfigurationOptions.ES_UPDATE_SCRIPT, "ctx._source.s1=s1;ctx._source.s1=s1;");
settings.put(ConfigurationOptions.ES_UPDATE_SCRIPT_PARAMS, "s1:s1,s2:s2");
settings.put(ConfigurationOptions.ES_MAPPING_ID, "id");

// pojo.toString() generates {"id":1,"s1":"val1","s2":"val2"}
JavaRDD<String> json = sc.parallelize(Arrays.asList(pojo)).map(p -> p.toString());
JavaEsSpark.saveJsonToEs(json, settings);

The execution results in:

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: JsonParseException[Unrecognized token 'val1': was expecting ('true', 'false' or 'null')
 at [Source: [B@6055dec5; line: 1, column: 22]]; fragment[e":{"_id":"1"}}
{"pa]
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:322)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:299)
    at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:149)
    at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:199)
    at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:223)
    at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
    at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
    at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
    at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
    at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Wireshark provides:

{"update":{"_id":"1"}}
{"params":{"s1":val1,"s2":"val2"},"script":"ctx._source.s1=s1;ctx._source.s1=s1;","upsert":{"id":1,"s1":"val1","s2":"val2"}}

I'm using elasticsearch 1.4.2 and tried both elasticsearch-hadoop 2.1.0.Beta3 and SNAPSHOT.

Any Ideas / workarounds are appreciated.

@jakrol
Copy link
Author

jakrol commented Jan 9, 2015

I did some source code digging an finally was able to fix it (at least for my use cases). What I did:

  • ParsingUtils: values() returns List<Object> instead of List<String>
  • AbstractDefaultParamsExtractor: access modifiers for some properties changed from private to protected
  • UpdateScriptParamsExtractor added.
  • JsonFieldExtractors: uses UpdateScriptParamsExtractor instead of anonymous AbstractDefaultParamsExtractor instance
  • AbstractBulkFactory.FieldWriter: write(Object object): handling of lists changed. All values are written using doWrite(value, false) instead of using doWrite(value, true) for the last list entry

It's not been thoroughly tested! You can find the changes at https://github.com/jakrol/elasticsearch-hadoop

@costin
Copy link
Member

costin commented Jan 9, 2015

Hi @jakrol

Thanks a lot for the bug report and for trying to fix the issue. I've posted a different type of fix in master (I've reworked a bit the JSON templating engine) to address not only your issue, but also handle the JSON values better (in particular to not quote null, boolean and number types).
It would be great if you could try the latest dev snapshot (already published in maven) and report back.

P.S. Again, thanks for working on a fix. I hope to see more in the future 👍
A couple of remarks if I may: considering using pull requests (see the contributing guidelines that appear when you open one) and second, configure your editor to format only the parts that are modified / added by you. Eclipse supports this and I suspect so does IntelliJ and the rest of the Java IDEs out there.
This significantly reduces the noise as otherwise a single char modified can reformat an entire file and make the diff huge and barely readable.

Cheers,

@jakrol
Copy link
Author

jakrol commented Jan 12, 2015

Hi @costin,

thanks for your response. I just tried 2.1.0.BUILD-SNAPSHOT an everything works fine.

Greetings,

@jakrol jakrol closed this as completed Jan 12, 2015
@costin
Copy link
Member

costin commented Jan 12, 2015

Great! Thanks again for everything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants