New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON serialization error #311
Comments
Also this happens, although I cannot find records that cause that:
Btw I use |
After some investigation it appears that without Is it possible to add an ability to write |
I have the similar problem while using pig and elasticsearch. Indexing works just fine till 45 m records. But somehow it breaks after that. I tried without using the 'es.mapping.id=key' as suggested and the execution went just fine. Exception is - |
I need to have 'es.mapping.id=key' as part of my hadoop job, so it is still an open issue for me. |
Just to add more to this, For sometime I thought that my dataset (Which is pretty huge)might have some wrong characters or something to not qualify as a _id, but I bulk indexed the data from a standalone java application, it works fine without complaining. |
When constructing documents with fields from JSON data, properly escape chars (according to the JSON spec) to avoid invalid documents Relates #311
we have been having same issues with hive 0.13. Jobs fail with similar errors mentioned in this thread. Its either illegal control character (which are not present in our file) or unexpected control character. I compiled jar from master after fix.
|
@bobbych do you have a sample that I could use to reproduce the error? If you could post it on a gist or dropbox somewhere (potentially zipped to make sure the special chars are not replaced) that would help a lot. Thanks! |
@costin I managed to avoid the problem by hashing |
Sorry to hear that (ideally it shouldn't have been a problem in the first place). P.S. A hashing function is a good idea anyway to 'slim' the id field however pay attention to its collision |
@costin I think these strange CRTL characters are being added at network level. When use es client directly from hive job fails , but when i setup proxy (ngnix) and route traffic via proxy job ran successfully. not sure 100% what is going on but i think issue is related to network esp aws enhanced networking. Our Hadoop cluster has enhanced network while ES doesn't. |
@costin I tried the version 2.1.0.BUILD-SNAPSHOT you mentioned on Dec 13. Unfortunately it did not work for me. If there has been a change after that, I will try again this week. |
@sdubey I'm not sure how you use the version but tools like Maven or/and Gradle automatically download the latest version so if you use those, just running an update should be enough (you should see in the console the new jar being downloaded and its suffix which typically is a timestamp. Additionally in the console when you use es-hadoop you get a git hash that is useful to identify the exact source compiled). Thanks! |
Though it is not clear this issue has been fixed, as there hasn't been any update I'm closing it down. If the problem persists, please try the latest beta (currently 4) and if it is not fixed, create a new issue while potentially linking to this one. Thanks! |
I have some docs that looks like this:
They are actually garbage that some malware generated, but that's not the point.
The point is that es-hadoop on top of spark+mesos breaks when I index these docs:
When indexing with curl one by one does not produce any errors:
There is probably some error in bulk indexing code.
The errors look like this:
and spark hangs after that (it should exit, I suppose).
This also reminds me about #217, because it is very hard to tell what is wrong looking at the error messages.
I'm on spark 1.1.0 and es-hadoop 2.1.Beta2.
The text was updated successfully, but these errors were encountered: