New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Byte and BytesRef to the XContentBuilder #6127
Conversation
Although the change with |
Afaik BytesRef are only used inside Lucene and there BytesRef are UTF-8 encoded. I don't think that there is a use case for BytesRef with bytes that aren't UTF-8 encoded. |
On Mon, May 12, 2014 at 11:38 AM, Mathias Fußenegger <
Nik |
When indexing text with Lucene, terms will indeed be indexed as UTF-8, but you could as well index something that is not text, and in that case it could be anything. For example, binary fields can have doc values today, and the terms are binary. Similarly, the index terms for numeric fields are not UTF-8.
If you know that the |
thanks for the feedback. I've updated the PR to just add the Byte type handling to the XContentBuilder. If the PR is otherwise okay I'll squash the commits and reword the commit message. And I'll try to see if the |
One more thing. The XContentBuilder already has a method
Isn't that also wrong? |
This might be dangerous indeed if not called with bytes which are encoded in UTF-8. |
Maybe one way to make this less trappy and to address your needs would be to make the method name explicit about the fact that it expects the argument to be UTF-8 bytes. For example, the method could be called |
They are UTF-8 but we convert them to UTF-32 but that is then and IntsRef I think that is what you are referring to.
I agree this is dangerous and I think we should just name the method accordingly as @jpountz suggested. But I guess we tread a lot of stuff in Elasticsearch as UTF-8 so I wonder if there are more places? |
I think it's not too bad in general. For example the method that @mfussenegger pointed out is only used by aggregations on string terms. There is at least one issue open related to the serialization of search responses: #6077: when sorting everything works fine even if you're not working with UTF-8 bytes until the serialization of the final response where it assumes that bytes are UTF-8 encoded. |
What does this now mean for this PR? Should I also change the |
@mfussenegger In my opinion, |
I've added the utf8Field method and deprecated the other one. I also rebased it against current master. Unfortunately my use case isn't quite solved with this as I use field(Map..) with BytesRef inside the Map. But for the moment that is okay I can work around that for now. Is there anything else stopping this PR from getting merged? |
The commit looks great. Could you please just move the |
Of course.. I've updated the PR as suggested. |
@mfussenegger Merged, thanks! |
Hi all,
currently the XContentBuilder will call .ToString() on the object as fallback if it isn't known.
This will cause Byte instances to become a String instead of staying a number and BytesRef have some kind of hex representation which is wrong.
We could work around this if course by converting the values beforehand, but for performance reasons it would be nice to have the XContentBuilder handling the cases correctly.