Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Byte and BytesRef to the XContentBuilder #6127

Closed
wants to merge 1 commit into from
Closed

Add support for Byte and BytesRef to the XContentBuilder #6127

wants to merge 1 commit into from

Conversation

mfussenegger
Copy link
Contributor

Hi all,

currently the XContentBuilder will call .ToString() on the object as fallback if it isn't known.
This will cause Byte instances to become a String instead of staying a number and BytesRef have some kind of hex representation which is wrong.

We could work around this if course by converting the values beforehand, but for performance reasons it would be nice to have the XContentBuilder handling the cases correctly.

@jpountz
Copy link
Contributor

jpountz commented May 12, 2014

Although the change with Byte looks good to me, I don't think XContentBuilder should expect all BytesRef's bytes to be UTF-8 bytes?

@mfussenegger
Copy link
Contributor Author

Afaik BytesRef are only used inside Lucene and there BytesRef are UTF-8 encoded.
Our use case is that we get these values directly from lucene and we want to keep working with them without doing the utf16/utf8 encoding step again until really needed.

I don't think that there is a use case for BytesRef with bytes that aren't UTF-8 encoded.

@nik9000
Copy link
Member

nik9000 commented May 12, 2014

On Mon, May 12, 2014 at 11:38 AM, Mathias Fußenegger <
notifications@github.com> wrote:

Afaik BytesRef are only used inside Lucene and there BytesRef are UTF-8
encoded.
Our use case is that we get these values directly from lucene and we want
to keep working with them without doing the utf16/utf8 encoding step again
until really needed.

I don't think that there is a use case for BytesRef with bytes that aren't
UTF-8 encoded.

The BytesRefs used by the fuzzy terms enum are utf-32 I believe. Its
reasonably rare that you'd encounter one but it happens.

Nik

@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

Afaik BytesRef are only used inside Lucene and there BytesRef are UTF-8 encoded.

When indexing text with Lucene, terms will indeed be indexed as UTF-8, but you could as well index something that is not text, and in that case it could be anything. For example, binary fields can have doc values today, and the terms are binary. Similarly, the index terms for numeric fields are not UTF-8.

Our use case is that we get these values directly from lucene and we want to keep working with them without doing the utf16/utf8 encoding step again until really needed.

If you know that the BytesRef contains UTF-8 bytes, something that you can do would be to wrap the term bytes into an org.elasticsearch.common.Text instance. Elasticsearch will then know that it contains text and that the bytes it gets are encoded in UTF-8.

@mfussenegger
Copy link
Contributor Author

thanks for the feedback.

I've updated the PR to just add the Byte type handling to the XContentBuilder. If the PR is otherwise okay I'll squash the commits and reword the commit message.

And I'll try to see if the org.elasticsearch.common.Text class is usable for us.

@mfussenegger
Copy link
Contributor Author

One more thing. The XContentBuilder already has a method

public XContentBuilder field(XContentBuilderString name, BytesRef value) throws IOException {
    field(name);
    generator.writeUTF8String(value.bytes, value.offset, value.length);
    return this;
}

Isn't that also wrong?

@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

This might be dangerous indeed if not called with bytes which are encoded in UTF-8.

@jpountz jpountz self-assigned this May 13, 2014
@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

Maybe one way to make this less trappy and to address your needs would be to make the method name explicit about the fact that it expects the argument to be UTF-8 bytes. For example, the method could be called utf8Field (in the same way that there are some rawField methods)?

@s1monw
Copy link
Contributor

s1monw commented May 13, 2014

@nik9000

The BytesRefs used by the fuzzy terms enum are utf-32 I believe. Its reasonably rare that you'd encounter one but it happens.

They are UTF-8 but we convert them to UTF-32 but that is then and IntsRef I think that is what you are referring to.

Maybe one way to make this less trappy and to address your needs would be to make the method name explicit about the fact that it expects the argument to be UTF-8 bytes. For example, the method could be called utf8Field (in the same way that there are some rawField methods)?

I agree this is dangerous and I think we should just name the method accordingly as @jpountz suggested. But I guess we tread a lot of stuff in Elasticsearch as UTF-8 so I wonder if there are more places?

@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

I think it's not too bad in general. For example the method that @mfussenegger pointed out is only used by aggregations on string terms. There is at least one issue open related to the serialization of search responses: #6077: when sorting everything works fine even if you're not working with UTF-8 bytes until the serialization of the final response where it assumes that bytes are UTF-8 encoded.

@mfussenegger
Copy link
Contributor Author

What does this now mean for this PR? Should I also change the field(XContentBuilderString name, BytesRef value) method name to something else or can the Byte change already get merged?

@jpountz
Copy link
Contributor

jpountz commented May 20, 2014

@mfussenegger In my opinion, field(XContentBuilderString name, BytesRef value) should be marked with @Deprecated and replaced with utf8Field(XContentBuilderString name, BytesRef utf8Bytes) that does exactly the same thing but makes it explicit that the bytes of the input will be interpreted as utf-8 bytes. Regarding the change that you initially proposed, the Byte change is good and I think we could have utf8Field(String name, BytesRef utf8Bytes) to address your concern to not keep converting back and forth between utf-8 and utf-16?

@mfussenegger
Copy link
Contributor Author

I've added the utf8Field method and deprecated the other one. I also rebased it against current master.

Unfortunately my use case isn't quite solved with this as I use field(Map..) with BytesRef inside the Map. But for the moment that is okay I can work around that for now.

Is there anything else stopping this PR from getting merged?

@jpountz
Copy link
Contributor

jpountz commented May 22, 2014

The commit looks great. Could you please just move the Byte check above with the other numeric wrapper classes (Double, Short, etc.) and use the same way to check the class of the value (type == Byte.class)?

@mfussenegger
Copy link
Contributor Author

Of course.. I've updated the PR as suggested.

@jpountz jpountz closed this in 82e9a4e May 28, 2014
@jpountz
Copy link
Contributor

jpountz commented May 28, 2014

@mfussenegger Merged, thanks!

@jpountz jpountz changed the title add support for Byte and BytesRef to the XContentBuilder Serialization: Add support for Byte and BytesRef to the XContentBuilder May 28, 2014
@clintongormley clintongormley changed the title Serialization: Add support for Byte and BytesRef to the XContentBuilder Internal: Add support for Byte and BytesRef to the XContentBuilder Jul 16, 2014
@clintongormley clintongormley changed the title Internal: Add support for Byte and BytesRef to the XContentBuilder Add support for Byte and BytesRef to the XContentBuilder Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants