New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc values for binary field #5669
Conversation
I think it is a great idea to use doc values for this use-case indeed! Your patch looks like a good start, I think there are two things to do before being able to merge it in:
The issue with binary doc values is that they only support one value per document. So you need to encode all values of the same document into a single field instance. You can look at
Field data support is going to be useful in order to get those values at search time. For example, you can look at |
@jpountz I've updated the PR. The default field data format for binary data is |
byte[] value = new byte[length]; | ||
System.arraycopy(bytes.bytes, bytes.offset + offset + 4, value, 0, length); | ||
offset += 4 + length; | ||
return new BytesRef(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a new BytesRef on every call to nextValue
, it should modify in-place the scratch
member that it inherits from BytesValues
(potentially resizing the array if it is too small).
This makes sense. I left comments to improve the PR but it already looks really great! |
byte[] bytes = new byte[size]; | ||
for (int i = 0; i < size; i ++) { | ||
bytes[i] = randomByte(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use getRandom().nextBytes(bytes)
here
@jpountz I've updated the PR as per your comment |
private int compare(byte[] left, byte[] right) { | ||
for (int i = 0, j = 0; i < left.length && j < right.length; i++, j++) { | ||
int a = left[i]; | ||
int b = right[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should do an unsigned comparison:
int a = left[i] & 0xFF;
int b = right[i] & 0xFF;
Thanks @kzwang , I just gave it another look! It's getting close! |
@jpountz updated PR with those changes |
Merged, thanks @kzwang ! |
@kzwang Noticed this PR. One question is why do we need to store data in doc values, instead of just putting them in memory (e.g. like an encoded string). If all data is in memory, access speed should be much faster than doc value who sits on disk. Thanks! |
Storing data on disk does not necessarily mean that access will be slower thanks to the filesystem cache. |
add doc values support for binary field
in my image plugin, the score function need to get the binary field for all documents, using doc values will be much faster than using store field