Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc values for binary field #5669

Closed
wants to merge 1 commit into from

Conversation

kzwang
Copy link
Contributor

@kzwang kzwang commented Apr 2, 2014

add doc values support for binary field

in my image plugin, the score function need to get the binary field for all documents, using doc values will be much faster than using store field

@jpountz jpountz self-assigned this Apr 4, 2014
@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2014

I think it is a great idea to use doc values for this use-case indeed!

Your patch looks like a good start, I think there are two things to do before being able to merge it in:

  1. support multiple values per document
  2. add field data support

The issue with binary doc values is that they only support one value per document. So you need to encode all values of the same document into a single field instance. You can look at NumberFieldMapper.CustomLongNumericDocValuesField to see how we do it for numeric fields. Very likely, field values should be encoded using length encoding:

[length1(vInt), value1 (bytes), length2(vInt), value2(bytes), ... ]

Field data support is going to be useful in order to get those values at search time. For example, you can look at BinaryDVNumericIndexFieldData and BinaryDVNumericAtomicFieldData that implement numeric field data on top of binary doc values. One major difference though is that you would need to implement BytesValues instead of Long or DoubleValues.

@kzwang
Copy link
Contributor Author

kzwang commented Apr 5, 2014

@jpountz I've updated the PR.
The values are encoded as [total num of values, length1, value1, length2, value2 ...]

The default field data format for binary data is disabled as I think normally it won't be used for sorting, faceting etc.

byte[] value = new byte[length];
System.arraycopy(bytes.bytes, bytes.offset + offset + 4, value, 0, length);
offset += 4 + length;
return new BytesRef(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new BytesRef on every call to nextValue, it should modify in-place the scratch member that it inherits from BytesValues (potentially resizing the array if it is too small).

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2014

The values are encoded as [total num of values, length1, value1, length2, value2 ...]
The default field data format for binary data is disabled as I think normally it won't be used for sorting, faceting etc.

This makes sense. I left comments to improve the PR but it already looks really great!

byte[] bytes = new byte[size];
for (int i = 0; i < size; i ++) {
bytes[i] = randomByte();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use getRandom().nextBytes(bytes) here

@kzwang
Copy link
Contributor Author

kzwang commented Apr 5, 2014

@jpountz I've updated the PR as per your comment

private int compare(byte[] left, byte[] right) {
for (int i = 0, j = 0; i < left.length && j < right.length; i++, j++) {
int a = left[i];
int b = right[j];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do an unsigned comparison:

int a = left[i] & 0xFF;
int b = right[i] & 0xFF;

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2014

Thanks @kzwang , I just gave it another look! It's getting close!

@kzwang
Copy link
Contributor Author

kzwang commented Apr 7, 2014

@jpountz updated PR with those changes

@jpountz jpountz closed this in 866c520 Apr 7, 2014
jpountz pushed a commit that referenced this pull request Apr 7, 2014
@jpountz
Copy link
Contributor

jpountz commented Apr 7, 2014

Merged, thanks @kzwang !

@kzwang kzwang deleted the feature/binary_doc_value branch April 7, 2014 08:26
@clintongormley clintongormley added the :Search/Mapping Index mappings, including merging and defining field types label Jun 6, 2015
@clintongormley clintongormley changed the title add doc value for binary field Add doc values for binary field Jun 6, 2015
@YunchaoGongSC
Copy link

@kzwang Noticed this PR. One question is why do we need to store data in doc values, instead of just putting them in memory (e.g. like an encoded string). If all data is in memory, access speed should be much faster than doc value who sits on disk. Thanks!

@jpountz
Copy link
Contributor

jpountz commented Feb 8, 2016

Storing data on disk does not necessarily mean that access will be slower thanks to the filesystem cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Mapping Index mappings, including merging and defining field types v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants