Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using binary type and doc_values=true makes doc() lookups completely unusable #14469

Closed
jrots opened this issue Nov 3, 2015 · 3 comments
Closed
Labels
discuss :Search/Search Search-related issues that do not fall into other categories

Comments

@jrots
Copy link

jrots commented Nov 3, 2015

(ES 2.0 )
In the documentation it states that you can fetch binary doc_values by using "doc_values": true, for that type: https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html.
When trying to fetch this in a java plugin I always get unsupported_operation_exception, thrown by UnsupportedOperationException, BytesBinaryDVAtomicFieldData line 104:
you can reproduce it by creating a simple index and adding some documents with binary data

curl -s -XPUT "http://localhost:9200/testindex/" -d '{
    "settings": {
        "index.number_of_shards": 1,
        "index.number_of_replicas": 0
    },
    "mappings": {
    "test":{"_all":{"enabled":false},
    "properties":{
        "qa_data":{"type":"binary", "doc_values": true}
        }
       }
    }
}'

curl -s -XPUT 'http://localhost:9200/testindex/test/3'  -d '{
            "qa_data" : "AB4BNINAgBoQgAQAABQAJIAAEAAAAAAAAAAAACBAAAgAAAAAAIAAABAAgAACIAAQAAAICAAAAAAAgAAgAgAAAAAAAAAAAJAAAHACAAAAAQAAAAAAAAAAQAAAAAAAAAAIEEAAAAAACAAAAAAAAAAAAgIAIAgwICBAAAAAAAAAAAAAAAAAAAAAQAQAAAAgAAAAAAAAIAAAAAAAACAACAAAAAAAAAACAAEAAAEAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAQAAAAAACAAAAAAAAAAAAAAAAAgAAAAAAEAAAAAAAAAAAAHgAAPEgAAAAAAAAAIAAAAAAKAAEABAAAABcAAgAEAAAAHAADAAQAAAAhAAEABAAAACMAAQAEAAAAJAABAAQAAAAvAAIACQADADYAAgAEAAAAOAAEAAQAAAA5AAQADgACAD8AAQABAAIATAACAAQAAABfAAEABAAAAGIAAgAJAAMAZQABAAQAAAByAAEAAQACAHQAAgAJAAIA4wACAAQAAAD2AAEABAAAAP0AAgAEAAABDwACAAQAAAEcAAEAAQACATcAAgAEAAABQwACAAkAAgFLAAEAAwADAWQAAwANAAEBdQADAAQAAAF5AAEABAAAAYUAAQAEAAABlwABAAQAAAH5AAEAAgACAgkAAgADAAICFAABAAEAAgIVAAEAAgACAhYAAgAEAAACLAACAAQAAAIvAAIABAAAAmgAAgAEAAACpgACAAIAAQLWAAQADwACAtwAAwAHAAIC4wACAAMAAgMrAAMAAwACA0MAAQAEAAADTQADAAQAAANZAAMABAAAA2EAAwAEAAADpgADAAMAAQOtAAEAAQABA7UAAgAJAAIDvAABAAIAAwO9AAIABAAABBoAAgAEAAAEJgACAAMAAgRFAAIABAAABH0AAgADAAIEjQACAAMAAgT7AAMABAAABRAABAAPAAIFKAACAAQAAAU5AAQADwACBYcAAQABAAMFzAACAAQAAAYfAAIACQACBooAAgADAAIGtwABAAEAAgcbAAEABAAABx4AAgAEAAAHIgACAAQAAAcjAAMABAAAByQAAQAEAAAHJQADAAQAAAc5AAEABAAABzoAAQAEAAAHOwABAAQAAAc8AAEAAQACB10AAwADAAI="
}'
curl -s -XPUT 'http://localhost:9200/testindex/test/4'  -d '{
             "qa_data" : "AB4CtINAgxqQmBw2ABQAJKFCGAAwkAAAAAEAICBAAIgQAABAAKAAhBAQgCASIAAYAIAKKgAAAgAAgGQgAoAAAABAAAAAAgAAAHACAACAESAAAAQAAAAAQAAAAAgAAAAIEEBAABECCQAAAARAQAEAAgIEIAg0ISDAAAAAABSAAAAAAAQAAAAIQRQAAAIhAAAAAAABoAAAAAAIACAACIAAAAAAgUAiAAEAAAGAgCAAAAAAAAAAAAAAAAAAAIAAAAAABgEQAAgAgACEAgAAAFAAAAgBAIAAgAAAAAAAAAQAAAAAAAAAXgCAPGgAAAAAAAAAADAQAAABAAIACQABAAIAAgAJAAIABAABAAkAAgAFAAIACQABAAoAAgAEAAAACwABAAQAAAAMAAIABAAAABMAAwAKAAIAFAADAA0AAgAXAAIABAAAABwABAAOAAIAHwABAAEAAgAhAAEAAgABACMAAgAEAAAAJAACAAQAAAAoAAIACQACACkAAQAEAAAALwACAAkAAwA2AAIACwADADgABAAOAAIAOQAEAA4AAwA/AAEAAQACAEsAAQACAAIATAACAAIAAQBRAAMABAAAAFYAAQABAAIAWAABAAQAAABdAAEAAgACAF8AAQAEAAAAYgACAAkAAgBlAAEABAAAAHIAAQAEAAAAdAACAAkAAwCFAAIABAAAAJAAAQAJAAEAtAABAAQAAAC3AAMADQACALwAAQAEAAAAvQACAAkAAgDGAAEABAAAANwAAgAJAAEA4wABAAQAAADnAAEACQABAPYAAgAJAAEA/QACAAkAAgEFAAEADQADAQ8AAgACAAIBFAABAAkAAQEcAAIABAAAASIAAgAEAAABJwABAAQAAAE1AAIACQABATcAAQACAAIBQQACAAoAAgFDAAEACQACAUUAAwADAAEBSQADAAoAAQFLAAMAAwACAVcABAADAAEBYwACAAsAAwFkAAMABAAAAXUAAwANAAIBeQABAAQAAAF8AAEACQACAYUAAQAEAAABigABAAIAAgGNAAIABAAAAY4AAQAEAAABlwABAAkAAQGpAAIABAAAAdYAAwAOAAIB9wADAAQAAAH5AAEACQABAgkAAgANAAICFAABAAkAAgIVAAEAAgADAhYAAQAEAAACMQABAAQAAAJKAAEABQADAmUAAQAEAAACaAACAAEAAgJsAAIACgABAncAAgAEAAACgwADAA0AAwKmAAIABAAAAs4AAQACAAIC1gAEAA8AAgLcAAEABwACAuMAAgADAAIDBgACAAoAAQMKAAIABAAAAygAAQAEAAADKwADAAQAAAMxAAEABAAAAzgAAgACAAIDPAABAAEAAgNDAAMAAwABA00AAwADAAADUgADAAQAAANZAAMABAAAA2EAAgAKAAIDcAABAAEAAQN+AAMACwABA6YAAwAJAAEDpwABAAkAAgOtAAIACQABA7AAAwAEAAADtQACAAkAAwO6AAEAAgACA7wAAgACAAIDvQABAAMAAQPKAAIACQABA/cAAQACAAID+gABAAIAAAP8AAEAAQABBAEAAgAEAAAEGgACAAQAAAQcAAIABAAABCAABAAEAAAEJgACAAkAAgQrAAEAAgACBEUAAgAKAAIERwADAAsAAgRIAAEAAgACBHgAAgAJAAIEfQACAAMAAgSNAAIAAwACBJsABAAIAAEExgAEAA8AAQTIAAEAAQABBM8AAwAKAAEE9wACAAIAAQT7AAIABAAABQcAAgADAAIFDwABAAIAAgUQAAQADAACBSgAAgACAAIFOQAEAA8AAgU9AAEABAAABX0AAgAJAAIFhwABAAYAAgXMAAEAAgABBdAAAwAKAAEF2QABAAEAAgXaAAMABAAABhEAAQAEAAAGGgABAAQAAAYfAAIACQACBi8AAwAKAAIGOwAEAA4AAgZHAAEACQACBlAAAgAJAAEGWwACAAIAAQZ0AAEAAgACBnYAAgADAAEGtwABAAQAAAb6AAEABQACBxsAAQAEAAAHHQABAAQAAAceAAIACQACByIAAQACAAEHIwACAAMAAgckAAEABAAAByUAAgAEAAAHLwACAAQAAAc5AAEABAAABzoAAgAJAAMHOwACAAQAAAc8AAEAAgACBz4AAQAEAAAHTAABAAIAAQdUAAMABAAAB1UAAgAEAAA="

Writing a simple plugin that just calls doc() or calls doc().get('qa_data') will throw exceptions immediately.

  @Override
    public float runAsFloat() {
        float finalScore = 0;
        LeafDocLookup doc = doc(); //=> will throw java.lang.UnupportedOperationException 
    }
@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2015

I'm curious what your use-case is, for now the only use-case I knew about for doc values on binary fields was an image plugin for elasticsearch (#5669) and doc values were consumed directly through the plugin, bypassing the scripting layer.

@jrots
Copy link
Author

jrots commented Nov 3, 2015

Well I have an index that contains +/- 100M documents, the data that is indexed are "persons" and a typically search is :
find users that are around me, but the sorting needs to be done dynamically:
Each "person" has answered x questions with some answers.
I need calculate a "matchscore" on the fly for the persons I find, basically the overlap of my questions with the persons I find. (And also some additional checks on the answers of those questions if they are the same). The "matchscore" will give back a percentage between 0 and 100.
There are about 1 to 1800 questions that can be answered by a person,
I store the question data packed in 64 bit longs : so I only have to do a bitwise operation to find the "matching" questions.
This goes pretty fast in my benchmarks.
my questionids from 0 to 64 : 100000000000000000000000000000000000010001000000010000000000001 &
the other person questionids from 0 to 64 : 100000000000000000000000000000000000000001000000010000000000001
result will only contain the matching "questionids"

An array of all questions I have answered will look like (encoded in 64 bit ints: )

[-8989044006797179904,5629656300523520,0,2323857442082914304,36028797287432192,153122456050075656,8388640,144115188075855872,158329681740288,1099511627776,274877906944,34632368128,8796093022208,8623497224,3467807172325277696,0,274945015808,2305843009213693984,8192,576460752303423488,144116287587549184,0,128,4096,2147483648,0,36028797018964992,0,2161728080043835392,536870912
]

And for an other person like:

[-8989040706113233866,5629656858499072,3499296910466940960,2323857992107163712,45036563478904864,1306043995025050154,2199031669792,180143985099014144,562949960761856,36047626155590656,274877906952,34632384512,1225551944202847296,4611967502027857928,3756319573209513984,1477180677777523712,9075601440770,2377900603251622304,134225920,612489549322420544,2449959296801276032,2305843009213693952,128,100732928,576601492006502400,22517998271135872,36028797018963968,288230376151711744,6773554836496449536,3149824
]

I first stored the data as a long array in elastic, but you cannot rely on the order as doc_values will be ordered low to high,
So I created a bytearray that I base64 encode and such store in ES, afterwards decode

        byte[] decodeString  = Base64.getDecoder().decode(encodedString);
        ByteBuffer byteBuf = ByteBuffer.wrap( decodeString );
        byteBuf.order( ByteOrder.BIG_ENDIAN );
        LongBuffer longBuf = byteBuf.asLongBuffer();
        long[] questions = new long[numberOfQuestions];
        longBuf.get(questions);

I found a workaround for now, to store it as "text" with:
"qa_data":{"type":"string", "doc_values": true, "index": "no", "store" : "no"}
and not binary .. seems to work for now.

@jpountz
Copy link
Contributor

jpountz commented Nov 4, 2015

Thanks for explaining the use-case.

Unrelated to binary doc values but I'm wondering that storing the questions ids directly could be a better option both in terms of storage and runtime. You could take ids from the shortest array and then use galloping search to find common ids in the other array?

Otherwise I agree that we should either document the limitations with doc values on binary fields or add support so that you can at least use them in scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants