New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize SIMD to find expected end of token #499
Comments
I think it may be possible to find the ending double quote or escape character of JSON string type. However, JSON number seems quite difficult to do so. As the syntax is more complicated, and also RapidJSON currently need to determine whether it is 32/64-bit signed/unsigned integer or double during the single pass of parsing. |
With any data type that is not a string you need to be able to find either ',' a newline or '}' or ']' unless I'm missing something. |
I think the types under consideration should be JSON string and number only. I think knowing where the token should end does not help much. For number, you may check https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/reader.h#L824 and see how it validates the input, and at the same time detects the actual data type and decodes it. It does not (need to) check for |
If we know where the number ends couldn't we use what you wrote in https://github.com/miloyip/itoa-benchmark/blob/master/src/sse2.cpp |
|
Oh right. My bad. |
It may be possible to use SIMD for scanning the same kind of characters, such as |
So let's focus on strings for now. I'll open issues for other data types as I come up with algorithms that may or may not fit in order to parse them in parallel. {
"k": "v, # no quotation mark
"k2": "v2"
} Can cause erroneous parsing if we determine that the quotation mark that specifies the start of the "k2" key is the end of the string of the value "v,". |
The malformed JSON you quoted is invalid at To optimize string parsing (https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/reader.h#L737), I think it can scan the first appearance of the characters |
I know we can find the start of the string but how do we find the end of the string? |
I've committed the optimizations which uses SSE2 instruction to scan (and copy) unescaped characters during string parsing. Unfortunately I have mistakenly commit it to master directly, instead of making a pull request. Anyway, here are the results.
The likely/unlikely and unsafe push are also committed. This should improve performance in #275 . @xpol can try verifying it with |
Well done! Can you link to the code? |
The scan/copy thing is at https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/reader.h#L796 |
Couldn't we refactor the code a bit to repeat less? |
Regarding numbers, I have something in mind that might work. It's still a bit raw though.
Unfortunately this requires the users to have CPUs that support AVX2 but this should speed up parsing. |
Due to page boundary issue, SIMD load needed to be aligned. I think for number it will be quite difficult to do so. You may try to do a simple |
As for true/false/null we can employ a slightly different method from picohttpparser's findchar_first().
In this case I'm not sure it's worth the trouble but I figured I should mention it anyway. |
As I said, due to page boundary issue, only aligned data can be accessed. The preceding unaligned bytes needed to process in non-SIMD way. In other words, only data >= 16 bytes are useful for SIMD. So true, false, null should not able to gain performance boost via SIMD. |
In the SIMD functions |
In an early version of RapidJSON, an issue reported that the In Intel® 64 and IA-32 Architectures Optimization Reference Manual
This is not feasible as RapidJSON should not enforce such requirement. To fix this issue, currently the routine process bytes up to the next aligned address. After tha, use aligned read to perform SIMD processing. Also see #85. |
I close this issue now. If @thedrow has no idea on optimization, please drop a new issue. You may also do experiments on your fork and we can discuss there as well. |
Thanks :) 👍 |
picohttpparser uses SIMD to scan a string in order to find a character in a string.
We can use it or something similar to be able to determine the end of the token.
Say we have the following JSON:
Instead of scanning each character we can find the boundaries of the integer value (that is ':' and ',') which will allow us to use the existing buffer sliced from start to end in order to cast the value to integer.
There can be multiple characters to look ahead for like newline or " depending on the type of what is being parsed on the contents of the JSON file.
If we'd be able to optimize for the common cases I think it can be useful.
The text was updated successfully, but these errors were encountered: