Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewritten parsers - ~3x performance increase #2

Closed
wants to merge 8 commits into from

Conversation

Projects
None yet
4 participants
@Salakar
Copy link
Member

commented May 5, 2016

Re-wrote the javascript parser, averages around 14k ops/s more and ~2-3x faster parsing.

Thoughts?

Benchmarks

image

Also slightly tweaked the hiredis parser also, ~1-2% change, not really noticeable but every little helps

@luin @BridgeAR tagging you both in.

@BridgeAR

This comment has been minimized.

Copy link
Member

commented May 5, 2016

@Salakar nice! But the optimization for big chunks is gone.

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 5, 2016

ooo ye, let me look into that, woops - too good to be true 😆

Guess it needs some kind of size check also to vary between chunking or standard? Not sure the best way to approach it tbh

@BridgeAR

This comment has been minimized.

Copy link
Member

commented May 5, 2016

Yepp. If a bulk reply is bigger than the current chunk cache all following chunks until the bulk reply is complete and concat all cached chunks.

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 8, 2016

Maybe we could persuade either @mscdex or @andrasq to work their performance wizardry / give this a once over look at - v2.0 branch. @ both - this is the redis response parser that the two main redis clients for node use, based on: http://redis.io/topics/protocol

@BridgeAR BridgeAR referenced this pull request May 8, 2016

Merged

v.2.0-proposal #3

@mscdex

This comment has been minimized.

Copy link

commented May 9, 2016

I decided to take a crack at this this evening (writing one from scratch) for fun. So far the results are promising, but I am far from finished yet. There are some more optimizations to be had and type implementations to be completed. Example results as the code stands right now though:

multiple chunks in a bulk string
================================
old x 50,943 ops/sec ±0.36% (92 runs sampled)
hiredis x 1,075,063 ops/sec ±0.39% (95 runs sampled)
new x 460,955 ops/sec ±0.82% (94 runs sampled)
mscdex x 1,334,433 ops/sec ±1.63% (93 runs sampled)

Fastest is mscdex

4mb bulk string
===============
old x 329 ops/sec ±0.47% (83 runs sampled)
hiredis x 469 ops/sec ±1.05% (79 runs sampled)
new x 327 ops/sec ±0.48% (65 runs sampled)
mscdex x 1,742 ops/sec ±1.30% (85 runs sampled)

Fastest is mscdex

simple string
=============
old x 2,920,941 ops/sec ±0.90% (93 runs sampled)
hiredis x 2,283,245 ops/sec ±0.44% (92 runs sampled)
new x 4,734,593 ops/sec ±0.37% (93 runs sampled)
mscdex x 5,090,149 ops/sec ±0.42% (91 runs sampled)

Fastest is mscdex

integer
=======
old x 2,688,599 ops/sec ±0.69% (88 runs sampled)
hiredis x 2,185,333 ops/sec ±0.24% (94 runs sampled)
new x 6,014,210 ops/sec ±0.48% (92 runs sampled)
mscdex x 19,786,130 ops/sec ±0.38% (92 runs sampled)

Fastest is mscdex

Some initial comments:

  • I suspect that the simple string parsing speed will be about the same as the "new" parser because IIRC most of the time is spent converting Buffers to strings, and there is not much that can be done about that outside of node core.
  • I removed the try-catch from the hiredis js interface since hiredis could be modified to not throw also, so removing the try-catch makes things more realistic IMHO when comparing benchmark results. This removal is reflected in the results above.
  • I found a bug with startBigBuffer. It is appending lorem which causes all of the parsers to fail (due to incorrect string length). Removing the + lorem at the end fixes this.
@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 9, 2016

@mscdex this looks impressive! Be interesting to see how you achieved this

As for hiredis - I think we were deprecating support as in most cases js is way faster than native

Also, any thoughts on @luin 's big array perf on the v2 proposal pr

@mscdex

This comment has been minimized.

Copy link

commented May 9, 2016

@Salakar I actually pulled from the v.2.0 branch, so the "new" parser is the one from #3 as of this writing.

I will post the code once it is feature complete and once I'm happy with the optimizations.... I will try to work on it this week.

I have not worked on array, error, or null implementations at all yet.

@andrasq

This comment has been minimized.

Copy link

commented May 10, 2016

@Salakar I only got a chance to look at this tonight, wrote an alternate number parser,
then got derailed when I tested with 2^64 - 1.

My approach was to gather the digits numerically into two numeric vars n and m, and return
either the numeric value of n if no overflow, or the string concatenation of n and m otherwise.
@mscdex may have done the same, my results are similar to his (once normalized)

However, when I tested this approach I found that parsing a big number causes the throughput
of all the other test runs to drop by a large amount (close to half), then spent the rest of the time
chasing that glitch. Never did figure out how to avoid it.

old: + integer x 1,882,301 ops/sec ±0.57% (90 runs sampled)
hiredis: + integer x 1,289,520 ops/sec ±0.15% (95 runs sampled)
new: + integer x 3,270,845 ops/sec ±0.80% (90 runs sampled)
andras: + integer x 11,643,973 ops/sec ±0.52% (94 runs sampled)

My code, if you want to play with it

// gather digits numerically, return string if Number overflow
// 15 digits will not cause overflow, 16 can
function parseSimpleNumbers (parser) {
  var buf = parser.buffer, offset = parser.offset, length = buf.length
  var v, n = 0, m = 1, suffix = '', sign = ''

  if (buf[offset] === 0x2d) { sign = '-'; offset++ }
  var start = offset

  while (offset < length) {
    if (buf[offset] >= 0x30 && buf[offset] <= 0x39) { // '0'..'9'
      if (offset - start < 15) n = (n * 10) + buf[offset++] - 0x30
      else m = (m * 10) + buf[offset++] - 0x30
    }
    else if (buf[offset++] === 0x0a) break; // '\n'
  }
  parser.offset = offset
  if (offset - start > 15 && m !== 1) return sign + n + ('' + m).slice(1)
  return sign ? -n : n
}
@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

@andrasq that's an interesting way of doing that, incredibly performant, the benchmarks look identical to what @mscdex showed - so I assume he did similar. Didn't think about using SMIs, learn something new everyday!

There were two unused vars v and suffix, removing those ups it by 2mil ops/s here.

Benchmark on my machine:
image

On a side note, a lot of the redis commands (such as SET, SETNX etc) return either a 1 or 0 integer response, so when benchmarked against this, the increase is pretty insane:
image

Can't seem to replicate the issue you had in the tests dropping after benchmarking a big number though. running on OSX and Node v6.1 here

Also apologies for the random tagging you both here, I've just been keeping an eye on all the nodejs PR's tagged performance as it's always interesting to review and your names pop up a lot =]

Here's the improvement log for Node Redis with the new parser changes:

                  before  to after
  PING:          19385 ->  18797 ops/sec (∆ -    588 -   3.03%)
  PING:         422711 -> 474790 ops/sec (∆ +  52079 +  12.32%)
  SET 4B str:    15976 ->  19191 ops/sec (∆ +   3215 +  20.12%)
  SET 4B str:   301140 -> 326050 ops/sec (∆ +  24910 +   8.27%)
  SET 4B buf:     9931 ->  10834 ops/sec (∆ +    903 +   9.09%)
  SET 4B buf:   145362 -> 155378 ops/sec (∆ +  10016 +   6.89%)
  GET 4B str:    16820 ->  21078 ops/sec (∆ +   4258 +  25.32%)
  GET 4B str:   307837 -> 408237 ops/sec (∆ + 100400 +  32.61%)
  GET 4B buf:    18249 ->  21424 ops/sec (∆ +   3175 +  17.40%)
  GET 4B buf:   319952 -> 409016 ops/sec (∆ +  89064 +  27.84%)
  SET 4KiB str:  16750 ->  19096 ops/sec (∆ +   2346 +  14.01%)
  SET 4KiB str: 118992 -> 123890 ops/sec (∆ +   4898 +   4.12%)
  SET 4KiB buf:  10766 ->  11951 ops/sec (∆ +   1185 +  11.01%)
  SET 4KiB buf: 113974 -> 119713 ops/sec (∆ +   5739 +   5.04%)
  GET 4KiB str:  13368 ->  14038 ops/sec (∆ +    670 +   5.01%)
  GET 4KiB str: 117673 -> 145822 ops/sec (∆ +  28149 +  23.92%)
  GET 4KiB buf:  12539 ->  13920 ops/sec (∆ +   1381 +  11.01%)
  GET 4KiB buf: 118573 -> 146122 ops/sec (∆ +  27549 +  23.23%)
  INCR:          18747 ->  21059 ops/sec (∆ +   2312 +  12.33%)
  INCR:         337285 -> 390544 ops/sec (∆ +  53259 +  15.79%)
  LPUSH:         18854 ->  20717 ops/sec (∆ +   1863 +   9.88%)
  LPUSH:        309936 -> 372811 ops/sec (∆ +  62875 +  20.29%)
  LRANGE 10:     16185 ->  19563 ops/sec (∆ +   3378 +  20.87%)
  LRANGE 10:    118493 -> 181427 ops/sec (∆ +  62934 +  53.11%)
  LRANGE 100:    10287 ->  14496 ops/sec (∆ +   4209 +  40.92%)
  LRANGE 100:    16667 ->  25839 ops/sec (∆ +   9172 +  55.03%)
  SET 4MiB str:    177 ->    177 ops/sec (∆ -      0 -   0.00%)
  SET 4MiB str:    178 ->    183 ops/sec (∆ +      5 +   2.81%)
  SET 4MiB buf:    732 ->    744 ops/sec (∆ +     12 +   1.64%)
  SET 4MiB buf:    773 ->    666 ops/sec (∆ -    107 -  13.84%)
  GET 4MiB str:    161 ->    160 ops/sec (∆ -      1 -   0.62%)
  GET 4MiB str:    163 ->    159 ops/sec (∆ -      4 -   2.45%)
  GET 4MiB buf:    163 ->    159 ops/sec (∆ -      4 -   2.45%)
  GET 4MiB buf:    167 ->    164 ops/sec (∆ -      3 -   1.80%)
  :              85332 ->  85170 ops/sec (∆ -    162 -   0.19%)

  Mean difference in ops/sec:  +14712.8
@andrasq

This comment has been minimized.

Copy link

commented May 10, 2016

nice!

whoops, the v and suffix were left over from experiments.
The glitch on my machine is there with both node 5 and 6 (though
node 6 is less affected):

node-v5.10.1:
andras: + integer x 11,653,357 ops/sec ±0.43% (87 runs sampled)
andras: + largeInteger x 1,347,323 ops/sec ±0.84% (93 runs sampled)
andras: + integer x 6,224,194 ops/sec ±0.82% (91 runs sampled)

node-v6.0.0:
andras: + integer x 12,459,959 ops/sec ±0.33% (89 runs sampled)
andras: + largeInteger x 1,516,039 ops/sec ±1.75% (90 runs sampled)
andras: + integer x 8,359,377 ops/sec ±0.88% (91 runs sampled)

the large integer used was
var largeIntegerBuffer = new Buffer(':18446744073709551615\r\n'); // (2 ^ 64) - 1

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

@andrasq oh ok this happens here for me too:

andras: + integer x 19,008,076 ops/sec ±1.24% (80 runs sampled)
andras: + largeInteger x 2,486,108 ops/sec ±1.02% (81 runs sampled)
andras: + integer x 15,435,560 ops/sec ±1.07% (84 runs sampled)

Weirdly only affects integers, if you string then large int then back to string it remains consistent:

andras: + string x 4,615,010 ops/sec ±1.55% (87 runs sampled)
andras: + largeInteger x 2,527,927 ops/sec ±1.03% (87 runs sampled)
andras: + string x 4,612,367 ops/sec ±1.25% (85 runs sampled)

If you create a separate instance of the parser to parse the large int it still affects the other parsers integer parsing performance also so doesn't seem like it's the parser buffer or offsets etc.

It's one off degradation also, if you stack large ints the small int parsing is still roughly the same lower perf rate, large int perf doesn't drop however:

andras: + integer x 19,056,384 ops/sec ±1.05% (84 runs sampled)
andras: + largeInteger x 2,536,531 ops/sec ±0.95% (87 runs sampled)
andras: + largeInteger x 2,521,160 ops/sec ±1.19% (85 runs sampled)
andras: + largeInteger x 2,549,584 ops/sec ±1.15% (84 runs sampled)
andras: + largeInteger x 2,567,975 ops/sec ±1.09% (86 runs sampled)
andras: + integer x 15,601,798 ops/sec ±1.05% (87 runs sampled)

Tried it with your qtimeit lib, same issue also, perf drops after large ints:

"function (){ x = parser.execute(integerBuffer); }": 100000000 loops in 4.3300 sec: 23094428.57 / sec, 0.000043 ms each
"function (){ x = parser.execute(largeIntegerBuffer); }": 100000000 loops in 38.1861 sec: 2618753.99 / sec, 0.000382 ms each
"function (){ x = parser.execute(integerBuffer); }": 100000000 loops in 5.2384 sec: 19089640.45 / sec, 0.000052 ms each
@andrasq

This comment has been minimized.

Copy link

commented May 10, 2016

it's weird. Last night I tried node --trace-gc and the parse of integerBuffer shows
zero garbage collection before the largeInteger, lots of gc during largeInteger (as expected,
all those strings), but then lots of gc scavenging after largeInteger too even when back to
parsing integerBuffer again. 50-60 gc scavenge runs / second (every 15-20ms) and a couple
of mar-and-sweeps too.

it's possible the benchmark fits in the cpu cache without garbage collection, but once gc kicks in
gc will be a cache buster, with the cache contents bouncing between gc and the script, slowing it. (Which would suggest that the lower value is the more realistic figure to expect in production)
Still not bad though. Good work that you're doing!

qtimeit is a node port of my php tool that I used for years before I started coding node.
I write lots of micro benchmarks, and qtimeit is great for one-liners.

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

@andrasq could the issue be related to nodejs/node#1671 I wonder, looks GC related. A lot of similar benchmarks there

@andrasq

This comment has been minimized.

Copy link

commented May 10, 2016

@Salakar I wouldn't think so, that issue is specific to gc-ing Buffers. Also, the same test run
behaves differently before and after an intervening largeInteger, so it's not gc specifically but
more the state of the system.

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

@andrasq true, I guess I was thinking that lots of buffers being parsed to .execute() was not helping gc, it's been a long day hah. The lower benchmark is still impressive anyway.

Got any tips on how I could improve the buffer caching / chunking? Would some kind of setup with a stream be more efficient? As you can see in the improvement log above the ops/s on 4mb strings/buffers is quite low, which understandably due to the size would be slower.

qtimeit is deffo handy though, only saw it today after stalking your github repo's 😆 will be using this more for quick tests. 👍 - also some good tips in your node-docs repo

@andrasq

This comment has been minimized.

Copy link

commented May 10, 2016

I'm meaning to look at buffer chunking too, had not had time yet. Until then, a while back I wrote a
package qbuffers to coalesce and re-split buffers for a beanstalkd driver, might find ideas in there.
(qbuffers will coalesce buffers, can re-split on user-defined delimiters, and can return either strings
or buffer slices.)

@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

Closing in favour of #3 and #4 - #4 has the latest code changes for v2.0

@Salakar Salakar closed this May 10, 2016

@Salakar Salakar referenced this pull request May 10, 2016

Closed

v2.0 changes #4

1 of 3 tasks complete
@andrasq

This comment has been minimized.

Copy link

commented May 11, 2016

for the 4mb bulk string, I don't see an obvious path to a 5x speedup. On my system
(an older AMD box, much lower memory bandwidth than newer Intel systems) 1000 runs
of the 4mb bulk string test take about 12 seconds, 7.5 sec for the buffer concat and 4.5
for the utf8 conversion. For buffer returns the utf8 conversion is avoided but the concat
is needed. For string responses the utf8 conversion is necessary. If the api requires
that the data be combined and returned in a contiguous block, the merge/convert
overheads are unavoidable.

Baseline, 1000 calls:
11.68 - total
 7.26 - buffer concat (coalesce)
 4.40 - utf8 conversion of concated buffer (toString)
 0.02 - other

For string responses, concatenating strings appears twice as fast as buffers,
suggesting a possibility for a 45% gain. Streams might help with this, to ease
the handling of char fragments split by buffer boundaries.

Pre-converted sub-strings, 1000 calls:
 8.08 - utf8-convert each chunk separately, then concat the results
 4.12 - utf8 conversion of the 64 64k chunks (toString)
 3.93 - the concat of the 64 64k sub-strings (coalesce)

For buffers, copying the chunks into a preallocated merge buffer hints that
a similar gain as the above may be possible without pre-converting,
just by avoiding the Buffer.concat.

Copy chunks into merge buffer, 100 calls:
 0.825 - total
 0.371 - buffer.copy (coalesce)
 0.409 - utf8 convert (toString)
 0.038 - other
@Salakar

This comment has been minimized.

Copy link
Member Author

commented May 17, 2016

@andrasq thanks for pointing us in the right direction, we've got some decent increases in perf now on the 4mb bulk strings and the other benchmarks, results here if you're interested: #3

Not if sure there's much left to improve on now on that PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.