-
Notifications
You must be signed in to change notification settings - Fork 469
Optimize SIMD impl of lines_fwd
and lines_bwd
#535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree company="Loongson" |
src/simd/lines_fwd.rs
Outdated
let off = beg.align_offset(32); | ||
if off != 0 && off < remaining { | ||
(beg, line) = lines_fwd_fallback(beg, beg.add(off), line, line_stop); | ||
remaining -= off; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the SIMD code to do this instead of doing a byte-wise search here. This will hurt the throughput of the lines_fwd/1
case, but that doesn't matter much, because the latency is much more important (which should be within a few ns). Basically, modify the SIMD loop to this C pseudo-code:
beg = (beg + 128) & ~31;
remaining = /* some equivalent of this, but I haven't though it through */
I'm curious what difference this would make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. In the first iteration, we scan 128 bytes, and beg
moves forward to the next 32-byte aligned point. However, the actual distance covered may be less than 128 bytes. The tricky part here is figuring out how to discard the unnecessary entries in line_next
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the leading segment smaller than 16 or 31 bytes, the byte-wise search is more latency-friendly.
Benchmark results on AMD Zen4: ``` simd/lines_fwd/1 time: [1.4801 ns 1.4803 ns 1.4804 ns] thrpt: [644.19 MiB/s 644.26 MiB/s 644.34 MiB/s] change: time: [−38.958% −38.933% −38.898%] (p = 0.00 < 0.05) thrpt: [+63.661% +63.755% +63.821%] Performance has improved. simd/lines_fwd/8 time: [3.8571 ns 3.8607 ns 3.8641 ns] thrpt: [1.9282 GiB/s 1.9299 GiB/s 1.9316 GiB/s] change: time: [−19.458% −19.423% −19.394%] (p = 0.00 < 0.05) thrpt: [+24.060% +24.105% +24.159%] Performance has improved. simd/lines_fwd/128 time: [14.064 ns 14.092 ns 14.120 ns] thrpt: [8.4426 GiB/s 8.4592 GiB/s 8.4764 GiB/s] change: time: [−6.0955% −5.8911% −5.6706%] (p = 0.00 < 0.05) thrpt: [+6.0115% +6.2599% +6.4912%] Performance has improved. simd/lines_fwd/1024 time: [18.160 ns 18.178 ns 18.195 ns] thrpt: [52.415 GiB/s 52.462 GiB/s 52.516 GiB/s] change: time: [−4.1174% −3.9973% −3.8859%] (p = 0.00 < 0.05) thrpt: [+4.0430% +4.1637% +4.2942%] Performance has improved. simd/lines_fwd/131072 time: [871.09 ns 871.14 ns 871.21 ns] thrpt: [140.12 GiB/s 140.13 GiB/s 140.14 GiB/s] change: time: [−10.451% −10.405% −10.358%] (p = 0.00 < 0.05) thrpt: [+11.555% +11.613% +11.670%] Performance has improved. simd/lines_fwd/134217728 time: [1.7326 ms 1.7332 ms 1.7338 ms] thrpt: [72.094 GiB/s 72.120 GiB/s 72.146 GiB/s] change: time: [−0.4091% −0.2348% −0.0670%] (p = 0.00 < 0.05) thrpt: [+0.0671% +0.2353% +0.4108%] Change within noise threshold. ```
Benchmark results on Intel i7-7700k (e20e006):
|
Benchmark results on AMD Zen4: