Optimize SIMD impl of `lines_fwd` and `lines_bwd` #535

heiher · 2025-06-26T15:17:39Z

Benchmark results on AMD Zen4:

 simd/lines_fwd/1        time:   [1.4801 ns 1.4803 ns 1.4804 ns]
                         thrpt:  [644.19 MiB/s 644.26 MiB/s 644.34 MiB/s]
                  change:
                         time:   [−38.958% −38.933% −38.898%] (p = 0.00 < 0.05)
                         thrpt:  [+63.661% +63.755% +63.821%]
                         Performance has improved.

 simd/lines_fwd/8        time:   [3.8571 ns 3.8607 ns 3.8641 ns]
                         thrpt:  [1.9282 GiB/s 1.9299 GiB/s 1.9316 GiB/s]
                  change:
                         time:   [−19.458% −19.423% −19.394%] (p = 0.00 < 0.05)
                         thrpt:  [+24.060% +24.105% +24.159%]
                         Performance has improved.

 simd/lines_fwd/128      time:   [14.064 ns 14.092 ns 14.120 ns]
                         thrpt:  [8.4426 GiB/s 8.4592 GiB/s 8.4764 GiB/s]
                  change:
                         time:   [−6.0955% −5.8911% −5.6706%] (p = 0.00 < 0.05)
                         thrpt:  [+6.0115% +6.2599% +6.4912%]
                         Performance has improved.

 simd/lines_fwd/1024     time:   [18.160 ns 18.178 ns 18.195 ns]
                         thrpt:  [52.415 GiB/s 52.462 GiB/s 52.516 GiB/s]
                  change:
                         time:   [−4.1174% −3.9973% −3.8859%] (p = 0.00 < 0.05)
                         thrpt:  [+4.0430% +4.1637% +4.2942%]
                         Performance has improved.

 simd/lines_fwd/131072   time:   [871.09 ns 871.14 ns 871.21 ns]
                         thrpt:  [140.12 GiB/s 140.13 GiB/s 140.14 GiB/s]
                  change:
                         time:   [−10.451% −10.405% −10.358%] (p = 0.00 < 0.05)
                         thrpt:  [+11.555% +11.613% +11.670%]
                         Performance has improved.

 simd/lines_fwd/134217728
                         time:   [1.7326 ms 1.7332 ms 1.7338 ms]
                         thrpt:  [72.094 GiB/s 72.120 GiB/s 72.146 GiB/s]
                  change:
                         time:   [−0.4091% −0.2348% −0.0670%] (p = 0.00 < 0.05)
                         thrpt:  [+0.0671% +0.2353% +0.4108%]
                         Change within noise threshold.

heiher · 2025-06-26T15:23:04Z

@microsoft-github-policy-service agree company="Loongson"

lhecker · 2025-06-26T15:27:21Z

src/simd/lines_fwd.rs

+        let off = beg.align_offset(32);
+        if off != 0 && off < remaining {
+            (beg, line) = lines_fwd_fallback(beg, beg.add(off), line, line_stop);
+            remaining -= off;
+        }


We can use the SIMD code to do this instead of doing a byte-wise search here. This will hurt the throughput of the lines_fwd/1 case, but that doesn't matter much, because the latency is much more important (which should be within a few ns). Basically, modify the SIMD loop to this C pseudo-code:

beg = (beg + 128) & ~31; remaining = /* some equivalent of this, but I haven't though it through */

I'm curious what difference this would make.

I see what you mean. In the first iteration, we scan 128 bytes, and beg moves forward to the next 32-byte aligned point. However, the actual distance covered may be less than 128 bytes. The tricky part here is figuring out how to discard the unnecessary entries in line_next.

For the leading segment smaller than 16 or 31 bytes, the byte-wise search is more latency-friendly.

Benchmark results on AMD Zen4: ``` simd/lines_fwd/1 time: [1.4801 ns 1.4803 ns 1.4804 ns] thrpt: [644.19 MiB/s 644.26 MiB/s 644.34 MiB/s] change: time: [−38.958% −38.933% −38.898%] (p = 0.00 < 0.05) thrpt: [+63.661% +63.755% +63.821%] Performance has improved. simd/lines_fwd/8 time: [3.8571 ns 3.8607 ns 3.8641 ns] thrpt: [1.9282 GiB/s 1.9299 GiB/s 1.9316 GiB/s] change: time: [−19.458% −19.423% −19.394%] (p = 0.00 < 0.05) thrpt: [+24.060% +24.105% +24.159%] Performance has improved. simd/lines_fwd/128 time: [14.064 ns 14.092 ns 14.120 ns] thrpt: [8.4426 GiB/s 8.4592 GiB/s 8.4764 GiB/s] change: time: [−6.0955% −5.8911% −5.6706%] (p = 0.00 < 0.05) thrpt: [+6.0115% +6.2599% +6.4912%] Performance has improved. simd/lines_fwd/1024 time: [18.160 ns 18.178 ns 18.195 ns] thrpt: [52.415 GiB/s 52.462 GiB/s 52.516 GiB/s] change: time: [−4.1174% −3.9973% −3.8859%] (p = 0.00 < 0.05) thrpt: [+4.0430% +4.1637% +4.2942%] Performance has improved. simd/lines_fwd/131072 time: [871.09 ns 871.14 ns 871.21 ns] thrpt: [140.12 GiB/s 140.13 GiB/s 140.14 GiB/s] change: time: [−10.451% −10.405% −10.358%] (p = 0.00 < 0.05) thrpt: [+11.555% +11.613% +11.670%] Performance has improved. simd/lines_fwd/134217728 time: [1.7326 ms 1.7332 ms 1.7338 ms] thrpt: [72.094 GiB/s 72.120 GiB/s 72.146 GiB/s] change: time: [−0.4091% −0.2348% −0.0670%] (p = 0.00 < 0.05) thrpt: [+0.0671% +0.2353% +0.4108%] Change within noise threshold. ```

heiher · 2025-06-26T16:38:47Z

Benchmark results on Intel i7-7700k (e20e006):

simd/lines_fwd/1        time:   [2.9242 ns 2.9293 ns 2.9350 ns]
                        thrpt:  [324.93 MiB/s 325.56 MiB/s 326.13 MiB/s]
                 change:
                        time:   [−39.147% −38.969% −38.816%] (p = 0.00 < 0.05)
                        thrpt:  [+63.443% +63.850% +64.331%]
                        Performance has improved.

simd/lines_fwd/8        time:   [7.7873 ns 7.7982 ns 7.8092 ns]
                        thrpt:  [976.98 MiB/s 978.35 MiB/s 979.72 MiB/s]
                 change:
                        time:   [−17.301% −17.160% −17.018%] (p = 0.00 < 0.05)
                        thrpt:  [+20.508% +20.714% +20.921%]
                        Performance has improved.

simd/lines_fwd/128      time:   [27.227 ns 27.327 ns 27.488 ns]
                        thrpt:  [4.3368 GiB/s 4.3623 GiB/s 4.3783 GiB/s]
                 change:
                        time:   [−4.5703% −4.0676% −3.4233%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5446% +4.2401% +4.7892%]
                        Performance has improved.

simd/lines_fwd/1024     time:   [38.485 ns 38.561 ns 38.646 ns]
                        thrpt:  [24.677 GiB/s 24.732 GiB/s 24.780 GiB/s]
                 change:
                        time:   [−5.3727% −4.8970% −4.2502%] (p = 0.00 < 0.05)
                        thrpt:  [+4.4389% +5.1492% +5.6777%]
                        Performance has improved.

simd/lines_fwd/131072   time:   [1.8623 µs 1.8700 µs 1.8804 µs]
                        thrpt:  [64.917 GiB/s 65.279 GiB/s 65.547 GiB/s]
                 change:
                        time:   [−14.964% −14.690% −14.387%] (p = 0.00 < 0.05)
                        thrpt:  [+16.804% +17.220% +17.598%]
                        Performance has improved.

simd/lines_fwd/134217728
                        time:   [5.6005 ms 5.6123 ms 5.6264 ms]
                        thrpt:  [22.217 GiB/s 22.272 GiB/s 22.319 GiB/s]
                 change:
                        time:   [−3.4977% −3.2437% −2.9432%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0324% +3.3524% +3.6244%]
                        Performance has improved.

lhecker reviewed Jun 26, 2025

View reviewed changes

heiher force-pushed the opt-simd-lines branch from 0856566 to e20e006 Compare June 26, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize SIMD impl of `lines_fwd` and `lines_bwd` #535

Optimize SIMD impl of `lines_fwd` and `lines_bwd` #535

heiher commented Jun 26, 2025 •

edited

Loading

Uh oh!

heiher commented Jun 26, 2025

Uh oh!

lhecker Jun 26, 2025

Uh oh!

heiher Jun 26, 2025

Uh oh!

heiher Jun 26, 2025

Uh oh!

heiher commented Jun 26, 2025

Uh oh!

Uh oh!

Optimize SIMD impl of lines_fwd and lines_bwd #535

Are you sure you want to change the base?

Optimize SIMD impl of lines_fwd and lines_bwd #535

Conversation

heiher commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heiher commented Jun 26, 2025

Uh oh!

lhecker Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

heiher Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

heiher Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

heiher commented Jun 26, 2025

Uh oh!

Uh oh!

Optimize SIMD impl of `lines_fwd` and `lines_bwd` #535

Optimize SIMD impl of `lines_fwd` and `lines_bwd` #535

heiher commented Jun 26, 2025 •

edited

Loading