Skip to content

Optimize SIMD impl of lines_fwd and lines_bwd #535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

heiher
Copy link

@heiher heiher commented Jun 26, 2025

Benchmark results on AMD Zen4:

 simd/lines_fwd/1        time:   [1.4801 ns 1.4803 ns 1.4804 ns]
                         thrpt:  [644.19 MiB/s 644.26 MiB/s 644.34 MiB/s]
                  change:
                         time:   [−38.958% −38.933% −38.898%] (p = 0.00 < 0.05)
                         thrpt:  [+63.661% +63.755% +63.821%]
                         Performance has improved.

 simd/lines_fwd/8        time:   [3.8571 ns 3.8607 ns 3.8641 ns]
                         thrpt:  [1.9282 GiB/s 1.9299 GiB/s 1.9316 GiB/s]
                  change:
                         time:   [−19.458% −19.423% −19.394%] (p = 0.00 < 0.05)
                         thrpt:  [+24.060% +24.105% +24.159%]
                         Performance has improved.

 simd/lines_fwd/128      time:   [14.064 ns 14.092 ns 14.120 ns]
                         thrpt:  [8.4426 GiB/s 8.4592 GiB/s 8.4764 GiB/s]
                  change:
                         time:   [−6.0955% −5.8911% −5.6706%] (p = 0.00 < 0.05)
                         thrpt:  [+6.0115% +6.2599% +6.4912%]
                         Performance has improved.

 simd/lines_fwd/1024     time:   [18.160 ns 18.178 ns 18.195 ns]
                         thrpt:  [52.415 GiB/s 52.462 GiB/s 52.516 GiB/s]
                  change:
                         time:   [−4.1174% −3.9973% −3.8859%] (p = 0.00 < 0.05)
                         thrpt:  [+4.0430% +4.1637% +4.2942%]
                         Performance has improved.

 simd/lines_fwd/131072   time:   [871.09 ns 871.14 ns 871.21 ns]
                         thrpt:  [140.12 GiB/s 140.13 GiB/s 140.14 GiB/s]
                  change:
                         time:   [−10.451% −10.405% −10.358%] (p = 0.00 < 0.05)
                         thrpt:  [+11.555% +11.613% +11.670%]
                         Performance has improved.

 simd/lines_fwd/134217728
                         time:   [1.7326 ms 1.7332 ms 1.7338 ms]
                         thrpt:  [72.094 GiB/s 72.120 GiB/s 72.146 GiB/s]
                  change:
                         time:   [−0.4091% −0.2348% −0.0670%] (p = 0.00 < 0.05)
                         thrpt:  [+0.0671% +0.2353% +0.4108%]
                         Change within noise threshold.

@heiher
Copy link
Author

heiher commented Jun 26, 2025

@microsoft-github-policy-service agree company="Loongson"

Comment on lines 113 to 115
let off = beg.align_offset(32);
if off != 0 && off < remaining {
(beg, line) = lines_fwd_fallback(beg, beg.add(off), line, line_stop);
remaining -= off;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the SIMD code to do this instead of doing a byte-wise search here. This will hurt the throughput of the lines_fwd/1 case, but that doesn't matter much, because the latency is much more important (which should be within a few ns). Basically, modify the SIMD loop to this C pseudo-code:

beg = (beg + 128) & ~31;
remaining = /* some equivalent of this, but I haven't though it through */

I'm curious what difference this would make.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. In the first iteration, we scan 128 bytes, and beg moves forward to the next 32-byte aligned point. However, the actual distance covered may be less than 128 bytes. The tricky part here is figuring out how to discard the unnecessary entries in line_next.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the leading segment smaller than 16 or 31 bytes, the byte-wise search is more latency-friendly.

Benchmark results on AMD Zen4:

```
 simd/lines_fwd/1        time:   [1.4801 ns 1.4803 ns 1.4804 ns]
                         thrpt:  [644.19 MiB/s 644.26 MiB/s 644.34 MiB/s]
                  change:
                         time:   [−38.958% −38.933% −38.898%] (p = 0.00 < 0.05)
                         thrpt:  [+63.661% +63.755% +63.821%]
                         Performance has improved.

 simd/lines_fwd/8        time:   [3.8571 ns 3.8607 ns 3.8641 ns]
                         thrpt:  [1.9282 GiB/s 1.9299 GiB/s 1.9316 GiB/s]
                  change:
                         time:   [−19.458% −19.423% −19.394%] (p = 0.00 < 0.05)
                         thrpt:  [+24.060% +24.105% +24.159%]
                         Performance has improved.

 simd/lines_fwd/128      time:   [14.064 ns 14.092 ns 14.120 ns]
                         thrpt:  [8.4426 GiB/s 8.4592 GiB/s 8.4764 GiB/s]
                  change:
                         time:   [−6.0955% −5.8911% −5.6706%] (p = 0.00 < 0.05)
                         thrpt:  [+6.0115% +6.2599% +6.4912%]
                         Performance has improved.

 simd/lines_fwd/1024     time:   [18.160 ns 18.178 ns 18.195 ns]
                         thrpt:  [52.415 GiB/s 52.462 GiB/s 52.516 GiB/s]
                  change:
                         time:   [−4.1174% −3.9973% −3.8859%] (p = 0.00 < 0.05)
                         thrpt:  [+4.0430% +4.1637% +4.2942%]
                         Performance has improved.

 simd/lines_fwd/131072   time:   [871.09 ns 871.14 ns 871.21 ns]
                         thrpt:  [140.12 GiB/s 140.13 GiB/s 140.14 GiB/s]
                  change:
                         time:   [−10.451% −10.405% −10.358%] (p = 0.00 < 0.05)
                         thrpt:  [+11.555% +11.613% +11.670%]
                         Performance has improved.

 simd/lines_fwd/134217728
                         time:   [1.7326 ms 1.7332 ms 1.7338 ms]
                         thrpt:  [72.094 GiB/s 72.120 GiB/s 72.146 GiB/s]
                  change:
                         time:   [−0.4091% −0.2348% −0.0670%] (p = 0.00 < 0.05)
                         thrpt:  [+0.0671% +0.2353% +0.4108%]
                         Change within noise threshold.
```
@heiher
Copy link
Author

heiher commented Jun 26, 2025

Benchmark results on Intel i7-7700k (e20e006):

simd/lines_fwd/1        time:   [2.9242 ns 2.9293 ns 2.9350 ns]
                        thrpt:  [324.93 MiB/s 325.56 MiB/s 326.13 MiB/s]
                 change:
                        time:   [−39.147% −38.969% −38.816%] (p = 0.00 < 0.05)
                        thrpt:  [+63.443% +63.850% +64.331%]
                        Performance has improved.

simd/lines_fwd/8        time:   [7.7873 ns 7.7982 ns 7.8092 ns]
                        thrpt:  [976.98 MiB/s 978.35 MiB/s 979.72 MiB/s]
                 change:
                        time:   [−17.301% −17.160% −17.018%] (p = 0.00 < 0.05)
                        thrpt:  [+20.508% +20.714% +20.921%]
                        Performance has improved.

simd/lines_fwd/128      time:   [27.227 ns 27.327 ns 27.488 ns]
                        thrpt:  [4.3368 GiB/s 4.3623 GiB/s 4.3783 GiB/s]
                 change:
                        time:   [−4.5703% −4.0676% −3.4233%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5446% +4.2401% +4.7892%]
                        Performance has improved.

simd/lines_fwd/1024     time:   [38.485 ns 38.561 ns 38.646 ns]
                        thrpt:  [24.677 GiB/s 24.732 GiB/s 24.780 GiB/s]
                 change:
                        time:   [−5.3727% −4.8970% −4.2502%] (p = 0.00 < 0.05)
                        thrpt:  [+4.4389% +5.1492% +5.6777%]
                        Performance has improved.

simd/lines_fwd/131072   time:   [1.8623 µs 1.8700 µs 1.8804 µs]
                        thrpt:  [64.917 GiB/s 65.279 GiB/s 65.547 GiB/s]
                 change:
                        time:   [−14.964% −14.690% −14.387%] (p = 0.00 < 0.05)
                        thrpt:  [+16.804% +17.220% +17.598%]
                        Performance has improved.

simd/lines_fwd/134217728
                        time:   [5.6005 ms 5.6123 ms 5.6264 ms]
                        thrpt:  [22.217 GiB/s 22.272 GiB/s 22.319 GiB/s]
                 change:
                        time:   [−3.4977% −3.2437% −2.9432%] (p = 0.00 < 0.05)
                        thrpt:  [+3.0324% +3.3524% +3.6244%]
                        Performance has improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants