Skip to content

[pull] master from rust-lang:master#2169

Merged
pull[bot] merged 21 commits intoMu-L:masterfrom
rust-lang:master
Nov 17, 2022
Merged

[pull] master from rust-lang:master#2169
pull[bot] merged 21 commits intoMu-L:masterfrom
rust-lang:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Nov 17, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

cjgillot and others added 21 commits November 13, 2022 14:02
Co-authored-by: joboet <jonas.boettiger@icloud.com>
Based on Wojciech Muła's "SIMD-friendly algorithms for substring searching"[0]

The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "criticla factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.

The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.

benchmarks taken on a Zen2:

```
16CGU, OLD:
test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 1)
test str::bench_contains_short_long                      ... bench:         667 ns/iter (+/- 29)
test str::bench_contains_bad_naive                       ... bench:         131 ns/iter (+/- 2)
test str::bench_contains_bad_simd                        ... bench:         130 ns/iter (+/- 2)
test str::bench_contains_equal                           ... bench:         148 ns/iter (+/- 4)


16CGU, NEW:
test str::bench_contains_short_short                     ... bench:           8 ns/iter (+/- 0)
test str::bench_contains_short_long                      ... bench:         135 ns/iter (+/- 4)
test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 2)
test str::bench_contains_bad_simd                        ... bench:         292 ns/iter (+/- 1)
test str::bench_contains_equal                           ... bench:           3 ns/iter (+/- 0)


1CGU, OLD:
test str::bench_contains_short_short                     ... bench:          30 ns/iter (+/- 0)
test str::bench_contains_short_long                      ... bench:         713 ns/iter (+/- 17)
test str::bench_contains_bad_naive                       ... bench:         131 ns/iter (+/- 3)
test str::bench_contains_bad_simd                        ... bench:         130 ns/iter (+/- 3)
test str::bench_contains_equal                           ... bench:         148 ns/iter (+/- 6)

1CGU, NEW:
test str::bench_contains_short_short                     ... bench:          10 ns/iter (+/- 0)
test str::bench_contains_short_long                      ... bench:         111 ns/iter (+/- 0)
test str::bench_contains_bad_naive                       ... bench:         135 ns/iter (+/- 3)
test str::bench_contains_bad_simd                        ... bench:         274 ns/iter (+/- 2)
test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0)
```


[0] http://0x80.pl/articles/simd-strfind.html#sse-avx2
The Big-O is cubic, but this is only called with ~70 chars so it's still fast enough
- bump simd compare to 32bytes
- import small slice compare code from memmem crate
- try a few different probe bytes to avoid degenerate cases
  - but special-case 2-byte needles
For the next commit, `FunctionCx::codegen_*_terminator` need to take a
`&mut Bx` instead of consuming a `Bx`. This triggers a cascade of
similar changes across multiple functions. The resulting code is more
concise and replaces many `&mut bx` expressions with `bx`.
In `codegen_assert_terminator` we decide if a BB's successor is a
candidate for merging, which requires that it be the only successor, and
that it only have one predecessor. That result then gets passed down,
and if it reaches `funclet_br` with the appropriate BB characteristics,
then no `br` instruction is issued, a `MergingSucc::True` result is
passed back, and the merging proceeds in `codegen_block`.

The commit also adds `CachedLlbb`, a new type to help keep track of
each BB that has been merged into its predecessor.
Merge basic blocks where possible when generating LLVM IR.

r? `@ghost`
x86_64 SSE2 fast-path for str.contains(&str) and short needles

Based on Wojciech Muła's [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#sse-avx2)

The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "critical factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.

The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.

benchmarks taken on a Zen2, compiled with `-Ccodegen-units=1`:

```
OLD:
test str::bench_contains_16b_in_long                     ... bench:         504 ns/iter (+/- 14) = 5061 MB/s
test str::bench_contains_2b_repeated_long                ... bench:         948 ns/iter (+/- 175) = 2690 MB/s
test str::bench_contains_32b_in_long                     ... bench:         445 ns/iter (+/- 6) = 5732 MB/s
test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 1) = 569 MB/s
test str::bench_contains_bad_simd                        ... bench:          84 ns/iter (+/- 8) = 880 MB/s
test str::bench_contains_equal                           ... bench:         142 ns/iter (+/- 7) = 394 MB/s
test str::bench_contains_short_long                      ... bench:         677 ns/iter (+/- 25) = 3768 MB/s
test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 2) = 2074 MB/s

NEW:
test str::bench_contains_16b_in_long                     ... bench:          82 ns/iter (+/- 0) = 31109 MB/s
test str::bench_contains_2b_repeated_long                ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_32b_in_long                     ... bench:          71 ns/iter (+/- 1) = 35929 MB/s
test str::bench_contains_bad_naive                       ... bench:           7 ns/iter (+/- 0) = 10571 MB/s
test str::bench_contains_bad_simd                        ... bench:          97 ns/iter (+/- 41) = 762 MB/s
test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0) = 14000 MB/s
test str::bench_contains_short_long                      ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_short_short                     ... bench:          12 ns/iter (+/- 0) = 4666 MB/s
```
Record `LocalDefId` in HIR nodes instead of a side table

This is part of an attempt to remove the `HirId -> LocalDefId` table from HIR.
This attempt is a prerequisite to creation of `LocalDefId` after HIR lowering (#96840), by controlling how `def_id` information is accessed.

This first part adds the information to HIR nodes themselves instead of a table.
The second part is #103902
The third part will be to make `hir::Visitor::visit_fn` take a `LocalDefId` as last parameter.
The fourth part will be to completely remove the side table.
Attempt to reuse `Vec<T>` backing storage for `Rc/Arc<[T]>`

If a `Vec<T>` has sufficient capacity to store the inner `RcBox<[T]>`, we can just reuse the existing allocation and shift the elements up, instead of making a new allocation.
@pull pull bot added the ⤵️ pull label Nov 17, 2022
@pull pull bot merged commit 36db030 into Mu-L:master Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants