SIMD code generator #60

jakobnissen · 2020-11-04T22:06:21Z

Introduction

This PR introduces the :simd code generator. At this time, it is very similar to the :goto code generator. In fact, the only current change from the :goto-generator is that it uses SIMD loop unrolling instead of the manual Automa-generated loop unrolling.

In future PRs, I would like to change the large-scale structure of the code generated by the :simd generator (see issue #53), if it turns out to be possible and performant. But for now, this change only pertains to SIMD loop unrolling.

Benefits

Speed. That's it, really. We can use an "empty" format validator (see #55 on a byte array) to most accurately time Automa's parsing, since neither the Machine's attached actions, nor file I/O will influence the result.

The times are given in median runtimes determined using BenchmarkTools.jl:

6 MiB FASTA with 60 chars per line:
Before: 2.255 ms
After:  1.283 ms

1.3 GiB 4-line FASTQ with around 100 nt/read:
Before: 368.734 ms
After:  132.312 ms

These are, of course, optimal conditions. Even on the extremely simple biofast FASTQ benchmark, processing a file is approximately 2.5x slower than processing a byte vector in RAM. So a doubling of speed of the actual Automa parsing is realistically only a 20 % performance increase. Of course, this does mean that we are moving the performance bottleneck somewhere else, such that e.g. TranscodingStreams now have more room for optimization.
Or maybe it turns out it's more than that. We'll have to see for certain when all the tests are in place and the behaviour is verified to be correct.

Drawbacks of using the `:simd` generator over the `:goto` generator:

Unfortunately, SIMD code generation comes with a few limitations:

getbyte can no longer be a custom function in a :simd generator, it has to be Base.getindex. I don't think anyone ever used that functionality, though.
Unlike the :goto generator, the generated code is not pure Base code. That means the "consumer" of the code must have import'ed or using Automa, else they will have a NameError
The code generation uses information from LLVM to determine which CPU instructions are available. If a :simd generator is used to generate Julia code on a modern CPU (less than 10 years old x86 CPU), but the Julia code is executed on another CPU (10 years old, or >5 years old AMD, or RISC CPUs), it will likely error catastrophically. This may happen if cached Julia code is used on another computer.

Drawbacks of merging this code

Larger code base and more maintenance, obviously.
Check the simd.jl file. It checks the available instructions, but there is, as far as I know, no truly stable and officially supported way of doing this. So it may error sometime in the future. Since this happens at import time, even users not using the :simd generator will be affected if this fails on some systems or in the future.
Ensuring correct behaviour will be harder, since different CPUs will generate different code. It should not be too hard to test, though, since almost all modern CPUs have all instruction sets and can test the code generated for all CPUs.

To-do list

~~Still fails FASTQ parsing, so there are probably a few bugs to be kinked out~~
~~Needs way more tests, ideally 100% coverage~~
Too much code duplication between :simd and :goto generator. This needs to be factored out (or does it? It depends on how much progress I'll make on reorganizing the :simd codegen along the road)

closes issue #54

jakobnissen · 2020-11-07T15:27:23Z

Note for future: On a FASTA file, the SIMD generator executes 8.7x fewer instructions and 10x fewer branches, while only being 2.5x faster. Part of the reason may be increased cache misses (2.5k -> 4k / 192k), but the major factor is probably the level of instruction level parallism. Looking at the assembly, the SIMD loop has serial memory dependencies, preventing superscalar instructions. On my computer, manually unrolling the SIMD loop does not make it use different registers, so the seriality remains and is in fact slower.
It appears the main bottleneck may be that the CPU simply can't parallelize the vector instructions to the same degree.
It might be worth looking into prefetching though

kescobo · 2020-12-18T14:56:35Z

Unlike the :goto generator, the generated code is not pure Base code. That means the "consumer" of the code must have import'ed or using Automa, else they will have a NameError

Might be worth checking how much this increases load time for some consumer package. Automa is pretty light in terms of dependencies, and it sounds like it would be worth it, but it might be nice to have that information at hand.

jakobnissen · 2020-12-18T15:23:33Z

That's a good idea. The new SIMD.jl dependency has historically been devastating for load times, although this should have improved quite significantly with 1.6, and I don't plan on merging this before then.

jakobnissen · 2020-12-19T14:57:21Z

Turns out compile times are affected, but not by much. On version 1.5, compiling and running a FASTA reader takes about 2.3 seconds without this PR, and 2.55 seconds with this PR. On 1.6, it's 1.85 seconds without and 1.95 with.

jakobnissen · 2020-12-19T15:04:14Z

I think this is done now and can get merged. Code coverage is a little misleading, it's actually almost 100%. I've no coverage on the branches that activates on big-endian CPUs and on CPUs without SIMD capabilities, because I don't have those kinds of CPUs. I'ts just simple if-statements though, so not much can go wrong.

kescobo

This is not the kind of code I'm fit to review technically. One thing I'll note is that I don't see much associated documentation. There are some useful (I gather) comments throughout, and I recognize that this may not effect the user-facing API, but it seems complicated enough (and helpful enough) that it might be worth spending a bit more time fleshing out the human readable bits.

Unless you think it would all be self-evident to someone familiar with SIMD operations generall. Basically, what would happen to that code if you were hit by the metaphorical bus - would it be interpretable?

jakobnissen · 2020-12-20T16:00:21Z

Okay, now I've added a whole bunch of comments and also mitigated some of the 1.6 regressions. You're right, it's not very clear what it does. Hopefully the comments help.
Worst case scenario, if, say, in the future this falls apart, the whole SIMD stuff can just be removed, and the SIMD CodeGenContext fall back to returning a goto CodeGenContext.

Although I do think that this whole SIMD code could probably be split out into its own package. I can see how having SIMD byte matching could be generally useful. It is essentially like https://docs.rs/memchr/2.3.4/memchr/, except this generalizes to any collection of bytes. But that's a PR for another time :)

jakobnissen · 2021-03-20T16:44:09Z

Okay, I've been running on this branch on my daily work for a few months now, refactored it out to use ScanByte, and added non-SIMD fallbacks. So merging now.

jakobnissen added 4 commits November 2, 2020 17:04

Add inital SIMD code

477e2ac

Fix SSSE3 support

8b590d0

Include simd generator

93521aa

Fix code when p_end != p_eof

264d436

jakobnissen added 3 commits December 17, 2020 15:57

Improve inferrability of haszerolayout

2d1342b

Add more tests

1496609

Merge branch 'simd' of github.com:jakobnissen/Automa.jl into simd

5ee92e7

jakobnissen added 3 commits December 19, 2020 14:44

Jump through more hoops to make LLVM compile the SIMD loop

ebfb7e8

Improve SIMD tests

3f72597

Improve test coverage

3cf0e07

jakobnissen changed the title ~~SIMD code generator [WIP, do not merge]~~ SIMD code generator Dec 19, 2020

Force LLVM to use vptest

3d2c6bf

kescobo reviewed Dec 20, 2020

View reviewed changes

jakobnissen added 2 commits December 20, 2020 16:15

Add comments

9ff2d42

Improve SIMD loop for LLVM11

3c63a5b

jakobnissen mentioned this pull request Jan 28, 2021

Performance comparison with Rust BioJulia/FASTX.jl#40

Closed

jakobnissen mentioned this pull request Mar 3, 2021

Use SIMD code generator BioJulia/FASTX.jl#42

Closed

jakobnissen and others added 2 commits March 7, 2021 21:27

Move to ScanByte

303ebb5

Fix project

9ab54f6

jakobnissen merged commit 8f90ba4 into BioJulia:master Mar 20, 2021

jakobnissen deleted the simd branch April 30, 2022 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD code generator #60

SIMD code generator #60

jakobnissen commented Nov 4, 2020 •

edited

jakobnissen commented Nov 7, 2020

kescobo commented Dec 18, 2020

jakobnissen commented Dec 18, 2020

jakobnissen commented Dec 19, 2020

jakobnissen commented Dec 19, 2020

kescobo left a comment •

edited

jakobnissen commented Dec 20, 2020

jakobnissen commented Mar 20, 2021

SIMD code generator #60

SIMD code generator #60

Conversation

jakobnissen commented Nov 4, 2020 • edited

Introduction

Benefits

Drawbacks of using the :simd generator over the :goto generator:

Drawbacks of merging this code

To-do list

jakobnissen commented Nov 7, 2020

kescobo commented Dec 18, 2020

jakobnissen commented Dec 18, 2020

jakobnissen commented Dec 19, 2020

jakobnissen commented Dec 19, 2020

kescobo left a comment • edited

Choose a reason for hiding this comment

jakobnissen commented Dec 20, 2020

jakobnissen commented Mar 20, 2021

jakobnissen commented Nov 4, 2020 •

edited

Drawbacks of using the `:simd` generator over the `:goto` generator:

kescobo left a comment •

edited