Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD code generator #60

Merged
merged 15 commits into from Mar 20, 2021
Merged

SIMD code generator #60

merged 15 commits into from Mar 20, 2021

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Nov 4, 2020

Introduction

This PR introduces the :simd code generator. At this time, it is very similar to the :goto code generator. In fact, the only current change from the :goto-generator is that it uses SIMD loop unrolling instead of the manual Automa-generated loop unrolling.

In future PRs, I would like to change the large-scale structure of the code generated by the :simd generator (see issue #53), if it turns out to be possible and performant. But for now, this change only pertains to SIMD loop unrolling.

Benefits

Speed. That's it, really. We can use an "empty" format validator (see #55 on a byte array) to most accurately time Automa's parsing, since neither the Machine's attached actions, nor file I/O will influence the result.

The times are given in median runtimes determined using BenchmarkTools.jl:

6 MiB FASTA with 60 chars per line:
Before: 2.255 ms
After:  1.283 ms

1.3 GiB 4-line FASTQ with around 100 nt/read:
Before: 368.734 ms
After:  132.312 ms

These are, of course, optimal conditions. Even on the extremely simple biofast FASTQ benchmark, processing a file is approximately 2.5x slower than processing a byte vector in RAM. So a doubling of speed of the actual Automa parsing is realistically only a 20 % performance increase. Of course, this does mean that we are moving the performance bottleneck somewhere else, such that e.g. TranscodingStreams now have more room for optimization.
Or maybe it turns out it's more than that. We'll have to see for certain when all the tests are in place and the behaviour is verified to be correct.

Drawbacks of using the :simd generator over the :goto generator:

Unfortunately, SIMD code generation comes with a few limitations:

  • getbyte can no longer be a custom function in a :simd generator, it has to be Base.getindex. I don't think anyone ever used that functionality, though.
  • Unlike the :goto generator, the generated code is not pure Base code. That means the "consumer" of the code must have import'ed or using Automa, else they will have a NameError
  • The code generation uses information from LLVM to determine which CPU instructions are available. If a :simd generator is used to generate Julia code on a modern CPU (less than 10 years old x86 CPU), but the Julia code is executed on another CPU (10 years old, or >5 years old AMD, or RISC CPUs), it will likely error catastrophically. This may happen if cached Julia code is used on another computer.

Drawbacks of merging this code

  • Larger code base and more maintenance, obviously.
  • Check the simd.jl file. It checks the available instructions, but there is, as far as I know, no truly stable and officially supported way of doing this. So it may error sometime in the future. Since this happens at import time, even users not using the :simd generator will be affected if this fails on some systems or in the future.
  • Ensuring correct behaviour will be harder, since different CPUs will generate different code. It should not be too hard to test, though, since almost all modern CPUs have all instruction sets and can test the code generated for all CPUs.

To-do list

  • Still fails FASTQ parsing, so there are probably a few bugs to be kinked out
  • Needs way more tests, ideally 100% coverage
  • Too much code duplication between :simd and :goto generator. This needs to be factored out (or does it? It depends on how much progress I'll make on reorganizing the :simd codegen along the road)

closes issue #54

@jakobnissen
Copy link
Member Author

Note for future: On a FASTA file, the SIMD generator executes 8.7x fewer instructions and 10x fewer branches, while only being 2.5x faster. Part of the reason may be increased cache misses (2.5k -> 4k / 192k), but the major factor is probably the level of instruction level parallism. Looking at the assembly, the SIMD loop has serial memory dependencies, preventing superscalar instructions. On my computer, manually unrolling the SIMD loop does not make it use different registers, so the seriality remains and is in fact slower.
It appears the main bottleneck may be that the CPU simply can't parallelize the vector instructions to the same degree.
It might be worth looking into prefetching though

@kescobo
Copy link
Member

kescobo commented Dec 18, 2020

Unlike the :goto generator, the generated code is not pure Base code. That means the "consumer" of the code must have import'ed or using Automa, else they will have a NameError

Might be worth checking how much this increases load time for some consumer package. Automa is pretty light in terms of dependencies, and it sounds like it would be worth it, but it might be nice to have that information at hand.

@jakobnissen
Copy link
Member Author

That's a good idea. The new SIMD.jl dependency has historically been devastating for load times, although this should have improved quite significantly with 1.6, and I don't plan on merging this before then.

@jakobnissen
Copy link
Member Author

Turns out compile times are affected, but not by much. On version 1.5, compiling and running a FASTA reader takes about 2.3 seconds without this PR, and 2.55 seconds with this PR. On 1.6, it's 1.85 seconds without and 1.95 with.

@jakobnissen
Copy link
Member Author

I think this is done now and can get merged. Code coverage is a little misleading, it's actually almost 100%. I've no coverage on the branches that activates on big-endian CPUs and on CPUs without SIMD capabilities, because I don't have those kinds of CPUs. I'ts just simple if-statements though, so not much can go wrong.

@jakobnissen jakobnissen changed the title SIMD code generator [WIP, do not merge] SIMD code generator Dec 19, 2020
Copy link
Member

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the kind of code I'm fit to review technically. One thing I'll note is that I don't see much associated documentation. There are some useful (I gather) comments throughout, and I recognize that this may not effect the user-facing API, but it seems complicated enough (and helpful enough) that it might be worth spending a bit more time fleshing out the human readable bits.

Unless you think it would all be self-evident to someone familiar with SIMD operations generall. Basically, what would happen to that code if you were hit by the metaphorical bus - would it be interpretable?

@jakobnissen
Copy link
Member Author

Okay, now I've added a whole bunch of comments and also mitigated some of the 1.6 regressions. You're right, it's not very clear what it does. Hopefully the comments help.
Worst case scenario, if, say, in the future this falls apart, the whole SIMD stuff can just be removed, and the SIMD CodeGenContext fall back to returning a goto CodeGenContext.

Although I do think that this whole SIMD code could probably be split out into its own package. I can see how having SIMD byte matching could be generally useful. It is essentially like https://docs.rs/memchr/2.3.4/memchr/, except this generalizes to any collection of bytes. But that's a PR for another time :)

@jakobnissen jakobnissen merged commit 8f90ba4 into BioJulia:master Mar 20, 2021
@jakobnissen
Copy link
Member Author

Okay, I've been running on this branch on my daily work for a few months now, refactored it out to use ScanByte, and added non-SIMD fallbacks. So merging now.

@jakobnissen jakobnissen deleted the simd branch April 30, 2022 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants