Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VanityGPG is slow #2

Closed
nwalfield opened this issue Sep 29, 2020 · 26 comments
Closed

VanityGPG is slow #2

nwalfield opened this issue Sep 29, 2020 · 26 comments

Comments

@nwalfield
Copy link

An OpenPGP fingerprint is a hash not only over the public key material, but also the key's meta data. Thus, if you vary the key's creation time, you'll get a different fingerprint. Given a certificate, that should be pretty easy to do: extract the primary key (let pk = cert.primary_key().key().clone()), get the v4 variant (let mut pk = match pk { Key::V4(k) => k, _ => unreachable!() }, modify the creation time (k.set_creation_time(k.creation_time() - delta)), and check the resulting fingerprint (k.fingerprint()). When you find a match, you'll need to generate new self-signatures. See CertBuilder's code for how to do that. Depending on your requirements, you can vary from a few hours in the past to weeks or months in the past. This should speed up the search by multiple orders of magnitude.

@RedL0tus
Copy link
Owner

Thank you for the suggestion. If I did not understood it wrong, what you are talking about is similar to Scallion's approach. I'm not familiar with neither OpenPGP nor cryptography, so I took the simplest approach.

I'll try to do that, or if you are willing to help, PRs are extremely welcome.

By the way, I'm so glad to see suggestions like what you said here. To be honest, I started this as a meme project as I was joking about it with my friends, never thought I could receive such useful suggestion. I just want to say thank you again!

@RedL0tus
Copy link
Owner

I just wrote a working PoC, the updated version should come soon.

@nwalfield
Copy link
Author

Thank you for the suggestion.

No problem!

If I did not understood it wrong, what you are talking about is similar to Scallion's approach.

How scallion works is described here. If I understand correctly, scallion increments the public key's exponent. I don't know how safe that is... generating RSA keys is a tricky business. Letting a cryptographic library generate the key pair, and then varying the creation time seems safer. But, I'm not a cryptographer. Perhaps the approach is sound.

I'm not familiar with neither OpenPGP nor cryptography, so I took the simplest approach.

No worries!

By the way, I'm so glad to see suggestions like what you said here. To be honest, I started this as a meme project as I was joking about it with my friends, never thought I could receive such useful suggestion. I just want to say thank you again!

:)

RedL0tus added a commit that referenced this issue Oct 3, 2020
Signed-off-by: Kay Lin <i@v2bv.net>
@RedL0tus
Copy link
Owner

RedL0tus commented Oct 3, 2020

@nwalfield Please checkout the 0.3 branch!

On my Jetson Nano I got an average speed of over 850,000 fingerprints per seconds (was ~1000/s before), which is, just as you described, several magnitudes faster than my original approach. Thank you!

@nwalfield
Copy link
Author

Very cool!

A few more ideas for improving performance:

  • You create a 100k keys (by default) up front. I'd create the keys on demand: in your main loop, have a worker thread creates a key, and then searches that key's search space in a tight inner loop. This should increase cache locality a lot, and avoid the expensive startup time.
  • Avoid locking as much as possible. For instance:
    • Don't push the progress bar for every try, but every 1000 tries, say.
  • Default to ECC, not RSA. RSA keys are longer (more bytes to hash), and much more computationally intensive to generate (lots of sanity checks) than ECC keys (pretty much any 32 byte value is a valid ECC key).

Some general comments:

  • When saving a key, only generate the primary, not the general purpose key. The subkeys don't change the fingerprint and the user can add them later by importing the key into, say, gpg and adding the subkeys there (--quick-add-key).
  • Allow the user to limit the search space. A vanity key that was created in 1992 is not that interesting. By default, I'd search up to say a month in the past. That gives you over 2.5 million possibilities, and probably amortizes the key creation cost.

FWIW, on my Xeon E3-1275 v6 (4 cores, 8 threads), I see about 800k tries per second using the following configuration: --pattern 'BADCAFE$' --pool_size 10000 -c cv25519. Interestingly, the "tries per second" seem to increase with time. That's a bit strange.

@nwalfield
Copy link
Author

I just realized I didn't compile with --release. With that, I get about 8M tries per second.

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 3, 2020

Very cool!

Thanks!

A few more ideas for improving performance:

  • You create a 100k keys (by default) up front. I'd create the keys on demand: in your main loop, have a worker thread creates a key, and then searches that key's search space in a tight inner loop. This should increase cache locality a lot, and avoid the expensive startup time.

I cannot find a proper way to do that. The main loop is controlled by Rayon for parallel processing and I cannot come up with a clean enough solution to do that.

  • Avoid locking as much as possible. For instance:

    • Don't push the progress bar for every try, but every 1000 tries, say.

The progress bar is controlled by a separate thread, and the counter uses atomic operations, so I don't think this is a big problem...

  • Default to ECC, not RSA. RSA keys are longer (more bytes to hash), and much more computationally intensive to generate (lots of sanity checks) than ECC keys (pretty much any 32 byte value is a valid ECC key).

Will do. I always use Ed25519 personally. but GPG defaults to RSA so I just used RSA here.

Some general comments:

  • When saving a key, only generate the primary, not the general purpose key. The subkeys don't change the fingerprint and the user can add them later by importing the key into, say, gpg and adding the subkeys there (--quick-add-key).

I choose to generate the full general-purpose key is because some older versions of GPG that does not support adding ECC subkeys are still in use. That function is also executed only when a match has been found. Maybe I can add an option for generating subkeys or not.

  • Allow the user to limit the search space. A vanity key that was created in 1992 is not that interesting. By default, I'd search up to say a month in the past. That gives you over 2.5 million possibilities, and probably amortizes the key creation cost.

Will do.

FWIW, on my Xeon E3-1275 v6 (4 cores, 8 threads), I see about 800k tries per second using the following configuration: --pattern 'BADCAFE$' --pool_size 10000 -c cv25519. Interestingly, the "tries per second" seem to increase with time. That's a bit strange.

That's probably because the status thread was spawned before the initialization finishes, and the "tries per second" is simply the result of dividing the total number of fingerprints by the number of seconds passed after the thread was spawned. I'll fix that.

RedL0tus added a commit that referenced this issue Oct 3, 2020
RedL0tus added a commit that referenced this issue Oct 3, 2020
And list possible values.

Signed-off-by: Kay Lin <i@v2bv.net>
nwalfield pushed a commit to nwalfield/VanityGPG that referenced this issue Oct 3, 2020
@nwalfield
Copy link
Author

Take a look at https://github.com/nwalfield/VanityGPG/tree/ideas for some ideas of how to speed it up in the way that I discussed.

@nwalfield
Copy link
Author

The progress bar is controlled by a separate thread, and the counter uses atomic operations, so I don't think this is a big problem...

atomic operations create a memory fence, and lock the cache. They are better than a lock, but they are still expensive, especially when they are contended.

RedL0tus added a commit that referenced this issue Oct 3, 2020
RedL0tus added a commit that referenced this issue Oct 3, 2020
RedL0tus added a commit that referenced this issue Oct 3, 2020
Signed-off-by: Kay Lin <i@v2bv.net>
@RedL0tus
Copy link
Owner

RedL0tus commented Oct 3, 2020

Now it shows a reading of over 2 million tries per second on my Jetson Nano 🤣

@nwalfield
Copy link
Author

Now it shows a reading of over 2 million tries per second on my Jetson Nano rofl

Nice! I suspect that most people will want to choose a short key id. It might make sense to provide matching against a fixed suffix (perhaps even a list taken either from the command line or a file) in addition to matching a regex.

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 4, 2020 via email

@nwalfield
Copy link
Author

I think it would be interesting to benchmark it. When I replaced the regex with a fixed string comparison, I saw a 5-10% improvement.

@nwalfield
Copy link
Author

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:

        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;

To compute the hash, you could use the sha1 crate.

@nwalfield
Copy link
Author

FWIW, there is also https://crates.io/crates/sha-1 :D

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 5, 2020

I think it would be interesting to benchmark it. When I replaced the regex with a fixed string comparison, I saw a 5-10% improvement.

Of course it would be faster with ends_with if I only have one rule, but it lacks flexibility and if we make it to accept multiple rules, it will be kind of messy and the efficiency may not be ideal. One way of improving this may be by using SIMD accelerated regex implementations, and seems the Rust Core Team has been working on this for a while.

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:

        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;

To compute the hash, you could use the sha1 crate.

Checkout the newest commit on the 0.3 branch, I introduced a new rPGP backend. It is purely implemented in rust (so libclang is no longer needed) and utilizes the strategy you mentioned here. I know sequoia may be a better choice over all, but it is way too complicated and one of my server is using FreeBSD which does not offer libclang by default.

Edit: The performance has only been doubled on my Jetson Nano, but I don't know why.

@nwalfield
Copy link
Author

but it is way too complicated

Could you provide some concrete feedback about that? sequoia-openpgp provides a low-level API, but we've tried to make it as easy to use as possible with lots of documentation and examples. What were the issues that you ran into? What did you find to be too complicated?

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 5, 2020

but it is way too complicated

Could you provide some concrete feedback about that? sequoia-openpgp provides a low-level API, but we've tried to make it as easy to use as possible with lots of documentation and examples. What were the issues that you ran into? What did you find to be too complicated?

Hmmm... Let me do some clarafication. The sequoia-openpgp crate itself is user-friendly enough and is full of features, but for this project I only need the key generation functionality; As I mentioned above, nettle-sys requires libclang to be present in the system, but it could be a pain to achieve in certain situation (e.g. FreeBSD does not offer binary for it, and compiling LLVM+clang is kind of painful).

Don't get me wrong, it does not mean I don't like sequoia, it's just heavier than ideal for this project. If I'm going to start any new project that needs anything related to OpenPGP, sequoia is definitely my first choice. It is easier to use and may be faster than rPGP.

Back to VanityGPG, I will maintain these two backends in parallel since rPGP lacks security audit and does not support the ECC curves from NIST, while sequoia requires FFI. These two backends are feature-gated and can be switched on compilation.

@nwalfield
Copy link
Author

Thanks for the feedback. To be clear, I'm not offended, if you don't like something about Sequoia. In fact, I'd appreciate any type of feedback.

I know there is a set of low-level API, but there's barely documentations for them, and sequoia is a large project.

What APIs were you thinking about here?

FWIW, sequoia should be in the FreeBSD ports tree these days. I'm not a FreeBSD user, so I don't know how one would go about using it.

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 5, 2020

Thanks for the feedback. To be clear, I'm not offended, if you don't like something about Sequoia. In fact, I'd appreciate any type of feedback.

I hope sequoia can get better and better.

Actually I have another small question. I see there is a struct named KeyBlueprint but it is private and the actual generation work is left to the CertBuilder struct's impl blocks. Is it a good idea to move the key generation part to KeyBlueprint and make it public?

Also, is it a good idea to make the generate method of CipherSuite public?

In VanityGPG, I somehow have to recreate these two methods since they are private.

I know there is a set of low-level API, but there's barely documentations for them, and sequoia is a large project.

What APIs were you thinking about here?

Actually no... My brain got confused back then and sequoia's documentations are good, seems I've messed up my memory. Sorry for that.

FWIW, sequoia should be in the FreeBSD ports tree these days. I'm not a FreeBSD user, so I don't know how one would go about using it.

Yes it's in the FreeBSD ports, but as you know Rust does not have a stable ABI, and install nettle-sys from cargo requires libclang, so...

Actually the people here to be blamed should be someone from the maintainers of FreeBSD. From a message in the mailing list, they say they don't do binary builds for libclang is because they think building that .so would consume too much resources, and I think that's kind of ridiculous.

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 6, 2020

I bet that you could get a factor 4-5 improvement in throughput if you computed the fingerprint manually, and adjusted the timestamp in place.

I'd serialize the key into a buffer:

        let pk = Key::V4(self.primary_key.clone().take_secret().0);
        // Serialize them leaving space for the header.
        let mut data = [ 0u8; 2048 ];
        let len = pk.serialize_into(&mut data[3..]).unwrap();
        let data = &mut data[..len + 3];
        // Add the header: 0x99, big endian 16-bit length.
        data[0] = 0x99;
        data[1] = ((len >> 8) & 0xFF) as u8;
        data[2] = (len & 0xFF) as u8;

To compute the hash, you could use the sha1 crate.

I've just pushed a commit to let the sequoia backend use this strategy.

@RedL0tus
Copy link
Owner

RedL0tus commented Oct 6, 2020

Do you think this issue can be closed now?

@nwalfield
Copy link
Author

Looking good!

FWIW, my hacked up version was at 29.7 Mhashes/sec on my computer. This version is at 21.0 Mhashes/sec.

These are the things that could yields some more improvement, I suspect:

  • Use Nettle's sha instead of sha1, +1 Mhashes/sec for me.
  • Avoid allocating on the heap. Try changing Backend::fingerprint to place the fingerprint into a buffer passed by &mut instead of allocating a String on the heap.
  • https://crates.io/crates/faster-hex claims to be 10x faster than the hex crate (although it encodes the data as lower case).
  • Try and benchmark the suffix list vs. regex approach :)

RedL0tus added a commit that referenced this issue Nov 21, 2020
Since the performance of the program has been increased
dramatically (thanks to #2), the reporting rate can be further
redused to improve a little bit of performance (idealy).

Signed-off-by: Kay Lin <i@v2bv.net>
@RedL0tus
Copy link
Owner

RedL0tus commented Nov 21, 2020

Sorry, I was kind of busy in the last month and did not reply in time.

I've just switched to nettle for the sequoia backend. The rpgp one still uses SHA-1 as it is intended to be pure-rust.

For the second part, I'm actually not sure about how to improve on that, because we need the string anyway. Do you mean to pass a &mut [u8] to the main program and encode the string there?

For faster-hex, I'm still hesitating for switching to that, because it looks like it is no longer being maintained.

For regex, I wrote a simple "benchmark", and the result is, when there is a relatively large number of suffixes, regex can be a lot faster than using starts_with.

My "benchmark" code: https://termbin.com/6tnlv

And the result on my Ryzen 3 2200G is:

>>> Result: regex_average=3928807.5, startswith_average=9898233

The unit is microseconds.

Another advantage of regex is that, it is just easier to write complex search patterns.

@RedL0tus
Copy link
Owner

@nwalfield I've implemented custom binary-to-hex conversion with AVX2 and SSE4.1, based on code from faster-hex. Because it's only used for converting SHA-1 digest to hex and the data length is fixed, I made some extra "optimization". This change almost doubled the hashrate on 3700X, achieving over 100,000,000 hash/s with 16 threads.

Also, I've added a feature to use jemalloc as the default allocator, and it seems to have an improvement of around 1M hash/s.

I've just gathered some perf data and generated this flamegraph:

flamegraph

So... it seems there's nothing more that can be easily optimized. And I guess it may be time to consider closing this issue...?

@nwalfield
Copy link
Author

Sounds cool! I haven't taken a look at the changes, but they make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants