cipher: block cipher trait inefficiencies #332

jack-signal · 2020-10-13T16:53:10Z

tl;dr The block cipher trait exposes parallelism (this is good!). The block cipher trait only exposes parallelism for specific widths (this is bad!)

The aes-soft and aesni crates both implement an 8x wide AES operation, due to bitslicing and pipelining respectively. However the interface mandates that you batch process in exactly the width of the underlying implementation. This means that if you have say 4 or 6 blocks which could be processed at once, your options are to either process the blocks serially, or to set up some extra dummy blocks which are encrypted and then thrown away. It turns out, at least for aes-soft on my machine, always processing 8 blocks and using dummy inputs is faster, even when processing just 2 blocks.

This design also restricts the possible implementation approaches. There is no reason that, for example, AES-NI couldn't have several loops, one unrolled 8x and another 2x, followed by a final 1x to handle a trailing block. But since callers will only ever provide input in multiple of 8 (or else between 1 and 7 serial blocks) there is no possibility for intermediate unrolling.

In theory everyone could just perform this batching in higher level crates which use this trait. In practice effectively nobody will, and as a result everything built on the block cipher traits is not as efficient as it otherwise could be. The trait should instead accept any number of blocks, and process them as efficiently as is possible, along with advertising the preferred parallelism which would allow higher level algorithms to tune their buffer sizes properly.

tarcieri · 2020-10-13T17:00:43Z

Yes, @newpavlov and I have discussed this in the past and generally agree with you, I think. Among other things, it would be more efficient on modern Intel architectures to process blocks 4-at-a-time. See also: #289.

One general reason for requiring implementations to specify the granularity of parallelism is it may be hardcoded directly into the implementation. This is particularly true of AVX2/AVX512 implementations, especially when combined with a universal hash function.

The trait should instead accept any number of blocks, and process them as efficiently as is possible

In the universal-hash crate we have the update_padded method for this, but a similar method for block-cipher which works on a slice of blocks sounds like a reasonable idea. It could even work independent of padding (and perhaps we should decouple the universal-hash crate in a similar respect, but I digress).

I think there still needs to be some sort of associated type/constant for the ideal buffer size for the number of blocks to act on in parallel, but there could be a simple method with a default implementation which can act on arbitrary numbers of blocks and map them to underlying parallel buffer processing.

Sidebar: the universal-hash crate and its degree of parallelism are important concerns for building efficient AEADs based on block ciphers, notably AES-GCM and AES-GCM-SIV. We want to exploit a degree of parallelism that's "in sync" across both the block cipher an universal hash function in order to properly leverage instruction-level parallelism.

Partially addresses #332. Adds methods which work over an arbitrarily sized slice of blocks to encrypt/decrypt. Provides a default implementation, but can be potentially further optimized by individual implementations.

tarcieri · 2020-10-30T16:27:49Z

@jack-signal take a look at #351 and see if it addresses some of your concerns

Partially addresses #332. Adds methods which work over an arbitrarily sized slice of blocks to encrypt/decrypt. Provides a default implementation, but can be potentially further optimized by individual implementations.

tarcieri · 2020-11-01T17:21:54Z

#354 contains some explorations of how to abstract over parallelism in a way that encrypting a slice of blocks can be (blanket) impl'd both efficiently and automatically while eliminating it as a higher-level API concern

tarcieri · 2021-01-03T16:07:23Z

I think the main concerns brought up in this issue have been addressed: current v0.2 releases of cipher have encrypt_slice/decrypt_slice, which in the v0.3 prereleases has been renamed back to encrypt_blocks/decrypt_blocks.

I'm going to close this issue noting there's some ongoing discussion of efficiency improvements in #444.

Followup to RustCrypto#255. This fixes a clippy error and bumps the version to v0.11.0-pre as RustCrypto#255 contained a breaking change.

tarcieri added the cipher Block and stream cipher crate label Oct 15, 2020

tarcieri mentioned this issue Oct 30, 2020

cipher: add BlockCipher::{encrypt_slice, decrypt_slice} #351

Merged

tarcieri changed the title ~~Block cipher trait inefficiencies~~ cipher: block cipher trait inefficiencies Dec 9, 2020

tarcieri closed this as completed Jan 3, 2021

tarcieri mentioned this issue Oct 20, 2021

One-pass encryption/decryption RustCrypto/AEADs#74

Open

5 tasks

dns2utf8 pushed a commit to dns2utf8/traits that referenced this issue Jan 24, 2023

scrypt: fix clippy; bump to v0.11.0-pre (RustCrypto#332)

59e7f6e

Followup to RustCrypto#255. This fixes a clippy error and bumps the version to v0.11.0-pre as RustCrypto#255 contained a breaking change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cipher: block cipher trait inefficiencies #332

cipher: block cipher trait inefficiencies #332

jack-signal commented Oct 13, 2020

tarcieri commented Oct 13, 2020 •

edited

tarcieri commented Oct 30, 2020

tarcieri commented Nov 1, 2020

tarcieri commented Jan 3, 2021

cipher: block cipher trait inefficiencies #332

cipher: block cipher trait inefficiencies #332

Comments

jack-signal commented Oct 13, 2020

tarcieri commented Oct 13, 2020 • edited

tarcieri commented Oct 30, 2020

tarcieri commented Nov 1, 2020

tarcieri commented Jan 3, 2021

tarcieri commented Oct 13, 2020 •

edited