# Comparative Performance Review of Most of the SHA-3 Second-Round Candidates

Thomas Pornin, <thomas.pornin@cryptolog.com>

May 7, 2010

#### Abstract

We reimplemented several round-2 SHA-3 candidates, in both C and Java. Implementations were made from scratch, not reusing the code provided with the NIST submission packages. Similar optimization techniques and efforts were applied to each function; the resulting performance differences should then be considered as intrinsic to the functions themselves. We benchmarked our code on a variety of platforms, including embedded low-power platforms; we present here our measures, and what conclusions can be inferred on what makes a hash function fast.<sup>1</sup>

# 1 Introduction

For the SHA-3 competition[1], NIST defined a "Reference Platform and Compiler", namely a PC with an "Intel Core 2 Duo processor", clocked at 2.4 GHz, running the Windows operating system and using Microsoft Visual Studio 2005 Professional Edition as compiler. The NIST notice introducing the competition also states that "the public" is encouraged to performs efficiency analysis on "other platforms".

Quite predicibly, most submitters concentrated on performance on PC systems, for a variety of reasons:

- The NIST had plainly announced that they would run tests and an efficiency analysis on their "reference platform".
- Cryptographers develop and optimize for what they can test on, and desktop systems are, by far, the easiest to use, and cheapest, development platform nowadays.
- Modern x86-compatible processors tend to have some advanced features (SSE2 unit with 128-bit registers, hard-coded AES implementation for the most recent versions) which appear to have a high potential for cryptographic applications.
- The PC is the *visible* computer. Few people are fully aware that most computing is now done on *hidden*, embedded systems, which have percolated by dozens within the most mundane-looking appliances. Embedded systems have a huge industrial significance which is widely underestimated.

Therefore, we feared that high performance on the reference platform could cloud away implementation issues on other systems, resulting in a seemingly fast but actually suboptimal function selected as SHA-3. We also feared that some candidates could be unfairly dismissed as slow, despite their cryptographic merits, if their designers could not come up with decently optimized implementations: optimization is a difficult craft and a distinct skill set from what good cryptographers master.

We thus wanted to have a fair comparison between hash functions, on a variety of software platform types, with an emphasis on the non-PC-with-x86-assembly architectures. In order to smooth out discrepancies between implementer skills, we decided that all functions ought to be implemented with similar efforts and

<sup>&</sup>lt;sup>1</sup>This work was partially supported by the French Agence Nationale de la Recherche through the SAPHIR2 project under Contract ANR-08-VERS-014.

techniques by the *same* developer. Hopefully, resulting speed differences should then be representative of the intrinsic characteristics of the hash functions themselves.

This approach complements the eBASH[2] measures: our implementations focus on more portable code (to try to abstract away special architecture characteristics such as SIMD instructions), we benchmarked them on embedded systems which are not covered by eBASH (mostly due to automation issues; it is hard to run automatic benchmarks on a pocket calculator), and we added the "unique developer" rule.

Our work is the continuation of the development of sphlib[3], an open-source library of hash function implementations, in C and Java. sphlib was written by the same developer along the same lines, and included many pre-SHA-3 hash functions, including MD5, SHA-1, the SHA-2 family and Whirlpool. We thus added implementations for some of the SHA-3 "round 2" candidates; currently, 9 out of the 14 candidates have been implemented. This is an ongoing work and the five remaining candidates shall be added within the next months. sphlib-2.0, the version which we describe here, shall be released for downloads shortly, on the www.crypto-hash.fr Web site.

In the following sections, we describe our target architectures and development rules. We also define what we measure and how we measure it. We then give our measures, both numerically and as graphical charts. Finally, we describe what these numbers tell us, with regards to what impacts hash function performance.

# 2 Target Architectures

## 2.1 Large Desktop Systems

A large desktop system is about any visible computer. This covers personal computers and workstations, but also big servers, laptops, netbooks and even smartphones. For our purposes, a large desktop system is any system with a 32-bit or more CPU, with a superscalar architecture (the hardware can execute several instructions simultaneously) and a large level-1 cache for instructions (32 kB or more). We will see later on (section 4.1) that level-1 cache size turns out to be the most important factor for performance.

As representative of that category, we used the two following systems:

- A PC with an Intel Core2 Quad Q6600 processor, clocked at 2.4 GHz ("Kentsfield" core). This is similar to the NIST reference platform<sup>2</sup>. RAM size and operating system are not relevant for our tests (everything important happens in level-1 cache anyway). The system runs in 64-bit mode (known as "x86-64", "AMD64", "x64", "Intel 64", "EM64T" and a few other names). The C compiler is GCC[4], version 4.4.1.
- The very same PC, but this time used in 32-bit mode. Registers have size 32 bits instead of 64, and there are fewer of them. The C compiler is still GCC-4.4.1.

Those two platforms shall be representative of large desktop systems because our test code is a portable C implementation, which does not use the specific features of the Intel architecture (MMX, SSE2...). Thus, they should behave similarly to other, large architectures such as the XScale and Cortex ARM-compatible processors which are often found in smartphones. On "big" processors, the circuitry responsible for decoding opcodes into elementary instructions (which can then be reordered and executed concurrently or even speculatively) happens to be a very small part of the total CPU size. This is why RISC instruction sets did not supplant the older CISC sets such as x86: there is no significant gain anymore in a reduced instruction set, at least on the "large" processors that we consider here.

# 2.2 Embedded Systems

We consider here systems with a "small" 32-bit processor. The CPU still offers 32-bit registers, but with a reduced instruction set and no superscalar ability. The level-1 cache is much smaller, typically 8 to 16 kB.

<sup>&</sup>lt;sup>2</sup>Actually, the NIST reference platform is described a bit vaguely, since "Core 2 Duo" covers a wide range of processors from Intel, with distinct timing characteristics.

The clock rate is also much lower, counted in hundreds or even dozens of megahertz, not gigahertz. Of particular interest to us are:

- WiFi access points and routers. These are small, cheap appliances which must nevertheless route high-bandwidth data all day long. Routers are ideally located to run VPN software.
- Portable media players (MP3, small videos...). Cryptography, in particular signatures, is used to enforce digital rights. Such systems must be small and light, which implies drastic constraints on available CPU power (the battery is small, so the CPU is underclocked).
- Mobile phones (not the bigger so-called smartphones). Most can be used as media players, and there is a big market for data and applications running on those platforms. There again, DRM requires efficient cryptography. Also, mobile phones are inherently networked, and this implies cryptography as well (e.g. SSL connections).
- Payment terminals. These are the portable terminals which handle card-based transactions in many countries. Internally, those systems are similar to glorified smartcards, with tamper-resistant CPU and RAM. The physical and electromagnetic armor prevents these systems from using the big desktop processors, notably because of cooling issues. Processors are clocked at no more than 30 MHz or so. Yet those systems are in front line in valuable security systems and must rely on robust cryptography. Hardware Security Modules (HSM) are similar in that respect: despite their high price, their tamper-resistance implies drastically low CPU performance, with simple cores.

In order to represent this category, we used the two following architectures:

- A Linksys WRT54GS router, refurbished with the OpenWrt operating system[5] (a Linux clone for embedded systems). The core CPU is a Broadcom BCM3302, a MIPS-compatible 32-bit processor clocked at 200 MHz. The level-1 cache size is 8 kB for code, and another 8 kB for instructions. The C compiler is GCC, version 4.2.4.
- A Hewlett-Packard HP-50g scientific calculator. This system uses an ARM920T core (ARMv4 architecture), clocked at 12 MHz, but which can be programmatically boosted to 75 MHz (which we did for our tests). This system can be programmed in C (with a much reduced, non-standard C library), using HPGCC[6], a derivative from GCC 4.1.1.

ARM and MIPS architectures together cover the quasi-totality of 32-bit embedded systems nowadays. Our two systems should then be representative of this category of platforms.

#### 2.3 Virtual Machines

A newer trend in computing is the use of virtual machines. Instead of compiling code for an existing, physical family of processors, a virtual CPU is defined, with its own instruction set (usually called bytecode). The compiled program is executed through a Virtual Machine. The VM interprets the bytecode, and uses JIT ("Just In Time") compilation techniques to dynamically convert the bytecode of the most used code parts into native instructions for the host CPU.

Virtual Machines are not a new concept, but widespread industrial application really began with the Java programming language, in the mid-90's. VM have a number of worthwhile characteristics:

- The virtual CPU has the same characteristics "everywhere", such as the size of integer types; representation details such as endianness are abstracted away. This provides a great deal of portability.
- Since there is a single bytecode format, software distribution is greatly eased.
- Bytecode allows a priori verification of the code with regards to data types. The VM can enforce boundary checks upon array accesses. Program safety is greatly enhanced, to the point that potentially hostile code can be run in a sandbox, which supports the Applet model.

• The VM model (theoretically) allows for some performance improvements: the JIT compiler can optimize code for the specific CPU on which it happens to run, and the type safety allows for the more efficient types of garbage collection algorithms. In practice, these improvements compensate some of the overhead implied by the interpreter and JIT compilation process.

VM are extremely popular, and VM-based languages are said to be more used in servers than classical compiled languages. VM provide an API to interact with "native code" (e.g. compiled C code), but native code nullifies the advantages of VM which we described above. It thus makes sense to study the performance of hash functions when implemented in bytecode.

We use the following two architectures:

- Our PC (Intel Core 2 Q6600, 2.4 GHz, 64-bit mode), running the Java VM edited by Sun Microsystems (now Oracle), version 1.6.0 20[7].
- The same PC, in 32-bit mode, with the 32-bit version of the Java VM from Sun.

The Java VM is the most widely used VM. Its more recent competitor (the .NET VM from Microsoft, coupled with the C# programming language) should provide similar runtime characteristics. As will be explained in section 4, the VM and JIT compiler have their own specific overhead, which impacts hash function performance in various ways, which depend on the hash function itself.

# 2.4 Compilation Options

All the C compilers for our test architectures are variants of GCC. Optimization in GCC is triggered with the -On command line flags, where n is a digit from 1 to 9. Theoretically, a higher n implies better performance (and longer compilation time), but in practice things are not that clear. Code optimization relies on a number of heuristics, which are not necessarily valid for all codes, and especially not for hash function implementations which tend to rely on heavy loop unrolling and a plethora of local variables.

In practice, the -01 compilation flag often offers better performance than -02, sometimes dramatically so (e.g. +50% speed with SHA-1). Thus, the default compilation flags for sphlib are set, in the build script, to "-01 -fomit-frame-pointer". Nevertheless, for some hash functions, -02 does a better job than -01, so we also compiled all functions with "-02 -fomit-frame-pointer" and we kept the best of the two measured performances. This was done individually for each function and each hash output length.

Another tunable option of sphlib is the selection of the "small footprint" implementations. These are activated by defining the SPH\_SMALL\_FOOTPRINT macro. Small footprint implementations are optimized to fit within the 8 kB of level-1 cache of our MIPS-based test architecture; they offer much better performance on that system, but worse performance on large desktop systems. There again, for each platform, each optimization level, each function and each hash output length, we tried the "normal" and the "small footprint" implementations, and kept the best measure among the two.

An additional parameter is found on the ARM platform: the ARMv4 architecture supports two instruction sets, the "normal" set (inherited from older ARM versions) and the "thumb" mode. The thumb mode features simpler instructions, and some of the registers are not available through thumb mode; but instructions are also twice smaller, which makes the code more compact and helps it fit within the level-1 cache. Normal and thumb code can be freely mixed within a given application (the CPU can switch conventions in a single clock cycle, when calling a function). Hence, we doubled our tests, once in normal mode and once in thumb mode.

For the Java architectures, the VM does not offer many tunable options. We ran our tests with the -server command-line option, which should encourage the JIT compiler to think harder. Our test procedure includes a bit of "warm-up" so that measures are taken after the JIT compiler has acted.

### 2.5 Out of Scope Architectures

Our work does not cover the following kinds of architectures, which are nonetheless important with regards to hash function performance:

- 8-bit and 16-bit architectures. These are found mostly in smartcards. They are usually programmed in assembly, not C, and their characteristics vary quite widely. A comparison effort such as the one we describe here would require the use of many test platforms, each with its own development and optimization effort.
- FPGA and ASIC. Hardware has its own rules, in particular parallelism, and circuit size is at least as important as resulting bandwidth. sphlib is a *software* library.
- GPU. The most recent graphics accelerator, normally used for rendering 3D images and animations, actually use arrays of processors in SIMD mode (single instruction, multiple data), which can be programmed "generically". The market for GPU is still too new to define "representative" systems. Also, GPU programming has its own rules, with languages which are similar, but not identical to C. We did not investigate GPU implementation of hash functions (yet).

Also, we have not devoted any effort to the production of compact implementations. We always targetted the size of the level-1 cache for instructions. In practical situations (as opposed to ideal benchmark conditions), the hash function code must share the caches with the other elements in the applicative data processing path. Very small implementations (e.g. less than 1 kB of compiled code) are also very important in some specific situations where overall code size is constrained (for instance, in a boot ROM). sphlib implementations do not cover that kind of optimization, although the cache size issues illustrate how hash function performance is impacted by constraints on code size.

# 3 Measures

For each function, the hash function performance was measured on a *long message*. Basically, this means that we hash a single message of length bn bytes, where b is the block size and n is the number of blocks. b is equal to 8192 (this is the "traditional" buffer size for I/O and it fits within the level-1 cache for data on all our test architectures). n is dynamically adjusted so that the processing time exceeds two seconds. Elapsed time is measured with the clock() function on an otherwise idle machine (System.currentTimeMillis() in Java). Achieved precision is about 2% (in the tables below we give more digits, but only the first two or three are really significant).

Before that measure, the function is "warmed-up" by computing many digests on messages of various lengths. The point is to allow every part of the hardware and software to reach its top speed (cache memory, jump prediction tables in the processor, JIT compilation).

This measure does not take into account the initialization and finalization times. Although sphlib code tries not to do anything stupid, initialization and finalization procedures have not been optimized with as much care as the main loop. Thus, performance of sphlib might not be really representative of hash function performance on that matter. Not all functions are equivalent in that respect; but sphlib is not (yet) the right tool to evaluate hash function performance on very short messages.

The tables below (tables 1 to 6) contain our measures, expressed in megabytes per second (one megabyte is  $10^6$  bytes), cycles per byte, and speed relative to SHA-1. Apart from the SHA-3 round 2 candidates, we also list SHA-1, SHA-2 (SHA-256 and SHA-512) and Whirlpool. We omit speed measures for functions for which the exact same performance is obtained with a bigger output size; for instance, SHA-384 and SHA-512 are the same function save for the initial value and the final truncation, so we list only SHA-512. The charts (figures 1 to 6) list the SHA-3 round 2 candidates, and SHA-2, sorted by bandwidth (in megabytes per second), for 256-bit and 512-bit output sizes.

It shall be noted that whenever applicable, sphlib performance was compared with the submitted optimized code and the performance figures reported in the submission packages; sphlib was always on par with those figures. This validates the effectiveness of the optimization strategy of sphlib.

Note: for symmetry of notation, we call BLAKE-256 (respectively BLAKE-512) what the BLAKE specification names BLAKE-32 (respectively BLAKE-64).

| Function   | $oxed{f bandwidth (MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|----------------------------|-------------|---------------|
| SHA-1      | 333.46                     | 7.20        | 1.000         |
| SHA-256    | 145.10                     | 16.54       | 0.435         |
| SHA-512    | 187.72                     | 12.78       | 0.563         |
| Whirlpool  | 59.13                      | 40.59       | 0.177         |
| BLAKE-256  | 230.42                     | 10.42       | 0.691         |
| BLAKE-512  | 341.96                     | 7.02        | 1.025         |
| BMW-256    | 238.61                     | 10.06       | 0.716         |
| BMW-512    | 458.86                     | 5.23        | 1.376         |
| ECHO-256   | 62.43                      | 38.44       | 0.187         |
| ECHO-512   | 33.39                      | 71.88       | 0.100         |
| Fugue-256  | 79.42                      | 30.22       | 0.238         |
| Fugue-384  | 54.12                      | 44.35       | 0.162         |
| Fugue-512  | 41.30                      | 58.11       | 0.124         |
| JH-512     | 59.92                      | 40.05       | 0.180         |
| Luffa-256  | 97.26                      | 24.68       | 0.292         |
| Luffa-384  | 74.15                      | 32.37       | 0.222         |
| Luffa-512  | 54.78                      | 43.81       | 0.164         |
| Shabal-512 | 319.57                     | 7.51        | 0.958         |
| SIMD-256   | 46.77                      | 51.31       | 0.140         |
| SIMD-512   | 41.17                      | 58.29       | 0.123         |
| Skein-256  | 287.10                     | 8.36        | 0.861         |
| Skein-512  | 365.22                     | 6.57        | 1.095         |

Table 1: Performance on Intel x86 Q6600, 2.4 GHz, in 64-bit mode.





Figure 1: Bandwidth on Intel x86 Q6600,  $2.4~\mathrm{GHz}$ , in 64-bit mode (megabytes per second).

| Function   | $oxed{f bandwidth (MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|----------------------------|-------------|---------------|
| SHA-1      | 320.52                     | 7.49        | 1.000         |
| SHA-256    | 132.89                     | 18.06       | 0.415         |
| SHA-512    | 47.43                      | 50.60       | 0.148         |
| Whirlpool  | 30.37                      | 79.03       | 0.095         |
| BLAKE-256  | 192.43                     | 12.47       | 0.600         |
| BLAKE-512  | 45.04                      | 53.29       | 0.141         |
| BMW-256    | 205.70                     | 11.67       | 0.642         |
| BMW-512    | 151.23                     | 15.87       | 0.472         |
| ECHO-256   | 50.08                      | 47.92       | 0.156         |
| ECHO-512   | 26.95                      | 89.05       | 0.084         |
| Fugue-256  | 55.92                      | 42.92       | 0.174         |
| Fugue-384  | 43.44                      | 55.25       | 0.136         |
| Fugue-512  | 36.87                      | 65.09       | 0.115         |
| JH-512     | 19.85                      | 120.91      | 0.062         |
| Luffa-256  | 82.34                      | 29.15       | 0.257         |
| Luffa-384  | 61.29                      | 39.16       | 0.191         |
| Luffa-512  | 47.09                      | 50.97       | 0.147         |
| Shabal-512 | 248.55                     | 9.66        | 0.775         |
| SIMD-256   | 32.26                      | 74.40       | 0.101         |
| SIMD-512   | 26.84                      | 89.42       | 0.084         |
| Skein-256  | 63.91                      | 37.55       | 0.199         |
| Skein-512  | 62.14                      | 38.62       | 0.194         |

Table 2: Performance on Intel x86 Q6600,  $2.4~\mathrm{GHz}$ , in 32-bit mode.





Figure 2: Bandwidth on Intel x86 Q6600, 2.4 GHz, in 32-bit mode (megabytes per second).

| Function   | ${\bf bandwidth~(MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|--------------------------|-------------|---------------|
| SHA-1      | 5.93                     | 33.73       | 1.000         |
| SHA-256    | 2.82                     | 70.92       | 0.476         |
| SHA-512    | 1.47                     | 136.05      | 0.248         |
| Whirlpool  | 0.36                     | 555.56      | 0.061         |
| BLAKE-256  | 2.96                     | 67.57       | 0.499         |
| BLAKE-512  | 1.60                     | 125.00      | 0.270         |
| BMW-256    | 4.79                     | 41.75       | 0.808         |
| BMW-512    | 2.65                     | 75.47       | 0.447         |
| ECHO-256   | 1.09                     | 183.49      | 0.184         |
| ECHO-512   | 0.59                     | 338.98      | 0.099         |
| Fugue-256  | 1.82                     | 109.89      | 0.307         |
| Fugue-384  | 1.23                     | 162.60      | 0.207         |
| Fugue-512  | 0.95                     | 210.53      | 0.160         |
| JH-512     | 0.64                     | 312.50      | 0.108         |
| Luffa-256  | 1.78                     | 112.36      | 0.300         |
| Luffa-384  | 1.34                     | 149.25      | 0.226         |
| Luffa-512  | 1.03                     | 194.17      | 0.174         |
| Shabal-512 | 6.71                     | 29.81       | 1.132         |
| SIMD-256   | 0.81                     | 246.91      | 0.137         |
| SIMD-512   | 0.47                     | 425.53      | 0.079         |
| Skein-256  | 1.94                     | 103.09      | 0.327         |
| Skein-512  | 1.57                     | 127.39      | 0.265         |

Table 3: Performance on Broadcom BCM3302 (MIPS, 32-bit, 200 MHz).





Figure 3: Bandwidth on Broadcom BCM3302 (MIPS, 32-bit, 200 MHz) (megabytes per second).

| Function   | ${\bf bandwidth~(MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|--------------------------|-------------|---------------|
| SHA-1      | 3.17                     | 23.66       | 1.000         |
| SHA-256    | 1.82                     | 41.21       | 0.574         |
| SHA-512    | 0.55                     | 136.36      | 0.174         |
| Whirlpool  | 0.18                     | 416.67      | 0.057         |
| BLAKE-256  | 1.76                     | 42.61       | 0.555         |
| BLAKE-512  | 0.76                     | 98.68       | 0.240         |
| BMW-256    | 2.46                     | 30.49       | 0.776         |
| BMW-512    | 1.08                     | 69.44       | 0.341         |
| ECHO-256   | 0.45                     | 166.67      | 0.142         |
| ECHO-512   | 0.24                     | 312.50      | 0.076         |
| Fugue-256  | 0.73                     | 102.74      | 0.230         |
| Fugue-384  | 0.51                     | 147.06      | 0.161         |
| Fugue-512  | 0.39                     | 192.31      | 0.123         |
| JH-512     | 0.23                     | 326.09      | 0.073         |
| Luffa-256  | 1.02                     | 73.53       | 0.322         |
| Luffa-384  | 0.72                     | 104.17      | 0.227         |
| Luffa-512  | 0.56                     | 133.93      | 0.177         |
| Shabal-512 | 3.22                     | 23.29       | 1.016         |
| SIMD-256   | 0.34                     | 220.59      | 0.107         |
| SIMD-512   | 0.27                     | 277.78      | 0.085         |
| Skein-256  | 0.90                     | 83.33       | 0.284         |
| Skein-512  | 0.58                     | 129.31      | 0.183         |

Table 4: Performance on ARM920T (ARMv4, 32-bit, 75 MHz).





Figure 4: Bandwidth on ARM920T (ARMv4, 32-bit, 75 MHz) (megabytes per second).

| Function   | $oxed{f bandwidth (MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|----------------------------|-------------|---------------|
| SHA-1      | 129.37                     | 18.55       | 1.000         |
| SHA-256    | 82.52                      | 29.08       | 0.638         |
| SHA-512    | 77.54                      | 30.95       | 0.599         |
| Whirlpool  | 28.90                      | 83.04       | 0.223         |
| BLAKE-256  | 72.01                      | 33.33       | 0.557         |
| BLAKE-512  | 92.37                      | 25.98       | 0.714         |
| BMW-256    | 79.16                      | 30.32       | 0.612         |
| BMW-512    | 185.90                     | 12.91       | 1.437         |
| ECHO-256   | 22.36                      | 107.33      | 0.173         |
| ECHO-512   | 12.18                      | 197.04      | 0.094         |
| Fugue-256  | 41.94                      | 57.22       | 0.324         |
| Fugue-384  | 28.56                      | 84.03       | 0.221         |
| Fugue-512  | 21.39                      | 112.20      | 0.165         |
| JH-512     | 31.82                      | 75.42       | 0.246         |
| Luffa-256  | 58.89                      | 40.75       | 0.455         |
| Luffa-384  | 44.34                      | 54.13       | 0.343         |
| Luffa-512  | 34.89                      | 68.79       | 0.270         |
| Shabal-512 | 131.78                     | 18.21       | 1.019         |
| SIMD-256   | 19.43                      | 123.52      | 0.150         |
| SIMD-512   | 1.71                       | 1403.51     | 0.013         |
| Skein-256  | 106.18                     | 22.60       | 0.821         |
| Skein-512  | 103.40                     | 23.21       | 0.799         |

Table 5: Performance with Java (Intel x86 Q6600, 2.4 GHz, 64-bit mode).





Figure 5: Bandwidth with Java (Intel x86 Q6600, 2.4 GHz, 64-bit mode) (megabytes per second).

| Function   | ${\bf bandwidth~(MB/s)}$ | cycles/byte | rel. to SHA-1 |
|------------|--------------------------|-------------|---------------|
| SHA-1      | 122.91                   | 19.53       | 1.000         |
| SHA-256    | 66.51                    | 36.08       | 0.541         |
| SHA-512    | 26.68                    | 89.96       | 0.217         |
| Whirlpool  | 12.77                    | 187.94      | 0.104         |
| BLAKE-256  | 65.22                    | 36.80       | 0.531         |
| BLAKE-512  | 44.37                    | 54.09       | 0.361         |
| BMW-256    | 75.17                    | 31.93       | 0.612         |
| BMW-512    | 62.05                    | 38.68       | 0.505         |
| ECHO-256   | 18.22                    | 131.72      | 0.148         |
| ECHO-512   | 9.88                     | 242.91      | 0.080         |
| Fugue-256  | 35.85                    | 66.95       | 0.292         |
| Fugue-384  | 19.32                    | 124.22      | 0.157         |
| Fugue-512  | 18.27                    | 131.36      | 0.149         |
| JH-512     | 10.60                    | 226.42      | 0.086         |
| Luffa-256  | 42.01                    | 57.13       | 0.342         |
| Luffa-384  | 31.08                    | 77.22       | 0.253         |
| Luffa-512  | 24.77                    | 96.89       | 0.202         |
| Shabal-512 | 128.50                   | 18.68       | 1.045         |
| SIMD-256   | 16.71                    | 143.63      | 0.136         |
| SIMD-512   | 1.58                     | 1518.99     | 0.013         |
| Skein-256  | 44.40                    | 54.05       | 0.361         |
| Skein-512  | 39.23                    | 61.18       | 0.319         |

Table 6: Performance with Java (Intel x86 Q6600, 2.4 GHz, 32-bit mode).





Figure 6: Bandwidth with Java (Intel x86 Q6600, 2.4 GHz, 32-bit mode) (megabytes per second).

# 4 What Makes a Hash Function Fast

#### 4.1 Cache Size

The most important parameter for hash function performance is cache size, precisely the size of the level-1 cache for instructions. The level-1 cache for data is much less important, except for the functions which use tables in their implementation (e.g. ECHO).

Failure to fit the core algorithm loop within the level-1 cache implies severe performance degradation, usually by a factor of 2 or 3, possibly much more (we witnessed a degradation of a factor of 22 on a Skein implementation!). This impacts loop unrolling.

Loop unrolling is a technique which consists in duplicating the code for one function "round"; several successive rounds are thus converted to executable opcodes, presented successively in RAM. The main point of loop unrolling is that it nullifies data routing cost. For instance, consider SHA-256: at each round, the eight state words are "rotated". If you unroll eight successive rounds, then each round may use the proper variables, and the rotation becomes virtual (i.e. free: no more data copying between variables). If you unroll 64 successive rounds of SHA-256, then round constants can be hardcoded instead of being obtained from a table, which saves one memory indirection per round.

Thus, unrolling can be viewed as a generic tool for suppressing data routing cost (at runtime), at the expense of a larger code footprint<sup>3</sup>. We say that the implementation is *fully unrolled* when there is nothing more to gain, with regards to data routing, by unrolling more. For instance, a fully unrolled SHA-256 has 64 copies of the round code.

If the fully unrolled function fits in level-1 cache, then everything is fine. Otherwise, the function must be partially re-rolled; thus, some routing operations again imply some runtime cost, through data copying or indirections. In the implemented functions, only Shabal, Luffa and Fugue fit in the 8 kB level-1 cache of our MIPS-based test architecture, when fully unrolled. All other functions must be only partially unrolled, which explains their poorer performance on that platform (with regards to what they can do on large desktop systems).

### 4.2 64-bit Types

64-bit integers are efficient on systems which have native 64-bit registers. On 32-bit systems without such registers, 64-bit integer types are very inefficient. Any 64-bit value, on a 32-bit system, requires two registers; most operations require two or more opcodes; additions need manual carry propagation between the lower and upper words. Rotations are especially expensive, since they cannot use the rotation opcodes provided by the CPU for 32-bit registers (at least on the architectures which have such opcodes).

This is easily seen on the charts about the functions with 512-bit output. On the 64-bit architectures (either in C or Java), BMW-512 is the fastest hash function; Skein, BLAKE-512 and SHA-512 are also quite fast. On 32-bit systems, these functions lose much ground. Comparatively, Shabal efficiency remains high on 32-bit systems, because that function uses only operations on 32-bit values.

On the x86 architecture, the lack of 64-bit integer type is often compensated, in hash function implementations, with the use of special units which offer 64-bit registers (MMX, SSE...) but this requires inline assembly or non-portable compiler intrinsics; this does not work on other architectures, in particular the Java VM.

We also note that using 64-bit types makes porting to some architectures quite difficult. The "long long" integer type has been added to C in the 1999 standard, but in the previous standard (C89, aka "ANSI C"), the biggest available type is not guaranteed to exceed 32 bits. Fortunately, most embedded architectures nowadays use development kits derived from some version of GCC, which has offered the "long long" type for more than 15 years.

<sup>&</sup>lt;sup>3</sup>Compilation time is also greatly enlarged.

### 4.3 Endianness

Endianness appears to be a non-issue. Sticking to the native endianness of the machine can yield an improvement, but only a very slight one. For MD4 (which is extremely fast, more than twice faster than SHA-1), on big-endian Sparc v9 systems, using the special assembly opcodes for little-endian access may speed up the function by about 30%, with regards to the generic code (which must byte-swap the input words). The relative speedup decreases correspondingly with the function performance.

In some corner cases, e.g. when implementing under strict code and RAM size constraints, using the native endianness can help a bit, mostly because it avoids having to store decoded words in local variables; hence it makes sense to define I/O with the little-endian convention, which is now prevalent<sup>4</sup>. Most of the SHA-3 round 2 candidates use little-endian. Nevertheless, at least BLAKE uses big-endian and still offers very decent performance. Generally speaking, endianness is *not* an important factor for hash function performance.

# 4.4 Java Specific Issues

It appears that the JIT compiler produces very fat code. This is due to the constraints in which the JIT compiler operates: it must work fast, and produce code which the garbage collector can inspect, and which is amenable to runtime patching, based on code which may be dynamically loaded afterwards. The net effect is as if the level-1 cache was greatly reduced. In practice, one must implement partial unrolling in a way very similar to what is done for the C implementation for our MIPS test platform.

Another Java-specific overhead is about table accesses. Java implements an abstraction layer over memory; one cannot access RAM directly. Instead, tables use Java arrays, where every access is checked with regards to the actual array size. The JIT compiler tries to merge such checks, but array access are nevertheless more expensive than the already costly table accesses which are used in C. Functions which use many table accesses thus incur an extra slowdown on the Java VM.

# 4.5 Complexity

CPU do not fear executing code with thousands of distinct opcodes. Programmers, however, do. When a function consists of several distinct stages, each with many subtle and changing details, then the intellectual resources that the programmer can devote to optimization get diluted. The prime example is SIMD. This is the function which took the most time to implement, about twice more development time than any other of the candidates. In the end, in order to comply with our stated rule of "similar optimization efforts", we add to give up with making SIMD-512 fast in Java. This is why performance of SIMD-512 appears to be abysmal on Java platforms: we simply lacked the time to solve its severe L1-cache issues.

Regular designs are much easier to implement. For instance, BLAKE was very easy<sup>5</sup>. This yields more time to tinker with optimization. Design regularity should also be a valuable asset when trying to optimize for code size instead of raw bandwidth.

#### 4.6 Consistency

Most of the round 2 candidates are actually pairs of functions, one "small" function for 224-bit and 256-bit outputs, and one "big" function for 384-bit and 512-bit outputs. Some functions (e.g. Fugue and Luffa) are actually *three* functions, the third one being used for the 384-bit output.

Such a dichotomy has some negative effects:

• Implementing the whole family of functions requires more resources from the programmer, reducing correspondingly the resources allocated to actual optimization.

<sup>&</sup>lt;sup>4</sup>The little-endian convention makes it easier to port non-endian-neutral code from the PC/Windows world, where such code is common. Also, the little-endian convention makes it simpler to *simulate* the target system on a PC.

<sup>&</sup>lt;sup>5</sup>The clarity of the function specification considerably helped, too.

- Code size increases, when all the functions must be implemented together (e.g. for interoperability reasons).
- Performance issues and security cease to be orthogonal: the decision of which function output size to use, nominally a security-related issue, can no longer be taken independently of performance.

We note that SHA-2 already featured such a dichotomy. Some of the SHA-3 candidates (e.g. JH and Shabal) avoid that dichotomy and provide consistent performance for all output sizes.

## 4.7 On the Importance of Speed

It shall be noted that many of the candidates, and SHA-2 as well, offer good enough performance on large desktop systems. A good hard disk, or a gigabit network interface, in ideal conditions, tops at about 100 MB/s. Any hash function which offers at least that bandwidth will be sufficient for most purposes, in particular since we are talking about hash function bandwidth using a single core, and a PC which processes 100 MB of data per second is likely to have at least four cores. In brief, it is quite unlikely that even a not-that-fast hash function like SHA-2 becomes a bottleneck in any practical situation involving a large desktop system.

On the other hand, our MIPS router has 100baseT connectivity, and thus may receive data at 10 MB/s speed. But the fastest SHA-3 candidate on that platform is Shabal (which is faster than SHA-1), and its bandwidth is below 7 MB/s. This means that hash function performance is a very real issue on that kind of system.

This is why we consider performance measures on embedded systems to be much more relevant than those on large desktop systems.

## 5 Conclusion

Our measures show that most of the round 2 candidates were designed for maximum performance on large systems, preferably 64-bit architectures. For some of them, adequate performance is achieved only with the help of the special x86 units, such as the SSE2 unit (with its 128-bit registers and parallel execution features), or the AES-NI instructions which compute an AES round in two clock cycles on the newest Intel processors (all of them being newer than the NIST reference platform). In our tests, such extra units are not available, since we use portable code and try to obtain measures which can be interpolated to many distinct architectures in the same category.

Some candidates achieve very good performance on 64-bit platforms through the use of 64-bit integer operations, in particular BMW-512 (BMW-256 uses only 32-bit words). BMW-512 is the fastest function, by far, on 64-bit architectures. Unfortunately, its performance decreases quite a lot on 32-bit systems. In our view, BMW breaks records precisely on those architectures where breaking records matters the least.

It would be natural to suspect us of having chosen the specific test platforms on which Shabal would shine; after all, we are part of the Shabal design team, and this is a competition. However, it is really the other way round. When Shabal was designed, we had already conceived the idea that performance was most important on embedded 32-bit systems, and we made sure that Shabal was fast on those. It seems that few candidates followed the same path.

Development of sphlib will keep on, at least to include the five round 2 candidates which are not implemented yet. Other areas where performance issues should be investigated include:

- performance on hashing very small messages (less than one "elementary block");
- compact implementations, for constrained environments;
- script languages, in particular Javascript (which has almost nothing in common with Java).

# References

- [1] Cryptographic Algorithm Hash Competition, http://csrc.nist.gov/groups/ST/hash/sha-3/
- [2] eBASH: ECRYPT Benchmarking of All Submitted Hashes, http://bench.cr.yp.to/ebash.html
- [3] sphlib 1.1, http://www.crypto-hash.fr/modules/wfdownloads/singlefile.php?cid=13&lid=10
- [4] GCC, the GNU Compiler Collection, http://gcc.gnu.org/
- [5] OpenWrt, http://openwrt.org/
- [6] HPGCC, http://sourceforge.net/projects/hpgcc/
- [7] Java SE Downloads, http://java.sun.com/javase/downloads/index.jsp