



Cryptographic Library for FPGAs using SME

A Bachelor Project Defense

Jacob Herbst (mrw148) and Jonas Flach-Jensen (sjm233) Institute of Computer Science (DIKU)



# Agenda

- Why is this project interesting?
- Presentation of MD5 and AES
- Results
- suggestions for future work



# Motivation - Cryptography

- Is ubiquitous
- Often Hardware-focused
- This project has focused on implementing a variety of algorithms instead of nitpicking one algorithm.



## Motivation - Why use FPGAs?

- Architecture
- Configurable not only in computation but also in interface
- Often fast as the overhead from generality is ommitted
- Low power consumption (fixed precision)





#### MD5

- Cryptograpic hash function
- Merkle-Damgaard construction
- Four different compression functions, 64 rounds



Figure: MD5 round



Figure: Merkle-Damgaard construction



### MD5 - Optimizations

- Naive; 1 simple process, 4 busses
- Pipelined version; clocked process for preprocessing and each compression stage
- Same idea for SHA-2



Figure: MD5 pipeline



# MD5 - Pipeline

- Stalls on large messages due to data dependency
- Solution is enabling multiple inputs

|       |       | Inde  | epende | nt mes | ssage b | olocks |       |       |     |
|-------|-------|-------|--------|--------|---------|--------|-------|-------|-----|
| clock | 0     | 1     | 2      | 3      | 4       | 5      | 6     | 7     | - 8 |
|       | $P_1$ | $M_1$ | $F_1$  | $G_1$  | $H_1$   | $I_1$  | $C_1$ |       |     |
|       | _     | $P_2$ | $M_2$  | $F_2$  | $G_2^-$ | $H_2$  | $I_2$ | $C_2$ |     |

|       | Dependent message blocks |       |       |       |       |       |       |       |       |       |       |       |
|-------|--------------------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| clock | 0                        | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     | 10    | 11    |
|       | $P_1$                    | $M_1$ | $F_1$ | $G_1$ | $H_1$ | $I_1$ | $C_1$ |       |       |       |       |       |
|       |                          | $P_2$ | $M_2$ | -     | -     | -     | -     | $F_2$ | $G_2$ | $H_2$ | $I_2$ | $C_2$ |



#### **AES**

- The algorithm is Rijndael. It is a Block Cipher and a Substitution-permutation (SP) network.
- Four steps, or lookups:

$$W_i \begin{cases} K_i \\ W_{i-N} \oplus \mathsf{SubWords}(W_{i-1} \lll 8) \oplus \mathsf{rcon}_{i/N} & \text{if } i < N \\ W_{i-N} \oplus \mathsf{SubWords}(W_{i-1}) & \text{if } i \geq N \text{ and } i \equiv 0 (\bmod N) \\ W_{i-N} \oplus W_{i-1} & \text{if } i \geq N, \ N > 6, \ \text{and } i \equiv 4 \ (\bmod N) \\ & \text{otherwise} \end{cases}$$

$$T_{0}[a] = \begin{bmatrix} S[a] \cdot 02_{16} \\ S[a] \\ S[a] \\ S[a] \cdot 03_{16} \end{bmatrix} T_{1}[a] = \begin{bmatrix} S[a] \cdot 03_{16} \\ S[a] \cdot 02_{16} \\ S[a] \\ S[a] \end{bmatrix} T_{2}[a] = \begin{bmatrix} S[a] \\ S[a] \cdot 03_{16} \\ S[a] \cdot 02_{16} \end{bmatrix} T_{3}[a] = \begin{bmatrix} S[a] \\ S[a] \cdot 03_{16} \\ S[a] \cdot 02_{16} \end{bmatrix}$$

$$e_{j} = T_{0}[a_{0,3}] \oplus T_{1}[a_{1,2}] \oplus T_{2}[a_{2,1}] \oplus T_{3}[a_{3,0}] \oplus k_{j}$$



## AES - Optimization

- Fast naive version
- Pipelined by splitting up rounds
- no data dependecy



#### Results from pipelining

• 
$$C(x) = x + 2 \cdot blocks$$

Zedboard bus: 2132 MBps

MD5: Is easily optimised beyond the memory limit.

SHA: Worst hard to optimize because of dependecies.

AES: Reached limit with current approach? Potentially other approaches can reach higher.

ChaCha: Easily reaches limit.

|                    |                |                      | 14100      |                      |                        |       |       |
|--------------------|----------------|----------------------|------------|----------------------|------------------------|-------|-------|
| Version            | $f_{max}(Mhz)$ | clocks <sub>hi</sub> | TP(MBps)hi | clocks <sub>lo</sub> | TP(MBps) <sub>lo</sub> | LUT   | FF    |
| Naive              | 2.38           | b                    | 152        | Ь                    | 152                    | 11607 | 2304  |
| Proc <sub>4</sub>  | 9.50           | hi(6)                | 266        | lo(6)                | 101                    | 10247 | 5226  |
| Proc <sub>8</sub>  | 19.00          | hi(10)               | 532        | lo(10)               | 122                    | 10087 | 7538  |
| Proc <sub>16</sub> | 33.50          | hi(18)               | 937        | lo(18)               | 119                    | 10206 | 12162 |
| Proc <sub>32</sub> | 65.00          | hi(34)               | 1817       | lo(34)               | 123                    | 10149 | 21347 |
| $Proc_{64}$        | 115.00         | hi(66)               | 3209       | lo(66)               | 112                    | 10350 | 39718 |
|                    |                |                      | SHA        |                      |                        |       |       |

Version  $f_{max}(Mhz)$ clocks TP(MBps)hi clocks TP(MBps)<sub>lo</sub> Naive h 134.4 b 134.4 24330 2560 hi(6) Proc4 8.0 lo(6) 85.3 24466 8938 8.0 hi(10) 51.2 24756 14066 AES

Version clocks Naive Ь 352 10612 3195 TBox 25 Ь 400 16458 3195 Proc<sub>4</sub> 68 544 16474 2817 C(12) Proc11 208 1663 15659 4383 Proc22 C(24) 1662 15454 7401 BRAM<sub>11</sub> 195 1556 10012 10398

#### ChaCha

| Version            | $f_{max}(Mhz)$ | clocks | TP(MBps) | LUT   | FF    |
|--------------------|----------------|--------|----------|-------|-------|
| Naive              | 1.25           | b      | 80       | 14670 | 3457  |
| Proc <sub>11</sub> | 40.00          | C(9)   | 1279     | 14736 | 16898 |
| Procos             | 82.00          | C(20)  | 2557     | 17565 | 32420 |



#### Results compared to CPU

Comparing to CPU over GPU

- GPU is the standard approach for hardware acceleration.
- · CPU is closer in TDP.
- CPU is more approachable and making a GPU version would require higher development time.
- · already reached some board limitations.
- MD5: 4.5 times faster than any comparable
- AES: Proximity of the C# version, but cannot compete AES-NI.
- SHA: Only half the speed of i5, but faster than ARM processor. Potential for improvement.
- ChaCha: Reaches the bandwidth limit of the Zynq board. Not quite i5 speed. Doubling processes: 2714 MBps. Only 150 more than the previous version.

|    |       |                    | N.      | AD5                   |                        |       |                      |
|----|-------|--------------------|---------|-----------------------|------------------------|-------|----------------------|
|    | Naive | Proc <sub>64</sub> | C#      | C                     | OpenSLL <sub>low</sub> | Oper  | SLL <sub>high</sub>  |
| Pi | 152   | 3210               | 287     | 256                   | 42                     |       | 293                  |
| i5 | 152   | 3210               | 604     | 622                   | 81                     |       | 691                  |
|    |       |                    |         | AES                   |                        |       |                      |
|    | Naive | Proc <sub>11</sub> | C#      | C                     | OpenSLL <sub>low</sub> | Оре   | nSLL <sub>high</sub> |
| Pi | 400   | 1963               | 70      | 198                   | 72                     |       | 89                   |
| i5 | 400   | 1699               | 1963    | 340                   | 847                    |       | 5722                 |
|    |       |                    | SHA     |                       |                        |       |                      |
|    | Naive | Proc <sub>4</sub>  | C#      | OpenSLL <sub>lo</sub> | " OpenSL               | Lhigh |                      |
| Pi | 134   | 224                | 163     | 4                     | 2                      | 165   |                      |
| i5 | 134   | 224                | 438     | 6                     | 1                      | 461   |                      |
|    |       | Ch                 | aCha    |                       |                        |       |                      |
|    | Naive | Proc               | OpenSLL | low Op€               | enSLL <sub>high</sub>  |       |                      |
| Pi | 80    | 2557               |         | 84                    | 307                    |       |                      |

388

3092

2557



# Power usage

Why use TDP?

- Selling point for FPGAs
- Only possibility in the current stage.
- Time consuming to do actual tests.

All of our versions is significantly more power efficient than the CPUs:

FPGA 1.765W i5 65.000W Pi 7.500W



#### Future work

#### Critical work

- The Dependency routing needs to be fixed in the hashing functions.
- Make hashes able to switch between messages to circumvent the stalling.
- Make a useful interface to expose our implementations.

#### Optimizing work

- Investigate SHA hopefully getting the performance to reasonable levels
- Test if different approaches of AES approaches could yield better performance. Sugestions:
  - 1. Naive but pipelined
  - 2. Stateful BRAM.

#### Comparing work

- Compare to other research papers results which often are written in HDL to see if SME can actually provide comparable results. This would require better a better FPGA.
- Test against a GPU.



Questions?







































































































































