Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Intel SHA extension for SHA1 #807

Closed
noloader opened this issue Jan 4, 2017 · 17 comments
Closed

Add Intel SHA extension for SHA1 #807

noloader opened this issue Jan 4, 2017 · 17 comments

Comments

@noloader
Copy link
Contributor

noloader commented Jan 4, 2017

This patch adds SHA extension support for SHA1. It is a hack because of my lack of knowledge of Botan. I don't know how to cut-in a new ISA or CPU feature, so I changed the SHA SSE2 code to SHA extensions for cut-in and testing. Someone more familiar with Botan needs to take it further.

Hopefully it can serve as a starting point for SHA1 using Intel SHA extensions. It would be nice to see it make it into Botan 2.0.

Credit should got to Sean Gulley of Intel. He wrote the article New Instructions Supporting the Secure Hash Algorithm on Intel® Architecture Processors. Later, I found his reference implementation at mitls | experimental | hash to fill in the missing pieces from the Intel blog. We also had to use unaligned loads and stores to avoid SIGBUS on unaligned buffers.

Be careful of the ISA name. ARMv8 has AES and SHA extensions, too. I suspect there could be a collision if not mindful. Here are the CPU feature flags from a Goldmont board running Linux. Notice they call it sha_ni.

goldmont$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 92
model name      : Intel(R) Celeron(R) CPU J3455 @ 1.50GHz
stepping        : 9
microcode       : 0x1a
cpu MHz         : 799.987
cache size      : 1024 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
                  dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
                  art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf
                  eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr
                  pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm
                  3dnowprefetch intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep
                  erms mpx rdseed smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida
                  arat pln pts
bugs            : monitor
bogomips        : 2995.20
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
...

$ cat sha1.diff 
diff --git a/src/lib/hash/sha1/sha1_sse2/sha1_sse2.cpp b/src/lib/hash/sha1/sha1_sse2/sha1_sse2.cpp
index 8c77850..9195b75 100644
--- a/src/lib/hash/sha1/sha1_sse2/sha1_sse2.cpp
+++ b/src/lib/hash/sha1/sha1_sse2/sha1_sse2.cpp
@@ -1,153 +1,17 @@
 /*
-* SHA-1 using SSE2
-* Based on public domain code by Dean Gaudet
-*    (http://arctic.org/~dean/crypto/sha1.html)
+* SHA-1 using Intel SHA intrinsic
+* Based on public domain code by Sean Gulley
+*    (https://github.com/mitls/hacl-star/tree/master/experimental/hash)
 * (C) 2009-2011 Jack Lloyd
 *
 * Botan is released under the Simplified BSD License (see license.txt)
 */
 
 #include <botan/sha160.h>
-#include <emmintrin.h>
+#include <immintrin.h>
 
 namespace Botan {
 
-namespace SHA1_SSE2_F {
-
-namespace {
-
-/*
-* First 16 bytes just need byte swapping. Preparing just means
-* adding in the round constants.
-*/
-
-#define prep00_15(P, W)                                      \
-   do {                                                      \
-      W = _mm_shufflehi_epi16(W, _MM_SHUFFLE(2, 3, 0, 1));   \
-      W = _mm_shufflelo_epi16(W, _MM_SHUFFLE(2, 3, 0, 1));   \
-      W = _mm_or_si128(_mm_slli_epi16(W, 8),                 \
-                       _mm_srli_epi16(W, 8));                \
-      P.u128 = _mm_add_epi32(W, K00_19);                     \
-   } while(0)
-
-/*
-For each multiple of 4, t, we want to calculate this:
-
-W[t+0] = rol(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-W[t+1] = rol(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1);
-W[t+2] = rol(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1);
-W[t+3] = rol(W[t]   ^ W[t-5] ^ W[t-11] ^ W[t-13], 1);
-
-we'll actually calculate this:
-
-W[t+0] = rol(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1);
-W[t+1] = rol(W[t-2] ^ W[t-7] ^ W[t-13] ^ W[t-15], 1);
-W[t+2] = rol(W[t-1] ^ W[t-6] ^ W[t-12] ^ W[t-14], 1);
-W[t+3] = rol(  0    ^ W[t-5] ^ W[t-11] ^ W[t-13], 1);
-W[t+3] ^= rol(W[t+0], 1);
-
-the parameters are:
-
-W0 = &W[t-16];
-W1 = &W[t-12];
-W2 = &W[t- 8];
-W3 = &W[t- 4];
-
-and on output:
-prepared = W0 + K
-W0 = W[t]..W[t+3]
-*/
-
-/* note that there is a step here where i want to do a rol by 1, which
-* normally would look like this:
-*
-* r1 = psrld r0,$31
-* r0 = pslld r0,$1
-* r0 = por r0,r1
-*
-* but instead i do this:
-*
-* r1 = pcmpltd r0,zero
-* r0 = paddd r0,r0
-* r0 = psub r0,r1
-*
-* because pcmpltd and paddd are availabe in both MMX units on
-* efficeon, pentium-m, and opteron but shifts are available in
-* only one unit.
-*/
-#define prep(prep, XW0, XW1, XW2, XW3, K)                               \
-   do {                                                                 \
-      __m128i r0, r1, r2, r3;                                           \
-                                                                        \
-      /* load W[t-4] 16-byte aligned, and shift */                      \
-      r3 = _mm_srli_si128((XW3), 4);                                    \
-      r0 = (XW0);                                                       \
-      /* get high 64-bits of XW0 into low 64-bits */                    \
-      r1 = _mm_shuffle_epi32((XW0), _MM_SHUFFLE(1,0,3,2));              \
-      /* load high 64-bits of r1 */                                     \
-      r1 = _mm_unpacklo_epi64(r1, (XW1));                               \
-      r2 = (XW2);                                                       \
-                                                                        \
-      r0 = _mm_xor_si128(r1, r0);                                       \
-      r2 = _mm_xor_si128(r3, r2);                                       \
-      r0 = _mm_xor_si128(r2, r0);                                       \
-      /* unrotated W[t]..W[t+2] in r0 ... still need W[t+3] */          \
-                                                                        \
-      r2 = _mm_slli_si128(r0, 12);                                      \
-      r1 = _mm_cmplt_epi32(r0, _mm_setzero_si128());                    \
-      r0 = _mm_add_epi32(r0, r0);   /* shift left by 1 */               \
-      r0 = _mm_sub_epi32(r0, r1);   /* r0 has W[t]..W[t+2] */           \
-                                                                        \
-      r3 = _mm_srli_epi32(r2, 30);                                      \
-      r2 = _mm_slli_epi32(r2, 2);                                       \
-                                                                        \
-      r0 = _mm_xor_si128(r0, r3);                                       \
-      r0 = _mm_xor_si128(r0, r2);   /* r0 now has W[t+3] */             \
-                                                                        \
-      (XW0) = r0;                                                       \
-      (prep).u128 = _mm_add_epi32(r0, K);                               \
-   } while(0)
-
-/*
-* SHA-160 F1 Function
-*/
-inline void F1(uint32_t A, uint32_t& B, uint32_t C, uint32_t D, uint32_t& E, uint32_t msg)
-   {
-   E += (D ^ (B & (C ^ D))) + msg + rotate_left(A, 5);
-   B  = rotate_left(B, 30);
-   }
-
-/*
-* SHA-160 F2 Function
-*/
-inline void F2(uint32_t A, uint32_t& B, uint32_t C, uint32_t D, uint32_t& E, uint32_t msg)
-   {
-   E += (B ^ C ^ D) + msg + rotate_left(A, 5);
-   B  = rotate_left(B, 30);
-   }
-
-/*
-* SHA-160 F3 Function
-*/
-inline void F3(uint32_t A, uint32_t& B, uint32_t C, uint32_t D, uint32_t& E, uint32_t msg)
-   {
-   E += ((B & C) | ((B | C) & D)) + msg + rotate_left(A, 5);
-   B  = rotate_left(B, 30);
-   }
-
-/*
-* SHA-160 F4 Function
-*/
-inline void F4(uint32_t A, uint32_t& B, uint32_t C, uint32_t D, uint32_t& E, uint32_t msg)
-   {
-   E += (B ^ C ^ D) + msg + rotate_left(A, 5);
-   B  = rotate_left(B, 30);
-   }
-
-}
-
-}
-
 /*
 * SHA-160 Compression Function using SSE for message expansion
 */
@@ -155,181 +19,192 @@ inline void F4(uint32_t A, uint32_t& B, uint32_t C, uint32_t D, uint32_t& E, uin
 BOTAN_FUNC_ISA("sse2")
 void SHA_160::sse2_compress_n(secure_vector<uint32_t>& digest, const uint8_t input[], size_t blocks)
    {
-   using namespace SHA1_SSE2_F;
-
-   const __m128i K00_19 = _mm_set1_epi32(0x5A827999);
-   const __m128i K20_39 = _mm_set1_epi32(0x6ED9EBA1);
-   const __m128i K40_59 = _mm_set1_epi32(0x8F1BBCDC);
-   const __m128i K60_79 = _mm_set1_epi32(0xCA62C1D6);
-
-   uint32_t A = digest[0],
-          B = digest[1],
-          C = digest[2],
-          D = digest[3],
-          E = digest[4];
-
-   const __m128i* input_mm = reinterpret_cast<const __m128i*>(input);
-
-   for(size_t i = 0; i != blocks; ++i)
-      {
-      union v4si {
-         uint32_t u32[4];
-         __m128i u128;
-         };
-
-      v4si P0, P1, P2, P3;
-
-      __m128i W0 = _mm_loadu_si128(&input_mm[0]);
-      prep00_15(P0, W0);
-
-      __m128i W1 = _mm_loadu_si128(&input_mm[1]);
-      prep00_15(P1, W1);
-
-      __m128i W2 = _mm_loadu_si128(&input_mm[2]);
-      prep00_15(P2, W2);
-
-      __m128i W3 = _mm_loadu_si128(&input_mm[3]);
-      prep00_15(P3, W3);
-
-      /*
-      Using SSE4; slower on Core2 and Nehalem
-      #define GET_P_32(P, i) _mm_extract_epi32(P.u128, i)
-
-      Much slower on all tested platforms
-      #define GET_P_32(P,i) _mm_cvtsi128_si32(_mm_srli_si128(P.u128, i*4))
-      */
-
-#define GET_P_32(P, i) P.u32[i]
-
-      F1(A, B, C, D, E, GET_P_32(P0, 0));
-      F1(E, A, B, C, D, GET_P_32(P0, 1));
-      F1(D, E, A, B, C, GET_P_32(P0, 2));
-      F1(C, D, E, A, B, GET_P_32(P0, 3));
-      prep(P0, W0, W1, W2, W3, K00_19);
-
-      F1(B, C, D, E, A, GET_P_32(P1, 0));
-      F1(A, B, C, D, E, GET_P_32(P1, 1));
-      F1(E, A, B, C, D, GET_P_32(P1, 2));
-      F1(D, E, A, B, C, GET_P_32(P1, 3));
-      prep(P1, W1, W2, W3, W0, K20_39);
-
-      F1(C, D, E, A, B, GET_P_32(P2, 0));
-      F1(B, C, D, E, A, GET_P_32(P2, 1));
-      F1(A, B, C, D, E, GET_P_32(P2, 2));
-      F1(E, A, B, C, D, GET_P_32(P2, 3));
-      prep(P2, W2, W3, W0, W1, K20_39);
-
-      F1(D, E, A, B, C, GET_P_32(P3, 0));
-      F1(C, D, E, A, B, GET_P_32(P3, 1));
-      F1(B, C, D, E, A, GET_P_32(P3, 2));
-      F1(A, B, C, D, E, GET_P_32(P3, 3));
-      prep(P3, W3, W0, W1, W2, K20_39);
-
-      F1(E, A, B, C, D, GET_P_32(P0, 0));
-      F1(D, E, A, B, C, GET_P_32(P0, 1));
-      F1(C, D, E, A, B, GET_P_32(P0, 2));
-      F1(B, C, D, E, A, GET_P_32(P0, 3));
-      prep(P0, W0, W1, W2, W3, K20_39);
-
-      F2(A, B, C, D, E, GET_P_32(P1, 0));
-      F2(E, A, B, C, D, GET_P_32(P1, 1));
-      F2(D, E, A, B, C, GET_P_32(P1, 2));
-      F2(C, D, E, A, B, GET_P_32(P1, 3));
-      prep(P1, W1, W2, W3, W0, K20_39);
-
-      F2(B, C, D, E, A, GET_P_32(P2, 0));
-      F2(A, B, C, D, E, GET_P_32(P2, 1));
-      F2(E, A, B, C, D, GET_P_32(P2, 2));
-      F2(D, E, A, B, C, GET_P_32(P2, 3));
-      prep(P2, W2, W3, W0, W1, K40_59);
-
-      F2(C, D, E, A, B, GET_P_32(P3, 0));
-      F2(B, C, D, E, A, GET_P_32(P3, 1));
-      F2(A, B, C, D, E, GET_P_32(P3, 2));
-      F2(E, A, B, C, D, GET_P_32(P3, 3));
-      prep(P3, W3, W0, W1, W2, K40_59);
-
-      F2(D, E, A, B, C, GET_P_32(P0, 0));
-      F2(C, D, E, A, B, GET_P_32(P0, 1));
-      F2(B, C, D, E, A, GET_P_32(P0, 2));
-      F2(A, B, C, D, E, GET_P_32(P0, 3));
-      prep(P0, W0, W1, W2, W3, K40_59);
-
-      F2(E, A, B, C, D, GET_P_32(P1, 0));
-      F2(D, E, A, B, C, GET_P_32(P1, 1));
-      F2(C, D, E, A, B, GET_P_32(P1, 2));
-      F2(B, C, D, E, A, GET_P_32(P1, 3));
-      prep(P1, W1, W2, W3, W0, K40_59);
-
-      F3(A, B, C, D, E, GET_P_32(P2, 0));
-      F3(E, A, B, C, D, GET_P_32(P2, 1));
-      F3(D, E, A, B, C, GET_P_32(P2, 2));
-      F3(C, D, E, A, B, GET_P_32(P2, 3));
-      prep(P2, W2, W3, W0, W1, K40_59);
-
-      F3(B, C, D, E, A, GET_P_32(P3, 0));
-      F3(A, B, C, D, E, GET_P_32(P3, 1));
-      F3(E, A, B, C, D, GET_P_32(P3, 2));
-      F3(D, E, A, B, C, GET_P_32(P3, 3));
-      prep(P3, W3, W0, W1, W2, K60_79);
-
-      F3(C, D, E, A, B, GET_P_32(P0, 0));
-      F3(B, C, D, E, A, GET_P_32(P0, 1));
-      F3(A, B, C, D, E, GET_P_32(P0, 2));
-      F3(E, A, B, C, D, GET_P_32(P0, 3));
-      prep(P0, W0, W1, W2, W3, K60_79);
-
-      F3(D, E, A, B, C, GET_P_32(P1, 0));
-      F3(C, D, E, A, B, GET_P_32(P1, 1));
-      F3(B, C, D, E, A, GET_P_32(P1, 2));
-      F3(A, B, C, D, E, GET_P_32(P1, 3));
-      prep(P1, W1, W2, W3, W0, K60_79);
-
-      F3(E, A, B, C, D, GET_P_32(P2, 0));
-      F3(D, E, A, B, C, GET_P_32(P2, 1));
-      F3(C, D, E, A, B, GET_P_32(P2, 2));
-      F3(B, C, D, E, A, GET_P_32(P2, 3));
-      prep(P2, W2, W3, W0, W1, K60_79);
-
-      F4(A, B, C, D, E, GET_P_32(P3, 0));
-      F4(E, A, B, C, D, GET_P_32(P3, 1));
-      F4(D, E, A, B, C, GET_P_32(P3, 2));
-      F4(C, D, E, A, B, GET_P_32(P3, 3));
-      prep(P3, W3, W0, W1, W2, K60_79);
-
-      F4(B, C, D, E, A, GET_P_32(P0, 0));
-      F4(A, B, C, D, E, GET_P_32(P0, 1));
-      F4(E, A, B, C, D, GET_P_32(P0, 2));
-      F4(D, E, A, B, C, GET_P_32(P0, 3));
-
-      F4(C, D, E, A, B, GET_P_32(P1, 0));
-      F4(B, C, D, E, A, GET_P_32(P1, 1));
-      F4(A, B, C, D, E, GET_P_32(P1, 2));
-      F4(E, A, B, C, D, GET_P_32(P1, 3));
-
-      F4(D, E, A, B, C, GET_P_32(P2, 0));
-      F4(C, D, E, A, B, GET_P_32(P2, 1));
-      F4(B, C, D, E, A, GET_P_32(P2, 2));
-      F4(A, B, C, D, E, GET_P_32(P2, 3));
-
-      F4(E, A, B, C, D, GET_P_32(P3, 0));
-      F4(D, E, A, B, C, GET_P_32(P3, 1));
-      F4(C, D, E, A, B, GET_P_32(P3, 2));
-      F4(B, C, D, E, A, GET_P_32(P3, 3));
-
-      A = (digest[0] += A);
-      B = (digest[1] += B);
-      C = (digest[2] += C);
-      D = (digest[3] += D);
-      E = (digest[4] += E);
-
-      input_mm += (64 / 16);
-      }
-
-#undef GET_P_32
+    __m128i ABCD, ABCD_SAVE, E0, E0_SAVE, E1;
+    __m128i MASK, MSG0, MSG1, MSG2, MSG3;
+
+    uint32_t* state = &digest[0];
+
+    // Load initial values
+    ABCD = _mm_loadu_si128((__m128i*) state);
+    E0 = _mm_set_epi32(state[4], 0, 0, 0);
+    ABCD = _mm_shuffle_epi32(ABCD, 0x1B);
+    MASK = _mm_set_epi64x(0x0001020304050607ULL, 0x08090a0b0c0d0e0fULL);
+
+    while (blocks)
+    {
+        const __m128i* input_mm = reinterpret_cast<const __m128i*>(input);
+
+        // Save current hash
+        ABCD_SAVE = ABCD;
+        E0_SAVE = E0;
+
+        // Rounds 0-3
+        MSG0 = _mm_loadu_si128(input_mm+0);
+        MSG0 = _mm_shuffle_epi8(MSG0, MASK);
+        E0 = _mm_add_epi32(E0, MSG0);
+        E1 = ABCD;
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);
+
+        // Rounds 4-7
+        MSG1 = _mm_loadu_si128(input_mm+1);
+        MSG1 = _mm_shuffle_epi8(MSG1, MASK);
+        E1 = _mm_sha1nexte_epu32(E1, MSG1);
+        E0 = ABCD;
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 0);
+        MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
+
+        // Rounds 8-11
+        MSG2 = _mm_loadu_si128(input_mm+2);
+        MSG2 = _mm_shuffle_epi8(MSG2, MASK);
+        E0 = _mm_sha1nexte_epu32(E0, MSG2);
+        E1 = ABCD;
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);
+        MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
+        MSG0 = _mm_xor_si128(MSG0, MSG2);
+
+        // Rounds 12-15
+        MSG3 = _mm_loadu_si128(input_mm+3);
+        MSG3 = _mm_shuffle_epi8(MSG3, MASK);
+        E1 = _mm_sha1nexte_epu32(E1, MSG3);
+        E0 = ABCD;
+        MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 0);
+        MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
+        MSG1 = _mm_xor_si128(MSG1, MSG3);
+
+        // Rounds 16-19
+        E0 = _mm_sha1nexte_epu32(E0, MSG0);
+        E1 = ABCD;
+        MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 0);
+        MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
+        MSG2 = _mm_xor_si128(MSG2, MSG0);
+
+        // Rounds 20-23
+        E1 = _mm_sha1nexte_epu32(E1, MSG1);
+        E0 = ABCD;
+        MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
+        MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
+        MSG3 = _mm_xor_si128(MSG3, MSG1);
+
+        // Rounds 24-27
+        E0 = _mm_sha1nexte_epu32(E0, MSG2);
+        E1 = ABCD;
+        MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 1);
+        MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
+        MSG0 = _mm_xor_si128(MSG0, MSG2);
+
+        // Rounds 28-31
+        E1 = _mm_sha1nexte_epu32(E1, MSG3);
+        E0 = ABCD;
+        MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
+        MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
+        MSG1 = _mm_xor_si128(MSG1, MSG3);
+
+        // Rounds 32-35
+        E0 = _mm_sha1nexte_epu32(E0, MSG0);
+        E1 = ABCD;
+        MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 1);
+        MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
+        MSG2 = _mm_xor_si128(MSG2, MSG0);
+
+        // Rounds 36-39
+        E1 = _mm_sha1nexte_epu32(E1, MSG1);
+        E0 = ABCD;
+        MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 1);
+        MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
+        MSG3 = _mm_xor_si128(MSG3, MSG1);
+
+        // Rounds 40-43
+        E0 = _mm_sha1nexte_epu32(E0, MSG2);
+        E1 = ABCD;
+        MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
+        MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
+        MSG0 = _mm_xor_si128(MSG0, MSG2);
+
+        // Rounds 44-47
+        E1 = _mm_sha1nexte_epu32(E1, MSG3);
+        E0 = ABCD;
+        MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 2);
+        MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
+        MSG1 = _mm_xor_si128(MSG1, MSG3);
+
+        // Rounds 48-51
+        E0 = _mm_sha1nexte_epu32(E0, MSG0);
+        E1 = ABCD;
+        MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
+        MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
+        MSG2 = _mm_xor_si128(MSG2, MSG0);
+
+        // Rounds 52-55
+        E1 = _mm_sha1nexte_epu32(E1, MSG1);
+        E0 = ABCD;
+        MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 2);
+        MSG0 = _mm_sha1msg1_epu32(MSG0, MSG1);
+        MSG3 = _mm_xor_si128(MSG3, MSG1);
+
+        // Rounds 56-59
+        E0 = _mm_sha1nexte_epu32(E0, MSG2);
+        E1 = ABCD;
+        MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 2);
+        MSG1 = _mm_sha1msg1_epu32(MSG1, MSG2);
+        MSG0 = _mm_xor_si128(MSG0, MSG2);
+
+        // Rounds 60-63
+        E1 = _mm_sha1nexte_epu32(E1, MSG3);
+        E0 = ABCD;
+        MSG0 = _mm_sha1msg2_epu32(MSG0, MSG3);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);
+        MSG2 = _mm_sha1msg1_epu32(MSG2, MSG3);
+        MSG1 = _mm_xor_si128(MSG1, MSG3);
+
+        // Rounds 64-67
+        E0 = _mm_sha1nexte_epu32(E0, MSG0);
+        E1 = ABCD;
+        MSG1 = _mm_sha1msg2_epu32(MSG1, MSG0);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 3);
+        MSG3 = _mm_sha1msg1_epu32(MSG3, MSG0);
+        MSG2 = _mm_xor_si128(MSG2, MSG0);
+
+        // Rounds 68-71
+        E1 = _mm_sha1nexte_epu32(E1, MSG1);
+        E0 = ABCD;
+        MSG2 = _mm_sha1msg2_epu32(MSG2, MSG1);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);
+        MSG3 = _mm_xor_si128(MSG3, MSG1);
+
+        // Rounds 72-75
+        E0 = _mm_sha1nexte_epu32(E0, MSG2);
+        E1 = ABCD;
+        MSG3 = _mm_sha1msg2_epu32(MSG3, MSG2);
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E0, 3);
+
+        // Rounds 76-79
+        E1 = _mm_sha1nexte_epu32(E1, MSG3);
+        E0 = ABCD;
+        ABCD = _mm_sha1rnds4_epu32(ABCD, E1, 3);
+
+        // Add values back to state
+        E0 = _mm_sha1nexte_epu32(E0, E0_SAVE);
+        ABCD = _mm_add_epi32(ABCD, ABCD_SAVE);
+
+        input += 64;
+        blocks--;
+    }
+
+    // Save state
+    ABCD = _mm_shuffle_epi32(ABCD, 0x1B);
+    _mm_storeu_si128((__m128i*) state, ABCD);
+    state[4] = _mm_extract_epi32(E0, 3);
    }
-
-#undef prep00_15
-#undef prep
-
 }

Here is the updated sha_sse2.cpp and the diff packaged as a ZIP file.

sha1_sse2_updated.zip

@neverhub
Copy link
Contributor

neverhub commented Jan 5, 2017

I am very interested in speed measurements with native instructions.
Could you run the following commands for me? Thanks in advance.

./botan speed --msec=1000 SHA-1
./botan speed --msec=1000 SHA-256
./botan speed --msec=1000 SHA-224

@securitykernel
Copy link
Collaborator

Indeed, interesting to see the output of botan speed. weidai11/cryptopp#139 indicates a quite impressive improve. The SHA extensions are only available in the Goldmont architecture, is that true? That would somehow limit the impact of this implementation.

However, 2.0.0 is feature freeze and likely to be released this or next week, so this would be a candidate for 2.1. The same goes for #808.

@noloader
Copy link
Contributor Author

noloader commented Jan 5, 2017

@randombit, @neverhub, @cordney,

All measurements were taken from the Celeron J3455 running at 1.5 GHz (burst at 2.3 GHz). The library was configured with ./configure.py --cc={gcc|clang} --cc-abi="-march=native -msse4 -msha".

  • Botan without Intel SHA extensions:
$ ./botan speed --msec=3000 SHA-1 SHA-224 SHA-256
SHA-160 [base] hash 274.826 MiB/sec (824.480 MiB in 3000.009 ms)
SHA-224 [base] hash 92.349 MiB/sec (277.051 MiB in 3000.027 ms)
SHA-256 [base] hash 92.364 MiB/sec (277.094 MiB in 3000.027 ms)
  • Botan with Intel SHA extensions, GCC 6.2:
$ ./botan speed --msec=3000 SHA-1 SHA-224 SHA-256
SHA-160 [base] hash 1195.907 MiB/sec (3587.723 MiB in 3000.000 ms)
SHA-224 [base] hash 535.740 MiB/sec (1607.219 MiB in 3000.000 ms)
SHA-256 [base] hash 535.970 MiB/sec (1607.914 MiB in 3000.005 ms)
  • Botan with Intel SHA extensions, Clang 3.8:
$ ./botan speed --msec=3000 SHA-1 SHA-224 SHA-256
SHA-160 [base] hash 1251.347 MiB/sec (3754.043 MiB in 3000.001 ms)
SHA-224 [base] hash 535.835 MiB/sec (1607.508 MiB in 3000.006 ms)
SHA-256 [base] hash 535.859 MiB/sec (1607.578 MiB in 3000.003 ms)

@randombit
Copy link
Owner

Thank you! It will require some work to handle the ISA support properly, but I (or maybe someone else if interested, any takers?) can certainly handle it from here based on your patch. The timing is not right for target this for 2.0. But it is a great candidate for 2.1.

How would you like your code to be attributed? Normally in Botan all copyright is held by individual authors, and we collectively distribute all of it under the BSD-2 license. Alternately it can be released in files (C) me, distributed under BSD, and explaining it is based on public domain code by you (and referencing also the public domain code you reference, either way). I was not sure if you had a personal preference for non-copyright.

@securitykernel
Copy link
Collaborator

The numbers are quite impressive, that's a 5x increase. Can you say more on the microarchitectures on which the SHA extensions are available today and maybe will be in the future? ark.intel.com does not list these yet, even for the Celeron you say you tested on. Will this be in Kaby-Lake processors?

@noloader
Copy link
Contributor Author

noloader commented Jan 5, 2017

@randombit,

How would you like your code to be attributed? Normally in Botan all copyright is held by individual authors...

I prefer to assign any rights to Botan. Botan can release it under any terms it prefers. (I know some countries endow copyright even if its unwanted).


@cordney,

Can you say more on the microarchitectures on which the SHA extensions are available today and maybe will be in the future?

SHA extension were originally supposed to be part of Apollo Lake (IIRC). In 3Q 2016 we saw the SHA extensions surface under Goldmont (and not Apollo Lake). At this point in time, I am aware of 6 processors with the SHA extensions, and all of them are Goldmont:

  • Pentium J4205 (desktop)
  • Pentium N4200 (mobile)
  • Celeron J3455 (desktop)
  • Celeron J3355 (desktop)
  • Celeron N3450 (mobile)
  • Celeron N3350 (mobile)

I'm guessing SHA will proliferate a lot like AES-NI in Intel CPUs. Eventually it will be ubiquitous.

I was never able to use ARK to identify the processors with SHA extensions. I used Wikipedia's page on Goldmont, and then worked backwards using the part numbers.

When I sourced the motherboard I needed, I basically asked the same question ("does this CPU have SHA?"). A person answered and stated ARK provided the information. But like I said, I was never able to locate it in ARK.

@securitykernel
Copy link
Collaborator

Ok, from http://www.legitreviews.com/intel-cannonlake-added-to-llvms-clang_179210 it seems that the SHA extensions will be available in common desktop processors starting with Cannonlake later this year.

Regarding testing, I did not know that Intel has a software development emulator that, amongst others, emulates AES NI and SHA extensions. So one would not strictly need a real processor to test this.

@noloader
Copy link
Contributor Author

noloader commented Jan 5, 2017

@randombit, @cordney, @neverhub,

The last piece of helpful information might be... the intrinsics are available with:

  • LLVM Clang 3.4
  • Apple Clang 5.0 (Clang distributed with Xcode 5)
  • GCC 4.9 and above
  • ICC 14.0 Intel Compiler
  • MSVC 19.00 (Visual Studio 2015)
  • SunCC unknown; not available Sun Studio 12.5

@randombit
Copy link
Owner

@cordney I had forgotten about this tool but it is very useful, I used it in the past to test AES-NI and AVX2 support well before I had hardware.

@randombit
Copy link
Owner

This extension might dramatically increase the usability of XMSS, at least with SHA-256. Right now signatures are ... quite slow (last time I tried, generating a H16 signature took over an hour on an i7-6700k, and even H10 is on the order of several seconds).

@noloader
Copy link
Contributor Author

noloader commented Jan 6, 2017

@randombit,

This extension might dramatically increase the usability of XMSS, at least with SHA-256. Right now signatures are ... quite slow (last time I tried, generating a H16 signature took over an hour on an i7-6700k, and even H10 is on the order of several seconds).

I'm happy to benchmark it for you. Can you supply a sample program?

@randombit
Copy link
Owner

The times for H10 signatures are reported by ./botan speed XMSS, we can probably extrapolate H16 results from that. H10 SHA-256 signature takes ~1.5 seconds on i7-6700k

@noloader
Copy link
Contributor Author

noloader commented Jan 6, 2017

@randombit,

Again, its a four-core Celeron J3455 running at 1.5 GHz (burst at 2.3 GHz). Its not as impressive as a 6th gen i5 or i7. For example, according to cpu feature flags from /proc/cpuinfo above, the Celeron lacks AVX and BMI. I wish I could provide you numbers for an 8-core machine running at 3.5 GHz with SHA+AVX2+BMI2.

  • Botan without Intel SHA extensions:
$ ./botan speed XMSS
XMSS_SHA2-256_W16_H10 0 keygen/sec; 5797.47 ms/op (1 op in 5797.47 ms)
XMSS_SHA2-256_W16_H10 0  sign/sec; 5771.97 ms/op (1 op in 5771.97 ms)
XMSS_SHA2-256_W16_H10 288  verify/sec; 3.47 ms/op (88 ops in 305.27 ms)
XMSS_SHA2-512_W16_H10 0 keygen/sec; 14983.22 ms/op (1 op in 14983.22 ms)
XMSS_SHA2-512_W16_H10 0  sign/sec; 15015.88 ms/op (1 op in 15015.88 ms)
XMSS_SHA2-512_W16_H10 121  verify/sec; 8.24 ms/op (38 ops in 313.28 ms)
XMSS_SHAKE128_W16_H10 0 keygen/sec; 6012.83 ms/op (1 op in 6012.83 ms)
XMSS_SHAKE128_W16_H10 0  sign/sec; 6006.14 ms/op (1 op in 6006.14 ms)
XMSS_SHAKE128_W16_H10 283  verify/sec; 3.52 ms/op (86 ops in 302.97 ms)
XMSS_SHAKE256_W16_H10 0 keygen/sec; 22089.44 ms/op (1 op in 22089.44 ms)
XMSS_SHAKE256_W16_H10 0  sign/sec; 22134.43 ms/op (1 op in 22134.43 ms)
XMSS_SHAKE256_W16_H10 85  verify/sec; 11.69 ms/op (26 ops in 303.87 ms)
  • Botan with Intel SHA extensions, GCC 6.2:
$ ./botan speed XMSS
XMSS_SHA2-256_W16_H10 0 keygen/sec; 1947.36 ms/op (1 op in 1947.36 ms)
XMSS_SHA2-256_W16_H10 0  sign/sec; 1919.45 ms/op (1 op in 1919.45 ms)
XMSS_SHA2-256_W16_H10 989  verify/sec; 1.01 ms/op (298 ops in 301.26 ms)
XMSS_SHA2-512_W16_H10 0 keygen/sec; 15089.58 ms/op (1 op in 15089.58 ms)
XMSS_SHA2-512_W16_H10 0  sign/sec; 15101.98 ms/op (1 op in 15101.98 ms)
XMSS_SHA2-512_W16_H10 119  verify/sec; 8.39 ms/op (36 ops in 302.09 ms)
XMSS_SHAKE128_W16_H10 0 keygen/sec; 6024.96 ms/op (1 op in 6024.96 ms)
XMSS_SHAKE128_W16_H10 0  sign/sec; 6017.33 ms/op (1 op in 6017.33 ms)
XMSS_SHAKE128_W16_H10 303  verify/sec; 3.29 ms/op (92 ops in 302.63 ms)
XMSS_SHAKE256_W16_H10 0 keygen/sec; 22058.96 ms/op (1 op in 22058.96 ms)
XMSS_SHAKE256_W16_H10 0  sign/sec; 22134.83 ms/op (1 op in 22134.83 ms)
XMSS_SHAKE256_W16_H10 82  verify/sec; 12.11 ms/op (26 ops in 314.88 ms)
  • Botan with Intel SHA extensions, Clang 3.8:
$ ./botan speed  XMSS
XMSS_SHA2-256_W16_H10 0 keygen/sec; 1879.58 ms/op (1 op in 1879.58 ms)
XMSS_SHA2-256_W16_H10 0  sign/sec; 1882.61 ms/op (1 op in 1882.61 ms)
XMSS_SHA2-256_W16_H10 906  verify/sec; 1.10 ms/op (272 ops in 300.22 ms)
XMSS_SHA2-256_W16_H10 0 keygen/sec; 1877.62 ms/op (1 op in 1877.62 ms)
XMSS_SHA2-256_W16_H10 0  sign/sec; 1892.15 ms/op (1 op in 1892.15 ms)
XMSS_SHA2-256_W16_H10 918  verify/sec; 1.09 ms/op (276 ops in 300.61 ms)
XMSS_SHAKE128_W16_H10 0 keygen/sec; 5751.41 ms/op (1 op in 5751.41 ms)
XMSS_SHAKE128_W16_H10 0  sign/sec; 5693.70 ms/op (1 op in 5693.70 ms)
XMSS_SHAKE128_W16_H10 349  verify/sec; 2.86 ms/op (106 ops in 302.87 ms)
XMSS_SHAKE256_W16_H10 0 keygen/sec; 20715.56 ms/op (1 op in 20715.56 ms)
XMSS_SHAKE256_W16_H10 0  sign/sec; 20993.81 ms/op (1 op in 20993.81 ms)
XMSS_SHAKE256_W16_H10 90  verify/sec; 11.10 ms/op (28 ops in 310.90 ms)

@randombit
Copy link
Owner

A 3x improvement for a very computationally intensive problem out of the gate is nothing to sneeze at. For XMSS we can probably do much better with multithreaded and/or SIMD execution of many inflight SHA-256 operations in order to expose more ILP to the CPU.

@randombit
Copy link
Owner

I looked into this a little bit, already the code to handle SHA extensions is in master. I must have done this at some point. It is only enabled currently for x86-32 through some oversight, but everything to set -msha flag for GCC, read SHA cpuid bit, and such is already there.

@noloader Can you post the output of ./botan cpuid on this machine?

randombit added a commit that referenced this issue Jan 7, 2017
Need to install SDE to test this. But it compiles at least. :)

Based on GH #807
@noloader
Copy link
Contributor Author

noloader commented Jan 7, 2017

@randombit,

Can you post the output of ./botan cpuid on this machine?

Yes sir:

$ ./botan cpuid
CPUID flags: sse2 ssse3 sse41 sse42 rdtsc clmul aes_ni rdrand rdseed intel_sha

randombit added a commit that referenced this issue May 18, 2017
Need to install SDE to test this. But it compiles at least. :)

Based on GH #807
randombit added a commit that referenced this issue May 19, 2017
@randombit
Copy link
Owner

Merged to master now, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants