New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide clear_internal_memory() callback, document the grave performance impact of not using the callbacks #231

Open
solardiz opened this Issue Nov 11, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@solardiz

solardiz commented Nov 11, 2017

Here are some benchmarks of Argon2d 1.3 t=1 p=1 as implemented in phc-winner-argon2 as of today for the password hashing use case (not KDF) on 2x E5-2650v4 (24 cores, 48 threads) with 8x DDR4-2400 (a server kindly provided by Packet) running CentOS 7, built with that distro's default gcc, using AVX2 (I verified with objdump -d), in cumulative hashes per second after warm-up (CPUs at max turbo for all cores in use):

32 MiB, 48 threads, as-is: 520
32 MiB, 48 threads, malloc()/free() out of loop: 650
32 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 1070

Per /proc/meminfo, the memory was allocated as AnonHugePages, so I don't expect further speedup from explicit use of huge pages on this Linux kernel (but there might be on CentOS 6, which per my testing wouldn't allocate huge pages transparently). 24 threads were consistently slightly slower across all of these, e.g. 980 hashes/s for the last one.

Here are some more, for 2 MiB (high request rate capacity, targeting L3 cache):

2 MiB, 48 threads, as-is: 15.5k
2 MiB, 48 threads, malloc()/free() out of loop: 15.7k
2 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 27.4k
2 MiB, 24 threads, as-is: 20.7k
2 MiB, 24 threads, malloc()/free() out of loop: 20.5k
2 MiB, 24 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 23.3k

This shows that memory (de)allocation and cleansing has a combined performance cost of over 2x at least on this platform for 32 MiB, but apparently glibc is being smart at 2 MiB allocations so only the cleansing costs us at that size.

The inclusion of such overhead in default builds and its potential extent should be documented prominently (perhaps including right in the argon2 program output, not only in some documentation file), and a way for applications to override not only the memory (de)allocation functions, but also clear_internal_memory() should be provided.

BTW, while the code mostly uses the clear_internal_memory() wrapper function, in two places it calls secure_wipe_memory() directly - intentional (why?) or a bug?

As a sanity check, current yescrypt t=0 p=1 (built for plain SSE2, as that still works best even on Broadwell-EP with yescrypt's default 128-bit S-box lookups, which I like for them being so numerous) computes 780 hashes/s for 32 MiB, 48 threads at default settings, 815 at pwxform rounds reduced from 6 to 3. Since it has another 1/3 of a pass over the memory (on top of Argon2's t=1), this would translate to 780*4/3 = 1040 to 815*4/3 = 1087 hashes/s for just the memory filling phase at same memory access speed, which is similar to the 1070 we see for Argon2d with the artificial slowdowns removed. That's at the same 1 KiB block size that Argon2 uses. Higher memory bandwidth is usable on this system at 2 KiB (925*4/3 = 1233) and 4 KiB (1005*4/3 = 1340, 1005*32*2*4/3*2^20/10^9 = 90 GB/s), but these are not an option for Argon2 (block size not configurable).

@solardiz

This comment has been minimized.

solardiz commented Nov 11, 2017

FWIW, for these tests I patched the phc-winner-argon2 tree itself as follows:

diff --git a/src/core.c b/src/core.c
index 8781852..9c3a27c 100644
--- a/src/core.c
+++ b/src/core.c
@@ -99,7 +99,18 @@ int allocate_memory(const argon2_context *context, uint8_t **memory,
     if (context->allocate_cbk) {
         (context->allocate_cbk)(memory, memory_size);
     } else {
+#undef MALFRE
+#ifdef MALFRE
         *memory = malloc(memory_size);
+#else
+        static __thread void *m = NULL;
+        static __thread size_t ms = 0;
+        if (!m || memory_size > ms) {
+            printf("Allocating %zu bytes\n", memory_size);
+            m = realloc(m, ms = memory_size);
+        }
+        *memory = m;
+#endif
     }
 
     if (*memory == NULL) {
@@ -116,11 +127,16 @@ void free_memory(const argon2_context *context, uint8_t *memory,
     if (context->free_cbk) {
         (context->free_cbk)(memory, memory_size);
     } else {
+#ifdef MALFRE
         free(memory);
+#endif
     }
 }
 
 void NOT_OPTIMIZED secure_wipe_memory(void *v, size_t n) {
+#if 1
+    return;
+#endif
 #if defined(_MSC_VER) && VC_GE_2005(_MSC_VER)
     SecureZeroMemory(v, n);
 #elif defined memset_s
diff --git a/src/run.c b/src/run.c
index b21b9d7..c8bb5f8 100644
--- a/src/run.c
+++ b/src/run.c
@@ -131,11 +131,14 @@ static void run(uint32_t outlen, char *pwd, size_t pwdlen, char *salt, uint32_t
         fatal("could not allocate memory for hash");
     }
 
+int i;
+for (i = 0; i < 1000; i++) {
     result = argon2_hash(t_cost, m_cost, threads, pwd, pwdlen, salt, saltlen,
                          out, outlen, encoded, encodedlen, type,
                          version);
     if (result != ARGON2_OK)
         fatal(argon2_error_message(result));
+}
 
     stop_time = clock();
 

(making code edits between the tests), and I used this script:

#!/bin/sh
M=47
MEM=32768
killall argon2
for n in `seq 0 $M`; do
        echo password | taskset -c $n ./argon2 saltsalt -d -t 1 -k $MEM | fgrep seconds &
done
wait
echo
for n in `seq 0 $M`; do
        echo password | taskset -c $n ./argon2 saltsalt -d -t 1 -k $MEM | fgrep seconds &
done
wait
echo
for n in `seq 0 $M`; do
        echo password | taskset -c $n ./argon2 saltsalt -d -t 1 -k $MEM | fgrep seconds &
done
wait

Because of issue #204, I ran the script under time, to sanity-check its real time as well.

@sneves

This comment has been minimized.

Contributor

sneves commented Dec 25, 2017

BTW, while the code mostly uses the clear_internal_memory() wrapper function, in two places it calls secure_wipe_memory() directly - intentional (why?) or a bug?

The only uses of secure_wipe_memory are after checking the flags in the context structure, so it seems correct. But I'm less enthused about FLAG_clear_internal_memory, which is a global variable without adequate namespacing. I don't know when it got there, but it's completely unrelated to the already existent flags dealing with conditional memory wiping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment