Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DsaTest.testTiming() could use a warmup #93

Open
ascarpino opened this issue Jul 11, 2023 · 3 comments
Open

DsaTest.testTiming() could use a warmup #93

ascarpino opened this issue Jul 11, 2023 · 3 comments

Comments

@ascarpino
Copy link

I observed a lot of variability in the results from this DSA timing test. Sigma values on the same machine could vary by over 2 sigma. When looking at the individual timing results over multiple runs, it was clear there is a startup cost that is interfering with the check.

The test starts off with getCurrentThreadCpuTime() differences of:
3496000
1320000
1220000
1171000
1054000
1065000
1009000
968000
987000
1094000
1173000
1171000

After the first 100, we see the difference has narrowed:
387000
381000
362000
338000
348000
334000
348000

After 260 iterations, the values finally reach their lowest level:
175000
177000
165000
167000
171000
164000
162000
163000
171000
177000
173000
164000
166000

Now we are getting the true timing of the operation. Removing just the first 300 iterations halved the sigma values returned by the test and it was more consistent run over run. It is my belief that the variability of sigma values is a result of larger K values being randomly chosen more in the early runs that skews the sigma. These timing differences are likely caused by transitioning to math intrinsics from java code and/or the large allocation/GC related to the 50k BigIntegers and 50k longs just before the testing loop. Like many performance tests, there is a warmup sequence to remove startup costs.

The simplest fix for this is to throw away the first 300 results.

This is starting at line 516 of wycheproof/java/com/google/security/wycheproof/testcases/DsaTest.java

+   // To reduce noise in this test, a large sample is taken and a warmup
+   // to eliminate startup costs.
-    // The timings below are quite noisy. Thus we need a large number of samples.

+   int loops = 50300;
     int samples = 50000;
     int j = 0;
     long[] timing = new long[samples];
     BigInteger[] k = new BigInteger[samples];
+     for (int i = 0; i < loops; i++) {
-     for (int i = 0; i < samples; i++) {
         long start = bean.getCurrentThreadCpuTime();
         signer.update(messageBytes);
         byte[] signature = signer.sign();
+        long end = bean.getCurrentThreadCpuTime();
+        // Skip first 300 for warmup
+        if (i >= 300) {
+            timing[j] = end - start;
+            k[j++] = extractK(signature, h, priv, false);
+        }
-       timing[i] = bean.getCurrentThreadCpuTime() - start;
-       k[i] = extractK(signature, h, priv, false);
     }

@bleichenbacher-daniel
Copy link

I don't think any changes are necessary here.

What the test is doing is to generate signatures, select a subset from those signatures based on timing information, and then
checks if the k's used to generate those signatures are biased. If it is possible to select DSA signatures with small k's by selecting signatures that were generated faster than others, then the implementation has a weakness.

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

@ascarpino
Copy link
Author

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

I would absolutely disagree with the premise that external factors, like warmup, gc, server load, etc., do not change the distribution. With the randomness of K, noise could be introduced at unfortunate times. That does not show weakness in the implementation, it shows weakness in the test. The test does try to mitigate some of this by a large allowance for sigma. But as the above results show, it could be hard for that allowance to overcome a 10x performance differences if certain lengths of K occur at the wrong time. For example, there maybe 100 small K values or 1000 in a test run. Many of those small K's may be in the first half of the test or the latter.

The lack of a warmup also fails to take intrinsics into consideration. Once the C2 compiler decides the method is hot, the intrinsic will change the performance values and disrupt the results distribution. That is not a weakness in the implementation, that's a failure to test during normal system operation.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

The results were generated with 50000 iteration that the wycheproof test uses. It's true that more iterations will reduce the influence of noise, but a warmup would reduce the biggest noise influencer and not result in a significantly longer test run.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

Accept what I suggested or not, that is your decision.

@bleichenbacher-daniel
Copy link

The point I wanted to make is that there can not be a test failure because of noise. If the implementation is correct then expected result will be close to a normal distribution with variance 1.

Too much noise can of course hide timing leaks. By selecting the signatures with the shortest timing the test eliminates the biggest influences of noise without the need to examine the environment. Slow signatures during startup or during garbage collection are most likely not used. As long as their number is small they don't have a significant influence. Also, if the test becomes slower in the middle because of heavy other load on the machine then the result will be computed just from the 25'000 signatures generated during the quiet time. This can miss a bias, but I can't lead to false positives.

The current setup of the test is for continuous testing. If a randomized test is repeated many times then it is important that the probability of false positives is small. Hence the large threshold. For other usages it might be reasonable to use a smaller threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants