Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization #24

Open
ghost opened this issue May 8, 2020 · 39 comments
Open

Optimization #24

ghost opened this issue May 8, 2020 · 39 comments

Comments

@ghost
Copy link

ghost commented May 8, 2020

I changed the starting points and limited the number of kangaroos. I am currently working on a CPU and I have received the following results:

CPU i3, 534.2K j/s, 64 bit key, 100 tests - mean runtime ~ 3 hours

I have to say how the tests were conducted so that there were no questions about this. Each time a key was found, a new random 64-bit key was created and the search algorithm was run again

Now I'm busy porting code to OpenCL, this is a big and complicated job for me.

I think it's still possible to optimize the key finding time by calculating the correct jump size for each kangaroo.

Any ideas?

@CatfishCrypt
Copy link

how did you do this:
"Each time a key was found, a new random 64-bit key was created and the search algorithm was run again"
the random part more specifically?

@ghost
Copy link
Author

ghost commented May 8, 2020

the random part more specifically?

  1. Generate random number 2^64:2^65
  2. Create secp256k1 point with generated number
  3. Run kangaroo
  4. After key is find and valid - repeat

I did this to understand how much faster the algorithm works with the changes I made. The standard algorithm completes on average in 5 hours

@CatfishCrypt
Copy link

Interesting...do you have to manually load each new key and then run Kang or do you have this automated somehow?

@ghost
Copy link
Author

ghost commented May 9, 2020

I have automated this. Everything is very simple there, I added a loop and deleted the command line arguments

@CatfishCrypt
Copy link

Simple for someone who knows their way away programming :)
Do you feel you are finding keys faster via this way?

@JeanLucPons
Copy link
Owner

Hi,

Please be more precise:
You translated (+2^64) the start range of Tame or Wild ?
You 100 keys are the same or uniformly distributed ?
To make average it is important to have key uniformly distributed in the range.

You can speed up the search by spreading the wild to -N/8,N/8, even -N/16,N/16
CreateHerd() function in Kangaroo.cpp
This optimize the overlap between tame and wild, and you can achieve up to 1.25sqrt(N) by compressing the wild but this has side effects when the key to search is near the border of the range and increase also collision between wild...

@CatfishCrypt
Copy link

Jean Luc,

Using the Div8 in CreateHerd?

@JeanLucPons
Copy link
Owner

Yes, replace the code between in Kangaroo.cpp:574,579

by

    if((j + firstType) % 2 == TAME) {
      // Tame in [0..N]
      d[j].Rand(rangePower);
    } else {
      // Wild in [-N/8..N/8]
      d[j].Rand(rangePower-2);
      d[j].ModSubK1order(&rangeWidthDiv8);
    }

You win a lot for key around the center but keys on the border will longer to solve , in average you still win.

@CatfishCrypt
Copy link

Can it be put in lines 581:586? Lines 574:579 are grayed out; that's the symmetry code-grayed out.

@CatfishCrypt
Copy link

Current code:

#ifdef USE_SYMMETRY

		// Tame in [0..N/2]
		d[j].Rand(rangePower - 1);
		if ((j + firstType) % 2 == WILD) {
			// Wild in [-N/4..N/4]
			d[j].ModSubK1order(&rangeWidthDiv4);
		}

#else

		// Tame in [0..N]
		d[j].Rand(rangePower);
		if ((j + firstType) % 2 == WILD) {
			// Wild in [-N/2..N/2]
			d[j].ModSubK1order(&rangeWidthDiv2);
		}

#endif

@JeanLucPons
Copy link
Owner

JeanLucPons commented May 9, 2020

It is well in the second part 574:579, USE_SYMMETRY is not defined.
If in your code the it is 581:586, you may have added line of code ?

// Tame in [0..N]

  // Choose random starting distance
  if(lock) LOCK(ghMutex);

  for(uint64_t j = 0; j<nbKangaroo; j++) {

#ifdef USE_SYMMETRY

    // Tame in [0..N/2]
    d[j].Rand(rangePower - 1);
    if((j+ firstType) % 2 == WILD) {
      // Wild in [-N/4..N/4]
      d[j].ModSubK1order(&rangeWidthDiv4);
    }

#else

    if((j + firstType) % 2 == TAME) {
      // Tame in [0..N]
      d[j].Rand(rangePower);
    } else {
      // Wild in [-N/8..N/8]
      d[j].Rand(rangePower-2);
      d[j].ModSubK1order(&rangeWidthDiv8);
    }

#endif

    pk.push_back(d[j]);

  }

@ghost
Copy link
Author

ghost commented May 9, 2020

Do you feel you are finding keys faster via this way?

Yes

@ghost
Copy link
Author

ghost commented May 9, 2020

Hi @JeanLucPons

You 100 keys are the same or uniformly distributed ?
You translated (+2^64) the start range of Tame or Wild ?
You can speed up the search by spreading the wild to -N/8,N/8, even -N/16,N/16
This optimize the overlap between tame and wild, and you can achieve up to 1.25sqrt(N) by compressing the wild but this has side effects when the key to search is near the border of the range and increase also collision between wild...

I generated keys randomly, duplicates are excluded. I also created a file with the keys and ran the test again to make sure that there were no duplicate keys and the results were correct.
The result shows a significant increase in the key finding speed.

I tested the keys and in a small range (2^32 - 2^63) in all ranges the speed of finding the key increases.

I use Four kangaroo method. The problem is that the more kangaroos you use, the slower the key will be found. Optimal use 4 kangaroos - 2T and 2W. You can find research on it.
Now, using 4 kangaroos, we can parallelize their work by creating groups in which 4 kangaroos will work.
If you want maximum speed, all herds (groups) should work on the same key, but at the same time have different initial parameters.
I'm working on parallelization right now

@CatfishCrypt
Copy link

@AndrewBrz

when you say 4 kangaroos, you mean 4 herds of kangaroos? I've read something to that extent T1 T2, W1 W2. How are you implementing that?

@ghost
Copy link
Author

ghost commented May 9, 2020

@CatfishCrypt
4 kangaroos it is 1 herd, T1, T2, W1 and W2

@CatfishCrypt
Copy link

Wow...Doesn't seem right, mathematically speaking. Huge space, with only 4 kangaroos versus thousands/millions.

@JeanLucPons
Copy link
Owner

Ok

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

Take also in consideration the GPU handle large number of kangaroos, there is some users with several GPU that use 2^24 or more kangaroos with large distinguished bit number on large range.
The GPU performance is much depend on the access to global memory, so the less access to global memory you have, better it is.

I will work soon on a distributed version (client/server) where the server will handle DP and collision check.

@ghost
Copy link
Author

ghost commented May 9, 2020

Wow...Doesn't seem right, mathematically speaking. Huge space, with only 4 kangaroos versus thousands/millions.

You can create many kangaroo herds, but all of them should work independently of each other, but you can try and add communication between them (I still have no idea how to do this without sacrificing speed). There should be 4 kangaroos in one herd. This is mathematically correct

@ghost
Copy link
Author

ghost commented May 9, 2020

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

I will do all the tests as I finish on the OpenCL version

@CatfishCrypt
Copy link

@JeanLucPons

The server version is exactly what I was thinking! That would be awesome...and I believe would reduce solution time by Eleventy Gabillionsqrt(N) :)

@JeanLucPons
Copy link
Owner

Make test on 40 or 48 bit range and large number of trials to get good average estimation. Note that the error in proportional to 1/sqrt(trials). Rather than giving a time comparison, give the ratio of sqrt(N), Standard method is 2.08sqrt(N) , for 4 kangaroos (non paralell) around 1.78sqrt(N).

I will do all the tests as I finish on the OpenCL version

You write an OpenCL version for the standard method of for 4 kangaroo method ?

@JeanLucPons
Copy link
Owner

JeanLucPons commented May 9, 2020

I tested tested the Wild in [-N/8..N/8], 1000 trials, 40 bits search.
2.139 is the estimation of the expected average taking into account the DP overhead.

Original code:
-t 2 -d 5 in40_1000.txt
[999] 2^21.926 Dead:5 Avg:2^21.147 DeadAvg:1.4 (2.214sqrt(N) 2.139sqrt(N))

With the modification above:
-t 2 -d 5 in40_1000.txt
[999] 2^20.585 Dead:3 Avg:2^20.939 DeadAvg:2.6 (1.917sqrt(N) 2.139sqrt(N))

Gain: 15%
2 times more dead due to the compression of wild.
This trick works only for parallel version with enough kangaroo.

@ghost
Copy link
Author

ghost commented May 9, 2020

You write an OpenCL version for the standard method of for 4 kangaroo method ?

Standart, three method and 4 method. There are not big changes in the code, so you can add and remove kangaroos for tests.

@ghost
Copy link
Author

ghost commented May 9, 2020

Gain: 15%
2 times more dead due to the compression of wild.

how much kangaroo was used T and W?

@ghost
Copy link
Author

ghost commented May 9, 2020

@JeanLucPons
Change this line
bool Point::equals(Point &p) { return x.IsEqual(&p.x) && y.IsEqual(&p.y) && z.IsEqual(&p.z); }
to
bool Point::equals(Point &p) { return x.IsEqual(&p.x); }

This change will only check X coordinates. Another optimization is to calculate only the X coordinate, which allows us to accelerate the speed of calculations (there are research and you can find and read them)
There are many more optimizations for jumps and starting points.

@CatfishCrypt
Copy link

What changes have you experimented with for starting and jump points? More interested in how you changed your start points, you change them to start other than around the center?

@JeanLucPons
Copy link
Owner

Thanks Andrew for your job ;)

  • There was 1024T and 1024W.
  • Concerning the equals, I prefer to keep as it is, if you hit a symmetric point (even without using symmetry), the check with keyToSearch and keyToSearchNeg will fail and it can miss a point. If you want to make such an optimization, do it locally by using Int::IsEqual(). For instance, in the hashTable, this optimization is done locally and only a part of x is tested.
  • Concerning jumps, take care of keeping compatibility with workfiles. Of course depending on the used method they probably won't be compatible but it would be great if 2 users using the same method can share work files. There is a version field that can be used. I remarked that I missed the version check in the 1.4, F.ck !

@kpot87
Copy link

kpot87 commented May 10, 2020

@AndrewBrz if you make AMD support please share with others

@DvR4
Copy link

DvR4 commented May 13, 2020

if you make AMD support please share with others

I think he will not be share. After your proposal, he does not respond.

@JeanLucPons
Copy link
Owner

That's too bad. I hope he will share his work.

@ghost
Copy link
Author

ghost commented May 14, 2020

when I finish working on the algorithm, I will share the code

@dem10
Copy link

dem10 commented May 14, 2020

when I finish working on the algorithm, I will share the code

It's great. I'm glad you took the opportunity to make a tool for red cards.

@kpot87
Copy link

kpot87 commented Jun 1, 2020

@AndrewBrz hi! Any new about AMD support? Tnx

@ghost
Copy link
Author

ghost commented Jun 1, 2020

@kpot87 hi!
Yes, I am testing and fixing bugs. Opencl kernel shows excellent results, I implemented it in the official bitcoin library. But there are still bugs that need to be fixed.
I don’t have much free time to finish it quickly.

These are the results of kernel testing, they indicate that the kernel is optimized as much as possible
screen

@JeanLucPons
Copy link
Owner

Hi Andrew,
Great news.
Keep up the good work !

@RB61
Copy link

RB61 commented Jun 3, 2020

I changed the starting points and limited the number of kangaroos. I am currently working on a CPU and I have received the following results:

CPU i3, 534.2K j/s, 64 bit key, 100 tests - mean runtime ~ 3 hours

I have to say how the tests were conducted so that there were no questions about this. Each time a key was found, a new random 64-bit key was created and the search algorithm was run again

Now I'm busy porting code to OpenCL, this is a big and complicated job for me.

I think it's still possible to optimize the key finding time by calculating the correct jump size for each kangaroo.

Any ideas?

Hi... any progress on opencl version?
The framework seems to be ready (brichard19/BitCrack). The necessary basic functions are ready (clMath/Secp256k1.cl).

@ghost
Copy link
Author

ghost commented Jun 3, 2020

hi @RB61
I wanted to take BitCrack opencl code, but after the tests I refused. Their code is not stable, my code shows the best results. I'm testing and making some changes right now. Some more work left.

This is the BitCrack OpenCl kernel
bitcrack

@RB61
Copy link

RB61 commented Jun 8, 2020

hi @RB61
I wanted to take BitCrack opencl code, but after the tests I refused. Their code is not stable, my code shows the best results. I'm testing and making some changes right now. Some more work left.

This is the BitCrack OpenCl kernel
bitcrack

The opencl code as has a huge bug that brichard19 has not fixed. The cuda version works fine.
see here:
brichard19/BitCrack#223

@RB61
Copy link

RB61 commented Jun 21, 2020

Hi @AndrewBrz ... any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants