Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized for some nvidian cards #228

Closed

Conversation

davilizh
Copy link

@davilizh davilizh commented Jun 20, 2017

The code is optimized for GTX1060, can improve GTX1060 with 2 GPC performance by 15%, and GTX1060 with 1 GPC performance by more than 30%. Meanwhile, it also increases performance on GTX1070 by 3%, on Telsla M60 by 2%, and should also benefit other chips.

  1. ethash_cuda_miner_kernel.cu

We have commented out "launch_bounds" in the code. launch_bound is discussed in http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4fzSzZc9p in detail.

  1. dagger_shuffle.cuh
  1. We moved around and reduced variable definitions to the minimum required. The compiler should have been able to do this analysis, but it never hurts to help out the compiler.
    The state in compute_hash of dagger_shuffle.cuh is modified.
  2. We simplify the nested if/else blocks into a switch statement.
  3. We simplify control flow. Remove the conditional from the inner loop so all threads calculate the value, and then all threads use a __shfl to read thread t's value (throwing away the rest of the threads' calculated value).
  4. We increase the total number of LDGs to increase occupancy. We define PARALLEL_HASH to let each warp have PARALLEL_HASH LDGs in-flight at a time, not 1 at a time, which is the original case.
    Every thread is the master for calculating one hash value. Each thread initializes its version of state using keccak_f1600_init. Then in the main loop: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 0 captures the result of shuffle[0-3] into state[8-11]. On the next loop when i=1 threads 0-7 copy the values of thread 1's state[0-7] into each threads' shuffle[0-7], do the main computation, and then thread 1 captures the result of shuffle[0-3] into state[8-11].
    With the modification this is changed so that if PARALLEL_HASH=2: When i=0 threads 0-7 copy the values of thread 0's state[0-7] into each threads' shuffle[0][0-7] and thread 1's state[0-7] into each threads' shuffle[1][0-7]. They do the main computation on these 2 shuffle vectors in parallel. Then thread 0 captures the result of shuffle[0][0-3] into its state[8-11] and thread 1 captures the result of shuffle[1][0-3] into its state[8-11].
  1. keccak.cuh

Since the input argument uint2 *s is changed in dagger_shuffle.cuh, we have to modify keccak_f1600_init and keccak_f1600_final in keccak.cuh accordingly.

…formance by 15%, and GTX1060 with 1 GPC performance by more than 30%. Meanwhile, it also increases performance on GTX1070 by 3%, on Telsla M60 by 2%, and should also benefit other chips. However, also find 5% decrease for Nvidia GRID K520.
…formance by 15%, and GTX1060 with 1 GPC performance by more than 30%. Meanwhile, it also increases performance on GTX1070 by 3%, on Telsla M60 by 2%, and should also benefit other chips. However, also find 5% decrease for Nvidia GRID K520.
@davilizh
Copy link
Author

The first commit is wrong since I added the 3 files to the root directory: "cpp-ethereum", while the correct path should be "cpp-ethereum/libethash-cuda". The second commit fix this issue.

@zachgrayio
Copy link

@davilizh I'm curious what the CUDA hashrate looks like for the GRID K520 with current builds, got any benchmarks to share?

@davilizh
Copy link
Author

I do not have K520 at hand, it is tested by someone else in other community.
The result is as follows.
However, I don't think the result is 100% reasonable as the min value is 0. I will ask him to test again if I can get contact with him.

"min/mean/max: 0/4858402/7689557 H/s
inner mean: 5534151 H/s"

@FUNtasticOne
Copy link

Would love to test this build with my GTX1060 graphics cards. I´m not a developer or programmer, and I´m completely new to GitHub, so I hope my commentary here is not too annoying.
I tried to build this repository as descriped in the instructions. There is a "getstuff.bat", which should download some needed files. Unfortunately this isn´t working. I looked into the batch file and i can´t even download the files manually, because the site is not reachable.
So I would like to ask cautiously whether the possibility exists to download the compiled release from somewhere?
Again, if my question is out of place, I beg your pardon.
Regards

@Cyclenerd
Copy link

Hi,

compiled it successfully under Ubuntu 16.04.2 with CUDA 8 and NVIDIA drivers 381.22:

apt-get install -y cuda-command-line-tools-8-0
git clone "https://github.com/davilizh/cpp-ethereum.git"
cd cpp-ethereum
git checkout optimized_for_some_nvidian_cards
mkdir build
cd build
cmake -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0 -DBUNDLE=cudaminer ..
make -j8

I have 2 GTX 1060 graphics cards:

name, power.draw [W], fan.speed [%], temperature.gpu, clocks.current.video [MHz], clocks.current.memory [MHz]
GeForce GTX 1060 6GB, 70.14 W, 70 %, 61, 1328 MHz, 4201 MHz
GeForce GTX 1060 6GB, 70.79 W, 70 %, 53, 1417 MHz, 4201 MHz

Previously, each line was 41.94 MH/s. With your patch comes more often 47.19MH/s. So works for me. Do I need to adjust something on the graphics or memory clock to make 47 stable?

MANY THANKS!

Cyclenerd added a commit to Cyclenerd/ethereum_nvidia_miner that referenced this pull request Jun 25, 2017
With this update Genoil/cpp-ethereum#228,
ethminer will need the CUDA8 tools. We already have the 381.22 drivers.
CUDA 8.0 comes with a driver version 375... that doesn't support the
GTX 1080 Ti. As a result, installing CUDA from apt-get doesn't work
since it installs this driver version. Thus, you have to install only
the `cuda-command-line-tools-8-0` to opt-out of installing the driver.
@Genoil
Copy link
Owner

Genoil commented Jun 25, 2017

Wow nice one! I don't have any Nvidia cards anymore to test, but it is tempting to merge! I suggest you submit this to @chfast his fork too, if you haven't already.

@davilizh
Copy link
Author

davilizh commented Jun 26, 2017

@Cyclenerd Thank you very much. Good news to know. I think you do not need to adjust the graphics or memory clock, the chip would boost to the maximum working frequency automatically if workload is large. But remember not to let it be over hot. We find that performance will degrade if the temperature is too high.

@FUNtasticOne which operation system are you using? If windows, you can download the exe from here: https://ci.appveyor.com/project/ethereum-mining/ethminer/build/93/job/ss7k95dsy1kly4vl/artifacts. If linux, then do as Cyclenerd's flow.

@Genoil Would be great if my code can be merged to the master branch.
Thank you for your advice. Actually, I have created a pull request to chfast's fork: ethereum-mining/ethminer#18 (comment).

@FUNtasticOne
Copy link

FUNtasticOne commented Jun 26, 2017

Thank you @davilizh I´m on windows and just downloaded the exe.

For me there is no improvement of the hash rate..
I just tested with one Palit GTX1060 6GB RAM.
First tested with NVIDIA driver version 376.53 on Windows 10 x64.
Even after upgrading the driver to the latest version (382.53) there is no improvement.
The hash rate is almost exactly 20 MH/s with the slightly overclocked card, both before and after changing the exe file.
Of course I´m using CUDA interface for mining.

Please let me know if you´re interested in additional information.

Thank you again!

@davilizh
Copy link
Author

@FUNtasticOne Can you check the runtime working frequency of your DRAM? Mine works at 4.5GHz. and I guess that yours should be 4.0 GHz ( from https://www.lelong.com.my/9734-palit-gtx1060-jetstream-6gb-ddr5-192bit-gtx1060-cwchoo85-188799203-2017-04-Sale-P.htm).

@FUNtasticOne
Copy link

FUNtasticOne commented Jun 26, 2017

@davilizh My DRAM frequenzy is 4.0 GHz stock, overclocked to 4.1 GHz. GPU-Z says 2052 MHz which has to be doubled.

@seedlord
Copy link

seedlord commented Jun 26, 2017

Will this update come into the release-section?
I cant compile by myself and would like to try it out on a GB GTX1070 Gaming G1
Currently i am at 30.9MH/s with 88% Powerload and 8808Mhz Memory Clock.

@chfast
Copy link

chfast commented Jun 26, 2017

@seedlord, for Windows you can test this build: https://ci.appveyor.com/project/ethereum-mining/ethminer/build/93/job/ss7k95dsy1kly4vl/artifacts.

@seedlord
Copy link

Thanks for link.
No difference for me. Still 30.9MH/s

@FUNtasticOne
Copy link

FUNtasticOne commented Jun 26, 2017

As already written above, for me there isn´t also any difference with the downloadable exe.
davilizh wrote that the first commit was wrong. Could it be the case that the downoadable build is the faulty one?
Maybe someone who compiled by themself and had more performance after this can check it out for windows with the downloadable file?

@seedlord
Copy link

Windows exe is from date 9th July. The optimized code came out 6 days ago, so it should be 20th July or not?

@chfast
Copy link

chfast commented Jun 26, 2017

The build comes from a PR to ethminer: ethereum-mining/ethminer#18, but I'm guessing this one here implements the same optimization.

@ghost
Copy link

ghost commented Jun 26, 2017

Okay, I've made some testing -- 6x1060 3GB:
Claymore 9.5 -> 136MH/s on screen + constant devfee DAG swap
Your optimized version -> 150MH/s
I used downloaded Win version from posts above to test... as .bat file included doesn't work, connection timeout to download page.
Good job mate!

@ggilyeat
Copy link

@deadgray What are your OC settings on your 3Gs?

I've gone from ~19.5 (~117) with claymore to just under 23 (137) with the enhancements.

Very, very impressed, sirs.

@ghost
Copy link

ghost commented Jun 26, 2017

@ggilyeat +100 core + 860 mem, cards are with Samsung mem.
Ah, and 67% power set; whole rig uses just 540W from the wall. Amd miners can dream of that power eff. :-D

@juliotec
Copy link

juliotec commented Jun 26, 2017

How can i show the average hashrate to show how can i increase with this patch? Im testing with my Asus 1060 Dual OC 3GB

@juliotec
Copy link

juliotec commented Jun 26, 2017

I tested my Asus GTX 1060 Dual OC 3GB

Specs:

GPU Power : only 80%

GPU Clock: 2000 mhz (OC Stable)

Memory Clock: 9300 mhz (OC Stable)

Ethminer

before

ethminer -U -M

Trial 1... 22953902
Trial 2... 22976764
Trial 3... 23022626
Trial 4... 23045626
Trial 5... 22999672

min/mean/max: 22953902/22999718/23045626 H/s
inner mean: 22999687 H/s

after

ethminer -U -M

Trial 1... 24377389
Trial 2... 24385488
Trial 3... 25032318
Trial 4... 24361208
Trial 5... 24417937

min/mean/max: 24361208/24514868/25032318 H/s

inner mean: 24393604 H/s

Increase 6%

Claymore's v9.4

Same specs

22 mh/s

So ...

Ethmine: 24.393604 Mh/S

Claymore: 22 Mh/S without dev fee (using proxy named nofee 5.0)

Increase 10.88%

Ethminer wins

@marvykkio
Copy link

marvykkio commented Jun 26, 2017

What version of genoil should I use? I just changed the .exe file from the version
1.1.7 does not work
help pls
4x 1070 gigabyte G1
windows 10 64 bit

@juliotec
Copy link

use this one https://ci.appveyor.com/api/buildjobs/ss7k95dsy1kly4vl/artifacts/build%2Fethminer-0.11.0.dev0-Windows.zip

the changes are not been merged in the master i think

@marvykkio
Copy link

This has been downloaded, I miss the version of ethermine.
What should I take 1.1.6? 1.1.7? 0.9.41 genoil.
Which I have to use

@marvykkio
Copy link

errore

VCruntime is also missing

@ghost
Copy link

ghost commented Jun 26, 2017

@marvykkio you can find those files in your system, or in Visual Studio

@Genoil Genoil added the wontfix label Jun 26, 2017
@davilizh
Copy link
Author

@deadgray Terribly sorry, hope this does not affect your use of the code. It seems that someone else have helped me fix it and created a new pull request.

@davilizh
Copy link
Author

@Genoil Hi, Genoil. Is there any possibility that this code been merged into your master branch? I have added a switch named "--cuda-parallel-hash" to disable and enable my optimization, and the code is merged into Chfast's master branch now. This switch enables people to scale parallel-hash from 1 to 8 to find the best value for their card.

@ghost
Copy link

ghost commented Jun 28, 2017

@davilizh good joke :-)
Anyway, I tested chfast's version with your great patch included; it seems little slower than older appveyor link compiled version from link above with 1060, almost no difference with 1080, of course I tried new switch.

@ghost
Copy link

ghost commented Jun 28, 2017

@davilizh
Copy link
Author

@deadgray What's your dram frequency?

@ghost
Copy link

ghost commented Jun 28, 2017

@davilizh +860 Mhz, so 4665 Mhz.

@Genoil
Copy link
Owner

Genoil commented Jun 28, 2017

@davilizh Hi David, I have chosen to cease further development of the fork, so it's unlikely that the patch will be applied. I don't think you have to worry that people won't be able to find the new ethminer fork by Pavel.

That said, I am looking at your code and find the term PARALLEL_HASH slightly confusing. As far as I'm concerned the hashes were already done in parallel in the first place. The fundamental difference is that rather than doing a single coalesced global read of 128 bytes (8 threads * uint4) for a single hash, you do 4 in series of 32 bytes (2 threads * uint4), for 2 different hashes. I guess it's the reduced memory bus width of GP106 that makes this more efficient. Nevertheless I like it very much.

@davilizh
Copy link
Author

@deadgray I do not know why. Maybe you have a different dram type. BTW, if you already get the maximum dram bandwidth, my code can not push anymore above the maximum value.

@ghost
Copy link

ghost commented Jun 28, 2017

@davilizh Even if I can't get more, I'm quite impressed with your patch, good work, my total mining hashrate is now 30MH/s up without investment, which is worth new 1070 card :-)

@azazhu
Copy link

azazhu commented Jun 28, 2017

@Genoil, we just need make sure that 32B is coalesced. From texture of view, both use full texture bandwidth. To downstream unit, we can issue more load instructions to saturate the memory.

@Genoil
Copy link
Owner

Genoil commented Jun 28, 2017

@azazhu I know and I might have even implemented something similar in the last few years, but either in the wrong way or the hardware I used didn't benefit from this. Nice to see such a dramatic improvement 2 years later.

@azazhu
Copy link

azazhu commented Jun 28, 2017

@Genoil The arch has been changed a lot :)

@Genoil
Copy link
Owner

Genoil commented Jun 28, 2017

@azazhu @davilizh is 'we' you two or is that 'us' in general? Just curious :)

@diversuss
Copy link

diversuss commented Jun 28, 2017

Hi, @davilizh! Is there any way to apply this to claymore on windows?
I tried it on ethermine with etherminer. There was 24,5Mh/s in console, but only 17Mh/s on the dashboard on ethermine.org. But if i use claymore with same overclocking, i see 23.3Mh/s in console and same on the dashboard.

@vitoth
Copy link

vitoth commented Jun 28, 2017

Tested on windows 10 with 1060 6g, OCed +130 +870
not much difference from claymore, i still need to see 6hr average on nanopool to compare.

Should I try on Linux?
What are the best settings for bat file?
Is nanopool ok?
Are there any instructions/tutorial how to set it up for best performance?

@davilizh
Copy link
Author

@diversuss sorry, I do not know because I do not know the detail of claymore.

@justchil
Copy link

I'm seeing an improvement over claymore on my 1070/1060 machine.

This is a bit OT and I am searching -- anyway to have the reported hashrate to show up with this miner? Calculated is looking better on nanopool :)

@chavvdarrr
Copy link

Is there anyway to monitor the miner and take action if stops?
Its not stable, but can't find switch for auto-monitoring.. or with api

@MichaelA2014
Copy link

Hey guys. I am trying to run it on 4x1070 cards in Windows 10. I get "Application was not able to start correctly (0xc000007b). Click OK to close application"
How do I fix that?
My bat file is as follows:
`setx GPU_FORCE_64BIT_PTR 0
setx GPU_MAX_HEAP_SIZE 100
setx GPU_USE_SYNC_OBJECTS 1
setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_SINGLE_ALLOC_PERCENT 100

ethminer.exe --farm-recheck 200 -U -S eth-us-west1.nanopool.org:9999 -FS eth-us-
2017-06-29_7-08-21
east1.nanopool.org:9999 -O mywallet.rigname`

Thank you.
https://s18.postimg.org/5c7cbh3d5/2017-06-29_7-08-21.jpg

@chavvdarrr
Copy link

Is your Windows 32 or 64-bit? (64b is a must)
Also I had to download&install vc++ 2015 x64 runtime

@MichaelA2014
Copy link

MichaelA2014 commented Jun 29, 2017

my Windows is 10 x64
I installed vcc 2015 x64 redistributable but the error is still there

@MichaelA2014
Copy link

Do I have to download and install visual studio in addition to vcc 2015 x64?

@chavvdarrr
Copy link

no
Also check if your video driver is recent

@MichaelA2014
Copy link

my driver is the latest 382.53, windows 10 x64. visual c++ redistributable is installed

@MichaelA2014
Copy link

Got it working. For those who is having the same problem as I did (missing files and 0xc error download and install all in one run times package from http://www.pcgameshardware.de/Windows-Software-122001/Downloads/All-in-One-Runtimes-Download-1164729/

@Genoil Genoil closed this Jun 30, 2017
@azazhu
Copy link

azazhu commented Jul 5, 2017

@Genoil :)

@Klintistwood
Copy link

Hello all,

I'm completely new to mining, I don't have a lot of stats to compare with but I started using the previous version of ethminer where I could get a stable 31M/H with my GTX 1070 (with only memory OC and power brought to 80%) and now with this new version I go above 32M/H so there is definitely a very nice improvement !

I had however another question. When running the new version, I got a warning from my antivirus saying that there is a Trojan/Win64.BitCoinMiner inside the executable. It was not the case with the previous version. I downloaded the file from here: https://github.com/ethereum-mining/ethminer/releases/tag/v0.11.0rc1

Any reason to be worried about?

Thanks!

@ghost
Copy link

ghost commented Jul 5, 2017

@Klintistwood Antivirus software mostly detects miners because 'viruses' / 'ransomware' (or however you want to name it) currently deploy the software on the enemies computers to mine coins for the attacker.

This means it's a false-positive alarm - just ignore it. The miner is safe.

If you worry about it you can try to compile the miner yourself - but this is very technical and should be discussed on e.g. the gitter chat and not on this issue bug tracker.

And yes you are right - the newest version has a very good performance improvement,
thanks to the last patches. 👍

Btw: The project moved to a new git repository here: https://github.com/ethereum-mining/ethminer/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet