add Tesla V100 benchmark, update gtx1060 75w benchmark

+ minor other readme changes The benchmark shows a few big conclusions : - batch size 16 is a very efficient batch size on Tesla V100, providing an average 900 simulations per move - As of February 2019, PhoenixGo does not benefit from CPU thread number higher than 2 cores / 4 threads on Tesla V100 These numbers may change if PhoenixGo supports newer Tensorflow versions, as well as newer tensorRT versions too
Tencent · Feb 18, 2019 · 04c41e4 · 04c41e4
1 parent 0ee090e
commit 04c41e4
Show file tree

Hide file tree

Showing 3 changed files with 213 additions and 15 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -158,8 +158,11 @@ it only if your computation device can handle it
 Some independent speed benchmarks have been run, they are available 
 in the docs :
 
-- for GTX 1060 : 
-[benchmark testing batch size from 4 to 64, tree size up to 2000M, max children up to 512, with tensorRT ON and OFF](/docs/benchmark-gtx1060.md)
+- for GTX 1060 75W (75w power limit) : 
+[benchmark testing batch size from 4 to 64, tree size up to 2000M, max children up to 512, with tensorRT ON and OFF](/docs/benchmark-gtx1060-75w.md)
+
+- for Tesla V100 :
+[benchmark testing batch size from 4 to 128, 4 to 12 vcpu, no tensorrt](/docs/benchmark-teslaV100.md)
 
 #### A10. GTP command `time_settings` doesn't work.
 
@@ -259,6 +262,9 @@ model on V100.
 See: [#75](https://github.com/Tencent/PhoenixGo/issues/75) for 
 how to build TensorRT model.
 
+You can find a speed benchmark for Tesla V100 in 
+[FAQ question](/docs/FAQ.md#a9-what-is-the-speed-of-the-engine--how-can-i-make-the-engine-think-faster-)
+
 ### Specific questions : bazel issues (linux and mac)
 
 #### B0. It is too hard to install bazel or start bazel

diff --git a/docs/benchmark-gtx1060.md → docs/benchmark-gtx1060-75w.md b/docs/benchmark-gtx1060.md → docs/benchmark-gtx1060-75w.md
@@ -1,8 +1,8 @@
 # Benchmark setup :
 
 ## setup
-- hardware : gtx 1060 6gb (1gpu, power limit set to 75W), ryzen r7 1700, 
-16gb ram
+- hardware : gtx 1060 6gb 75W (1gpu, power limit set to 75W), 
+ryzen r7 1700, 16gb ram
 - software : ubuntu 16.04 LTS, nvidia 384, cuda 9.0, cudnn 7.1.4, 
 tensorrt 3.0.4, bazel 0.11.1
 - engine settings: unlimited time per move, all time management settings 
@@ -354,21 +354,27 @@ for example speed gain +12% = 12% less time to calculate a move as compared
 to batch size 4 = 27.5 seconds vs 30.5 seconds per move = 4 seconds 
 difference out of 30.5 seconds 
 
-# CONCLUSIONS for GTX 1060 : 
+# CONCLUSIONS for GTX 1060 75W : 
 
-- TensorRT can increase speed by arround 15%-20% on a GTX 1060 with batch size 4 
-(for bigger batch size with tensorRT, see 
+- TensorRT can increase speed by arround 15%-20% on a GTX 1060 75W with 
+batch size 4 (for bigger batch size with tensorRT, see 
 [#75](https://github.com/Tencent/PhoenixGo/issues/75))
-- bigger batch size significantly increases speed of the engine on a GTX 1060 :
-     -> for batch size 8 , gain = +12%
-     -> for batch size 16 , gain = +33%
-     -> for batch size 24 , gain = +31%
-     -> for batch size 32 , gain = +47%
+- bigger batch size significantly increases speed of the engine on a 
+GTX 1060 75W :
+     -> for batch size 4 to 8 , gain = +12%
+     -> for batch size 4 to 16 , gain = +33%
+     -> for batch size 4 to 24 , gain = +31%
+     -> for batch size 4 to 32 , gain = +47%
+- batch sizes higher than 16 bring significant small increase speed 
+on gtx 1060 75W, but considering the loss of computing accuracy, 
+this is not an efficient choice
+- therefore, the most efficient batch size seems to be 16, providing 
+**an average 210 simulations per second on gtx1060 75W**
 - Compute device (GPU or CPU) utilization is higher with higher batching
-- gtx 1060 is too weak to benefit higher batch sizes than 32
+- gtx 1060 75W is too weak to benefit higher batch sizes than 32
 - number of threads does not significantly change speed of the engine
 - tree size does not significantly change the speed of the engine 
 (arround 5% more time with max tree size)
 
-# TO DO : 
-- i will try higher batch sizes with a Tesla V100 on windows 10
+For comparison, you can refer to 
+[tesla-V100-benchmark](benchmark-teslaV100.md)
diff --git a/docs/benchmark-teslaV100.md b/docs/benchmark-teslaV100.md
@@ -0,0 +1,186 @@
+# Benchmark setup :
+
+## setup
+- hardware : google cloud machine with Tesla V100-SXM2-16GB, 4 and 
+12vcpu (skylake server or later, support avx512, google cloud platform
+does not allow more than 8 vcpu per die, maximum for 1 GPU is 12 vcpu), 
+16gb system ram, 40 gb hdd
+- software : ubuntu 18.04 LTS, nvidia 410, cuda 10.0, cudnn 7.4.2, 
+no tensorrt, bazel 0.11.1
+- engine settings: limited time 60 seconds per move, all other time 
+management settings disabled in config file, all the rest is default 
+settings
+
+## methodology : 
+- most moves come from the same game played using 
+[gtp2ogs](https://github.com/online-go/gtp2ogs), for few moves moves, 
+copy paste stderr output
+- tensorRT is not used with the V100 here, because it would need to 
+build our own tensor model, which was not done here, see 
+[FAQ question](#a13-i-have-a-nvidia-rtx-card-turing-or-tesla-v100titan-v-volta-is-it-compatible-)
+for details
+
+## credits :
+- credit for doing this tests go to 
+[wonderingabout](https://github.com/wonderingabout)
+- credit for providing the hardware goes to google cloud 
+platform
+
+# BATCH SIZE 4 
+
+batch size 4
+tensorrt : OFF
+8 threads
+children : 64
+400M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 4th move(w): pp, winrate=56.125683%, N=20226, Q=0.122514, p=0.728064, v=0.114009, cost 60014.109375ms, sims=22976, height=46, avg_height=12.719517, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 2th move(w): pd, winrate=56.061207%, N=6544, Q=0.121224, p=0.212260, v=0.106201, cost 60021.523438ms, sims=23096, height=30, avg_height=9.461098, global_step=639200
+```
+
+# BATCH SIZE 8
+
+batch size 8
+tensorrt : OFF
+16 threads
+children : 96
+2000M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 8th move(w): nq, winrate=56.569016%, N=27010, Q=0.131380, p=0.591054, v=0.117058, cost 60036.335938ms, sims=32152, height=59, avg_height=14.057215, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 4th move(w): pp, winrate=56.120705%, N=28938, Q=0.122414, p=0.715722, v=0.114932, cost 60016.148438ms, sims=32728, height=54, avg_height=13.301727, global_step=639200
+```
+
+# BATCH SIZE 16 
+
+batch size 16
+tensorrt : OFF
+32 threads
+children : 128
+2000M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 2th move(w): dp, winrate=56.057503%, N=15696, Q=0.121150, p=0.207913, v=0.105841, cost 60048.324219ms, sims=53568, height=34, avg_height=9.826570, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 6th move(w): qn, winrate=56.170525%, N=29628, Q=0.123410, p=0.306212, v=0.111480, cost 60020.058594ms, sims=53968, height=71, avg_height=14.943110, global_step=639200
+```
+
+# BATCH SIZE 32 
+
+batch size 32
+tensorrt : OFF
+64 threads
+children : 128
+2000M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 10th move(w): cp, winrate=65.475777%, N=58821, Q=0.309516, p=0.886150, v=0.196629, cost 60111.078125ms, sims=59444, height=53, avg_height=10.118464, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 8th move(w): qf, winrate=56.714546%, N=55717, Q=0.134291, p=0.613282, v=0.114160, cost 60048.957031ms, sims=62368, height=64, avg_height=13.618808, global_step=639200
+```
+
+# BATCH SIZE 64 
+
+batch size 64
+tensorrt : OFF
+32 threads
+children : 128
+2000M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 12th move(w): bo, winrate=65.815079%, N=64431, Q=0.316302, p=0.884180, v=0.226788, cost 60263.683594ms, sims=64960, height=54, avg_height=12.711725, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 10th move(w): pc, winrate=65.360603%, N=67943, Q=0.307212, p=0.887373, v=0.175454, cost 60165.914062ms, sims=69031, height=63, avg_height=7.840148, global_step=639200
+```
+
+# BATCH SIZE 128 
+
+batch size 128
+tensorrt : OFF
+256 threads
+children : 128
+2000M tree size 
+unlimited sims
+60s per move
+
+### 4 vcpu (2 physical cores/ 4 cpu threads) :
+
+```
+stderr: 16th move(w): rf, winrate=65.895859%, N=66225, Q=0.317917, p=0.937327, v=0.202470, cost 60232.250000ms, sims=67328, height=44, avg_height=10.560142, global_step=639200
+```
+
+### 12 vcpu (6 physical cores/ 12 cpu threads) :
+
+```
+stderr: 12th move(w): ob, winrate=65.983253%, N=70697, Q=0.319665, p=0.920173, v=0.223816, cost 60312.035156ms, sims=71664, height=49, avg_height=6.786881, global_step=639200
+```
+
+# CONCLUSIONS for GTX 1060 : 
+
+- all the conclusions below are without tensorRT optimization, which 
+is known to bring 15-30% extra computation performance depending on 
+hardware and settings
+     -> for batch size 4 to 8 , gain = +43%
+     -> for batch size 8 to 16 , gain = +60%
+     -> for batch size 16 to 32 , gain = +14%
+     -> for batch size 32 to 64 , gain = +10%
+     -> for batch size 64 to 128 , gain = +3.7%
+- batch sizes 8 and 16 significant great increases speed on Tesla 
+V100 with 6 cores / 12 cpu threads or less
+- batch sizes higher 16 to 64 bring significant small increase speed 
+on Tesla V100 with 6 cores / 12 cpu threads or less, but considering 
+the loss of computing accuracy, this is not an efficient choice
+- therefore, the most efficient batch size seems to be 16, providing 
+**an average 900 simulations per second on Tesla V100**
+- batch size higher than 64 do not bring significant speed increases on 
+Tesla V100 with 6 cores / 12 cpu threads or less
+
+- on the CPU side, as of February 2019, PhoenixGo engine does not 
+significantly benefit from a number of cpu threads higher than 2 
+cpu cores/ 4 cpu threads, even on Tesla V100
+
+For comparison, you can refer to 
+[gtx-1060-75w-benchmark](benchmark-gtx1060-75w.md)