Skip to content

Commit

Permalink
add Tesla V100 benchmark, update gtx1060 75w benchmark
Browse files Browse the repository at this point in the history
+ minor other readme changes

The benchmark shows a few big conclusions :

- batch size 16 is a very efficient batch size on Tesla V100,
providing an average 900 simulations per move
- As of February 2019, PhoenixGo does not benefit from CPU
thread number higher than 2 cores / 4 threads on Tesla V100

These numbers may change if PhoenixGo supports newer Tensorflow
versions, as well as newer tensorRT versions too
  • Loading branch information
wonderingabout committed Feb 18, 2019
1 parent 0ee090e commit 04c41e4
Show file tree
Hide file tree
Showing 3 changed files with 213 additions and 15 deletions.
10 changes: 8 additions & 2 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,11 @@ it only if your computation device can handle it
Some independent speed benchmarks have been run, they are available
in the docs :

- for GTX 1060 :
[benchmark testing batch size from 4 to 64, tree size up to 2000M, max children up to 512, with tensorRT ON and OFF](/docs/benchmark-gtx1060.md)
- for GTX 1060 75W (75w power limit) :
[benchmark testing batch size from 4 to 64, tree size up to 2000M, max children up to 512, with tensorRT ON and OFF](/docs/benchmark-gtx1060-75w.md)

- for Tesla V100 :
[benchmark testing batch size from 4 to 128, 4 to 12 vcpu, no tensorrt](/docs/benchmark-teslaV100.md)

#### A10. GTP command `time_settings` doesn't work.

Expand Down Expand Up @@ -259,6 +262,9 @@ model on V100.
See: [#75](https://github.com/Tencent/PhoenixGo/issues/75) for
how to build TensorRT model.

You can find a speed benchmark for Tesla V100 in
[FAQ question](/docs/FAQ.md#a9-what-is-the-speed-of-the-engine--how-can-i-make-the-engine-think-faster-)

### Specific questions : bazel issues (linux and mac)

#### B0. It is too hard to install bazel or start bazel
Expand Down
32 changes: 19 additions & 13 deletions docs/benchmark-gtx1060.md → docs/benchmark-gtx1060-75w.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Benchmark setup :

## setup
- hardware : gtx 1060 6gb (1gpu, power limit set to 75W), ryzen r7 1700,
16gb ram
- hardware : gtx 1060 6gb 75W (1gpu, power limit set to 75W),
ryzen r7 1700, 16gb ram
- software : ubuntu 16.04 LTS, nvidia 384, cuda 9.0, cudnn 7.1.4,
tensorrt 3.0.4, bazel 0.11.1
- engine settings: unlimited time per move, all time management settings
Expand Down Expand Up @@ -354,21 +354,27 @@ for example speed gain +12% = 12% less time to calculate a move as compared
to batch size 4 = 27.5 seconds vs 30.5 seconds per move = 4 seconds
difference out of 30.5 seconds

# CONCLUSIONS for GTX 1060 :
# CONCLUSIONS for GTX 1060 75W :

- TensorRT can increase speed by arround 15%-20% on a GTX 1060 with batch size 4
(for bigger batch size with tensorRT, see
- TensorRT can increase speed by arround 15%-20% on a GTX 1060 75W with
batch size 4 (for bigger batch size with tensorRT, see
[#75](https://github.com/Tencent/PhoenixGo/issues/75))
- bigger batch size significantly increases speed of the engine on a GTX 1060 :
-> for batch size 8 , gain = +12%
-> for batch size 16 , gain = +33%
-> for batch size 24 , gain = +31%
-> for batch size 32 , gain = +47%
- bigger batch size significantly increases speed of the engine on a
GTX 1060 75W :
-> for batch size 4 to 8 , gain = +12%
-> for batch size 4 to 16 , gain = +33%
-> for batch size 4 to 24 , gain = +31%
-> for batch size 4 to 32 , gain = +47%
- batch sizes higher than 16 bring significant small increase speed
on gtx 1060 75W, but considering the loss of computing accuracy,
this is not an efficient choice
- therefore, the most efficient batch size seems to be 16, providing
**an average 210 simulations per second on gtx1060 75W**
- Compute device (GPU or CPU) utilization is higher with higher batching
- gtx 1060 is too weak to benefit higher batch sizes than 32
- gtx 1060 75W is too weak to benefit higher batch sizes than 32
- number of threads does not significantly change speed of the engine
- tree size does not significantly change the speed of the engine
(arround 5% more time with max tree size)

# TO DO :
- i will try higher batch sizes with a Tesla V100 on windows 10
For comparison, you can refer to
[tesla-V100-benchmark](benchmark-teslaV100.md)
186 changes: 186 additions & 0 deletions docs/benchmark-teslaV100.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Benchmark setup :

## setup
- hardware : google cloud machine with Tesla V100-SXM2-16GB, 4 and
12vcpu (skylake server or later, support avx512, google cloud platform
does not allow more than 8 vcpu per die, maximum for 1 GPU is 12 vcpu),
16gb system ram, 40 gb hdd
- software : ubuntu 18.04 LTS, nvidia 410, cuda 10.0, cudnn 7.4.2,
no tensorrt, bazel 0.11.1
- engine settings: limited time 60 seconds per move, all other time
management settings disabled in config file, all the rest is default
settings

## methodology :
- most moves come from the same game played using
[gtp2ogs](https://github.com/online-go/gtp2ogs), for few moves moves,
copy paste stderr output
- tensorRT is not used with the V100 here, because it would need to
build our own tensor model, which was not done here, see
[FAQ question](#a13-i-have-a-nvidia-rtx-card-turing-or-tesla-v100titan-v-volta-is-it-compatible-)
for details

## credits :
- credit for doing this tests go to
[wonderingabout](https://github.com/wonderingabout)
- credit for providing the hardware goes to google cloud
platform

# BATCH SIZE 4

batch size 4
tensorrt : OFF
8 threads
children : 64
400M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 4th move(w): pp, winrate=56.125683%, N=20226, Q=0.122514, p=0.728064, v=0.114009, cost 60014.109375ms, sims=22976, height=46, avg_height=12.719517, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 2th move(w): pd, winrate=56.061207%, N=6544, Q=0.121224, p=0.212260, v=0.106201, cost 60021.523438ms, sims=23096, height=30, avg_height=9.461098, global_step=639200
```

# BATCH SIZE 8

batch size 8
tensorrt : OFF
16 threads
children : 96
2000M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 8th move(w): nq, winrate=56.569016%, N=27010, Q=0.131380, p=0.591054, v=0.117058, cost 60036.335938ms, sims=32152, height=59, avg_height=14.057215, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 4th move(w): pp, winrate=56.120705%, N=28938, Q=0.122414, p=0.715722, v=0.114932, cost 60016.148438ms, sims=32728, height=54, avg_height=13.301727, global_step=639200
```

# BATCH SIZE 16

batch size 16
tensorrt : OFF
32 threads
children : 128
2000M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 2th move(w): dp, winrate=56.057503%, N=15696, Q=0.121150, p=0.207913, v=0.105841, cost 60048.324219ms, sims=53568, height=34, avg_height=9.826570, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 6th move(w): qn, winrate=56.170525%, N=29628, Q=0.123410, p=0.306212, v=0.111480, cost 60020.058594ms, sims=53968, height=71, avg_height=14.943110, global_step=639200
```

# BATCH SIZE 32

batch size 32
tensorrt : OFF
64 threads
children : 128
2000M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 10th move(w): cp, winrate=65.475777%, N=58821, Q=0.309516, p=0.886150, v=0.196629, cost 60111.078125ms, sims=59444, height=53, avg_height=10.118464, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 8th move(w): qf, winrate=56.714546%, N=55717, Q=0.134291, p=0.613282, v=0.114160, cost 60048.957031ms, sims=62368, height=64, avg_height=13.618808, global_step=639200
```

# BATCH SIZE 64

batch size 64
tensorrt : OFF
32 threads
children : 128
2000M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 12th move(w): bo, winrate=65.815079%, N=64431, Q=0.316302, p=0.884180, v=0.226788, cost 60263.683594ms, sims=64960, height=54, avg_height=12.711725, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 10th move(w): pc, winrate=65.360603%, N=67943, Q=0.307212, p=0.887373, v=0.175454, cost 60165.914062ms, sims=69031, height=63, avg_height=7.840148, global_step=639200
```

# BATCH SIZE 128

batch size 128
tensorrt : OFF
256 threads
children : 128
2000M tree size
unlimited sims
60s per move

### 4 vcpu (2 physical cores/ 4 cpu threads) :

```
stderr: 16th move(w): rf, winrate=65.895859%, N=66225, Q=0.317917, p=0.937327, v=0.202470, cost 60232.250000ms, sims=67328, height=44, avg_height=10.560142, global_step=639200
```

### 12 vcpu (6 physical cores/ 12 cpu threads) :

```
stderr: 12th move(w): ob, winrate=65.983253%, N=70697, Q=0.319665, p=0.920173, v=0.223816, cost 60312.035156ms, sims=71664, height=49, avg_height=6.786881, global_step=639200
```

# CONCLUSIONS for GTX 1060 :

- all the conclusions below are without tensorRT optimization, which
is known to bring 15-30% extra computation performance depending on
hardware and settings
-> for batch size 4 to 8 , gain = +43%
-> for batch size 8 to 16 , gain = +60%
-> for batch size 16 to 32 , gain = +14%
-> for batch size 32 to 64 , gain = +10%
-> for batch size 64 to 128 , gain = +3.7%
- batch sizes 8 and 16 significant great increases speed on Tesla
V100 with 6 cores / 12 cpu threads or less
- batch sizes higher 16 to 64 bring significant small increase speed
on Tesla V100 with 6 cores / 12 cpu threads or less, but considering
the loss of computing accuracy, this is not an efficient choice
- therefore, the most efficient batch size seems to be 16, providing
**an average 900 simulations per second on Tesla V100**
- batch size higher than 64 do not bring significant speed increases on
Tesla V100 with 6 cores / 12 cpu threads or less

- on the CPU side, as of February 2019, PhoenixGo engine does not
significantly benefit from a number of cpu threads higher than 2
cpu cores/ 4 cpu threads, even on Tesla V100

For comparison, you can refer to
[gtx-1060-75w-benchmark](benchmark-gtx1060-75w.md)

0 comments on commit 04c41e4

Please sign in to comment.