You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Baseline: 1 CPU thread model throughput = (11.391 x 25 + 11.330 x 50) / 75 = 11.350
Baseline: 1 GPU thread model throughput = 20.436
For reference:
Parallel threads = processes/threads used in parallel to run R (multiprocessing through sockets)
Model threads = threads used to run xgboost (multithreading)
Parallel GPUs = number of GPUs used in parallel processes/threads in R
Parallel GPU threads = number of processes running in a single GPU
Models = number of models to train in total
Seconds / Model = average throughput for 1 model, in seconds
Boost vs Baseline = your performance gain if you were to do the mentioned row vs doing only 1 CPU (or 1 GPU if GPU) process/thread for your model
CPU:
Run
Parallel Threads
Model Threads
Parallel GPUs
GPU Threads
Models
Seconds / Model
Boost vs Baseline
9
1
1
0
0
25
11.391
~1x
10
9
1
0
0
50
1.458
7.78x
11
18
1
0
0
100
0.797
14.24x
12
35
1
0
0
250
0.474
23.95x
13
70
1
0
0
500
0.440
25.79x
14
1
1
0
0
50
11.330
~1x
15
1
9
0
0
50
6.287
1.81x
16
1
18
0
0
50
6.283
1.81x
17
1
35
0
0
50
24.907
0.46x
18
1
70
0
0
50
165.522
0.07x
GPU:
Run
Parallel Threads
Model Threads
Parallel GPUs
GPU Threads
Models
Seconds / Model
Boost vs Baseline
1
1
1
1
1
25
20.436
~1x
2
2
1
2
1
50
10.666
1.91x
3
3
1
3
1
100
6.999
2.92x
4
4
1
4
1
250
5.182
3.94x
5
4
1
1
4
50
20.602
0.99x
6
8
1
2
4
100
10.495
1.95x
7
12
1
3
4
250
6.909
2.96x
8
16
1
4
4
500
5.222
3.91x
Conclusions:
More and too many CPU cores poorly used (multithreading) actually are playing AGAINST you by make the learning slower
More and too many CPU cores appropriately used (parallel) are providing huge performance boost and benefit significantly from hyperthreading (+200% from 35 threads to 70 threads)
More GPU in parallel provides moderate performance increase, given we are using NVIDIA Quadro P1000 GPUs, while 1x Tesla V100 is not cheap (more expensive than the CPUs on my server!)
Overallocating GPU threads did not help, because GPUs were already at 100% GPU usage for 0.1m data due to our sparse data
18 parallel threads to 35 parallel threads is not scaling linearly, I expect it is because I have only 4 DIMM on my server (80 GBps) instead of the full 12 possible DIMMs (240 GBps) => could check with Intel VTune, too lazy because it's a long process (but it's doable)
TODO: try with V100 later
The text was updated successfully, but these errors were encountered:
Following #3 with results.
Baseline: 1 CPU thread model throughput = (11.391 x 25 + 11.330 x 50) / 75 = 11.350
Baseline: 1 GPU thread model throughput = 20.436
For reference:
CPU:
GPU:
Conclusions:
TODO: try with V100 later
The text was updated successfully, but these errors were encountered: