xgboost in parallel: demo results #4

Laurae2 · 2019-05-13T17:01:16Z

Following #3 with results.

Baseline: 1 CPU thread model throughput = (11.391 x 25 + 11.330 x 50) / 75 = 11.350

Baseline: 1 GPU thread model throughput = 20.436

For reference:

Parallel threads = processes/threads used in parallel to run R (multiprocessing through sockets)
Model threads = threads used to run xgboost (multithreading)
Parallel GPUs = number of GPUs used in parallel processes/threads in R
Parallel GPU threads = number of processes running in a single GPU
Models = number of models to train in total
Seconds / Model = average throughput for 1 model, in seconds
Boost vs Baseline = your performance gain if you were to do the mentioned row vs doing only 1 CPU (or 1 GPU if GPU) process/thread for your model

CPU:

Run	Parallel Threads	Model Threads	Models	Seconds / Model	Boost vs Baseline
9	1	1	25	11.391	~1x
10	9	1	50	1.458	7.78x
11	18	1	100	0.797	14.24x
12	35	1	250	0.474	23.95x
13	70	1	500	0.440	25.79x
14	1	1	50	11.330	~1x
15	1	9	50	6.287	1.81x
16	1	18	50	6.283	1.81x
17	1	35	50	24.907	0.46x
18	1	70	50	165.522	0.07x

GPU:

Run	Parallel Threads	Model Threads	Parallel GPUs	GPU Threads	Models	Seconds / Model	Boost vs Baseline
1	1	1	1	1	25	20.436	~1x
2	2	1	2	1	50	10.666	1.91x
3	3	1	3	1	100	6.999	2.92x
4	4	1	4	1	250	5.182	3.94x
5	4	1	1	4	50	20.602	0.99x
6	8	1	2	4	100	10.495	1.95x
7	12	1	3	4	250	6.909	2.96x
8	16	1	4	4	500	5.222	3.91x

Conclusions:

More and too many CPU cores poorly used (multithreading) actually are playing AGAINST you by make the learning slower
More and too many CPU cores appropriately used (parallel) are providing huge performance boost and benefit significantly from hyperthreading (+200% from 35 threads to 70 threads)
More GPU in parallel provides moderate performance increase, given we are using NVIDIA Quadro P1000 GPUs, while 1x Tesla V100 is not cheap (more expensive than the CPUs on my server!)
Overallocating GPU threads did not help, because GPUs were already at 100% GPU usage for 0.1m data due to our sparse data
18 parallel threads to 35 parallel threads is not scaling linearly, I expect it is because I have only 4 DIMM on my server (80 GBps) instead of the full 12 possible DIMMs (240 GBps) => could check with Intel VTune, too lazy because it's a long process (but it's doable)

TODO: try with V100 later

szilard mentioned this issue May 14, 2019

Concurrent scaling (CPU/GPU) - ml-perf repo by @Laurae2 szilard/GBM-perf#27

Open

Provide feedback