Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost in parallel: demo results #4

Open
Laurae2 opened this issue May 13, 2019 · 0 comments
Open

xgboost in parallel: demo results #4

Laurae2 opened this issue May 13, 2019 · 0 comments

Comments

@Laurae2
Copy link
Owner

Laurae2 commented May 13, 2019

Following #3 with results.

Baseline: 1 CPU thread model throughput = (11.391 x 25 + 11.330 x 50) / 75 = 11.350

Baseline: 1 GPU thread model throughput = 20.436

For reference:

  • Parallel threads = processes/threads used in parallel to run R (multiprocessing through sockets)
  • Model threads = threads used to run xgboost (multithreading)
  • Parallel GPUs = number of GPUs used in parallel processes/threads in R
  • Parallel GPU threads = number of processes running in a single GPU
  • Models = number of models to train in total
  • Seconds / Model = average throughput for 1 model, in seconds
  • Boost vs Baseline = your performance gain if you were to do the mentioned row vs doing only 1 CPU (or 1 GPU if GPU) process/thread for your model

CPU:

Run Parallel Threads Model Threads Parallel GPUs GPU Threads Models Seconds / Model Boost vs Baseline
9 1 1 0 0 25 11.391 ~1x
10 9 1 0 0 50 1.458 7.78x
11 18 1 0 0 100 0.797 14.24x
12 35 1 0 0 250 0.474 23.95x
13 70 1 0 0 500 0.440 25.79x
14 1 1 0 0 50 11.330 ~1x
15 1 9 0 0 50 6.287 1.81x
16 1 18 0 0 50 6.283 1.81x
17 1 35 0 0 50 24.907 0.46x
18 1 70 0 0 50 165.522 0.07x

GPU:

Run Parallel Threads Model Threads Parallel GPUs GPU Threads Models Seconds / Model Boost vs Baseline
1 1 1 1 1 25 20.436 ~1x
2 2 1 2 1 50 10.666 1.91x
3 3 1 3 1 100 6.999 2.92x
4 4 1 4 1 250 5.182 3.94x
5 4 1 1 4 50 20.602 0.99x
6 8 1 2 4 100 10.495 1.95x
7 12 1 3 4 250 6.909 2.96x
8 16 1 4 4 500 5.222 3.91x

Conclusions:

  • More and too many CPU cores poorly used (multithreading) actually are playing AGAINST you by make the learning slower
  • More and too many CPU cores appropriately used (parallel) are providing huge performance boost and benefit significantly from hyperthreading (+200% from 35 threads to 70 threads)
  • More GPU in parallel provides moderate performance increase, given we are using NVIDIA Quadro P1000 GPUs, while 1x Tesla V100 is not cheap (more expensive than the CPUs on my server!)
  • Overallocating GPU threads did not help, because GPUs were already at 100% GPU usage for 0.1m data due to our sparse data
  • 18 parallel threads to 35 parallel threads is not scaling linearly, I expect it is because I have only 4 DIMM on my server (80 GBps) instead of the full 12 possible DIMMs (240 GBps) => could check with Intel VTune, too lazy because it's a long process (but it's doable)

TODO: try with V100 later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant