[R-package] Provide recommendation for mnative? #348

Laurae2 · 2017-03-15T14:45:09Z

@guolinke I am just wondering if recommending using mnative can yield better performance for those installing directly from install_github (default is mcore2 in R).

Installation log when installing using install_github in Windows for instance: we can see it is tuned for Core 2 architecture:

c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm-all.cpp -o lightgbm-all.o
c:/Rtools/mingw_64/bin/g++ -m64 -std=c++0x -I"C:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/include" -DNDEBUG -I../..//include -DUSE_SOCKET      -I"C:/swarm/workspace/External-R-3.3.2/vendor/extsoft/include"  -fopenmp -pthread -std=c++11   -O2 -Wall  -mtune=core2 -c lightgbm_R.cpp -o lightgbm_R.o
c:/Rtools/mingw_64/bin/g++ -m64 -shared -s -static-libgcc -o lightgbm.dll tmp.def ./lightgbm-all.o ./lightgbm_R.o -fopenmp -pthread -lws2_32 -liphlpapi -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib/x64 -LC:/swarm/workspace/External-R-3.3.2/vendor/extsoft/lib -LC:/PROGRA~1/MIE74D~1/RCLIEN~1/R_SERVER/bin/x64 -lR

This would require adding in the README.md of the R-package that to maximize performance, adding -march=native should be done but might break packages.

Regarding -O3 (if we were to push for even more), I know it is refused by CRAN for compatibility issues (some packages are breaking with -O3).

The text was updated successfully, but these errors were encountered:

chivee · 2017-03-16T05:58:05Z

@Laurae2 , did that means that we should alter the c++ build rather than just R libraries. I think we can make that a suggestion rather than a compulsory process.

guolinke · 2017-03-17T06:34:42Z

@Laurae2
I remember the difference between O2 and O3 in LightGBM is very small.
You can try some benchmarks on this.

Laurae2 · 2017-03-17T23:02:04Z

@chivee no, this would just be a suggestion to users if they want to achieve better local training speed. I'm not sure if it has a major impact though, I'll test all that thoroughly before I make a PR. As @guolinke there are very small differences just for O2 and O3 flag alone.

@guolinke when I get time on my server I'll try O3 and march=native to see what happens to the speed. I'm collecting a lot of (long) benchmarks since last month on xgboost and LightGBM to understand their performance (in ranking predictions (AUC), and speed) behavior depending on parameters.

I'll get back here once my new benchmarks are done.

Laurae2 · 2017-04-01T10:55:08Z

@guolinke Some results here. Not posting the exact details for the benchmark because there will be more at a mini-conference I am doing next month.

Settings:

v1 is LightGBM v1
v2 is LightGBM v2 @1bf7bbd
default means compiled with -O2 -mtune=core2
O3 means compiled with -O3 -march=native
O3-fmath means compiled with -O3 -ffast-math -march=native
O2 means compiled with -O2 -march=native
Os means compiled with -Os

Best means the best flags for compilation for maximum speed, with default settings overriding all the others if the difference is not significant (<~1%) and not consistent (similar flags giving results off).

CPU: i7-3930K
R + gcc 4.9

Summary (tl;dr)

We notice LightGBM v2 with O3 -march=native (specifically -O3 is benefiting for the performance. LightGBM v1 has no visible benefits from using any other flags than the defaults currently. Depending on the model parameters, different flags provide different performance (like: LightGBM v2 + -march=native performance boost is kicking off when building deeper trees, or if the overhead is low/large like for 1 thread runs).

Therefore, the following recommendations could be made:

-O2 -mtune=core2 for LightGBM v1 for maximum performance.
-O3 -march=native for LightGBM v2 for maximum performance.
When doing cross-validation of models, it is always better running several processes with a small number of threads (like 4x process 1-thread) than a multithreaded single process sequentially (like 1x process 4-threads), even though your RAM might explode.

I will follow up with more in the next month.

Bosch, 12 threads, LightGBM v1:

Parameters	v1 + default	v1 + Os	v1 + O2	v1 + O3	v1 + O3-fmath	Best
depth=3	724.18s	903.17s	725.38s	729.89s	723.23s	default
depth=6	579.29s	685.88s	584.64s	584.59s	583.89s	default
depth=9	395.23s	454.56s	398.25s	400.50s	398.93s	default
depth=12	596.55s	654.80s	596.90s	608.39s	604.25s	default

Bosch, 12 threads, LightGBM v2:

Parameters	v2 + default	v2 + Os	v2 + O2	v2 + O3	v2 + O3-fmath	Best
depth=3	873.08s	1104.39s	861.57s	861.99s	872.17s	O2
depth=6	730.06s	872.77s	724.59s	722.88s	724.98s	O3
depth=9	567.59s	634.52s	570.66s	556.12s	614.80s	O3
depth=12	854.97s	923.84s	845.12s	834.60s	847.38s	O3

Bosch, 6 threads, LightGBM v1:

Parameters	v1 + default	v1 + Os	v1 + O2	v1 + O3	v1 + O3-fmath	Best
depth=3	913.44s	1208.02s	903.01s	921.13s	915.41s	O2
depth=6	718.29s	885.44s	722.16s	723.94s	726.72s	default
depth=9	449.03s	533.58s	451.60s	455.08s	452.59s	default
depth=12	622.24s	704.10s	623.36s	618.28s	619.96s	O3

Bosch, 6 threads, LightGBM v2:

Parameters	v2 + default	v2 + Os	v2 + O2	v2 + O3	v2 + O3-fmath	Best
depth=3`	956.25s	1248.24s	965.32s	969.56s	975.95s	default
depth=6`	787.95s	952.82s	795.35s	782.70s	788.41s	???
depth=9`	548.84s	639.46s	546.65s	547.61s	547.05s	???
depth=12	770.47s	862.75s	766.49s	773.30s	762.61s	???

Bosch, 1 thread, LightGBM v1:

Parameters	v1 + default	v1 + Os	v1 + O2	v1 + O3	v1 + O3-fmath	Best
depth=3	2360.10s	3314.84s	2389.20s	2406.67s	2337.28s	O3-fmath
depth=6	1757.84s	2335.01s	1810.60s	1816.25s	1769.16s	default
depth=9	968.05s	1250.17s	994.99s	1007.10s	975.83s	default
depth=12	1202.59s	1468.61s	1238.31s	1246.01s	1216.62s	default

Bosch, 1 thread, LightGBM v2:

Parameters	v2 + default	v2 + Os	v2 + O2	v2 + O3	v2 + O3-fmath	Best
depth=3	2477.49s	3316.81s	2437.84s	2342.69s	2412.35s	O3
depth=6	1850.66s	2334.77s	1830.01s	1745.34s	1799.20s	O3
depth=9	1003.35s	1243.15s	990.65s	954.06s	970.39s	O3
depth=12	1236.83s	1469.03s	1216.49s	1159.22s	1191.33s	O3

guolinke · 2017-04-01T11:04:49Z

@Laurae2 Thanks for your benchmark 👍 .
If change to O3 is needed, you can create a PR for it.

Laurae2 · 2017-04-03T15:06:36Z

@guolinke I'll open a PR to add a recommendation when I get some good charts ready and when the mini-conference will be ready soon (early next month), I'll link to it on the PR.

I also have xgboost benchmarks for comparison, do you want to see them? (I also got for nthread={1, 2, 3, 4, 5, 6, 12} and depth={3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, but then it gets very large in GitHub, I plan to make a blog post on it instead)

guolinke · 2017-04-03T15:53:39Z

Sure. The comparison benchmarks are always welcome. It can help to find out which part we can further improve.

Laurae2 · 2017-04-04T19:30:24Z

@guolinke Here for xgboost:

xgboost is at commit b4d97d3
default means compiled with -O2 -mtune=core2
O3 means compiled with -O3 -march=native -funroll-loops
O3-fmath means compiled with -O3 -ffast-math -march=native -funroll-loops
-funroll-loops is added because it is xgboost's default (actually, not even seeing a difference with or without)

xgboost was "slow", I skipped -O2 -march=native and -Os (it took 2 days for each full benchmark per thread count, the singlethreaded run was very long to do).

To compare xgboost and LightGBM, best is copy&paste into Excel (or anything similar) and make charts. See the end of this comment for the Excel table example.

Default run:

Default flag:

-O3 flag:

-O3 -fast-math flag:

Summary (tl;dr)

Configuration to choose, difference might be large depending on case:

Deep trees and multithreading: -O2 -mtune=core2
Small trees and multithreading: -O3 -ffast-math -march=native -funroll-loops
No multithreading: -O3 -march=native -funroll-loops

One can see dmlc/xgboost#1950 more for understand xgboost implementation details.

More to come soon next month (on 10 May).

Bosch, 12 threads, xgboost depth-wise at b4d97d3:

Parameters	dw + default	dw + O3	dw + O3-fmath	Best
depth=3	1049.86s	1037.48s	1026.85s	O3-fmath
depth=6	832.13s	843.74s	789.30s	O3-fmath
depth=9	790.78s	799.14s	788.94s	default
depth=12	1288.12s	1303.58s	1323.37s	default

Bosch, 12 threads, xgboost loss guide at b4d97d3:

Parameters	lg + default	lg + O3	lg + O3-fmath	Best
depth=3	1047.75s	1042.41s	1030.32s	O3-fmath
depth=6	844.80s	841.92s	838.87s	O3-fmath
depth=9	799.60s	802.58s	797.94s	default
depth=12	1263.58s	1292.64s	1330.31s	default

Bosch, 6 threads, xgboost depth-wise at b4d97d3:

Parameters	dw + default	dw + O3	dw + O3-fmath	Best
depth=3	1222.31s	1194.40s	1171.52s	O3-fmath
depth=6	865.96s	866.79s	833.08s	O3-fmath
depth=9	696.18s	710.25s	703.25s	default
depth=12	1036.29s	1062.12s	1070.23s	default

Bosch, 6 threads, xgboost loss guide at b4d97d3:

Parameters	lg + default	lg + O3	lg + O3-fmath	Best
depth=3	1215.27s	1194.47s	1176.07s	O3-fmath
depth=6	871.79s	860.68s	855.88s	O3-fmath
depth=9	717.43s	714.81s	705.16s	O3-fmath
depth=12	1061.09s	1077.32s	1089.91s	default

Bosch, 1 thread, xgboost depth-wise at b4d97d3:

Parameters	dw + default	dw + O3	dw + O3-fmath	Best
depth=3	3122.58s	2719.62s	2885.43s	O3
depth=6	2076.36s	1909.22s	1967.32s	O3
depth=9	1296.96s	1215.27s	1260.41s	O3
depth=12	1684.07s	1520.32s	1577.45s	O3

Bosch, 1 thread, xgboost loss guide at b4d97d3:

Parameters	lg + default	lg + O3	lg + O3-fmath	Best
depth=3	3032.19s	2771.35s	2944.40s	O3
depth=6	2049.57s	1941.74s	1934.76s	O3-fmath
depth=9	1304.50s	1208.21s	1265.47s	O3
depth=12	1571.86s	1503.40s	1615.36s	O3

Excel table example:

Copy & paste:

LightGBM v1 table with header: on A1
LightGBM v2 table with header: on I1
xgboost-depthwise table with header: on Q1
xgboost-lossguide table with header: on W1
Paste all the table below on A7
Paste formula =INDEX($A$1:$AA$5,F8,G8) on E8, then double click the small box at bottom right on the cell to paste down
Paste formula =NUMBERVALUE(LEFT(E8, LEN(E8)-1)) on D8, then double click the small box at bottom right on the cell to paste down
Do the charts you want (even a pivot chart if you want)

Model	Flag	Depth	Speed	Row	Column
LightGBM v1	default	3	2360.1	2	2
LightGBM v1	default	6	1757.84	3	2
LightGBM v1	default	9	968.05	4	2
LightGBM v1	default	12	1202.59	5	2
LightGBM v1	Os	3	3314.84	2	3
LightGBM v1	Os	6	2335.01	3	3
LightGBM v1	Os	9	1250.17	4	3
LightGBM v1	Os	12	1468.61	5	3
LightGBM v1	O2	3	2389.2	2	4
LightGBM v1	O2	6	1810.6	3	4
LightGBM v1	O2	9	994.99	4	4
LightGBM v1	O2	12	1238.31	5	4
LightGBM v1	O3	3	2406.67	2	5
LightGBM v1	O3	6	1816.25	3	5
LightGBM v1	O3	9	1007.1	4	5
LightGBM v1	O3	12	1246.01	5	5
LightGBM v1	O3-fmath	3	2337.28	2	6
LightGBM v1	O3-fmath	6	1769.16	3	6
LightGBM v1	O3-fmath	9	975.83	4	6
LightGBM v1	O3-fmath	12	1216.62	5	6
LightGBM v2	default	3	2477.49	2	10
LightGBM v2	default	6	1850.66	3	10
LightGBM v2	default	9	1003.35	4	10
LightGBM v2	default	12	1236.83	5	10
LightGBM v2	Os	3	3316.81	2	11
LightGBM v2	Os	6	2334.77	3	11
LightGBM v2	Os	9	1243.15	4	11
LightGBM v2	Os	12	1469.03	5	11
LightGBM v2	O2	3	2437.84	2	12
LightGBM v2	O2	6	1830.01	3	12
LightGBM v2	O2	9	990.65	4	12
LightGBM v2	O2	12	1216.49	5	12
LightGBM v2	O3	3	2342.69	2	13
LightGBM v2	O3	6	1745.34	3	13
LightGBM v2	O3	9	954.06	4	13
LightGBM v2	O3	12	1159.22	5	13
LightGBM v2	O3-fmath	3	2412.35	2	14
LightGBM v2	O3-fmath	6	1799.2	3	14
LightGBM v2	O3-fmath	9	970.39	4	14
LightGBM v2	O3-fmath	12	1191.33	5	14
xgboost-depthwise	default	3	3122.58	2	18
xgboost-depthwise	default	6	2076.36	3	18
xgboost-depthwise	default	9	1296.96	4	18
xgboost-depthwise	default	12	1684.07	5	18
xgboost-depthwise	O3	3	2719.62	2	19
xgboost-depthwise	O3	6	1909.22	3	19
xgboost-depthwise	O3	9	1215.27	4	19
xgboost-depthwise	O3	12	1520.32	5	19
xgboost-depthwise	O3-fmath	3	2885.43	2	20
xgboost-depthwise	O3-fmath	6	1967.32	3	20
xgboost-depthwise	O3-fmath	9	1260.41	4	20
xgboost-depthwise	O3-fmath	12	1577.45	5	20
xgboost-lossguide	default	3	3032.19	2	24
xgboost-lossguide	default	6	2049.57	3	24
xgboost-lossguide	default	9	1304.5	4	24
xgboost-lossguide	default	12	1571.86	5	24
xgboost-lossguide	O3	3	2771.35	2	25
xgboost-lossguide	O3	6	1941.74	3	25
xgboost-lossguide	O3	9	1208.21	4	25
xgboost-lossguide	O3	12	1503.4	5	25
xgboost-lossguide	O3-fmath	3	2944.4	2	26
xgboost-lossguide	O3-fmath	6	1934.76	3	26
xgboost-lossguide	O3-fmath	9	1265.47	4	26
xgboost-lossguide	O3-fmath	12	1615.36	5	26

Laurae2 · 2017-05-21T16:48:23Z

Closed with #511

See: https://sites.google.com/view/lauraepp/benchmarks

…microsoft#348) * Added numClasses and objective, infer actualNumClasses from objective * Update LightGBM notebook example * Remove numClasses since it is now inferred from dataset

Laurae2 mentioned this issue Apr 21, 2017

[possibly issue] Watchdog detected hard LOCKUP #431

Closed

Laurae2 mentioned this issue May 13, 2017

[WIP] R 3.4 recommendations + Benchmark/Parameter Notes #511

Merged

Laurae2 closed this as completed May 21, 2017

jameslamb added the r-package label Dec 17, 2018

lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Provide recommendation for mnative? #348

[R-package] Provide recommendation for mnative? #348

Laurae2 commented Mar 15, 2017

chivee commented Mar 16, 2017

guolinke commented Mar 17, 2017

Laurae2 commented Mar 17, 2017

Laurae2 commented Apr 1, 2017 •

edited

guolinke commented Apr 1, 2017

Laurae2 commented Apr 3, 2017

guolinke commented Apr 3, 2017

Laurae2 commented Apr 4, 2017 •

edited

Laurae2 commented May 21, 2017

[R-package] Provide recommendation for mnative? #348

[R-package] Provide recommendation for mnative? #348

Comments

Laurae2 commented Mar 15, 2017

chivee commented Mar 16, 2017

guolinke commented Mar 17, 2017

Laurae2 commented Mar 17, 2017

Laurae2 commented Apr 1, 2017 • edited

guolinke commented Apr 1, 2017

Laurae2 commented Apr 3, 2017

guolinke commented Apr 3, 2017

Laurae2 commented Apr 4, 2017 • edited

Laurae2 commented May 21, 2017

Laurae2 commented Apr 1, 2017 •

edited

Laurae2 commented Apr 4, 2017 •

edited