Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
compared with xgboost new histogram based algorithm #211
xgboost just adapt the histogram based idea from LightGBM, dmlc/xgboost#1950 .
CPU: E5-2670 v3 * 2
The gap is much smaller, and LightGBM is about 1x faster(total about 2x) now.
ndcg at Yahoo LTR:
ndcg at MS LTR:
referenced this issue
Jan 14, 2017
xgboost approx / hist methods currently are scaling very poorly when it comes to multithreading. This is one of my test benchs using Bosch (I can test other sets), I can't even keep the CPU more busy than 60% (sparse 1Mx1K is too small in this case):
Worse is the approx method, which can't even use 25%.
Did you check the CPU usage while running with approx and with hist methods? When it comes to singlethreading, I found xgboost (fast method) to be faster than LightGBM. But for multithreading, LightGBM always wins as xgboost doesn't scale linearly with histograms.
@guolinke xgboost only 30% CPU usage on Higgs.
I'll run 12 threads, 6 threads, and 1 thread to compare all this on Higgs.
It seems to run much faster than your benchmark (for 12 threads, other results incoming soon). Here is a sample for Higgs:
Which CPU do you use? (I use i7-3930K in my case)
This is yours (did I make a mistake in the file name?):
@guolinke I also use DDR3 1600 MHz (64GB in my case).
My benchmarks on Higgs:
Something I don't understand is this when I use xgboost:
While you have:
I'm using Python 3.5, so the original Python script to create the libsvm files does not work. Instead I'm using this:
import os input_filename = "HIGGS.csv" output_train = "higgs.train" output_test = "higgs.test" num_train = 10500000 read_num = 0 input = open(input_filename, "r") train = open(output_train, "w") test = open(output_test,"w") def WriteOneLine(tokens, output): label = int(float(tokens)) output.write(str(label)) for i in range(1,len(tokens)): feature_value = float(tokens[i]) output.write(' ' + str(i-1) + ':' + str(feature_value)) output.write('\n') line = input.readline() while line: tokens = line.split(',') if read_num < num_train: WriteOneLine(tokens, train) else: WriteOneLine(tokens, test) read_num += 1 if (read_num % 1000 == 0): print(read_num) line = input.readline() input.close() train.close() test.close()
It does goes through the 11M files:
$ wc -l HIGGS.csv 11000000 HIGGS.csv $ wc -l HIGGS.train 10500000 HIGGS.train
Is it a normal behavior? My higgs.train is 6,082,744,083 bytes, HIGGS.csv is 8,035,497,980 bytes. Downloaded and created the libsvm files 3 times to triple check, same result.
First line of my HIGGS.train and HIGGS.csv:
@Laurae2 what is the data information output by your lightGBM? https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_higgs_speed.log#L4
@guolinke I have the same exact line as yours.
My params are also identical to yours, except I changed:
Default bagging in xgboost is 1.00 (use all data). I compiled xgboost from source. When I run your sh code as is (after removing all the things I can't run), I get the same issue.
My lines 3086061 to 3086063 on higgs.train do not seem malformed, I don't get why xgboost does not want to go any further, it's really strange:
I went creating the matrix (binary format) for xgboost using R. All my tries using a libsvm format ended with nearly the same issue (it's just the row count changing in xgboost, I tried Python 2.7 and 3.5...).
Now xgboost works properly and matches your runs for AUC. I have to test for speed after.
My new results.
Running depthwise soon.
Xgboost's algorithm is better for sparse data, And LightGBM is better for dense data.
And I will try to reduce time cost for the sparse feature in LightGBM as well.
@guolinke Will test for Bosch dataset.
@wxchan I got reverted results for depthwise (I could test for Bosch too if needed). See table below on Higgs:
Just chiming in to note that, although comparing performance with XGBoost set at
@Allardvm The most time cost part in histogram algorithm is the histogram building.