feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

imatiach-msft · 2021-06-01T19:27:47Z

Adding single (or "singleton") dataset mode to lightgbm learners.
User can enable this new mode by setting the parameter useSingleDatasetMode=True (it is false by default).
In this mode, each executor creates a single LightGBMDataset. By default, currently each task within an executor creates a dataset:

In this PR, a new mode is added to only create one dataset per executor:

This means that there is lower network communication overhead since fewer nodes are initialized and more parallelization is done within the machine in the native code with default number of threads. This also seems to reduce memory usage significantly for some datasets.

Note in most cluster configurations there is usually only one executor per machine anyway.

In performance tests, we've found this mode sometimes outperforms the default in certain scenarios, both in terms of memory and execution time.

On a sparse dataset with 9 GB of data and large parameter values (num_leaves=768, num_trees=1000, min_data_in_leaf=15000, max_bin=512) and 5 machines with 8 cores and 28 GB of RAM, runtime was 17.54 minutes with this new mode. When specifying tasks=5 it took 106 minutes and in default mode it failed with OOM.

However in other scenarios the default mode is much faster.
On dense Higgs dataset (4GB) with default parameters and 8 workers with 14 GB memory, 4 cores each the default run took 54 seconds but new single dataset mode took 1.1 minutes (used to be 2 minutes with recent optimization on dataset conversion code to native this was speeded up a lot), which was a bit slower.

For this reason we will keep this mode as non-default for now as we continue to do more benchmarking/experimentation.

imatiach-msft · 2021-06-01T19:28:00Z

/azp run

azure-pipelines · 2021-06-01T19:28:19Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2021-06-01T19:37:13Z

Codecov Report

Merging #1066 (607679a) into master (fe70f31) will decrease coverage by 0.20%.
The diff coverage is 90.38%.

@@            Coverage Diff             @@
##           master    #1066      +/-   ##
==========================================
- Coverage   85.74%   85.54%   -0.21%     
==========================================
  Files         252      254       +2     
  Lines       11605    11801     +196     
  Branches      599      619      +20     
==========================================
+ Hits         9951    10095     +144     
- Misses       1654     1706      +52

Impacted Files	Coverage Δ
...rosoft/ml/spark/stages/PartitionConsolidator.scala	`95.74% <ø> (ø)`
...crosoft/ml/spark/lightgbm/params/TrainParams.scala	`100.00% <ø> (ø)`
...om/microsoft/ml/spark/lightgbm/LightGBMUtils.scala	`74.50% <20.00%> (-18.29%)`	⬇️
...m/microsoft/ml/spark/lightgbm/LightGBMRanker.scala	`64.17% <80.00%> (+0.54%)`	⬆️
...osoft/ml/spark/lightgbm/dataset/DatasetUtils.scala	`61.11% <81.81%> (-21.86%)`	⬇️
.../ml/spark/lightgbm/dataset/DatasetAggregator.scala	`87.30% <87.30%> (ø)`
.../com/microsoft/ml/spark/lightgbm/SharedState.scala	`88.88% <88.88%> (ø)`
...m/microsoft/ml/spark/lightgbm/swig/SwigUtils.scala	`91.66% <90.90%> (-8.34%)`	⬇️
...com/microsoft/ml/spark/lightgbm/LightGBMBase.scala	`94.88% <97.36%> (+2.02%)`	⬆️
...om/microsoft/ml/spark/core/utils/ClusterUtil.scala	`68.57% <100.00%> (ø)`
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe70f31...607679a. Read the comment docs.

imatiach-msft · 2021-06-01T19:42:24Z

/azp run

azure-pipelines · 2021-06-01T19:42:35Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-02T18:09:15Z

/azp run

azure-pipelines · 2021-06-02T18:09:26Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-02T19:15:05Z

/azp run

azure-pipelines · 2021-06-02T19:15:16Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-02T19:24:05Z

/azp run

azure-pipelines · 2021-06-02T19:24:18Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-02T20:13:16Z

/azp run

azure-pipelines · 2021-06-02T20:13:26Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-03T06:04:45Z

/azp run

azure-pipelines · 2021-06-03T06:04:57Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-03T15:39:58Z

/azp run

azure-pipelines · 2021-06-03T15:40:09Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-03T17:48:13Z

/azp run

azure-pipelines · 2021-06-03T17:48:25Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2021-06-03T17:56:04Z

/azp run

azure-pipelines · 2021-06-03T17:56:17Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-01T03:59:17Z

/azp run

azure-pipelines · 2021-07-01T03:59:27Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-01T04:15:39Z

/azp run

azure-pipelines · 2021-07-01T04:15:49Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-08T23:22:47Z

/azp run

azure-pipelines · 2021-07-08T23:22:57Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2021-07-09T17:50:07Z

/azp run

azure-pipelines · 2021-07-09T17:50:18Z

Azure Pipelines successfully started running 1 pipeline(s).

pfung · 2021-07-13T03:56:51Z

Hello, how can I get the latest snapshot jar with this feature please? Thank you.

imatiach-msft requested a review from mhamilton723 as a code owner June 1, 2021 19:27

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from ccd45fa to 8511d54 Compare June 1, 2021 19:42

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from 736e93a to 90e5f0a Compare June 2, 2021 17:41

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from 06c35d7 to 15fa45d Compare June 2, 2021 19:14

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from 15fa45d to f5f9829 Compare June 2, 2021 19:23

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from f5f9829 to 4237fe7 Compare June 2, 2021 20:12

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from 4237fe7 to 5e003ae Compare June 3, 2021 06:04

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from 5e003ae to 174a5b6 Compare June 3, 2021 15:34

imatiach-msft force-pushed the ilmat/improve-csr-matrix branch from bd54562 to fbfe4f1 Compare June 3, 2021 17:55

imatiach-msft requested a review from eisber as a code owner June 30, 2021 18:59

mhamilton723 added 9 commits June 30, 2021 15:07

Refactoring to simplify

e212a32

Refactoring to simplify

2d39885

Refactoring to simplify

321b89e

Refactoring to simplify

51e1df5

Refactoring to simplify

66b8411

Refactoring to simplify

d088f9d

Refactoring to simplify

084b895

Refactoring to simplify

91504de

Refactoring to simplify

398ba1d

minor fix

ade471e

mhamilton723 added 2 commits July 7, 2021 17:37

Merge branch 'master' into ilmat/improve-csr-matrix

4708568

Merge branch 'master' into ilmat/improve-csr-matrix

607679a

mhamilton723 merged commit 0f69cf5 into microsoft:master Jul 12, 2021

pfung mentioned this pull request Jul 14, 2021

not found: mmlspark_2.12-1.0.0-rc3.pom #1119

Closed

TomFinley mentioned this pull request Jul 16, 2021

Lightgbm - mysterious OOM problems #1124

Closed

imatiach-msft mentioned this pull request Aug 9, 2021

setting number of thread per executor #950

Open

SammyBaek mentioned this pull request Nov 11, 2021

[LightGBM] training issue - java.net.ConnectException: Connection refused #1248

Closed

This was referenced Dec 27, 2021

Why it almost do not speedup with distributed learning? #1316

Open

LightGBM stuck at "reduce at LightGBMClassifier.scala:150" #1053

Open

imatiach-msft mentioned this pull request Jan 20, 2022

[BUG] Spark smoke test error with Criteo recommenders-team/recommenders#1615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

imatiach-msft commented Jun 1, 2021 •

edited

Loading

imatiach-msft commented Jun 1, 2021

azure-pipelines bot commented Jun 1, 2021

codecov bot commented Jun 1, 2021 •

edited

Loading

imatiach-msft commented Jun 1, 2021

azure-pipelines bot commented Jun 1, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

mhamilton723 commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

mhamilton723 commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

mhamilton723 commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

mhamilton723 commented Jul 9, 2021

azure-pipelines bot commented Jul 9, 2021

pfung commented Jul 13, 2021

feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage #1066

Conversation

imatiach-msft commented Jun 1, 2021 • edited Loading

imatiach-msft commented Jun 1, 2021

azure-pipelines bot commented Jun 1, 2021

codecov bot commented Jun 1, 2021 • edited Loading

Codecov Report

imatiach-msft commented Jun 1, 2021

azure-pipelines bot commented Jun 1, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 2, 2021

azure-pipelines bot commented Jun 2, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

imatiach-msft commented Jun 3, 2021

azure-pipelines bot commented Jun 3, 2021

mhamilton723 commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

mhamilton723 commented Jul 1, 2021

azure-pipelines bot commented Jul 1, 2021

mhamilton723 commented Jul 8, 2021

azure-pipelines bot commented Jul 8, 2021

mhamilton723 commented Jul 9, 2021

azure-pipelines bot commented Jul 9, 2021

pfung commented Jul 13, 2021

imatiach-msft commented Jun 1, 2021 •

edited

Loading

codecov bot commented Jun 1, 2021 •

edited

Loading