fix: LightGBMRanker should repartition by grouping column #778

JoanFM · 2020-01-21T13:20:13Z

fix: Repartition by groupingColumn to guarantee proper LightGBMRanker results
Consider repartitioning by grouping column since for now it is only guaranteed to have groups together in every partition but for correctness every group should be found together in the same partition.

A comment in #682 seems to agree on this statement

welcome · 2020-01-21T13:20:18Z

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!
Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

Make sure to check out the developer guide for guidance on testing your change.

imatiach-msft · 2020-01-21T16:06:28Z

/azp run

azure-pipelines · 2020-01-21T16:06:59Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2020-01-21T16:15:37Z

Codecov Report

Merging #778 into master will increase coverage by 8.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #778      +/-   ##
==========================================
+ Coverage   41.41%   49.44%   +8.02%     
==========================================
  Files         241      185      -56     
  Lines        9701     8515    -1186     
  Branches      525      518       -7     
==========================================
+ Hits         4018     4210     +192     
+ Misses       5683     4305    -1378

Impacted Files	Coverage Δ
...cala/com/microsoft/ml/spark/lime/BreezeUtils.scala	`0.00% <0.00%> (-100.00%)`	⬇️
.../org/apache/spark/ml/LimeNamespaceInjections.scala	`0.00% <0.00%> (-100.00%)`	⬇️
...scala/com/microsoft/ml/spark/stages/Batchers.scala	`3.70% <0.00%> (-92.60%)`	⬇️
...osoft/ml/spark/io/http/PartitionConsolidator.scala	`95.55% <0.00%> (ø)`	⬆️
...scala/com/microsoft/ml/spark/lime/Superpixel.scala	`1.77% <0.00%> (-86.99%)`	⬇️
...la/com/microsoft/ml/spark/stages/DropColumns.scala	`13.33% <0.00%> (-80.00%)`	⬇️
.../com/microsoft/ml/spark/stages/SelectColumns.scala	`13.33% <0.00%> (-80.00%)`	⬇️
...m/microsoft/ml/spark/stages/TextPreprocessor.scala	`12.96% <0.00%> (-79.63%)`	⬇️
.../main/scala/com/microsoft/ml/spark/lime/LIME.scala	`18.26% <0.00%> (-75.97%)`	⬇️
.../com/microsoft/ml/spark/stages/EnsembleByKey.scala	`13.82% <0.00%> (-75.54%)`	⬇️
... and 142 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f702921...5ba8333. Read the comment docs.

src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMRanker.scala

imatiach-msft

LGTM

mhamilton723 · 2020-01-23T16:04:55Z

/azp run

azure-pipelines · 2020-01-23T16:05:06Z

Azure Pipelines successfully started running 1 pipeline(s).

JoanFM · 2020-01-23T19:05:59Z

Can I get some help with the tests? It is expected for this PR to cause diffs but with the class structure in the tests I am having real trouble seeing what I need to check and update, can someone give me a hand on this?

Thank you very much

imatiach-msft · 2020-01-23T19:13:43Z

@JoanFM it looks like some lightgbm ranker tests are failing, not sure why yet. Will take a look when I get a chance.

JoanFM · 2020-01-23T21:23:43Z

Yes it is expected I think, I observed that model result is depending on partitioning of the data, but I cannot track down where the expected output is taken from or anything. The complete fuzzying class structure is a little too complex if u are not familiar with it I guess

JoanFM · 2020-01-28T07:36:56Z

@JoanFM it looks like some lightgbm ranker tests are failing, not sure why yet. Will take a look when I get a chance.

Could you point me in some direction to help me fix it on my own if you do not have the time?

imatiach-msft · 2020-01-28T16:37:00Z

/azp run

azure-pipelines · 2020-01-28T16:37:12Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-01-28T16:38:41Z

@JoanFM yes, will take a look sorry. Not sure where to point you to otherwise, as I'm not sure why it is failing myself. I was just going to get your branch and run the tests locally to see if I could reproduce the test failures.

JoanFM · 2020-01-29T07:30:39Z

@imatiach-msft I could reproduce the issue locally, I think it is natural to have differences because now every lightGBM instance see a different set of samples.
But I do not know how to change or how to see the expected output vs the obtained, But I guess you will see it much easier

…eterministic, we cannot guarantee same model after fit with same parameters and fitDF

JoanFM · 2020-01-31T09:46:04Z

Hello @imatiach-msft ,

I have seen the source of discrepancy.

At some point we test that when Ranker is saved, which I understand that is just the configuration, and the same loaded result in the same prediction.

But since repartition is non-deterministic I think, we cannot guarantee that we will get the same LightGBM model with the same trees and gains. Therefore, there are some discrepancies.

I think this test may be meaningless in this case. Let me know if it is okey this last change

imatiach-msft · 2020-01-31T15:59:42Z

/azp run

azure-pipelines · 2020-01-31T15:59:52Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2020-02-10T17:45:54Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-10T21:45:16Z

/azp run

azure-pipelines · 2020-02-10T21:45:26Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-10T21:45:31Z

/azp run

azure-pipelines · 2020-02-10T21:45:40Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-10T23:00:43Z

/azp run

azure-pipelines · 2020-02-10T23:00:53Z

Azure Pipelines successfully started running 1 pipeline(s).

JoanFM · 2020-02-11T07:50:55Z

Any idea what is failing? is there some test that is failing due to this change? Is there a way to see the cause of the failure?

imatiach-msft · 2020-02-11T15:52:50Z

@JoanFM yes it looks like one of the notebook tests timed out:
Array(Failure(java.util.concurrent.TimeoutException: Notebook /MMLSparkBuild/build_1.0.0-rc1-61-6de883c7-SNAPSHOT/Classification - Adult Census with Vowpal Wabbit.ipynb timed out after 2400000 ms, job in state RUNNING, For more information check the run page: https://eastus2.azuredatabricks.net/?o=4652563148747856#job/5249/run/1 )) was not empty

I see this error when I click "details" on the build. I'm guessing you don't have access because the devops account is still not public. I mentioned to @mhamilton723 that the devops account needs to be made public so external developers can view it, but he hasn't had the cycles to work on this.

imatiach-msft · 2020-02-11T15:52:59Z

/azp run

azure-pipelines · 2020-02-11T15:53:12Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-11T16:24:42Z

/azp run

azure-pipelines · 2020-02-11T16:24:53Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-11T16:25:49Z

it looks like datasets failed to download which caused all of the lightgbm tests to fail. Weird. Hoping it's a random issue, I think it might be related to the change @eisber recently did to update the datasets zip file.

imatiach-msft · 2020-02-11T16:50:04Z

hmm it looks like that was some random networking failure with downloading the files @eisber , the test seems to be passing now. Maybe we need to add some retry logic there in the future if download/unzip fails.

imatiach-msft · 2020-02-11T17:07:39Z

/azp run

azure-pipelines · 2020-02-11T17:07:50Z

Azure Pipelines successfully started running 1 pipeline(s).

imatiach-msft · 2020-02-11T17:16:42Z

/azp run

azure-pipelines · 2020-02-11T17:16:53Z

Azure Pipelines successfully started running 1 pipeline(s).

) * LightGBMRanker should repartition by grouping column * Ignore estimators in VerifyLightGBMRanker. Since repartition is not deterministic, we cannot guarantee same model after fit with same parameters and fitDF * Add option to turn ON and OFF repartitioning, and setting it off for test Co-authored-by: Ilya Matiach <ilmat@microsoft.com> Co-authored-by: Mark Hamilton <mhamilton723@gmail.com>

liyzcj · 2021-12-30T09:28:40Z

@JoanFM Why force materialization is needed here ?

JoanFM · 2021-12-30T09:31:33Z

Hey @liyzcj ,

I am a little disconnected from this project, but I guess it may had to do with the lazyness of the operation?

Not 100% sure at this moment, may not even be needed.

liyzcj · 2021-12-30T12:51:46Z

@JoanFM Thanks for reply, I may test for this later.

LightGBMRanker should repartition by grouping column

e269ae1

JoanFM requested a review from imatiach-msft as a code owner January 21, 2020 13:20

Merge branch 'master' into repartition

6dd9d38

JoanFM changed the title ~~LightGBMRanker should repartition by grouping column~~ fix: LightGBMRanker should repartition by grouping column Jan 21, 2020

imatiach-msft reviewed Jan 21, 2020

View reviewed changes

src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMRanker.scala Outdated Show resolved Hide resolved

imatiach-msft previously approved these changes Jan 21, 2020

View reviewed changes

JoanFM added 2 commits January 23, 2020 14:32

Merge branch 'master' into repartition

96a38f8

Merge branch 'master' into repartition

543fd8a

Merge branch 'master' into repartition

f84a62f

Merge branch 'master' into repartition

52034fb

JoanFM added 2 commits January 31, 2020 09:01

Merge branch 'master' into repartition

c8e7862

Ignore estimators in VerifyLightGBMRanker. Since repartition is not d…

54debfb

…eterministic, we cannot guarantee same model after fit with same parameters and fitDF

JoanFM dismissed imatiach-msft’s stale review via 54debfb January 31, 2020 09:42

Merge branch 'master' into repartition

2d4c9f2

Merge branch 'master' into repartition

a2fee1d

imatiach-msft approved these changes Feb 11, 2020

View reviewed changes

Merge branch 'master' into repartition

5ba8333

imatiach-msft merged commit e745784 into microsoft:master Feb 11, 2020

JoanFM deleted the repartition branch February 11, 2020 18:19

imatiach-msft mentioned this pull request Jan 26, 2022

fix: repartition bug in LightGBMRanker #1368

Merged

fix: LightGBMRanker should repartition by grouping column #778

fix: LightGBMRanker should repartition by grouping column #778

Conversation

JoanFM commented Jan 21, 2020 • edited Loading

welcome bot commented Jan 21, 2020

imatiach-msft commented Jan 21, 2020

azure-pipelines bot commented Jan 21, 2020

codecov bot commented Jan 21, 2020 • edited Loading

Codecov Report

imatiach-msft left a comment

Choose a reason for hiding this comment

mhamilton723 commented Jan 23, 2020

azure-pipelines bot commented Jan 23, 2020

JoanFM commented Jan 23, 2020

imatiach-msft commented Jan 23, 2020

JoanFM commented Jan 23, 2020

JoanFM commented Jan 28, 2020

imatiach-msft commented Jan 28, 2020

azure-pipelines bot commented Jan 28, 2020

imatiach-msft commented Jan 28, 2020

JoanFM commented Jan 29, 2020

JoanFM commented Jan 31, 2020

imatiach-msft commented Jan 31, 2020

azure-pipelines bot commented Jan 31, 2020

azure-pipelines bot commented Feb 10, 2020

imatiach-msft commented Feb 10, 2020

azure-pipelines bot commented Feb 10, 2020

imatiach-msft commented Feb 10, 2020

azure-pipelines bot commented Feb 10, 2020

imatiach-msft commented Feb 10, 2020

azure-pipelines bot commented Feb 10, 2020

JoanFM commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

azure-pipelines bot commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

azure-pipelines bot commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

azure-pipelines bot commented Feb 11, 2020

imatiach-msft commented Feb 11, 2020

azure-pipelines bot commented Feb 11, 2020

liyzcj commented Dec 30, 2021

JoanFM commented Dec 30, 2021

liyzcj commented Dec 30, 2021

JoanFM commented Jan 21, 2020 •

edited

Loading

codecov bot commented Jan 21, 2020 •

edited

Loading