Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: LightGBMRanker should repartition by grouping column #778

Merged
merged 13 commits into from
Feb 11, 2020

Conversation

JoanFM
Copy link
Contributor

@JoanFM JoanFM commented Jan 21, 2020

fix: Repartition by groupingColumn to guarantee proper LightGBMRanker results
Consider repartitioning by grouping column since for now it is only guaranteed to have groups together in every partition but for correctness every group should be found together in the same partition.

A comment in #682 seems to agree on this statement

@welcome
Copy link

welcome bot commented Jan 21, 2020

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!
Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

Make sure to check out the developer guide for guidance on testing your change.

@JoanFM JoanFM changed the title LightGBMRanker should repartition by grouping column fix: LightGBMRanker should repartition by grouping column Jan 21, 2020
@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jan 21, 2020

Codecov Report

Merging #778 into master will increase coverage by 8.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #778      +/-   ##
==========================================
+ Coverage   41.41%   49.44%   +8.02%     
==========================================
  Files         241      185      -56     
  Lines        9701     8515    -1186     
  Branches      525      518       -7     
==========================================
+ Hits         4018     4210     +192     
+ Misses       5683     4305    -1378     
Impacted Files Coverage Δ
...cala/com/microsoft/ml/spark/lime/BreezeUtils.scala 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/spark/ml/LimeNamespaceInjections.scala 0.00% <0.00%> (-100.00%) ⬇️
...scala/com/microsoft/ml/spark/stages/Batchers.scala 3.70% <0.00%> (-92.60%) ⬇️
...osoft/ml/spark/io/http/PartitionConsolidator.scala 95.55% <0.00%> (ø) ⬆️
...scala/com/microsoft/ml/spark/lime/Superpixel.scala 1.77% <0.00%> (-86.99%) ⬇️
...la/com/microsoft/ml/spark/stages/DropColumns.scala 13.33% <0.00%> (-80.00%) ⬇️
.../com/microsoft/ml/spark/stages/SelectColumns.scala 13.33% <0.00%> (-80.00%) ⬇️
...m/microsoft/ml/spark/stages/TextPreprocessor.scala 12.96% <0.00%> (-79.63%) ⬇️
.../main/scala/com/microsoft/ml/spark/lime/LIME.scala 18.26% <0.00%> (-75.97%) ⬇️
.../com/microsoft/ml/spark/stages/EnsembleByKey.scala 13.82% <0.00%> (-75.54%) ⬇️
... and 142 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f702921...5ba8333. Read the comment docs.

imatiach-msft
imatiach-msft previously approved these changes Jan 21, 2020
Copy link
Contributor

@imatiach-msft imatiach-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 23, 2020

Can I get some help with the tests? It is expected for this PR to cause diffs but with the class structure in the tests I am having real trouble seeing what I need to check and update, can someone give me a hand on this?

Thank you very much

@imatiach-msft
Copy link
Contributor

@JoanFM it looks like some lightgbm ranker tests are failing, not sure why yet. Will take a look when I get a chance.

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 23, 2020

Yes it is expected I think, I observed that model result is depending on partitioning of the data, but I cannot track down where the expected output is taken from or anything. The complete fuzzying class structure is a little too complex if u are not familiar with it I guess

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 28, 2020

@JoanFM it looks like some lightgbm ranker tests are failing, not sure why yet. Will take a look when I get a chance.

Could you point me in some direction to help me fix it on my own if you do not have the time?

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

@JoanFM yes, will take a look sorry. Not sure where to point you to otherwise, as I'm not sure why it is failing myself. I was just going to get your branch and run the tests locally to see if I could reproduce the test failures.

@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 29, 2020

@imatiach-msft I could reproduce the issue locally, I think it is natural to have differences because now every lightGBM instance see a different set of samples.
But I do not know how to change or how to see the expected output vs the obtained, But I guess you will see it much easier

…eterministic, we cannot guarantee same model after fit with same parameters and fitDF
@JoanFM
Copy link
Contributor Author

JoanFM commented Jan 31, 2020

Hello @imatiach-msft ,

I have seen the source of discrepancy.

At some point we test that when Ranker is saved, which I understand that is just the configuration, and the same loaded result in the same prediction.

But since repartition is non-deterministic I think, we cannot guarantee that we will get the same LightGBM model with the same trees and gains. Therefore, there are some discrepancies.

I think this test may be meaningless in this case. Let me know if it is okey this last change

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JoanFM
Copy link
Contributor Author

JoanFM commented Feb 11, 2020

Any idea what is failing? is there some test that is failing due to this change? Is there a way to see the cause of the failure?

@imatiach-msft
Copy link
Contributor

@JoanFM yes it looks like one of the notebook tests timed out:
Array(Failure(java.util.concurrent.TimeoutException: Notebook /MMLSparkBuild/build_1.0.0-rc1-61-6de883c7-SNAPSHOT/Classification - Adult Census with Vowpal Wabbit.ipynb timed out after 2400000 ms, job in state RUNNING, For more information check the run page: https://eastus2.azuredatabricks.net/?o=4652563148747856#job/5249/run/1 )) was not empty

I see this error when I click "details" on the build. I'm guessing you don't have access because the devops account is still not public. I mentioned to @mhamilton723 that the devops account needs to be made public so external developers can view it, but he hasn't had the cycles to work on this.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

it looks like datasets failed to download which caused all of the lightgbm tests to fail. Weird. Hoping it's a random issue, I think it might be related to the change @eisber recently did to update the datasets zip file.

@imatiach-msft
Copy link
Contributor

hmm it looks like that was some random networking failure with downloading the files @eisber , the test seems to be passing now. Maybe we need to add some retry logic there in the future if download/unzip fails.

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft imatiach-msft merged commit e745784 into microsoft:master Feb 11, 2020
@JoanFM JoanFM deleted the repartition branch February 11, 2020 18:19
ocworld pushed a commit to AhnLab-OSS/mmlspark that referenced this pull request Mar 24, 2020
)

* LightGBMRanker should repartition by grouping column

* Ignore estimators in VerifyLightGBMRanker. Since repartition is not deterministic, we cannot guarantee same model after fit with same parameters and fitDF

* Add option to turn ON and OFF repartitioning, and setting it off for test

Co-authored-by: Ilya Matiach <ilmat@microsoft.com>
Co-authored-by: Mark Hamilton <mhamilton723@gmail.com>
@liyzcj
Copy link
Contributor

liyzcj commented Dec 30, 2021

@JoanFM Why force materialization is needed here ?

@JoanFM
Copy link
Contributor Author

JoanFM commented Dec 30, 2021

Hey @liyzcj ,

I am a little disconnected from this project, but I guess it may had to do with the lazyness of the operation?

Not 100% sure at this moment, may not even be needed.

@liyzcj
Copy link
Contributor

liyzcj commented Dec 30, 2021

@JoanFM Thanks for reply, I may test for this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants