Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix lightgbm stuck in multiclass scenario and added stratified repartition transformer #618

Merged

Conversation

imatiach-msft
Copy link
Contributor

fix for issues #609 and #569

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-io
Copy link

codecov-io commented Jul 14, 2019

Codecov Report

Merging #618 into master will increase coverage by 0.09%.
The diff coverage is 96.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #618      +/-   ##
=========================================
+ Coverage    79.7%   79.8%   +0.09%     
=========================================
  Files         224     225       +1     
  Lines        8965    9016      +51     
  Branches      473     474       +1     
=========================================
+ Hits         7146    7195      +49     
- Misses       1819    1821       +2
Impacted Files Coverage Δ
.../com/microsoft/ml/spark/lightgbm/TrainParams.scala 100% <ø> (ø) ⬆️
...a/com/microsoft/ml/spark/lightgbm/TrainUtils.scala 90.62% <100%> (+0.67%) ⬆️
...crosoft/ml/spark/lightgbm/LightGBMClassifier.scala 88.09% <100%> (+0.75%) ⬆️
...icrosoft/ml/spark/lightgbm/LightGBMConstants.scala 100% <100%> (ø) ⬆️
...rosoft/ml/spark/stages/StratifiedRepartition.scala 93.1% <93.1%> (ø)
...osoft/ml/spark/io/http/PartitionConsolidator.scala 93.33% <0%> (-2.23%) ⬇️
src/main/python/mmlspark/stages/__init__.py 100% <0%> (ø) ⬆️
...om/microsoft/ml/spark/lightgbm/LightGBMUtils.scala 96.47% <0%> (+1.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 258cafb...4ff5a12. Read the comment docs.

labelToCount.map(lc => (lc._1, 1.0)).toMap
}

val spdata = dataset.toDF().rdd.keyBy(row => row.getInt(row.schema.fieldIndex(getLabelCol)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any equivalent in data frame API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it seems not

@mhamilton723
Copy link
Collaborator

/app run

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

Needs a fuzzer and also fails tests on build machine

@imatiach-msft imatiach-msft force-pushed the ilmat/lgbm-multiclass-stuck branch 2 times, most recently from 614dfd6 to 0cad2fd Compare July 19, 2019 03:59
@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft imatiach-msft force-pushed the ilmat/lgbm-multiclass-stuck branch 2 times, most recently from 185d1e9 to 77f1b0f Compare August 19, 2019 05:53
@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should talk about this in person. I don't think it makes sense to add dummy rows into peoples datasets as this will change the computation and hurt the classifier. Instead consider throwing a helpful error message that points them in the direction of stratified sampling. Also this makes LightGBM the code more complex and less maintainable.

@imatiach-msft
Copy link
Contributor Author

@mhamilton723 discussed, that was a mode added for debugging user issues, it is off by default. By default we just fail if the user does not have all labels on all partitions for classification.

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit d518b8a into microsoft:master Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants