Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix validation data creation for useSingleDataset mode #1527

Merged
merged 6 commits into from
Jun 14, 2022

Conversation

svotaw
Copy link
Collaborator

@svotaw svotaw commented Jun 12, 2022

Summary

Fix validation Dataset creation in useSingleDataset mode. Due to shared code with the regular training Dataset, every partition tries to merge its data with the "single" executor Dataset. But for validation data, there is only 1 array of data, so this ends up duplicating it. This causes 2 problems:

  1. Extra pressure for OOM errors
  2. If not every executor has the same # of partitions, then the validation Dataset is different length on each executor, causing errors.

Tests

The existing validation Dataset tests still pass.

Dependency changes

If you needed to make any changes to dependencies of this project, please describe them here.

AB#1828018

imatiach-msft
imatiach-msft previously approved these changes Jun 13, 2022
Copy link
Contributor

@imatiach-msft imatiach-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@svotaw
Copy link
Collaborator Author

svotaw commented Jun 13, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2022

Codecov Report

Merging #1527 (21b0107) into master (0a6a728) will increase coverage by 1.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1527      +/-   ##
==========================================
+ Coverage   82.84%   84.27%   +1.42%     
==========================================
  Files         297      290       -7     
  Lines       14942    14819     -123     
  Branches      728      719       -9     
==========================================
+ Hits        12379    12489     +110     
+ Misses       2563     2330     -233     
Impacted Files Coverage Δ
...rosoft/azure/synapse/ml/lightgbm/SharedState.scala 91.17% <100.00%> (+1.89%) ⬆️
...ynapse/ml/lightgbm/dataset/DatasetAggregator.scala 96.18% <100.00%> (ø)
...org/apache/spark/ml/param/JsonEncodableParam.scala 57.89% <0.00%> (-26.32%) ⬇️
...g/apache/spark/ml/param/PythonWrappableParam.scala 66.66% <0.00%> (-8.34%) ⬇️
...re/src/main/python/synapse/ml/core/schema/Utils.py 67.10% <0.00%> (-5.27%) ⬇️
...soft/azure/synapse/ml/cognitive/TextToSpeech.scala 84.84% <0.00%> (-3.04%) ⬇️
...oft/azure/synapse/ml/cognitive/TextAnalytics.scala 86.59% <0.00%> (-2.69%) ⬇️
.../azure/synapse/ml/cognitive/TextAnalyticsSDK.scala 86.01% <0.00%> (-1.40%) ⬇️
...ft/azure/synapse/ml/cognitive/FormRecognizer.scala 73.40% <0.00%> (-1.07%) ⬇️
...t/azure/synapse/ml/cognitive/BingImageSearch.scala 89.28% <0.00%> (-0.90%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a6a728...21b0107. Read the comment docs.

@svotaw
Copy link
Collaborator Author

svotaw commented Jun 14, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@svotaw svotaw enabled auto-merge (squash) June 14, 2022 14:16
@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@svotaw svotaw merged commit d0bc785 into microsoft:master Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants