Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partitioning logic for uneven division #318

Merged
merged 4 commits into from
May 6, 2021

Conversation

Shelnutt2
Copy link
Member

There was a bug in the logic of partitioning in that we used ceil not floor when computing the number of element per partition. This meant that if we rounded up and used up all the elements before the requested user partitioning count we would run out of samples/regions and end up with empty partitions. The empty partitions were not handled gracefully and caused the library to incorrectly fetch all headers. Depending on the exact scenario it might have also resulted in a segfault when attempting to set empty ranges on the core library due to accessing non-existent vector elements.

The fix for this issue is to use floor when computing element per partition and to just assign the left overs to the first N partitions,
where one is given to each partition until we run out of "left over" elements.

@Shelnutt2 Shelnutt2 requested a review from aaronwolen May 5, 2021 22:43
@shortcut-integration
Copy link

This pull request has been linked to Clubhouse Story #7184: Invalid partition created.

Shelnutt2 added a commit that referenced this pull request May 6, 2021
Sample partitions should never be empty after
[#318](#318). However
incase it is, this safety check will exit early instead of running over
the entire dataset.
There was a bug in the logic of partitioning in that we used ceil not
floor when computing the number of element per partition. This meant
that if we rounded up and used up all the elements before the requested
user partitioning count we would run out of samples/regions and end up
with empty partitions. The empty partitions were not handled gracefully
and caused the library to incorrectly fetch all headers. Depending on
the exact scenario it might have also resulted in a segfault when
attempting to set empty ranges on the core library due to accessing
non-existent vector elements.

The fix for this issue is to use floor when computing element per
partition and to just assign the left overs to the first N partitions,
where one is given to each partition until we run out of "left over"
elements.
@aaronwolen aaronwolen force-pushed the sethshelnutt/ch7184/invalid-partition-created branch from 75f5652 to 9d6e491 Compare May 6, 2021 18:10
This verifies uneven division does not produce one or more empty
partitions that would previously result in full exports
This verifies that samples are not assigned to more than one
partition
This adjusts the index min/max offsets as a function of the
current partition index
@Shelnutt2 Shelnutt2 merged commit de4c96e into master May 6, 2021
@Shelnutt2 Shelnutt2 deleted the sethshelnutt/ch7184/invalid-partition-created branch May 6, 2021 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants