Improvements to parallelization and related issues #27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the changes in d36ff29, the number of chunks should now always be an exact multiple of the number of workers, if that number is available.
Note that I've chosen to adjust the number of workers upward to the next smallest multiple of the number of workers. So for example, if n_workers = 60, then the number of chunks will be 120. This also means that if n_workers is greater over 100, then n_chunks will just be set to n_workers, no matter how large that may be. It might make more sense instead to adjust the number of chunks downward to the largest multiple of n_workers less than or equal to 100. But in that case you need to add a special case for n_workers > 100, in which you just set n_chunks to 100, since otherwise the largest multiple would be 0.
Another note: I'm not sure if you chose 100 chunks for reasons relating to memory usage or other concerns, which I why I implemented things as described above. However, it might make sense to simply always set n_chunks = n_workers whenever n_workers is available. Obviously, the code for that would be much simpler, and I'd be happy to implement that if you think it would be better.
Lastly, I've also adjusted the logic for avoiding excessive splitting of small objects by enforcing a minimum chunk size and adjusting n_chunks accordingly, instead of just setting n_chunks to 1 if the object is small enough. For example, with min_chunk_size = 20 and nrow = 100, n_chunks will be capped at 5, regardless of all other options. I've set min_chunksize to 20, but that's an arbitrary and untested choice, so if you have a better intuition for this, feel free to change it.
As for the other commits:
isTRUE
within if statements to ensure that conditionals never evaluate to NA.