-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem statement
As was seen at #666, the globally used CuPy memory pool is prone to fragmentation. This can lead to allocation failures when running the pipeline. The following example attempts to describe the problem:
- Let the pipeline run on a single GPU with 8 GiB of available memory.
- Method "A" runs. First it allocates the output array of 1 GiB, but also uses a 4 GiB array for temporary calculations.
- After method "A" returns, the memory pool holds 5 GiB overall, a chunk of 1 GiB holding the output of Method "A" and a chunk of 4 GiB, which is unused at this point. And 3 GiB is still free on the GPU.
- Method "B" runs. For the sake of example, this just allocates an output array of 1 GiB. To avoid a new allocation, the pool uses the unused 4 GiB chunk, bumps a pointer, and returns 1 GiB of it to the output array.
- After method "B" returns, there is still theoretical 6 GiB memory free: 3 GiB as part of the pool's 4 GiB chunk, and 3 GiB free on the device.
- Method "A" runs again. This should be possible, because it needs 5 GiB total, and as of above, there is a total 6 GiB free. However, while the allocation of the output 1 GiB can be served by the remaining 3 GiB in the pool, the subsequent allocation for the temporary 4 GiB array will fail, as no contiguous 4 GiB is available either in the pool, or on the device.
This is analog with what happens in #666: the smaller output of darks/flats takes hold of a larger backing chunk in the memory pool, effectively limiting the maximum size of subsequent allocations.
Possible solutions
A: Avoid pool fragmentation on an incidental basis
By reorganizing the code, or invoking free_all_blocks on the global pool, it is possible to solve these problems one-by-one, just as seen in #666. However, this does not guarantee that new cases will not be introduced unexpectedly. Also, free_all_blocks can introduce an execution bottleneck and reduce performance.
B: Make sure that auxiliary data is loaded to the GPU only on demand
A more generic, preventive solution is to make sure that auxiliary data, like darks and flats are only loaded to the device, when the subsequent method requires. After the method returns, the auxiliary data would be freed from the device, and kept in the host memory until it is needed again. This would guarantee that no unneeded data piece is loaded to the GPU that would cause the method to run out of memory.
C: Isolating the method's execution by using a separated memory pool
Even if the auxiliary data is loaded only on demand, it can occur that the output of one method holds a backing chunk in the pool larger than itself, thereby fragmenting the memory. A preventive solution would be to allocate the inputs/outputs of the method up front, and for each method in the pipeline, a separated memory pool would be used, which is released after the method returns. This would require more changes to the program structure, but would also make certain optimizations possible, such as pre-allocating memory based on the memory estimator.