-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory error doing inference on GPU with some version combination #77
Comments
The key to this error is RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[5,192,512,512] This size array is not requested at any point in the network architecture, which suggests that some memory leakage is occurring in the inference for loop across batches |
For large datasets, the expected use of the .predict function in tensorflow is to feed in the entire dataset and the internal system would loop through it, creating internal batches. However since our datasets can be exceedingly large (60GB or more), we can't rely on having everyone equipped with 100GB (or more of RAM) for the sake of doing inference. So I initially broke down inference in batches. See here for the function call :
|
It turns out tensorflow now has a predict_on_batch function, which is expected to be used in those cases : I found out that just dropping this function in tensorflow 2.7 at the exact line number mentioned above removes this memory leak error |
I will make a PR with this fix and deploy to the main branch. A package release should be done if it proves to be compatible with older tensorflow versions. |
I ran into the same issue (Windows 10, Python 3.7, tensorflow 2.4.4). my gpu is not great though and thus could be the actual culprit. when I changed to predict_on_batch I got a different error. Don't know if that is related though... 2021-12-07 01:04:37.560685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll Function call stack: 2021-12-07 01:14:23.500441: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. |
That error seems to be related to something else. It looks like CUDA has trouble initializing the GPU at all. Maybe make a separate issue. |
might be, sorry if not helpful. |
Doing long inference with Tensorflow 2.7, Python 3.9 can cause gpu out of memory errors like so:
The text was updated successfully, but these errors were encountered: