-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unload Models on GPU on Run #93
Comments
@rohanakut please add anything missing and assign this to whoever is best suited for it. |
As for now, using smaller models seems to be the only option for reducing the memory requirements, and as mentioned previously, this would come at the cost of accuracy. For e.g.,: the performance table on the following page compares the accuracy of the (smaller) mobilenet models with other large models: https://github.com/CSAILVision/semantic-segmentation-pytorch. |
So you're saying that unloading after run won't free enough memory? Most of these models won't be running concurrently, at least for now. |
Unloading these models will free up the GPU memory, but it will increase the latency in generating outputs for future queries. Testing the latency would make sense once the models are integrated in general pipeline of the project, standalone latency of these models is already tested and is provided in the link in my previous message. |
The semseg and object detection models are already integrated into the pipeline. What models are currently blocking? |
Assigning also to @SiddharthRaoA based on tech-arch meeting today, since he has experience doing this. Again, this is probably a solution for the "bandaid" category, since on production, with simultaneous requests from different users, models will probably have to be loaded all the time. But this could be a long-term solution for test servers and local testing, so is still useful even if it cannot be used in production. |
I tried to reduce the memory consumption of the chart processor by loading the models only when they are needed, and unloading them after processing. (For context, the chart pipeline has 5 models, consisting of 1 chart-type classifier and 4 type specific models. The line chart category has 2 type specific models, while the bar and pie chart categories have 1 each. Hence line charts require the heaviest processing.) The chart-type classifier is always kept loaded, while the type specific models are loaded only when required. This ensures that the type specific models occupy GPU memory only when the request contains a chart of that category. In idle state, only the chart classifier is running on the GPU. I tried the following 2 options for this:
Note: Both these methods clear the GPU cache after a model is done using it, to prevent PyTorch from holding on to the memory even after use. Below is the processing time and memory usage for both options, when tested on the line charts category (the category requiring the most processing time). Also, these results were obtained on Unicorn, and the models are bound to be much faster on Bach.
|
Assigning to @rianadutta for consideration. |
This is a possible approach to #85. To avoid the constant high use of memory, models should free the allocated GPU memory after doing preprocessing. Since preprocessors currently don't run in parallel, this should avoid OOM problems. This currently is necessary on the following models:
The latency added by loading/unloading the models on request should be measured as well.
The text was updated successfully, but these errors were encountered: