Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unload Models on GPU on Run #93

Open
2 tasks
JRegimbal opened this issue Aug 9, 2021 · 8 comments
Open
2 tasks

Unload Models on GPU on Run #93

JRegimbal opened this issue Aug 9, 2021 · 8 comments
Assignees

Comments

@JRegimbal
Copy link
Collaborator

This is a possible approach to #85. To avoid the constant high use of memory, models should free the allocated GPU memory after doing preprocessing. Since preprocessors currently don't run in parallel, this should avoid OOM problems. This currently is necessary on the following models:

  • Object Detection
  • Semantic Segmentation

The latency added by loading/unloading the models on request should be measured as well.

@JRegimbal
Copy link
Collaborator Author

@rohanakut please add anything missing and assign this to whoever is best suited for it.

@gp1702
Copy link
Contributor

gp1702 commented Aug 12, 2021

As for now, using smaller models seems to be the only option for reducing the memory requirements, and as mentioned previously, this would come at the cost of accuracy.

For e.g.,: the performance table on the following page compares the accuracy of the (smaller) mobilenet models with other large models: https://github.com/CSAILVision/semantic-segmentation-pytorch.

@JRegimbal
Copy link
Collaborator Author

So you're saying that unloading after run won't free enough memory? Most of these models won't be running concurrently, at least for now.

@gp1702
Copy link
Contributor

gp1702 commented Aug 13, 2021

Unloading these models will free up the GPU memory, but it will increase the latency in generating outputs for future queries. Testing the latency would make sense once the models are integrated in general pipeline of the project, standalone latency of these models is already tested and is provided in the link in my previous message.

@JRegimbal
Copy link
Collaborator Author

The semseg and object detection models are already integrated into the pipeline. What models are currently blocking?

@jeffbl
Copy link
Member

jeffbl commented Aug 16, 2021

Assigning also to @SiddharthRaoA based on tech-arch meeting today, since he has experience doing this. Again, this is probably a solution for the "bandaid" category, since on production, with simultaneous requests from different users, models will probably have to be loaded all the time. But this could be a long-term solution for test servers and local testing, so is still useful even if it cannot be used in production.

@SiddharthRaoA
Copy link
Contributor

I tried to reduce the memory consumption of the chart processor by loading the models only when they are needed, and unloading them after processing.

(For context, the chart pipeline has 5 models, consisting of 1 chart-type classifier and 4 type specific models. The line chart category has 2 type specific models, while the bar and pie chart categories have 1 each. Hence line charts require the heaviest processing.)

The chart-type classifier is always kept loaded, while the type specific models are loaded only when required. This ensures that the type specific models occupy GPU memory only when the request contains a chart of that category. In idle state, only the chart classifier is running on the GPU. I tried the following 2 options for this:

  1. The type specific models reside as .pt or ,pkl files and are loaded from this when needed. They don't take up any RAM or GPU memory when not being used. However, loading models from these files is quite slow which increases the processing time.
  2. The type specific models reside in the RAM when not being used. They are simply moved from the RAM to the GPU when needed. This is much faster than loading from .pth or .pkl files, but also consumes more RAM. There also seems to be a memory leak issue with this option, where the RAM usage keeps building up until everything is exhausted. I haven't found I fix for this yet, but we'll need to find one soon if we're going ahead with this option.

Note: Both these methods clear the GPU cache after a model is done using it, to prevent PyTorch from holding on to the memory even after use.

Below is the processing time and memory usage for both options, when tested on the line charts category (the category requiring the most processing time). Also, these results were obtained on Unicorn, and the models are bound to be much faster on Bach.

Mode Response time RAM usage Idle GPU usage Peak GPU usage (less than a few ms)
1 8-9s 4 GB 1.5 GB 3.2 GB
2 6s --- 1.5 GB 3.2 GB

@jeffbl
Copy link
Member

jeffbl commented Apr 21, 2022

Assigning to @rianadutta for consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants