Skip to content

Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

Closed
@alzio2607

Description

@alzio2607

How we are using LightGBM:

We are using the LightGBM c_api in our model hosting service, written in Golang. We've written a CGO wrapper around the c_api. We are using the “lib_lightgbm.so” library file provided with on Github.

Version Used :
3.1.1

Environment info:

Operating System: Observed on both Linux and MacOS

Architecture: x86_64

CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)

CMake version: 3.9.0

GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28

Context:

We load a few LightGBM models in our model hosting service and refresh the models as soon as the new ones are available. The models are loaded via the method LGBM_BoosterLoadModelFromString provided by the api. We are releasing the older models with the method LGBM_BoosterFree.
We are hosting this service on GKE pods which have a fixed amount of memory.

Issue:

We're seeing a gradual uptick in the RSS (Resident Memory Set) of the service as soon as the model is refreshed. We are measuring the RSS metric via prometheus, which exposes process_resident_memory_bytes. This indicates
that the entire memory is not freed up when LGBM_BoosterFree is called. This is causing our service pods to go down of OOM and the lifetime of the pods and in turn service health has taken a hit.

To remove the suspicions of the "Golang part of code" contributing to the RSS, we looked at the heap and the heap comes back to the pre-load value for a model.

To further solidify that Go is not the issue, we ran an experiment where we continuously free up and reload the same model again and again. We do not load anything else in the service except the model file (no metadatas or any such sort of thing).
We observed a staircase for the RSS metric:

Screenshot 2024-04-13 at 4 00 26 PM

For this experiment, the heap looked like this:

Screenshot 2024-04-13 at 4 00 26 PM

This ascertains the fact that something is going on with the C API while freeing up the space it takes for the model. I started digging into the code but the memory seems to be managed appropriately using unique pointers and destructors.

Then, I stripped the situation to it's bare minimum form where I load a model string (which is on disk), record the rss, release it, record the rss. I do this multiple times, while forcing the GC to run after every action and waiting 20 seconds for rss to settle so i get the exact values. And the results are a bit weird. RSS seems to be coming down to the older values in some iterations while it does not do so in others. The result is a gradual increase.

Initial: 1524
C_API Model Loaded: 4576
Model Released: 1988

Initial: 1988
C_API Model Loaded: 5659
Model Released: 4012

Initial: 4012
C_API Model Loaded: 7263
Model Released: 5509

Initial: 5509
C_API Model Loaded: 9209
Model Released: 7561

Initial: 7561
C_API Model Loaded: 9281
Model Released: 7603

Initial: 7603
C_API Model Loaded: 9336
Model Released: 7658

Initial: 7658
C_API Model Loaded: 11329
Model Released: 9681

Initial: 9681
C_API Model Loaded: 11372
Model Released: 9724

Initial: 9724
C_API Model Loaded: 13395
Model Released: 11748

Initial: 11748
C_API Model Loaded: 13476
Model Released: 11828

I am at a loss on how to figure this. Is it something about Go that I am missing?

Minimal Reproducible Example:

Attaching a minimal reproducible example. Since this example cannot be a simple snippet of code, have linked it to my git repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions