MemoryLeak in `LogisticRession` #28993

beijbom · 2024-05-10T17:50:10Z

Describe the bug

repro

Repeatedly call LogisticRegression().fit(X, y) on same size feature matrix. Or run the attached repro-script.

expected

The max memory allocated is steady regardless the number of train calls.

actual

Memory usage grows linear with the number of calls suggesting a memory leak.

notes

I'm using scikit-learn==1.2.2, and Python 3.10.11
Repro steps works on OSX (M1 Macbook) and on a AWS t3.large running the "Amazon Linux" OS.
Based on this thread at SO, I tried using SGDClassifier instead of LogisticRegression. The memory usage behaves as expected when using this solver. I therefore suspect this has to do libsvm not releasing memory correctly after completed training.
This ticket seems related to Memory problem with SVM #217

Steps/Code to Reproduce

import tracemalloc
import warnings

import numpy as np
from sklearn.linear_model import LogisticRegression, SGDClassifier


def do_train(train_type: str) -> None:
    np.random.seed(42)
    X = np.random.rand(sample_count, feature_dim)
    y = np.random.randint(2, size=sample_count)
    if train_type == "LR":
        clf = LogisticRegression(max_iter=20)
    elif train_type == "SGD":
        clf = SGDClassifier(loss="log_loss", max_iter=20)
    clf.fit(X, y)


def run(reps, train_type):
    tracemalloc.start()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        for _ in range(reps):
            do_train(train_type)

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps} | {train_type} | {base/ 1024**2:.2f} | {peak/ 1024**2:.2f} | {(peak-base)/ 1024**2:.2f}")


sample_count = 5000
feature_dim = 200

expected_memory_usage_features = sample_count * feature_dim * 8
expected_memory_usage_classifier = 2 * feature_dim * 8
expected_memory_usage_mb = (expected_memory_usage_features + expected_memory_usage_classifier) / 1024**2

if __name__ == "__main__":
    print(f"Expected memory usage: {expected_memory_usage_mb:.2f} MB")
    print("| # trainings | solver | base (MB) | peak (MB) | diff (MB) |")
    print("| ---- | ---- | ---- | ---- | ---- |")

    run(1, "LR")
    run(2, "LR")
    run(4, "LR")
    run(1, "SGD")
    run(2, "SGD")
    run(4, "SGD")

Expected Results

Expected memory usage: 7.63 MB

# trainings	solver	base (MB)	peak (MB)	diff (MB)
1	LR	0.02	7.83	7.81
2	LR	0.00	7.82	7.81
4	LR	0.00	7.82	7.81
1	SGD	0.02	7.83	7.81
2	SGD	0.00	7.82	7.81
4	SGD	0.00	7.82	7.81

Actual Results

Expected memory usage: 7.63 MB

# trainings	solver	base (MB)	peak (MB)	diff (MB)
1	LR	7.83	8.06	0.23
2	LR	15.45	15.69	0.24
4	LR	30.88	31.12	0.24
1	SGD	0.02	7.83	7.81
2	SGD	0.00	7.82	7.81
4	SGD	0.00	7.82	7.81

Versions

System:
    python: 3.10.11 (v3.10.11:7d4cc5aa85, Apr  4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/beijbom/.virtualenvs/trainer/bin/python
   machine: macOS-14.1.2-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 24.0
   setuptools: 68.0.0
        numpy: 1.25.1
        scipy: 1.11.1
       Cython: None
       pandas: 2.1.0
   matplotlib: None
       joblib: 1.3.1
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8

The text was updated successfully, but these errors were encountered:

glemaitre · 2024-05-13T12:06:18Z

I therefore suspect this has to do libsvm not releasing memory correctly after completed training.

The culprit is not libsvm because we don't use it here. But this is indeed in the branch of lbfgs. I manually added a gc.collect() and the leak disappear. So we should check that we don't have a unnecessary cyclic reference with the lbfgs solver branch. For other solver, we don't have the issue.

ogrisel · 2024-05-13T12:41:28Z

I confirm that I can reproduce the problem on macOS with the current state of main branch so this problem is likely to span from at least from scikit-learn 1.2.2 to the current developer version.

However I also confirm that inserting a manual call to gc.collect after each call to fit in the for loop fixes the problem. So this is not a real memory leak, but more of a delayed garbage collection. This can indeed be caused by cyclic references involving data structures that also hold references to large numpy arrays.

Since running gc.collect can be quite expansive, we don't want to do it defensively in the scikit-learn code base itself. Instead we should rather identify which objects are involved in this reference cycle and either:

change the code to avoid creating unnecessary cycles in the first place (if not needed);
or change the code to break the cycle before the end of fit by calling del something.some_attribute or something.some_attribute = None and then let the gc cheaply collect cycle-free temporary datastructures used in fit.

I have not tried to diagnose the root cause of the problem yet, but it's possible that the following references might help for anyone with a desire to look into the problem:

glemaitre · 2024-05-13T12:47:51Z

So this is not a real memory leak, but more of a delayed garbage collection. This can indeed be caused by cyclic references involving data structures that also hold references to large numpy arrays.

I am almost sure that the issue is happening when calling the optimize.minimize function by looking at the number if putting a gc.collect() before or after the call. So my guess would be to look at the C-function loss_gradient to be sure that we don't have something going on there.

ogrisel · 2024-05-13T12:49:01Z

We can also try to use tracemalloc to get more information, for instance:

Tialo · 2024-05-22T18:56:36Z

I used this code to test loss_gradient

import tracemalloc
import warnings

import numpy as np
from sklearn.linear_model import LogisticRegression


def do_train() -> None:
    np.random.seed(42)
    X = np.random.rand(2000, 3000)
    y = np.random.randint(2, size=2000)
    clf = LogisticRegression(max_iter=20)
    clf.fit(X, y)


def run(reps):
    tracemalloc.start()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        for _ in range(reps):
            do_train()

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps:^6} | {base/ 1024**2:^9.2f} | {peak/ 1024**2:^9.2f} | {(peak-base)/ 1024**2:^9.2f} |")


if __name__ == "__main__":
    print("| # runs | base (MB) | peak (MB) | diff (MB) |")
    print("| ------ | --------- | --------- | --------- |")

    run(1)
    run(2)
    run(4)
    run(8)
    run(16)

I did this change here
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_logistic.py#L455

def f(*args, **kwargs):
    return 0.0
opt_res = optimize.minimize(
    # func,
    f,
    w0,
    method="L-BFGS-B",
    jac=False,
    args=(X, target, sample_weight, l2_reg_strength, n_threads),
    options={
        "maxiter": max_iter,
        "maxls": 50,  # default is 20
        "iprint": iprint,
        "gtol": tol,
        "ftol": 64 * np.finfo(float).eps,
    },
)

and leak still exists

| # runs | base (MB) | peak (MB) | diff (MB) |
| ------ | --------- | --------- | --------- |
|   1    |   45.94   |  114.90   |   68.96   |
|   2    |   91.69   |  160.66   |   68.97   |
|   4    |  183.38   |  252.35   |   68.97   |
|   8    |  366.76   |  435.73   |   68.97   |
|   16   |  320.91   |  481.58   |  160.67   |

I also tried this script in new venv

import tracemalloc

from scipy.optimize import minimize
import numpy as np


def do_something(*args, **kwargs):
    return np.random.rand()


def do_train():
    minimize(
        do_something,
        np.random.rand(300),
        args=(np.random.rand(2000, 3000)),
        method="L-BFGS-B"
    )


def run(reps):
    tracemalloc.start()
    for _ in range(reps):
        do_train()

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps:^6} | {base/ 1024**2:^9.2f} | {peak/ 1024**2:^9.2f} | {(peak-base)/ 1024**2:^9.2f} |")


if __name__ == "__main__":
    print("| # runs | base (MB) | peak (MB) | diff (MB) |")
    print("| ------ | --------- | --------- | --------- |")

    run(1)
    run(2)
    run(3)
    run(4)
    run(5)
    run(6)
    run(7)
    run(8)
    run(9)
    run(14)
    run(16)
    run(8)

Result is

| # runs | base (MB) | peak (MB) | diff (MB) |
| ------ | --------- | --------- | --------- |
|   1    |   45.79   |   46.59   |   0.80    |
|   2    |   91.57   |   92.37   |   0.80    |
|   3    |  137.36   |  138.16   |   0.80    |
|   4    |  183.15   |  183.94   |   0.80    |
|   5    |  228.93   |  229.73   |   0.80    |
|   6    |  274.72   |  275.52   |   0.80    |
|   7    |   45.79   |  321.21   |  275.42   |
|   8    |  366.29   |  367.08   |   0.80    |
|   9    |  412.07   |  412.87   |   0.80    |
|   14   |  641.00   |  641.80   |   0.80    |
|   16   |  366.29   |  412.07   |   45.78   |
|   8    |  366.29   |  367.08   |   0.80    |

Looks like problem occurs because of scipy.optimizie.minimize

lorentzenchr · 2024-05-22T20:01:43Z

@Tialo Thanks for the further investigations. I propose to open an issue upstream in scipy and close here.
Note that I do not consider it a memory leak, just the way garbage collection works in combination with minimize/lgfgs.

lorentzenchr · 2024-05-22T20:07:28Z

I opened scipy/scipy#20768.

andyfaff · 2024-05-23T08:23:16Z

FYI, we discovered that there were circular references in a helper class (ScalarFunction) that prevented instances of that class from being collected when the calling scope exited. There is a PR underway to remove those circular references, which should fix this issue.

glemaitre · 2024-05-23T14:09:20Z

Thanks @andyfaff for the heads up

glemaitre · 2024-05-23T14:12:45Z

Special "Thank You" from my side @Tialo for crafting the minimal example that I did not get time writing on my first pass ;)

beijbom added Bug Needs Triage Issue requires triage labels May 10, 2024

glemaitre removed the Needs Triage Issue requires triage label May 13, 2024

lorentzenchr closed this as completed May 22, 2024

lorentzenchr mentioned this issue May 22, 2024

BUG: optimize.minimize: garbage collection in lbfgs scipy/scipy#20768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryLeak in `LogisticRession` #28993

MemoryLeak in `LogisticRession` #28993

beijbom commented May 10, 2024 •

edited by glemaitre

glemaitre commented May 13, 2024

ogrisel commented May 13, 2024 •

edited

glemaitre commented May 13, 2024

ogrisel commented May 13, 2024

Tialo commented May 22, 2024

lorentzenchr commented May 22, 2024

lorentzenchr commented May 22, 2024

andyfaff commented May 23, 2024

glemaitre commented May 23, 2024

glemaitre commented May 23, 2024

MemoryLeak in LogisticRession #28993

MemoryLeak in LogisticRession #28993

Comments

beijbom commented May 10, 2024 • edited by glemaitre

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented May 13, 2024

ogrisel commented May 13, 2024 • edited

glemaitre commented May 13, 2024

ogrisel commented May 13, 2024

Tialo commented May 22, 2024

lorentzenchr commented May 22, 2024

lorentzenchr commented May 22, 2024

andyfaff commented May 23, 2024

glemaitre commented May 23, 2024

glemaitre commented May 23, 2024

MemoryLeak in `LogisticRession` #28993

MemoryLeak in `LogisticRession` #28993

beijbom commented May 10, 2024 •

edited by glemaitre

ogrisel commented May 13, 2024 •

edited