Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryLeak in LogisticRession #28993

Closed
beijbom opened this issue May 10, 2024 · 10 comments
Closed

MemoryLeak in LogisticRession #28993

beijbom opened this issue May 10, 2024 · 10 comments
Labels

Comments

@beijbom
Copy link

beijbom commented May 10, 2024

Describe the bug

repro

  • Repeatedly call LogisticRegression().fit(X, y) on same size feature matrix. Or run the attached repro-script.

expected

  • The max memory allocated is steady regardless the number of train calls.

actual

  • Memory usage grows linear with the number of calls suggesting a memory leak.

notes

  • I'm using scikit-learn==1.2.2, and Python 3.10.11
  • Repro steps works on OSX (M1 Macbook) and on a AWS t3.large running the "Amazon Linux" OS.
  • Based on this thread at SO, I tried using SGDClassifier instead of LogisticRegression. The memory usage behaves as expected when using this solver. I therefore suspect this has to do libsvm not releasing memory correctly after completed training.
  • This ticket seems related to Memory problem with SVM #217

Steps/Code to Reproduce

import tracemalloc
import warnings

import numpy as np
from sklearn.linear_model import LogisticRegression, SGDClassifier


def do_train(train_type: str) -> None:
    np.random.seed(42)
    X = np.random.rand(sample_count, feature_dim)
    y = np.random.randint(2, size=sample_count)
    if train_type == "LR":
        clf = LogisticRegression(max_iter=20)
    elif train_type == "SGD":
        clf = SGDClassifier(loss="log_loss", max_iter=20)
    clf.fit(X, y)


def run(reps, train_type):
    tracemalloc.start()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        for _ in range(reps):
            do_train(train_type)

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps} | {train_type} | {base/ 1024**2:.2f} | {peak/ 1024**2:.2f} | {(peak-base)/ 1024**2:.2f}")


sample_count = 5000
feature_dim = 200

expected_memory_usage_features = sample_count * feature_dim * 8
expected_memory_usage_classifier = 2 * feature_dim * 8
expected_memory_usage_mb = (expected_memory_usage_features + expected_memory_usage_classifier) / 1024**2

if __name__ == "__main__":
    print(f"Expected memory usage: {expected_memory_usage_mb:.2f} MB")
    print("| # trainings | solver | base (MB) | peak (MB) | diff (MB) |")
    print("| ---- | ---- | ---- | ---- | ---- |")

    run(1, "LR")
    run(2, "LR")
    run(4, "LR")
    run(1, "SGD")
    run(2, "SGD")
    run(4, "SGD")

Expected Results

Expected memory usage: 7.63 MB

# trainings solver base (MB) peak (MB) diff (MB)
1 LR 0.02 7.83 7.81
2 LR 0.00 7.82 7.81
4 LR 0.00 7.82 7.81
1 SGD 0.02 7.83 7.81
2 SGD 0.00 7.82 7.81
4 SGD 0.00 7.82 7.81

Actual Results

Expected memory usage: 7.63 MB

# trainings solver base (MB) peak (MB) diff (MB)
1 LR 7.83 8.06 0.23
2 LR 15.45 15.69 0.24
4 LR 30.88 31.12 0.24
1 SGD 0.02 7.83 7.81
2 SGD 0.00 7.82 7.81
4 SGD 0.00 7.82 7.81

Versions

System:
    python: 3.10.11 (v3.10.11:7d4cc5aa85, Apr  4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/beijbom/.virtualenvs/trainer/bin/python
   machine: macOS-14.1.2-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 24.0
   setuptools: 68.0.0
        numpy: 1.25.1
        scipy: 1.11.1
       Cython: None
       pandas: 2.1.0
   matplotlib: None
       joblib: 1.3.1
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/beijbom/.virtualenvs/trainer/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
@beijbom beijbom added Bug Needs Triage Issue requires triage labels May 10, 2024
@glemaitre glemaitre removed the Needs Triage Issue requires triage label May 13, 2024
@glemaitre
Copy link
Member

I therefore suspect this has to do libsvm not releasing memory correctly after completed training.

The culprit is not libsvm because we don't use it here. But this is indeed in the branch of lbfgs. I manually added a gc.collect() and the leak disappear. So we should check that we don't have a unnecessary cyclic reference with the lbfgs solver branch. For other solver, we don't have the issue.

@ogrisel
Copy link
Member

ogrisel commented May 13, 2024

I confirm that I can reproduce the problem on macOS with the current state of main branch so this problem is likely to span from at least from scikit-learn 1.2.2 to the current developer version.

However I also confirm that inserting a manual call to gc.collect after each call to fit in the for loop fixes the problem. So this is not a real memory leak, but more of a delayed garbage collection. This can indeed be caused by cyclic references involving data structures that also hold references to large numpy arrays.

Since running gc.collect can be quite expansive, we don't want to do it defensively in the scikit-learn code base itself. Instead we should rather identify which objects are involved in this reference cycle and either:

  • change the code to avoid creating unnecessary cycles in the first place (if not needed);
  • or change the code to break the cycle before the end of fit by calling del something.some_attribute or something.some_attribute = None and then let the gc cheaply collect cycle-free temporary datastructures used in fit.

I have not tried to diagnose the root cause of the problem yet, but it's possible that the following references might help for anyone with a desire to look into the problem:

@glemaitre
Copy link
Member

So this is not a real memory leak, but more of a delayed garbage collection. This can indeed be caused by cyclic references involving data structures that also hold references to large numpy arrays.

I am almost sure that the issue is happening when calling the optimize.minimize function by looking at the number if putting a gc.collect() before or after the call. So my guess would be to look at the C-function loss_gradient to be sure that we don't have something going on there.

@ogrisel
Copy link
Member

ogrisel commented May 13, 2024

@Tialo
Copy link
Contributor

Tialo commented May 22, 2024

I used this code to test loss_gradient

import tracemalloc
import warnings

import numpy as np
from sklearn.linear_model import LogisticRegression


def do_train() -> None:
    np.random.seed(42)
    X = np.random.rand(2000, 3000)
    y = np.random.randint(2, size=2000)
    clf = LogisticRegression(max_iter=20)
    clf.fit(X, y)


def run(reps):
    tracemalloc.start()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        for _ in range(reps):
            do_train()

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps:^6} | {base/ 1024**2:^9.2f} | {peak/ 1024**2:^9.2f} | {(peak-base)/ 1024**2:^9.2f} |")


if __name__ == "__main__":
    print("| # runs | base (MB) | peak (MB) | diff (MB) |")
    print("| ------ | --------- | --------- | --------- |")

    run(1)
    run(2)
    run(4)
    run(8)
    run(16)

I did this change here
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_logistic.py#L455

def f(*args, **kwargs):
    return 0.0
opt_res = optimize.minimize(
    # func,
    f,
    w0,
    method="L-BFGS-B",
    jac=False,
    args=(X, target, sample_weight, l2_reg_strength, n_threads),
    options={
        "maxiter": max_iter,
        "maxls": 50,  # default is 20
        "iprint": iprint,
        "gtol": tol,
        "ftol": 64 * np.finfo(float).eps,
    },
)

and leak still exists

| # runs | base (MB) | peak (MB) | diff (MB) |
| ------ | --------- | --------- | --------- |
|   1    |   45.94   |  114.90   |   68.96   |
|   2    |   91.69   |  160.66   |   68.97   |
|   4    |  183.38   |  252.35   |   68.97   |
|   8    |  366.76   |  435.73   |   68.97   |
|   16   |  320.91   |  481.58   |  160.67   |

I also tried this script in new venv

import tracemalloc

from scipy.optimize import minimize
import numpy as np


def do_something(*args, **kwargs):
    return np.random.rand()


def do_train():
    minimize(
        do_something,
        np.random.rand(300),
        args=(np.random.rand(2000, 3000)),
        method="L-BFGS-B"
    )


def run(reps):
    tracemalloc.start()
    for _ in range(reps):
        do_train()

    base, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    print(f"| {reps:^6} | {base/ 1024**2:^9.2f} | {peak/ 1024**2:^9.2f} | {(peak-base)/ 1024**2:^9.2f} |")


if __name__ == "__main__":
    print("| # runs | base (MB) | peak (MB) | diff (MB) |")
    print("| ------ | --------- | --------- | --------- |")

    run(1)
    run(2)
    run(3)
    run(4)
    run(5)
    run(6)
    run(7)
    run(8)
    run(9)
    run(14)
    run(16)
    run(8)

Result is

| # runs | base (MB) | peak (MB) | diff (MB) |
| ------ | --------- | --------- | --------- |
|   1    |   45.79   |   46.59   |   0.80    |
|   2    |   91.57   |   92.37   |   0.80    |
|   3    |  137.36   |  138.16   |   0.80    |
|   4    |  183.15   |  183.94   |   0.80    |
|   5    |  228.93   |  229.73   |   0.80    |
|   6    |  274.72   |  275.52   |   0.80    |
|   7    |   45.79   |  321.21   |  275.42   |
|   8    |  366.29   |  367.08   |   0.80    |
|   9    |  412.07   |  412.87   |   0.80    |
|   14   |  641.00   |  641.80   |   0.80    |
|   16   |  366.29   |  412.07   |   45.78   |
|   8    |  366.29   |  367.08   |   0.80    |

Looks like problem occurs because of scipy.optimizie.minimize

@lorentzenchr
Copy link
Member

@Tialo Thanks for the further investigations. I propose to open an issue upstream in scipy and close here.
Note that I do not consider it a memory leak, just the way garbage collection works in combination with minimize/lgfgs.

@lorentzenchr
Copy link
Member

I opened scipy/scipy#20768.

@andyfaff
Copy link

FYI, we discovered that there were circular references in a helper class (ScalarFunction) that prevented instances of that class from being collected when the calling scope exited. There is a PR underway to remove those circular references, which should fix this issue.

@glemaitre
Copy link
Member

Thanks @andyfaff for the heads up

@glemaitre
Copy link
Member

Special "Thank You" from my side @Tialo for crafting the minimal example that I did not get time writing on my first pass ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

6 participants