Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

Closed
MEIXuYan opened this issue Mar 1, 2024 · 3 comments

Comments

@MEIXuYan
Copy link

MEIXuYan commented Mar 1, 2024

Thank so much for open source such great work !!!
When I running the superpoint generation process using parallel_cut_pursuit, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system.
This block me from training SSP(supervised superpoint, CVPR 2019) with more epochs when I transplant parallel_cut_pursuit as the optimization backend, although the training speed up a lot.
Does the parallel_cut_pursuit really have memory leak issue? How can I fix it?

#--- error training log
...
...
Epoch 2/50 (results_partition/xx/nsp800_f1000):
 75%|█████████████████████████████████████████████▉               | 469/622 [28:23<11:04,  4.34s/it]
[1]    735656 killed     python supervised_partition/train.py --config

Looking forward to your reply, thanks again~

@1a7r0ch3
Copy link
Owner

1a7r0ch3 commented Mar 2, 2024 via email

@MEIXuYan
Copy link
Author

MEIXuYan commented Mar 3, 2024

Thank you for your reply. I check the memory usage right before and after using cut-pursuit, it turns out that the leakage is not within parallel_cut_pursuit, here is the python scripts and the output.

#?--- print memory usage before cutp
process = psutil.Process()
memory_info = process.memory_full_info().rss / (1024 * 1024)
print(f"\ncurrent step memory usage (before cutp) {memory_info} MB")

#--- parallel cutpursuit 2019 - cp_kmpp_d0_dist
if cfg.pcp_type == 'cp_kmpp_d0_dist':
    # from sls_partition.pcutp_2019.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist
    from sls_partition.pcutp_2023.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist
    pred_in_component, x_c, pred_components, edges, times = cp_kmpp_d0_dist(
        1,
        ver_value,
        source_csr,
        target,
        edge_weights=edge_weights,
        vert_weights=node_size,
        coor_weights=coor_weights,
        min_comp_weight=cfg.cp_cutoff,
        cp_dif_tol=1e-2,
        cp_it_max=cfg.cp_iterations,
        split_damp_ratio=0.7,
        verbose=cfg.cp_verbose,
        max_num_threads=cfg.cp_num_threads,
        balance_parallel_split=True,
        compute_Time=True,
        compute_List=True,
        compute_Graph=True)
    #!--- free RAM
    del cp_kmpp_d0_dist
elif cfg.pcp_type == 'cp_d0_dist':
    #--- parallel cutpursuit 2024 - cp_d0_dist
    #- coor_weights = coor_weights | None
    from sls_partition.pcutp_2024.python.wrappers.cp_d0_dist import cp_d0_dist
    pred_in_component, x_c, pred_components, edges, times = cp_d0_dist(
        ver_value.shape[0],
        ver_value,
        source_csr,
        target,
        edge_weights=edge_weights,
        vert_weights=node_size,
        coor_weights=None,
        min_comp_weight=cfg.cp_cutoff,
        cp_dif_tol=1e-2,
        cp_it_max=cfg.cp_iterations,
        split_damp_ratio=0.7,
        verbose=cfg.cp_verbose,
        max_num_threads=cfg.cp_num_threads,
        balance_parallel_split=True,
        compute_Time=True,
        compute_List=True,
        compute_Graph=True)
    #!--- free RAM
    del cp_d0_dist
else:
    raise NotImplementedError('unknown pcutp type ' + cfg.pcp_type)

#?--- print memory usage before cutp
process = psutil.Process()
memory_info = process.memory_full_info().rss / (1024 * 1024)
print(f"current step memory usage (after cutp) {memory_info} MB")
...
...
current step memory usage (before collate) 1314.43359375 MB
current step memory usage (before train data load) 1330.3515625 MB
current step memory usage (before cutp) 1330.3515625 MB
current step memory usage (after cutp) 1332.97265625 MB
 11%|███████▏                                                        | 7/62 [00:04<00:37,  1.48it/s]
current step memory usage (before collate) 1334.26171875 MB
current step memory usage (before train data load) 1334.51953125 MB
current step memory usage (before cutp) 1334.51953125 MB
current step memory usage (after cutp) 1334.40625 MB
 13%|████████▎                                                       | 8/62 [00:05<00:34,  1.58it/s]
current step memory usage (before collate) 1335.6953125 MB
current step memory usage (before train data load) 1338.53125 MB
current step memory usage (before cutp) 1338.53125 MB
current step memory usage (after cutp) 1339.1015625 MB
...
...
current step memory usage (before collate) 1594.14453125 MB
current step memory usage (before train data load) 1594.14453125 MB
current step memory usage (before cutp) 1594.14453125 MB
current step memory usage (after cutp) 1594.4921875 MB
 55%|██████████████████████████████████▌                            | 34/62 [00:21<00:17,  1.58it/s]
current step memory usage (before collate) 1595.78125 MB
current step memory usage (before train data load) 1596.296875 MB
current step memory usage (before cutp) 1596.296875 MB
current step memory usage (after cutp) 1600.8046875 MB
...
...

To find what cause the memory leakage, I further converted the SSP backbone to an semantic segmentation network(Backbone+MLP and a BCE loss), then the leakage disappear, so I think the leakage is caused by the loss computing part of SSP but not the cut-pursuit part(C++ Part).

Further, I try to del every used variables in loss computing part, but the leakage stays. So I think maybe the non-end-to-end character caused that pytorch cannot release memory properly.

Thank you again for your quick reply !

@1a7r0ch3
Copy link
Owner

1a7r0ch3 commented Mar 3, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants