[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

MEIXuYan · 2024-03-01T15:43:47Z

Thank so much for open source such great work !!!
When I running the superpoint generation process using parallel_cut_pursuit, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system.
This block me from training SSP(supervised superpoint, CVPR 2019) with more epochs when I transplant parallel_cut_pursuit as the optimization backend, although the training speed up a lot.
Does the parallel_cut_pursuit really have memory leak issue? How can I fix it?

#--- error training log
...
...
Epoch 2/50 (results_partition/xx/nsp800_f1000):
 75%|█████████████████████████████████████████████▉               | 469/622 [28:23<11:04,  4.34s/it]
[1]    735656 killed     python supervised_partition/train.py --config

Looking forward to your reply, thanks again~

The text was updated successfully, but these errors were encountered:

1a7r0ch3 · 2024-03-02T14:28:22Z

Hi Yan Xu Thanks for reporting your issue. I never saw anything that look like memory leak on cut-pursuit. Can you monitor memory usage just before and right after the call to cut-pursuit? And also maybe have a look at how cut-pursuit is called in SSP, there might be unecessary copy that explains what you observe. 01/03/2024 07:43, Yan Xu :

…

Thank so much for open source such great work !!! When I running the superpoint generation process using `parallel_cut_pursuit`, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system. This block me from training SSP(supervised superpoint, CVPR 2019) with more epochs when I transplant `parallel_cut_pursuit` as the optimization backend, although the training speed up a lot. Does the `parallel_cut_pursuit` really have memory leak issue? How can I fix it? ```shell #--- error training log ... ... Epoch 2/50 (results_partition/xx/nsp800_f1000): 75%|█████████████████████████████████████████████▉ | 469/622 [28:23<11:04, 4.34s/it] [1] 735656 killed python supervised_partition/train.py --config ``` Looking forward to your reply, thanks again~ -- Reply to this email directly or view it on GitHub: #10 You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

MEIXuYan · 2024-03-03T13:57:21Z

Thank you for your reply. I check the memory usage right before and after using cut-pursuit, it turns out that the leakage is not within parallel_cut_pursuit, here is the python scripts and the output.

#?--- print memory usage before cutp
process = psutil.Process()
memory_info = process.memory_full_info().rss / (1024 * 1024)
print(f"\ncurrent step memory usage (before cutp) {memory_info} MB")

#--- parallel cutpursuit 2019 - cp_kmpp_d0_dist
if cfg.pcp_type == 'cp_kmpp_d0_dist':
    # from sls_partition.pcutp_2019.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist
    from sls_partition.pcutp_2023.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist
    pred_in_component, x_c, pred_components, edges, times = cp_kmpp_d0_dist(
        1,
        ver_value,
        source_csr,
        target,
        edge_weights=edge_weights,
        vert_weights=node_size,
        coor_weights=coor_weights,
        min_comp_weight=cfg.cp_cutoff,
        cp_dif_tol=1e-2,
        cp_it_max=cfg.cp_iterations,
        split_damp_ratio=0.7,
        verbose=cfg.cp_verbose,
        max_num_threads=cfg.cp_num_threads,
        balance_parallel_split=True,
        compute_Time=True,
        compute_List=True,
        compute_Graph=True)
    #!--- free RAM
    del cp_kmpp_d0_dist
elif cfg.pcp_type == 'cp_d0_dist':
    #--- parallel cutpursuit 2024 - cp_d0_dist
    #- coor_weights = coor_weights | None
    from sls_partition.pcutp_2024.python.wrappers.cp_d0_dist import cp_d0_dist
    pred_in_component, x_c, pred_components, edges, times = cp_d0_dist(
        ver_value.shape[0],
        ver_value,
        source_csr,
        target,
        edge_weights=edge_weights,
        vert_weights=node_size,
        coor_weights=None,
        min_comp_weight=cfg.cp_cutoff,
        cp_dif_tol=1e-2,
        cp_it_max=cfg.cp_iterations,
        split_damp_ratio=0.7,
        verbose=cfg.cp_verbose,
        max_num_threads=cfg.cp_num_threads,
        balance_parallel_split=True,
        compute_Time=True,
        compute_List=True,
        compute_Graph=True)
    #!--- free RAM
    del cp_d0_dist
else:
    raise NotImplementedError('unknown pcutp type ' + cfg.pcp_type)

#?--- print memory usage before cutp
process = psutil.Process()
memory_info = process.memory_full_info().rss / (1024 * 1024)
print(f"current step memory usage (after cutp) {memory_info} MB")

...
...
current step memory usage (before collate) 1314.43359375 MB
current step memory usage (before train data load) 1330.3515625 MB
current step memory usage (before cutp) 1330.3515625 MB
current step memory usage (after cutp) 1332.97265625 MB
 11%|███████▏                                                        | 7/62 [00:04<00:37,  1.48it/s]
current step memory usage (before collate) 1334.26171875 MB
current step memory usage (before train data load) 1334.51953125 MB
current step memory usage (before cutp) 1334.51953125 MB
current step memory usage (after cutp) 1334.40625 MB
 13%|████████▎                                                       | 8/62 [00:05<00:34,  1.58it/s]
current step memory usage (before collate) 1335.6953125 MB
current step memory usage (before train data load) 1338.53125 MB
current step memory usage (before cutp) 1338.53125 MB
current step memory usage (after cutp) 1339.1015625 MB
...
...
current step memory usage (before collate) 1594.14453125 MB
current step memory usage (before train data load) 1594.14453125 MB
current step memory usage (before cutp) 1594.14453125 MB
current step memory usage (after cutp) 1594.4921875 MB
 55%|██████████████████████████████████▌                            | 34/62 [00:21<00:17,  1.58it/s]
current step memory usage (before collate) 1595.78125 MB
current step memory usage (before train data load) 1596.296875 MB
current step memory usage (before cutp) 1596.296875 MB
current step memory usage (after cutp) 1600.8046875 MB
...
...

To find what cause the memory leakage, I further converted the SSP backbone to an semantic segmentation network(Backbone+MLP and a BCE loss), then the leakage disappear, so I think the leakage is caused by the loss computing part of SSP but not the cut-pursuit part(C++ Part).

Further, I try to del every used variables in loss computing part, but the leakage stays. So I think maybe the non-end-to-end character caused that pytorch cannot release memory properly.

Thank you again for your quick reply !

1a7r0ch3 · 2024-03-03T14:02:15Z

Hi Yan Xu, that's what I thought! You can close this issue, and maybe report what you observed in the SPT repo. Regards 03/03/2024 05:57, Yan Xu :

…

Thank you for your reply. I check the memory usage right before and after using cut-pursuit, it turns out that the leakage is not within `parallel_cut_pursuit`, here is the python scripts and the output. ```python #?--- print memory usage before cutp process = psutil.Process() memory_info = process.memory_full_info().rss / (1024 * 1024) print(f"\ncurrent step memory usage (before cutp) {memory_info} MB") #--- parallel cutpursuit 2019 - cp_kmpp_d0_dist if cfg.pcp_type == 'cp_kmpp_d0_dist': # from sls_partition.pcutp_2019.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist from sls_partition.pcutp_2023.python.wrappers.cp_kmpp_d0_dist import cp_kmpp_d0_dist pred_in_component, x_c, pred_components, edges, times = cp_kmpp_d0_dist( 1, ver_value, source_csr, target, edge_weights=edge_weights, vert_weights=node_size, coor_weights=coor_weights, min_comp_weight=cfg.cp_cutoff, cp_dif_tol=1e-2, cp_it_max=cfg.cp_iterations, split_damp_ratio=0.7, verbose=cfg.cp_verbose, max_num_threads=cfg.cp_num_threads, balance_parallel_split=True, compute_Time=True, compute_List=True, compute_Graph=True) #!--- free RAM del cp_kmpp_d0_dist elif cfg.pcp_type == 'cp_d0_dist': #--- parallel cutpursuit 2024 - cp_d0_dist #- coor_weights = coor_weights | None from sls_partition.pcutp_2024.python.wrappers.cp_d0_dist import cp_d0_dist pred_in_component, x_c, pred_components, edges, times = cp_d0_dist( ver_value.shape[0], ver_value, source_csr, target, edge_weights=edge_weights, vert_weights=node_size, coor_weights=None, min_comp_weight=cfg.cp_cutoff, cp_dif_tol=1e-2, cp_it_max=cfg.cp_iterations, split_damp_ratio=0.7, verbose=cfg.cp_verbose, max_num_threads=cfg.cp_num_threads, balance_parallel_split=True, compute_Time=True, compute_List=True, compute_Graph=True) #!--- free RAM del cp_d0_dist else: raise NotImplementedError('unknown pcutp type ' + cfg.pcp_type) #?--- print memory usage before cutp process = psutil.Process() memory_info = process.memory_full_info().rss / (1024 * 1024) print(f"current step memory usage (after cutp) {memory_info} MB") ``` ```shell ... ... current step memory usage (before collate) 1314.43359375 MB current step memory usage (before train data load) 1330.3515625 MB current step memory usage (before cutp) 1330.3515625 MB current step memory usage (after cutp) 1332.97265625 MB 11%|███████▏ | 7/62 [00:04<00:37, 1.48it/s] current step memory usage (before collate) 1334.26171875 MB current step memory usage (before train data load) 1334.51953125 MB current step memory usage (before cutp) 1334.51953125 MB current step memory usage (after cutp) 1334.40625 MB 13%|████████▎ | 8/62 [00:05<00:34, 1.58it/s] current step memory usage (before collate) 1335.6953125 MB current step memory usage (before train data load) 1338.53125 MB current step memory usage (before cutp) 1338.53125 MB current step memory usage (after cutp) 1339.1015625 MB ... ... current step memory usage (before collate) 1594.14453125 MB current step memory usage (before train data load) 1594.14453125 MB current step memory usage (before cutp) 1594.14453125 MB current step memory usage (after cutp) 1594.4921875 MB 55%|██████████████████████████████████▌ | 34/62 [00:21<00:17, 1.58it/s] current step memory usage (before collate) 1595.78125 MB current step memory usage (before train data load) 1596.296875 MB current step memory usage (before cutp) 1596.296875 MB current step memory usage (after cutp) 1600.8046875 MB ... ... ``` To find what cause the memory leakage, I further converted the SSP backbone to an semantic segmentation network(Backbone+MLP and a BCE loss), then the leakage disappear, so I think the leakage is caused by the loss computing part of SSP but not the cut-pursuit part(C++ Part). Further, I try to `del` every used variables in loss computing part, but the leakage stays. So I think maybe the non-end-to-end character caused that `pytorch` cannot release memory properly. Thank you again for your quick reply ! -- Reply to this email directly or view it on GitHub: #10 (comment) You are receiving this because you commented. Message ID: ***@***.***>

MEIXuYan closed this as completed Mar 3, 2024

MEIXuYan mentioned this issue Mar 13, 2024

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package drprojects/superpoint_transformer#76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

MEIXuYan commented Mar 1, 2024

1a7r0ch3 commented Mar 2, 2024 via email

MEIXuYan commented Mar 3, 2024

1a7r0ch3 commented Mar 3, 2024 via email

[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

[Memory Leak?]Memory usage keep going up when generating superpoint, leading system killing the process #10

Comments

MEIXuYan commented Mar 1, 2024

1a7r0ch3 commented Mar 2, 2024 via email

MEIXuYan commented Mar 3, 2024

1a7r0ch3 commented Mar 3, 2024 via email