Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM with reddit dataset #6

Open
xbxiong opened this issue Mar 28, 2024 · 1 comment
Open

OOM with reddit dataset #6

xbxiong opened this issue Mar 28, 2024 · 1 comment

Comments

@xbxiong
Copy link

xbxiong commented Mar 28, 2024

I would like to know about your experimental equipment. I encountered an OOM issue while running the Reddit dataset. Is this normal?

(pyg) xiongxunbin@graph-1:~/SGDD$ python train_SGDD.py --dataset reddit --nlayers=2 --beta 0.1 --r=0.5 --gpu_id=0
WARNING:root:The OGB package is out of date. Your version is 1.3.5, while the latest version is 1.3.6.
Namespace(beta=0.1, dataset='reddit', debug=0, dis_metric='ours', dropout=0.0, ep_ratio=0.5, epochs=2000, gpu_id=0, hidden=256, ignr_epochs=400, inner=0, keep_ratio=1.0, lr_adj=0.0001, lr_feat=0.0001, lr_model=0.01, mode='disabled', nlayers=2, normalize_features=True, one_step=0, opt_scale=1e-10, option=0, outer=20, reduction_rate=0.5, save=1, seed=15, sgc=1, sinkhorn_iter=5, weight_decay=0.0)
adj_syn: (76966, 76966) feat_syn: torch.Size([76966, 602])
  0%|                                                                                                                                 | 0/2001 [03:00<?, ?it/s]
Traceback (most recent call last):
  File "train_SGDD.py", line 79, in <module>
    agent.train()
  File "/home/xiongxunbin/SGDD/SGDD_agent.py", line 218, in train
    adj_syn, opt_loss = IGNR(self.feat_syn, Lx=adj[random_nodes].to_dense()[:, random_nodes])
  File "/home/xiongxunbin/miniconda3/envs/pyg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xiongxunbin/SGDD/models/IGNR.py", line 93, in forward
    c = torch.cat([c[self.edge_index[0]],
RuntimeError: CUDA out of memory. Tried to allocate 44.14 GiB (GPU 0; 79.20 GiB total capacity; 69.65 GiB already allocated; 8.24 GiB free; 69.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Suchun-sv
Copy link
Collaborator

Dear xbxiong,

Training on Reddit can indeed present challenges with our method. You might consider decreasing the mx_size to 2000 or 1000 and giving it another try. The issue at hand is likely due to the variability in the number of edges when sampling from Reddit, which could potentially be mitigated by altering the random seed. We're eager to learn about the outcomes of your experiments.

Our computation resource is a cluster of mixed A100 and V100.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants