Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise an error samgraph/commonn/cpu/cpu_device.cc:39 Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: os call failed or operation not supported on this OS when running /gnn_lab/example/samgraph/train_gcn.pyon papers100M dataset #15

Open
weihai-98 opened this issue Sep 18, 2023 · 6 comments

Comments

@weihai-98
Copy link

Hey, I got an error samgraph/commonn/cpu/cpu_device.cc:39 Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: os call failed or operation not supported on this OS when running /gnn_lab/example/samgraph/train_gcn.pyon papers100M dataset, however, this python script runs successfully on other datasets such as ognb_products and reddit. How can I fix it? Looking forward to your help. Thanks!

@weihai-98
Copy link
Author

I fix this bug when decreasing the cache_percentage from 0.21 to 0.001. However I am curious on how to realize the cache ratio of 0.21 reported in your paper? Increasing the pin-memory limit or others?

@molamooo
Copy link
Contributor

Can you provide more information about your setup? E.g., the number of GPUs, GPU memory, batch size, the script and parameters you use, how you generate the dataset.

@weihai-98
Copy link
Author

Yeah, I use 2 3090 GPUs with 24GB device memory, 1 for sample and 1 for train, the training batch size is 8000, hidden dim is 64 and I generate the dataset as the code(gnn_lab/utility/data-process/dataset/papers100M.ipynb) you provided. The script I run is /gnn_lab/example/samgraph/train_gcn.py.

@weihai-98
Copy link
Author

My papers100M dataset downloaded from https://snap.stanford.edu/ogb/data/nodeproppred/.

@molamooo
Copy link
Contributor

Then 24GB memory should be enough for 21% cache rate, since the entire feature is around 55GB. We need more information to indentify the root cause, e.g., the call stack when the error raises. You may sleep several seconds right after the trainer process launches, print pid, and attach GDB to it.

@weihai-98
Copy link
Author

Thanks! I will try to print the more detailed information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants