Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems about ninja #10

Open
Robot-2020 opened this issue Apr 25, 2022 · 5 comments
Open

Problems about ninja #10

Robot-2020 opened this issue Apr 25, 2022 · 5 comments

Comments

@Robot-2020
Copy link

Robot-2020 commented Apr 25, 2022

Hi, Doctor. I meet some problems when I run the code on the Linux.
I do really need your help. Could you help me? It really troubles me a lot.

15:43:32   Preprocess training set
15:43:36   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
15:43:36   Epoch 0 begin
Traceback (most recent call last):
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1666, in _run_ninja_build
    subprocess.run(
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "script/run.py", line 62, in <module>
    train_and_validate(cfg, solver)
  File "script/run.py", line 27, in train_and_validate
    solver.train(**kwargs)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/core/engine.py", line 143, in train
    loss, metric = model(batch)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/tasks/reasoning.py", line 85, in forward
    pred = self.predict(batch, all_loss, metric)
  File "/data1/home/wza/nbfnet/nbfnet/task.py", line 288, in predict
    pred = self.model(graph, h_index, t_index, r_index, all_loss=all_loss, metric=metric)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/home/wza/nbfnet/nbfnet/model.py", line 149, in forward
    output = self.bellmanford(graph, h_index[:, 0], r_index[:, 0])
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/utils/decorator.py", line 56, in wrapper
    return forward(self, *args, **kwargs)
  File "/data1/home/wza/nbfnet/nbfnet/model.py", line 115, in bellmanford
    hidden = layer(step_graph, layer_input)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/conv.py", line 91, in forward
    update = self.message_and_aggregate(graph, input)
  File "/data1/home/wza/nbfnet/nbfnet/layer.py", line 140, in message_and_aggregate
    sum = functional.generalized_rspmm(adjacency, relation_input, input, sum="add", mul=mul)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/spmm.py", line 378, in generalized_rspmm
    return Function.apply(sparse.coalesce(), relation, input)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/spmm.py", line 172, in forward
    forward = spmm.rspmm_add_mul_forward_cuda
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/utils/torch.py", line 27, in __getattr__
    return getattr(self.module, key)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/utils/decorator.py", line 21, in __get__
    result = self.func(obj)
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/utils/torch.py", line 31, in module
    return cpp_extension.load(self.name, self.sources, self.extra_cflags, self.extra_cuda_cflags,
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1080, in load
    return _jit_compile(
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1293, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1405, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1682, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'spmm': [1/3] /usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=spmm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/TH -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /data1/home/wza/.conda/envs/linkp/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -std=c++14 -c /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu -o rspmm.cuda.o
FAILED: rspmm.cuda.o
/usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=spmm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/TH -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /data1/home/wza/.conda/envs/linkp/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -std=c++14 -c /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu -o rspmm.cuda.o
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu: In instantiation of ‘at::rspmm_forward_cuda(const SparseTensor&, const at::Tensor&, const at::Tensor&)::<lambda()>::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]’:
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:246:600:   required from ‘struct at::rspmm_forward_cuda(const SparseTensor&, const at::Tensor&, const at::Tensor&)::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]::<lambda()>’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:246:608:   required from ‘at::rspmm_forward_cuda(const SparseTensor&, const at::Tensor&, const at::Tensor&)::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:246:607:   required from ‘struct at::rspmm_forward_cuda(const SparseTensor&, const at::Tensor&, const at::Tensor&) [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul; at::sparse::SparseTensor = at::Tensor]::<lambda()>’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:246:28:   required from ‘at::Tensor at::rspmm_forward_cuda(const SparseTensor&, const at::Tensor&, const at::Tensor&) [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul; at::sparse::SparseTensor = at::Tensor]’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:356:193:   required from here
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/rspmm.cu:244:37: internal compiler error: in tsubst_copy, at cp/pt.c:13189
     const int num_row_block = (num_row + row_per_block - 1) / row_per_block;
                                     ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
[2/3] /usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=spmm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/TH -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /data1/home/wza/.conda/envs/linkp/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -std=c++14 -c /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu -o spmm.cuda.o
FAILED: spmm.cuda.o
/usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=spmm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/TH -isystem /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /data1/home/wza/.conda/envs/linkp/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 -std=c++14 -c /data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu -o spmm.cuda.o
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu: In instantiation of ‘at::spmm_forward_cuda(const SparseTensor&, const at::Tensor&)::<lambda()>::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]’:
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:219:506:   required from ‘struct at::spmm_forward_cuda(const SparseTensor&, const at::Tensor&)::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]::<lambda()>’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:219:514:   required from ‘at::spmm_forward_cuda(const SparseTensor&, const at::Tensor&)::<lambda()> [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul]’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:219:512:   required from ‘struct at::spmm_forward_cuda(const SparseTensor&, const at::Tensor&) [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul; at::sparse::SparseTensor = at::Tensor]::<lambda()>’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:219:28:   required from ‘at::Tensor at::spmm_forward_cuda(const SparseTensor&, const at::Tensor&) [with NaryOp = at::NaryAdd; BinaryOp = at::BinaryMul; at::sparse::SparseTensor = at::Tensor]’
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:315:157:   required from here
/data1/home/wza/.conda/envs/linkp/lib/python3.8/site-packages/torchdrug/layers/functional/extension/spmm.cu:217:37: internal compiler error: in tsubst_copy, at cp/pt.c:13189
     const int num_row_block = (num_row + row_per_block - 1) / row_per_block;
                                     ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.
ninja: build stopped: subcommand failed.
@Robot-2020
Copy link
Author

When I evaluate ogbl-biokg by cpu
There is an error saying that:
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:75] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3917082528 bytes. Buy new RAM!

How can I resolve it? Please!

@KiddoZhu
Copy link
Contributor

KiddoZhu commented Apr 26, 2022

Hi! I haven't encountered the ninja issue you post here. It looks like ninja reports some internal compiler error?

For the memory error, it tries to allocate around 3GB memory. I think that's a expected behavior for large datasets like ogbl-biokg. Maybe you don't have enough memory in your machine?

@Robot-2020
Copy link
Author

I'm thinking about whether the ninja issue was caused by the gcc version?
Unfortunately I'm just a normal student in Chineses college, I can only borrow the school's server. No permission to update gcc.
So, when I change to run the code on the Windows, I can successfully trained the model. But in the test result, the report memory is not enough error.
I have checked it, it still has about 10GB memory, so, I don't know why it reports this error.
By the way, How can I do it so I can only test a part to get the result?
Thank you so much.

@chandeler
Copy link

I also met something wrong about ninja on a windows10 +python3.7+torch1.8 machine

@nxznm
Copy link

nxznm commented Feb 23, 2023

@Robot-2020 Hi~ Recently, I met the same issue as yours. I think the problem is caused by the gcc version.
My original gcc version in my lab server machine is 5.3.
Now I use anaconda to install gcc & g++ (their version is 7.3). And now there is no such issue:)
Maybe you can refer to this website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants