Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

Open
Nishikata97 opened this issue Apr 6, 2022 · 3 comments

Comments

@Nishikata97
Copy link

你好,我严格遵守了README.md提供的环境版本,但遇到了一个无法解决的报错信息,这个错误可能是线程同步的问题,报了内存非法访问的错误。因为我单步调试是可以运行的,但直接运行就报出了如下的错误:
Train 0: 0%| | 0/358 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/nishikata/Downloads/NCL-master/main.py", line 71, in <module> run_single_model(args) File "/home/nishikata/Downloads/NCL-master/main.py", line 45, in run_single_model train_data, valid_data, saved=True, show_progress=config['show_progress'] File "/home/nishikata/Downloads/NCL-master/trainer.py", line 47, in fit train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress) File "/home/nishikata/Downloads/NCL-master/trainer.py", line 133, in _train_epoch losses = loss_func(interaction) File "/home/nishikata/Downloads/NCL-master/ncl.py", line 217, in calculate_loss user_all_embeddings, item_all_embeddings, embeddings_list = self.forward() File "/home/nishikata/Downloads/NCL-master/ncl.py", line 139, in forward all_embeddings = torch.sparse.mm(self.norm_adj_mat, all_embeddings) File "/home/nishikata/anaconda3/envs/NCL/lib/python3.7/site-packages/torch/sparse/__init__.py", line 84, in mm return torch._sparse_mm(mat1, mat2) RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

请问你在代码调试过程中有没有遇到过这种问题,谢谢啦。

@hyp1231
Copy link
Member

hyp1231 commented Apr 6, 2022

你好!感谢指出这个问题,我确实遇到过这个报错。最开始实验时在 cudatoolkit==10.1 的环境中就可以跑通,后来在 cudatoolkit==11.3 的环境下就会报和这个一样的错误。

我们如果找到具体问题后会更新代码并在本 issue 下回复,建议先在 cudatoolkit==10.1 的环境下进行测试。

@Nishikata97
Copy link
Author

Nishikata97 commented Apr 11, 2022

你好,现有环境运行在2080ti是没有问题的。因为实验室条件受限,3090无法兼容cudatoolkit==11.1之前的版本,初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch
上述命令可以解决这个问题。

Requirements

python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1

下面是Yelp数据集的结果:

11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])

@hyp1231
Copy link
Member

hyp1231 commented Apr 11, 2022

你好,现有环境运行在2080ti是没有问题的。因为实验室条件受限,3090无法兼容cudatoolkit==11.1之前的版本,初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。 conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch 上述命令可以解决这个问题。

Requirements

python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1

下面是Yelp数据集的结果:

11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])

感谢反馈!

@hyp1231 hyp1231 mentioned this issue Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants