merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

Nishikata97 · 2022-04-06T13:47:25Z

你好，我严格遵守了README.md提供的环境版本，但遇到了一个无法解决的报错信息，这个错误可能是线程同步的问题，报了内存非法访问的错误。因为我单步调试是可以运行的，但直接运行就报出了如下的错误：
Train 0: 0%| | 0/358 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/nishikata/Downloads/NCL-master/main.py", line 71, in <module> run_single_model(args) File "/home/nishikata/Downloads/NCL-master/main.py", line 45, in run_single_model train_data, valid_data, saved=True, show_progress=config['show_progress'] File "/home/nishikata/Downloads/NCL-master/trainer.py", line 47, in fit train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress) File "/home/nishikata/Downloads/NCL-master/trainer.py", line 133, in _train_epoch losses = loss_func(interaction) File "/home/nishikata/Downloads/NCL-master/ncl.py", line 217, in calculate_loss user_all_embeddings, item_all_embeddings, embeddings_list = self.forward() File "/home/nishikata/Downloads/NCL-master/ncl.py", line 139, in forward all_embeddings = torch.sparse.mm(self.norm_adj_mat, all_embeddings) File "/home/nishikata/anaconda3/envs/NCL/lib/python3.7/site-packages/torch/sparse/__init__.py", line 84, in mm return torch._sparse_mm(mat1, mat2) RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

请问你在代码调试过程中有没有遇到过这种问题，谢谢啦。

The text was updated successfully, but these errors were encountered:

hyp1231 · 2022-04-06T13:58:41Z

你好！感谢指出这个问题，我确实遇到过这个报错。最开始实验时在 cudatoolkit==10.1 的环境中就可以跑通，后来在 cudatoolkit==11.3 的环境下就会报和这个一样的错误。

我们如果找到具体问题后会更新代码并在本 issue 下回复，建议先在 cudatoolkit==10.1 的环境下进行测试。

Nishikata97 · 2022-04-11T04:21:19Z

你好，现有环境运行在2080ti是没有问题的。因为实验室条件受限，3090无法兼容cudatoolkit==11.1之前的版本，初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch
上述命令可以解决这个问题。

Requirements

python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1

下面是Yelp数据集的结果：

11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])

hyp1231 · 2022-04-11T06:42:34Z

你好，现有环境运行在2080ti是没有问题的。因为实验室条件受限，3090无法兼容cudatoolkit==11.1之前的版本，初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。 conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch 上述命令可以解决这个问题。

Requirements
python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1
下面是Yelp数据集的结果：
11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])

感谢反馈！

hyp1231 mentioned this issue Jun 9, 2022

代码问题 #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

Nishikata97 commented Apr 6, 2022

hyp1231 commented Apr 6, 2022

Nishikata97 commented Apr 11, 2022 •

edited

hyp1231 commented Apr 11, 2022

Requirements

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

Comments

Nishikata97 commented Apr 6, 2022

hyp1231 commented Apr 6, 2022

Nishikata97 commented Apr 11, 2022 • edited

Requirements

hyp1231 commented Apr 11, 2022

Requirements

Nishikata97 commented Apr 11, 2022 •

edited