Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

mcarilli · 2019-05-22T20:06:22Z

The warning message was too subtle/too easy to overlook in the output of setup.py, and it really should be a hard error.

Making it a hard error should also assist with cuda driver version errors like #314. #314 resulted because the cuda driver (libcuda.so) version was 10.0, the cuda toolkit version used to compile the Pytorch binaries was 10.0 (which was fine), but the cuda toolkit version used to compile Apex was 10.1** (which triggered a PTX JIT compilation error at runtime because the 10.0 libcuda.so couldn't handle the PTX produced by the 10.1 nvcc). The PTX JIT compilation error message was cryptic and unhelpful.

However, if the toolkit version that was used to compile Pytorch binaries is too recent for the system's cuda driver version, Pytorch will raise a much more helpful error, something like

"AssertionError: 
The NVIDIA driver on your system is too old (found version 10000)..."

If we hard-enforce that the cuda toolkit version used to compile Apex == the cuda toolkit version used to compile Pytorch, we also ensure that if the toolkit version used to compile Apex is too new for the driver, the toolkit version used to compile Pytorch must also be too new for the driver, and therefore in such cases we will receive the helpful Pytorch error instead of the bizarre PTX JIT error.

**A warning of the mismatch between torch.version.cuda and the toolkit (nvcc) had likely been issued by the setup.py while compiling apex, but this warning had likely been overlooked, so what ended up surfacing was the PTX JIT error, which was not at all a clear indication of what had gone wrong.

Lyuba-Rogovaya · 2019-05-23T14:57:16Z

setup.py

@@ -47,10 +47,9 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
    print(raw_output + "from " + cuda_dir + "/bin\n")

    if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):


Pytorch supports Cuda 9.1. Not sure that minor versions must be also compared.

True, and 9.2 as well. Nvidia also ships Docker containers where we build Pytorch from source with cuda 10.1. In practice, I do find that sometimes minor version mismatches can cause errors.

In my case it did not. I had to edit the file and remove this check.

I think it's good to keep this check for safety. I will note what you said in the error message, telling people they can comment it out if necessary, but at their own risk.

Maybe the judge condition(i mean the if)can be other specific ones not just use the cuda version.

@proszx I'm not sure what you mean by "other specific ones". Could you explain it a bit?

That's helps me a lot. Thanks.

This is outdated, here is an updated diff:

diff --git a/setup.py b/setup.py index e3063be..ed88abd 100644 --- a/setup.py +++ b/setup.py @@ -30,6 +30,11 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir): print(raw_output + "from " + cuda_dir + "/bin\n") if (bare_metal_version != torch_binary_version): + + # allow minor version mismatch + if bare_metal_version.major == torch_binary_version.major and bare_metal_version.minor != torch_binary_version.minor: + return + raise RuntimeError( "Cuda extensions are being compiled with a version of Cuda that does " "not match the version used to compile Pytorch binaries. "

@ptrblck, perhaps this could be added to setup.py in commented out fashion so it's trivial for the user to activate? This is a safer approach than just outright return, since it'll check that the major version match.

I have this patch pasted all over since almost all projects I'm involved in have this mismatch and it works just fine. most of the time we have no control over the environment provided by the system, like HPC, so it's not by choice.

or even better, let's have an env var that activates this exception so that the assert message could include, run:

CUDA_MINOR_VERSION_MISMATCH_OK=1 python setup.py ...

so then the user doesn't have to change the source code at all. I'd be happy to submit a PR if it resonates.

I installed CUDA 10.2 and pytorch 1.7.1 in my conda virtual env, and Cuda compilation tools is 11.1, it didn't work for me.

Yorkking · 2019-08-05T07:40:56Z

Hi, @mcarilli, I meet this problem, too.
My nvcc version is cuda 10.1, but I found the pytorch latest version from this website is 10.0, so I downloaded it. But now this problem occurred in windows. Could you help me?

RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  
Pytorch binaries were compiled with Cuda 10.0.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. 
 You can try commenting out this check (at your own risk).

Sorry, I just try to delete the code, which detects the verison problem, and seems it works!

ptrblck · 2019-08-05T10:55:14Z

@Yorkking This error is just raised as an additional check, as the version mismatch could raise some errors. If you don't encounter any issues, you could stick to your current setup.
However, the recommended way would be to install a matching CUDA version.

Yorkking · 2019-08-07T02:12:07Z

OK，I got it. Thanks very much! @ptrblck

JoyHuYY1412 · 2019-08-21T12:37:32Z

@Yorkking This error is just raised as an additional check, as the version mismatch could raise some errors. If you don't encounter any issues, you could stick to your current setup.
However, the recommended way would be to install a matching CUDA version.

Hi, I have cuda8.0 and 9.0 in my machine, but the softlink cuda is linked to cuda8.0. How can I use the path to cuda9.0 to install apex? Thanks.

ptrblck · 2019-08-21T16:29:07Z

Could you try to export the cuda folder corresponding to the version you would like to use via e.g.

export CUDA_HOME=/usr/local/cuda-9.0

Zacharias030 · 2019-09-17T14:31:32Z

Could you try to export the cuda folder corresponding to the version you would like to use via e.g.
export CUDA_HOME=/usr/local/cuda-9.0

This was very helpful, thank you!

Why would the apex installation script not look for a matching version of cuda and gcc by itself? It was smart enough to send me here (and to require gcc-7 as gcc which another softlink was able to solve).

mcarilli · 2019-09-17T15:31:27Z

Mismatched cuda and pytorch versions is a potential issue that's not specific to Apex. The underlying Pytorch extension builder is what's responsible for finding Cuda. In general, it's hard to know where to look on an arbitrary system. The cuda runtime is just a bunch of binaries and shared objects, so it could be present literally anywhere.

zhuhui1214 · 2019-12-20T14:40:33Z

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

ptrblck · 2019-12-23T22:41:22Z

@zhuhui1214 This will not install the CUDA extensions and thus will run slower.
While this is a workaround for this issue, it's not a solution and you should e.g. make sure your local CUDA version matches the one shipped with PyTorch.

GraphGrailAi · 2020-04-06T08:03:08Z

I am still confused how to solve this, (read full thread): not able to install Apex with full functionality (Python-only install well), both on PC and Google Colab (CUDA 10.1)
error on PC is
No module named 'fused_layer_norm_cuda'
when try to use my code

cosen1024 · 2020-05-01T13:26:26Z

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

very useful!

sofzh · 2020-06-03T02:31:17Z

I meet this error too. My cuda version is 10.0.130, pytorch version is 1.0.0 .
ths error is "RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 9.0.176."

ptrblck · 2020-06-05T09:42:06Z

@sofzh
We recommend to use the native amp implementation as described here: #818

Alternatively, you would either have to install the matching CUDA version locally (9.0), build PyTorch from source using your current CUDA installation or install a PyTorch binary with a matching CUDA version.

reverts NVIDIA#323

griff4692 · 2020-06-25T15:54:28Z

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2

As a result I can't properly install apex. is it safe to comment out the exception and where in the code do I comment?

gauravghati · 2020-07-01T12:14:47Z

commenting out as said in #323 (comment) worked for me too..

# if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
#     raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
#                        "not match the version used to compile Pytorch binaries.  " +
#                        "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
#                        "In some cases, a minor-version mismatch will not cause later errors:  " +
#                        "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
#                        "You can try commenting out this check (at your own risk).")

just comment out this "if loop" in setup.py

I have cuda 10.1 and torch 1.5.1
I don't face any problem as such.

MuruganR96 · 2020-11-07T02:59:21Z

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2

As a result I can't properly install apex. is it safe to comment out the exception and where in the code do I comment?

@griff4692 Have you resolved the issue. Me too facing same problem.

Software Versions pre-installed:

Nvidia Driver: 450.51v
CUDA: 11v
cuDNN: 8.0v
Python: 3.8
Docker: 19.03.12v
Nvidia-docker: 2.0v
NGC(Nvidia GPU Cloud) CLI: 1.15.0v

Traceback (most recent call last):
  File "train.py", line 188, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 83, in train
    from apex import amp
ImportError: cannot import name 'amp'

pip uninstall apex
cd apex
rm -rf build 
python setup.py install --cuda_ext --cpp_ext

Traceback (most recent call last):
  File "setup.py", line 152, in <module>
    check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
  File "setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal
    "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 9.0.176.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

Please help me to solve this issue @ptrblck sir. :)

potipot · 2021-01-24T22:06:44Z

To procees with the installation you can edit the symbolic link under /usr/local
in my case I already had different CUDA versions installed so I just changed the symbolic link
from

cuda -> cuda-10.0/

to

cuda -> cuda-10.2/

and the installation proceeded. I guess you can achieve the same by properly setting some environment variable.

vadimkantorov · 2021-02-15T18:47:19Z

Is there a way to disable this check without modifying the code?

My current installation command is pip install cxxfilt && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex (my PyTorch is compiled with CUDA10.2, and machine has CUDA11.2)



    torch.__version__  = 1.7.1


    /tmp/pip-req-build-pd7maz9h/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Nov_30_19:08:53_PST_2020
    Cuda compilation tools, release 11.2, V11.2.67
    Build cuda_11.2.r11.2/compiler.29373293_0
    from /usr/local/cuda/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-pd7maz9h/setup.py", line 171, in <module>
        check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
      File "/tmp/pip-req-build-pd7maz9h/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 10.2.
    In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
    Running setup.py install for apex ... error

vvuonghn · 2021-07-07T03:05:12Z

In my case, I have comment at line.

if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
    raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
                       "not match the version used to compile Pytorch binaries.  " +
                       "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
                       "In some cases, a minor-version mismatch will not cause later errors:  " +
                       "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
                       "You can try commenting out this check (at your own risk).")

It worked for me

AlexBlack2202 · 2021-07-20T07:58:29Z

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

This work for me. Maybe author shoud update this command to readme file.

joeyslv · 2022-03-16T05:34:11Z

do not use "python setup.py install --cuda_ext --cpp_ext"
use "python setup.py install" instead

This work for me. Maybe author shoud update this command to readme file.

but it will report NoModuleFound error when using cuda extension

MHketbi · 2023-05-03T15:23:27Z

this still exists in 2023, it shouldnt
we are on CUDA 12.1

Hard error on Pytorch Cuda + Cuda toolkit version mismatch

8fff56e

mcarilli changed the title ~~Hard error on Pytorch Cuda + Cuda toolkit version mismatch~~ Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex May 22, 2019

mcarilli merged commit 50689f6 into master May 22, 2019

mcarilli deleted the error_mismatch branch May 22, 2019 23:34

Lyuba-Rogovaya reviewed May 23, 2019

View reviewed changes

adrienchaton mentioned this pull request Jun 25, 2019

RuntimeError and speed loss with opt_level = O1, O2 or O3 #373

Open

This was referenced Sep 19, 2019

Installation Error. #497

Open

Installation error #498

Closed

Sreelakshmi-k mentioned this pull request Nov 9, 2019

Error while installing simpletransformers ThilinaRajapakse/simpletransformers#34

Closed

keloemma mentioned this pull request Feb 18, 2020

Running setup.py install for apex ... error #723

Open

zhenduow mentioned this pull request Mar 5, 2020

Linux installation error #739

Closed

patrickpjiang mentioned this pull request Mar 24, 2020

Mixed precision training slower than FP32 training #297

Open

ron5569 mentioned this pull request Apr 8, 2020

nvcc not found Scalsol/mega.pytorch#11

Closed

leungi mentioned this pull request May 22, 2020

can't build apex on win #839

Closed

imsky added a commit to MeetElise/apex that referenced this pull request Jun 19, 2020

convert CUDA version mismatch to a warning not an error

0ac3dff

reverts NVIDIA#323

NeighborhoodCoding mentioned this pull request Aug 10, 2020

VC version 14.1 not installed. #935

Closed

vigneshmj1997 mentioned this pull request May 21, 2021

Cuda out of memory when traning with fp16 OpenNMT/OpenNMT-py#2065

Closed

apeterswu mentioned this pull request Jul 3, 2021

Megatron 11b training: CUDA error: invalid device function facebookresearch/fairseq#3681

Closed

SadRebel1000 mentioned this pull request Jul 23, 2021

(colab notebook) Train DALLE-pytorch on C@H lucidrains/DALLE-pytorch#291

Open

bjutlss mentioned this pull request Oct 10, 2022

Installation errors #1507

Open

ComeOnComeOnTurnYourRadioOn mentioned this pull request Nov 11, 2022

INSTALL.md Problem Scalsol/mega.pytorch#120

Open

shicode mentioned this pull request Dec 1, 2022

Cannot install apex due to Legacy Install Failure #1548

Closed

This was referenced Mar 3, 2023

can't install apex #1594

Open

Can't install apex #1603

Closed

694344851 mentioned this pull request Mar 14, 2023

apex uninstall #1612

Open

unlugi mentioned this pull request Sep 17, 2023

Questions about dataset and results duanyiqun/DiffusionDepth#9

Open

yangzhipeng1108 mentioned this pull request Nov 13, 2023

Whether to support Cuda 12.1 #1748

Open

momo1986 mentioned this pull request Dec 21, 2023

Cannot install apex on the machine of CUDA 12.2 #1761

Open

marvin-0042 mentioned this pull request Feb 27, 2024

Google Colab setup env hit Cuda/extension version mismatch issue NUS-HPC-AI-Lab/OpenDiT#53

Closed

TJ-Ouyang mentioned this pull request Apr 1, 2024

ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects hpcaitech/Open-Sora#249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

mcarilli commented May 22, 2019 •

edited

Lyuba-Rogovaya May 23, 2019

mcarilli May 23, 2019

Lyuba-Rogovaya May 23, 2019

mcarilli May 23, 2019

proszx Jul 3, 2019

ptrblck Jul 3, 2019

Runner-2019 Mar 7, 2022

stas00 Oct 17, 2022 •

edited

stas00 Oct 17, 2022 •

edited

ZhiyaoZhou Sep 7, 2023

Yorkking commented Aug 5, 2019 •

edited

ptrblck commented Aug 5, 2019

Yorkking commented Aug 7, 2019

JoyHuYY1412 commented Aug 21, 2019

ptrblck commented Aug 21, 2019

Zacharias030 commented Sep 17, 2019 •

edited

mcarilli commented Sep 17, 2019

zhuhui1214 commented Dec 20, 2019

ptrblck commented Dec 23, 2019

GraphGrailAi commented Apr 6, 2020

cosen1024 commented May 1, 2020

sofzh commented Jun 3, 2020

ptrblck commented Jun 5, 2020

griff4692 commented Jun 25, 2020

gauravghati commented Jul 1, 2020 •

edited

MuruganR96 commented Nov 7, 2020

potipot commented Jan 24, 2021

vadimkantorov commented Feb 15, 2021 •

edited

vvuonghn commented Jul 7, 2021 •

edited

AlexBlack2202 commented Jul 20, 2021

joeyslv commented Mar 16, 2022

MHketbi commented May 3, 2023

		@@ -47,10 +47,9 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
		print(raw_output + "from " + cuda_dir + "/bin\n")

		if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex #323

Conversation

mcarilli commented May 22, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Oct 17, 2022 • edited

Choose a reason for hiding this comment

stas00 Oct 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yorkking commented Aug 5, 2019 • edited

ptrblck commented Aug 5, 2019

Yorkking commented Aug 7, 2019

JoyHuYY1412 commented Aug 21, 2019

ptrblck commented Aug 21, 2019

Zacharias030 commented Sep 17, 2019 • edited

mcarilli commented Sep 17, 2019

zhuhui1214 commented Dec 20, 2019

ptrblck commented Dec 23, 2019

GraphGrailAi commented Apr 6, 2020

cosen1024 commented May 1, 2020

sofzh commented Jun 3, 2020

ptrblck commented Jun 5, 2020

griff4692 commented Jun 25, 2020

gauravghati commented Jul 1, 2020 • edited

MuruganR96 commented Nov 7, 2020

potipot commented Jan 24, 2021

vadimkantorov commented Feb 15, 2021 • edited

vvuonghn commented Jul 7, 2021 • edited

AlexBlack2202 commented Jul 20, 2021

joeyslv commented Mar 16, 2022

MHketbi commented May 3, 2023

mcarilli commented May 22, 2019 •

edited

stas00 Oct 17, 2022 •

edited

stas00 Oct 17, 2022 •

edited

Yorkking commented Aug 5, 2019 •

edited

Zacharias030 commented Sep 17, 2019 •

edited

gauravghati commented Jul 1, 2020 •

edited

vadimkantorov commented Feb 15, 2021 •

edited

vvuonghn commented Jul 7, 2021 •

edited