Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportError - fast_transformers/causal_product undefined symbol - unable to train or finetune #6

Open
kevingreenman opened this issue Apr 20, 2023 · 16 comments

Comments

@kevingreenman
Copy link

kevingreenman commented Apr 20, 2023

After downloading the data, I go to run bash run_finetune_h298.sh and get the following error:

Traceback (most recent call last):
  File "finetune_pubchem_light.py", line 14, in <module>
    from rotate_attention.rotate_builder import RotateEncoderBuilder as rotate_builder
  File "/home/kpg/molformer/finetune/rotate_attention/rotate_builder.py", line 3, in <module>
    from .attention_layer import RotateAttentionLayer
  File "/home/kpg/molformer/finetune/rotate_attention/attention_layer.py", line 8, in <module>
    from fast_transformers.attention import AttentionLayer
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/attention/__init__.py", line 13, in <module>
    from .causal_linear_attention import CausalLinearAttention
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/attention/causal_linear_attention.py", line 15, in <module>
    from ..causal_product import causal_dot_product
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/causal_product/__init__.py", line 9, in <module>
    from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
ImportError: /home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/causal_product/causal_product_cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIN3c107complexIfEEEEPKNS_6detail12TypeMetaDataEv

I get a similar error when running bash run_pubchem_light.sh:

Traceback (most recent call last):
  File "train_pubchem_light.py", line 18, in <module>
    from fast_transformers.builders import TransformerEncoderBuilder
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/builders/__init__.py", line 42, in <module>
    from ..attention import \
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/attention/__init__.py", line 13, in <module>
    from .causal_linear_attention import CausalLinearAttention
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/attention/causal_linear_attention.py", line 15, in <module>
    from ..causal_product import causal_dot_product
  File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/causal_product/__init__.py", line 9, in <module>
    from .causal_product_cpu import causal_dot_product as causal_dot_product_cpu, \
ImportError: /home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/fast_transformers/causal_product/causal_product_cpu.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe28TypeMeta21_typeMetaDataInstanceIN3c107complexIfEEEEPKNS_6detail12TypeMetaDataEv

I set up my environment based on the instructions in environment.md as follows:

conda create --name MolTran_CUDA11 python=3.8.10
conda activate MolTran_CUDA11

conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge
conda install rdkit==2021.03.2 pandas=1.2.4 scikit-learn=0.24.2 scipy=1.6.3 -c conda-forge

pip install transformers==4.6.0 pytorch-lightning==1.1.5 pytorch-fast-transformers==0.4.0 datasets==1.6.2 jupyterlab==3.4.0 ipywidgets==7.7.0 bertviz==1.4.0

git clone https://github.com/NVIDIA/apex
cd apex
export CUDA_HOME='/usr'
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The differences between the above and the original instructions were:

  1. added -c conda-forge to the 2nd conda install command (it couldn't find the packages otherwise)
  2. export CUDA_HOME='/usr' (the actual location on my system, found using which nvcc, which gave the output /usr/bin/nvcc)
  3. changed the pytorch and cudatoolkit versions to match the nvcc version I have installed, which is 11.6 (compiling Apex failed otherwise). I used the oldest pytorch version that supported cudatoolkit=11.6 (based on instructions here) to maximize likelihood of compatibility since this repo was created using pytorch==1.7.1 cudatoolkit=11.0.

Additional information that may be useful:

nvidia-smi: NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7

(MolTran_CUDA11) ~/molformer/finetune$ conda list | grep 'torch\|cuda'
cudatoolkit               11.6.0              hecad31d_10    conda-forge
ffmpeg                    4.3                  hf484d3e_0    pytorch
pytorch                   1.12.0          py3.8_cuda11.6_cudnn8.3.2_0    pytorch
pytorch-fast-transformers 0.4.0                    pypi_0    pypi
pytorch-lightning         1.1.5                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.12.0               py38_cu116    pytorch
torchvision               0.13.0               py38_cu116    pytorch
(MolTran_CUDA11) ~/molformer/finetune$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

Based on similar errors people have gotten with other repos (e.g. here, here), it seems that the problem is related to my version of PyTorch, but I'm not sure how to resolve this while still allowing Apex to compile on my system. Is it possible to run this repo on a system using nvcc 11.6 / CUDA 11.7?

@kevingreenman
Copy link
Author

Update:

I tried starting from scratch on a different machine that's running nvcc 11.0 / CUDA 11.0. So the only changes I had to make to the installation instructions were:

  1. added -c conda-forge to the 2nd conda install command (it couldn't find the packages otherwise)
  2. export CUDA_HOME='/usr/local/cuda-11.0' (the actual location on my system, found using which nvcc, which gave the output /usr/local/cuda-11.0/bin/nvcc)

But this time, the Apex compilation fails with the following error:

ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
      subprocess.run(
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/subprocess.py", line 516, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/home/kpg/apex/setup.py", line 762, in <module>
      setup(
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/command/install.py", line 68, in run
      return orig.install.run(self)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/install.py", line 698, in run
      self.run_command('build')
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/build.py", line 132, in run
      self.run_command(cmd_name)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
      self.build_extensions()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 670, in build_extensions
      build_ext.build_extensions(self)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 468, in build_extensions
      self._build_extensions_serial()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 494, in _build_extensions_serial
      self.build_extension(ext)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
      _build_ext.build_extension(self, ext)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
      objects = self.compiler.compile(
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 491, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1250, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error

  × Running setup.py install for apex did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/kpg/miniconda3/envs/MolTran_CUDA11/bin/python -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

  __file__ = %r
  sys.argv[0] = __file__

  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"

  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/home/kpg/apex/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' --cpp_ext --cuda_ext install --record /tmp/pip-record-8z9sz80y/install-record.txt --single-version-externally-managed --compile --install-headers /home/kpg/miniconda3/envs/MolTran_CUDA11/include/python3.8/apex
  cwd: /home/kpg/apex/
  Running setup.py install for apex ... error
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> apex

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Here are the details for this system:
nvidia-smi: NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0

(MolTran_CUDA11) ~/apex$ conda list | grep 'torch\|cuda'
cudatoolkit               11.0.221             h6bb024c_0
pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
pytorch-fast-transformers 0.4.0                    pypi_0    pypi
pytorch-lightning         1.1.5                    pypi_0    pypi
(MolTran_CUDA11) ~/apex$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Thank you in advance for any advice you can provide!

@philspence
Copy link

Not sure if this will solve your issue but I found that apex compilation worked when I changed to the 22.04-dev branch (git checkout 22.04-dev), I had loads of different issues compiling with the latest branch.

@kevingreenman
Copy link
Author

Thanks for sharing, but unfortunately this didn't solve my problem. On the first machine I mentioned above, I was able to compile apex but then got an error while running the molformer code. On the second machine, I wasn't able to compile apex. After running git checkout 22.04-dev on both machines, I got similar errors to before. It looks like there are many other people having compilation issues with apex.

@kevingreenman
Copy link
Author

On the first machine referenced above, I used the following installation this time (replacing conda with mamba for speed):

conda create --name MolTran_CUDA11
conda activate MolTran_CUDA11

mamba install python==3.8.10
mamba install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge
mamba install rdkit==2021.03.2 pandas=1.2.4 scikit-learn=0.24.2 scipy=1.6.3 -c conda-forge

pip install transformers==4.6.0 pytorch-lightning==1.1.5 pytorch-fast-transformers==0.4.0 datasets==1.6.2 jupyterlab==3.4.0 ipywidgets==7.7.0 bertviz==1.4.0

git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
export CUDA_HOME='/usr'
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Everything with nvidia-smi and nvcc --version is identical to before. The only difference in conda list | grep 'torch\|cuda' is I now have cudatoolkit 11.6.2 hfc3e2af_12 conda-forge instead of 11.6.0.

The error when running bash run_finetune_h298.sh is identical to before.

@philspence
Copy link

philspence commented Jul 25, 2023

It doesn't seem like the authors are replying to anyone so I'll try and help. Here is the env that works for me:

conda create molformer_env python==3.8.10 -y
conda activate molformer_env
conda config --add channels conda-forge

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch -y # this seems to be different version to yours
conda install rdkit==2021.03.2 pandas=1.2.4 scikit-learn=0.24.2 scipy=1.6.3 -y

pip install transformers==4.6.0 pytorch-lightning==1.1.5 pytorch-fast-transformers==0.4.0 datasets==1.6.2 jupyterlab==3.4.0 ipywidgets==7.7.0 bertviz==1.4.0 packaging

git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
export CUDA_HOME=/usr/local/cuda-11.0 # location of my cuda install
export TORCH_CUDA_ARCH_LIST="8.0"  # I'm compiling for use on A100s

python -m pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Looks like the torch version that you used is different to mine. Hopefully this helps you!

Edit: I used cuda 11.0 in the conda env & system install through nvidia repos

@kevingreenman
Copy link
Author

On the second machine, I ran:

conda create --name MolTran_CUDA11
conda activate MolTran_CUDA11

mamba install python==3.8.10
mamba install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge
mamba install rdkit==2021.03.2 pandas=1.2.4 scikit-learn=0.24.2 scipy=1.6.3 -c conda-forge

pip install transformers==4.6.0 pytorch-lightning==1.1.5 pytorch-fast-transformers==0.4.0 datasets==1.6.2 jupyterlab==3.4.0 ipywidgets==7.7.0 bertviz==1.4.0

git clone https://github.com/NVIDIA/apex
cd apex
git checkout 22.04-dev
export CUDA_HOME='/usr/local/cuda-11.0'
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

I've upgraded CUDA on the second machine since my last post, so now nvidia-smi gives NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1.

nvcc --version is the same as before.

$ conda list | grep 'torch\|cuda' now gives:

cudatoolkit               11.6.2              hfc3e2af_12    conda-forge
ffmpeg                    4.3                  hf484d3e_0    pytorch
pytorch                   1.12.0          py3.8_cuda11.6_cudnn8.3.2_0    pytorch
pytorch-fast-transformers 0.4.0                    pypi_0    pypi
pytorch-lightning         1.1.5                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.12.0               py38_cu116    pytorch
torchvision               0.13.0               py38_cu116    pytorch

My new error is

Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/home/kpg/apex/setup.py", line 1, in <module>
      import torch
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/__init__.py", line 201, in <module>
      _load_global_deps()
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/__init__.py", line 154, in _load_global_deps
      ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    File "/home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/ctypes/__init__.py", line 373, in __init__
      self._handle = _dlopen(self._name, mode)
  OSError: /home/kpg/miniconda3/envs/MolTran_CUDA11/lib/python3.8/site-packages/torch/lib/../../../../libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/kpg/miniconda3/envs/MolTran_CUDA11/bin/python -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

  __file__ = %r
  sys.argv[0] = __file__

  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"

  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/home/kpg/apex/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' egg_info --egg-base /tmp/pip-pip-egg-info-czvr4cr3
  cwd: /home/kpg/apex/
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@kevingreenman
Copy link
Author

@philspence thanks for sharing your installation process. It doesn't look like the different version of pytorch fixes it for me. Can you confirm whether after you compile apex successfully, are you also able to run bash run_finetune_h298.sh from the molformer/finetune/ directory without issues?

@philspence
Copy link

Haven't ran that particular benchmark previously, the other benchmarks have completed without issue. I've started the h298 benchmark and seems to be running okay (only finished one epoch so far). Have you tried to replicate my env? I notice that my compile command uses the --no-build-isolation flag. It took me a while to get a working environment, I tried a lot of different things before I got this working.

@kevingreenman
Copy link
Author

kevingreenman commented Jul 25, 2023

Yeah, I tried to replicate yours, still wasn't able to get it working. It didn't work regardless of whether or not I used the --no-build-isolation flag. If the h298 benchmark started running, then it got further than it did for me, so no need to run it to completion.

@philspence
Copy link

Yeah, I tried to replicate yours, still wasn't able to get it working. It didn't work regardless of whether or not I used the --no-build-isolation flag. If the h298 benchmark started running, then it got further than it did for me, so no need to run it to completion.

Did you also change your system cuda version to 11.0.3?

Here is my conda list | grep 'torch\|cuda':

cudatoolkit               11.0.3              h7761cd4_12    conda-forge
pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
pytorch-fast-transformers 0.4.0                    pypi_0    pypi
pytorch-lightning         1.1.5                    pypi_0    pypi
torchaudio                0.7.2                      py38    pytorch
torchvision               0.8.2                py38_cu110    pytorch

And my system cuda install:

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-rhel7-11-0-local-11.0.3_450.51.06-1.x86_64.rpm
rpm -i cuda-repo-rhel7-11-0-local-11.0.3_450.51.06-1.x86_64.rpm
yum clean all
yum -y install cuda-toolkit-11-0

@kevingreenman
Copy link
Author

Unfortunately I can't change my system CUDA version because this machine is a group workstation, and changing the system version would be disruptive for the other users. But when I first encountered this problem, my system was on cuda 11.0, so I don't think that's the problem.

@philspence
Copy link

philspence commented Jul 25, 2023

Ah, in which case, install cudatoolkit-dev:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 cudatoolkit-dev=11.0 -c pytorch -c conda-forge

and set your cuda home to be the base of your conda env:
export CUDA_HOME=/path/to/conda/env # normally /home/user/conda/env/name_of_env

That worked for me on a different system. If that doesn't work for you then I'm out of ideas!

@BlenderWang9487
Copy link

BlenderWang9487 commented Jul 31, 2023

I see that many people are getting frustrated because of cuda version issues with Apex.

My solution is that if you only want to fine-tune the model, you can avoid using Apex. Instead, I replaced the optimizer in the Python script located in the "finetune" folder with the ones available in the standard torch.optim module. At least, I was able to successfully execute the run_finetune_classification_hiv.sh script by modifying the finetune/finetune_pubchem_light_classification.py:

...
# from apex import optimizers
from torch.optim import Adam
...

    def configure_optimizers(self):
        ...
        # optimizer = optimizers.FusedLAMB(optim_groups, lr=learning_rate, betas=betas)
        optimizer = Adam(optim_groups, lr=learning_rate, betas=betas)
        ...

I only install these modules in my conda env:

conda create --name molformer python=3.8.10 -y
conda activate molformer

conda install \
    pytorch==1.7.1 \
    rdkit==2021.03.2 \
    pandas=1.2.4 \
    scikit-learn=0.24.2 \
    scipy=1.6.3 \
    -c pytorch -y

pip install \
    transformers==4.6.0 \
    pytorch-lightning==1.1.5 \
    pytorch-fast-transformers==0.4.0 \
    datasets==1.6.2 \
    jupyterlab==3.4.0 \
    ipywidgets==7.7.0 \
    bertviz==1.4.0

And I'm using an ubuntu 22.04 docker, with RTX 3090 gpu

It works for me.
Hope this will help.

@Dondonn
Copy link

Dondonn commented Sep 8, 2023

I'm able to extract the embeddings from frozen_embeddings_classification.ipynb

For people who are facing the same problem, I created the env with:

conda create -n molformer -y
conda activate molformer
conda install pytorch==1.12.1 cudatoolkit=11.3 -c pytorch -y
python -m pip install pytorch-fast-transformers
conda install -c conda-forge pytorch-lightning=2.0.7 -y

python -m pip install rdkit datasets regex
python -m pip install jupyterlab

Then adopt the response from @BlenderWang9487
-change
from apex import optimizers
-to
from torch.optim import Adam

def configure_optimizers(self):
...
# optimizer = optimizers.FusedLAMB(optim_groups, lr=learning_rate, betas=betas)
optimizer = Adam(optim_groups, lr=learning_rate, betas=betas)
...

I hope this also helps.

@mwillia6
Copy link

mwillia6 commented Oct 25, 2023

This solution to the “fast_transformers/causal_product undefined symbol” error worked for me. I used the install of BlenderWang9487 with the following additional steps:

  1. Did not install Apex
  2. Replaced optimizer with Adam in ~finetune/finetune_pubchem_light_classification.py
    3.Fixed issues with pytorch-fast-transformers==0.4.0
    install. Only some of the files are being installed. Check https://github.com/idiap/fast-transformers/tree/master/fast_transformers to see if all files were included. The fast_transfomers folder was missing files in my install.To fix this, I first did a pip install pytorch-fast-transformers==0.4.0. Then, I cloned the pytorch-fast-transformers repository and copied over any missing files into ~/anaconda3/envs/molformer/lib/python3.8/site-packages/fast_transformers. Make sure the following files are found in ~/fast_transformers/fast_transformers/causal_product folder, as well as any missing files from fast_transformers github:
    causal_product_cpu.cpp
    causal_product_cuda.cu
    causal_product_cpu.cpython-38-x86_64-linux-gnu.so

This should fix the “fast_transformers/causal_product undefined symbol” error. There’s probably a more elegant way to do this, but this is what worked for me.

4.Included the following statements in script:

export CUDA_HOME=/home/user/anaconda3/envs/molformer

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/anaconda3/envs/molformer/lib

edit: jackl-o-o-l had a better solution
idiap/fast-transformers#125
pip uninstall pytorch-fast-transformers
pip install --no-cache-dir pytorch-fast-transformers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants