GPU-enabled run failed #203

haochenz96 · 2023-03-18T17:49:16Z

Hi,

Thanks for developing this tool!

I just run SigProfilerExtractor with GPU but it failed. The same torch/CUDA/GPU setup works for other programs so I think this might be related to SigProfilerExtractor. Any help will be appreciated.

To replicate:
0. environment setup:

mamba create -n sig python==3.9 -y
mamba activate sig

# install torch
mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y

pip install -U pip

cd /home/zhangh5/work/software
git clone git@github.com:AlexandrovLab/SigProfilerMatrixGenerator
git clone git@github.com:AlexandrovLab/SigProfilerExtractor.git

pip install -e /home/zhangh5/work/software/SigProfilerMatrixGenerator
pip install -e /home/zhangh5/work/software/SigProfilerExtractor
python /home/zhangh5/work/bulk_analysis/mut_signature/setup/download_genome.py

run the test with BRCA dataset:

from SigProfilerExtractor import sigpro as sig
sig.sigProfilerExtractor(
    "vcf", 
    "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/test_results_VCF", 
    "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/21BRCA/21BRCA_vcf", 
    reference_genome="GRCh37", 
    minimum_signatures=1, 
    maximum_signatures=10, 
    nmf_replicates=100,
    gpu = True,
    )

Relevant logs are:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_device_name(0)
'NVIDIA A40'
>>> torch.cuda.get_device_name()
'NVIDIA A40'
>>> from SigProfilerExtractor import sigpro as sig
>>> sig.sigProfilerExtractor(
ignatures=1,
    maximum_signatures=10,
    nmf_replicates=100,
    gpu=True,
    )...     "vcf",
...     "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/test_results_VCF",
...     "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/21BRCA/21BRCA_vcf",
...     reference_genome="GRCh37",
...     minimum_signatures=1,
...     maximum_signatures=10,
...     nmf_replicates=100,
...     gpu=True,
...     )

************** Reported Current Memory Use: 0.37 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 10.17 seconds.
Matrices generated for 21 samples with 0 errors. Total of 183916 SNVs, 911 DINUCs, and 0 INDELs were successfully analyzed.
Extracting signature 1 for mutation type 96
The matrix normalizing cutoff is 11723


process 1 continues please wait...
execution time: 37 seconds

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/lila/data/iacobuzc/haochen/mambaforge/envs/sig/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/lila/data/iacobuzc/haochen/mambaforge/envs/sig/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/lila/data/iacobuzc/haochen/software/SigProfilerExtractor/SigProfilerExtractor/subroutines.py", line 385, in pnmf
    W, H, Conv = nmf_fn(g, totalProcesses, init=init, execution_parameters=execution_parameters, generator=rand_rng)
  File "/lila/data/iacobuzc/haochen/software/SigProfilerExtractor/SigProfilerExtractor/subroutines.py", line 296, in nnmf_gpu
    genomes = torch.from_numpy(genomes).float().cuda(gpu_id)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

"""

And the ouptut of nvidia-smi is:

(sig) [ln10 20230318-13:41:34]$ nvidia-smi
Sat Mar 18 13:47:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:65:00.0 Off |                    0 |
|  0%   34C    P0    62W / 300W |      0MiB / 46068MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

mdbarnesUCSD · 2023-03-20T18:02:02Z

Hi @haochenz96,

It looks like your GPU was busy at the time of the crash from the message and could be due to library installation issues. Could you please verify your installation is correct?

haochenz96 · 2023-03-21T00:19:29Z

Hi @mdbarnesUCSD !

Thanks for your response. How can I verify that my installation is correct? You mean the pytorch-cuda driver?

mdbarnesUCSD · 2023-03-21T22:00:47Z

Yes, the pytorch-cuda driver would be good to verify that it was installed correctly. It would be great to confirm that you can initialize two matrices, move them to GPU, and multiply them together without error. This code successfully runs on other NVIDIA gpu's (M60, V100, and A100 GPUs), so I would suspect there is some issue with your GPU's configuration.

mdbarnesUCSD · 2023-03-31T17:02:16Z

Please reopen this issue if it persists.

marcos-diazg assigned mdbarnesUCSD Mar 21, 2023

mdbarnesUCSD closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-enabled run failed #203

GPU-enabled run failed #203

haochenz96 commented Mar 18, 2023 •

edited

Loading

mdbarnesUCSD commented Mar 20, 2023 •

edited

Loading

haochenz96 commented Mar 21, 2023

mdbarnesUCSD commented Mar 21, 2023

mdbarnesUCSD commented Mar 31, 2023

GPU-enabled run failed #203

GPU-enabled run failed #203

Comments

haochenz96 commented Mar 18, 2023 • edited Loading

mdbarnesUCSD commented Mar 20, 2023 • edited Loading

haochenz96 commented Mar 21, 2023

mdbarnesUCSD commented Mar 21, 2023

mdbarnesUCSD commented Mar 31, 2023

haochenz96 commented Mar 18, 2023 •

edited

Loading

mdbarnesUCSD commented Mar 20, 2023 •

edited

Loading