Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-enabled run failed #203

Closed
haochenz96 opened this issue Mar 18, 2023 · 4 comments
Closed

GPU-enabled run failed #203

haochenz96 opened this issue Mar 18, 2023 · 4 comments
Assignees

Comments

@haochenz96
Copy link

haochenz96 commented Mar 18, 2023

Hi,

Thanks for developing this tool!

I just run SigProfilerExtractor with GPU but it failed. The same torch/CUDA/GPU setup works for other programs so I think this might be related to SigProfilerExtractor. Any help will be appreciated.

To replicate:
0. environment setup:

mamba create -n sig python==3.9 -y
mamba activate sig

# install torch
mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y

pip install -U pip

cd /home/zhangh5/work/software
git clone git@github.com:AlexandrovLab/SigProfilerMatrixGenerator
git clone git@github.com:AlexandrovLab/SigProfilerExtractor.git

pip install -e /home/zhangh5/work/software/SigProfilerMatrixGenerator
pip install -e /home/zhangh5/work/software/SigProfilerExtractor
python /home/zhangh5/work/bulk_analysis/mut_signature/setup/download_genome.py
  1. run the test with BRCA dataset:
from SigProfilerExtractor import sigpro as sig
sig.sigProfilerExtractor(
    "vcf", 
    "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/test_results_VCF", 
    "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/21BRCA/21BRCA_vcf", 
    reference_genome="GRCh37", 
    minimum_signatures=1, 
    maximum_signatures=10, 
    nmf_replicates=100,
    gpu = True,
    )

Relevant logs are:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_device_name(0)
'NVIDIA A40'
>>> torch.cuda.get_device_name()
'NVIDIA A40'
>>> from SigProfilerExtractor import sigpro as sig
>>> sig.sigProfilerExtractor(
ignatures=1,
    maximum_signatures=10,
    nmf_replicates=100,
    gpu=True,
    )...     "vcf",
...     "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/test_results_VCF",
...     "/home/zhangh5/work/bulk_analysis/mut_signature/test_run_BRCA/21BRCA/21BRCA_vcf",
...     reference_genome="GRCh37",
...     minimum_signatures=1,
...     maximum_signatures=10,
...     nmf_replicates=100,
...     gpu=True,
...     )

************** Reported Current Memory Use: 0.37 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 10.17 seconds.
Matrices generated for 21 samples with 0 errors. Total of 183916 SNVs, 911 DINUCs, and 0 INDELs were successfully analyzed.
Extracting signature 1 for mutation type 96
The matrix normalizing cutoff is 11723


process 1 continues please wait...
execution time: 37 seconds

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/lila/data/iacobuzc/haochen/mambaforge/envs/sig/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/lila/data/iacobuzc/haochen/mambaforge/envs/sig/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/lila/data/iacobuzc/haochen/software/SigProfilerExtractor/SigProfilerExtractor/subroutines.py", line 385, in pnmf
    W, H, Conv = nmf_fn(g, totalProcesses, init=init, execution_parameters=execution_parameters, generator=rand_rng)
  File "/lila/data/iacobuzc/haochen/software/SigProfilerExtractor/SigProfilerExtractor/subroutines.py", line 296, in nnmf_gpu
    genomes = torch.from_numpy(genomes).float().cuda(gpu_id)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

"""

And the ouptut of nvidia-smi is:

(sig) [ln10 20230318-13:41:34]$ nvidia-smi
Sat Mar 18 13:47:23 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:65:00.0 Off |                    0 |
|  0%   34C    P0    62W / 300W |      0MiB / 46068MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@mdbarnesUCSD
Copy link
Collaborator

mdbarnesUCSD commented Mar 20, 2023

Hi @haochenz96,

It looks like your GPU was busy at the time of the crash from the message and could be due to library installation issues. Could you please verify your installation is correct?

@haochenz96
Copy link
Author

Hi @mdbarnesUCSD !

Thanks for your response. How can I verify that my installation is correct? You mean the pytorch-cuda driver?

@mdbarnesUCSD
Copy link
Collaborator

Yes, the pytorch-cuda driver would be good to verify that it was installed correctly. It would be great to confirm that you can initialize two matrices, move them to GPU, and multiply them together without error. This code successfully runs on other NVIDIA gpu's (M60, V100, and A100 GPUs), so I would suspect there is some issue with your GPU's configuration.

@mdbarnesUCSD
Copy link
Collaborator

Please reopen this issue if it persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants