Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't call loops on SLURM #178

Open
wkc1986 opened this issue Feb 9, 2023 · 5 comments
Open

Can't call loops on SLURM #178

wkc1986 opened this issue Feb 9, 2023 · 5 comments

Comments

@wkc1986
Copy link

wkc1986 commented Feb 9, 2023

Describe the bug
call-hiccups_input_hic failed, apparently because GPU resources not requested. Similar situation for call-delta.

OS/Platform

  • OS/Platform: Red Hat Enterprise Linux 8.6
  • Conda version: used Singularity
  • Pipeline version: v 1.15.1
  • Caper version: 2.2.3

Caper configuration file

backend=slurm

# SLURM partition. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford Sherlock.
#slurm-partition=large-mem
slurm-partition=gpu

# SLURM account. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford SCG.
slurm-account=

# Local directory for localized files and Cromwell's intermediate files.
# If not defined then Caper will make .caper_tmp/ on CWD or `local-out-dir`.
# /tmp is not recommended since Caper store localized data files here.
local-loc-dir=

cromwell=/gs/gsfs0/users/kuchang/.caper/cromwell_jar/cromwell-82.jar
womtool=/gs/gsfs0/users/kuchang/.caper/womtool_jar/womtool-82.jar

# following parts added by me
#
# SLURM resource parameters
slurm-leader-job-resource-param=-t 48:00:00 --mem 4G

# This parameter defines resource parameters for submitting WDL task to job engine.
# It is for HPC backends only (slurm, sge, pbs and lsf).
# It is not recommended to change it unless your cluster has custom resource settings.
# See https://github.com/ENCODE-DCC/caper/blob/master/docs/resource_param.md for details.
slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu} --time=28-0
#slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=10000M

Input JSON file

{
  "hic.assembly_name": "mm10",
  "hic.chrsz": "../data/mm10/encode/mm10_no_alt.chrom.sizes.tsv",
  "hic.input_hic": "hic/70f45f73-c0c0-42a4-95e0-8242ca9eef03/call-add_norm/shard-1/execution/inter_30.hic",
  "hic.reference_index": "/gs/gsfs0/user/kuchang/data/mm10/encode/ENCFF018NEO.tar.gz",
  "hic.restriction_enzymes": [
    "none"
  ],
  "hic.restriction_sites": "/gs/gsfs0/user/kuchang/data/mm10/ftp-arimagenomics.sdsc.edu/pub/JUICER_CUTSITE_FILES/mm10_GATC_GANTC.txt.gz",
  "hic.create_accessibility_track_ram_gb": 64
}

call-hiccups_input_hic/execution/stderr ends with

GPU/CUDA Installation Not Detected
Exiting HiCCUPS

Looking at call-hiccups_input_hic/execution/script.submit, the sbatch call doesn't have --gres=gpu:1 which I'm guessing would be necessary. Same with call-delta/execution/script.submit. The slurm-partition specified should in fact have GPUs.

In addition, call-delta/execution/stderr contains /usr/bin/python: can't find '__main__' module in ''

@leepc12
Copy link

leepc12 commented Feb 9, 2023

Please open up hic.wdl and manually add gpu attribute (not gpuCount) to runtime block of two hiccups tasks:

gpuType: "nvidia-tesla-p100"

gpuCount: 1

runtime {
	...
	gpu: 1
	...
}

That looks like a Singularity issue. Please post your call-delta/execution/stderr and also stdout too if possible.

@wkc1986
Copy link
Author

wkc1986 commented Feb 9, 2023

Hi Jin-wook, thanks for quick reply. I edited hic.wdl to put gpu: 1 in both hiccups and hiccups_2, and indeed the sbatch command now has --gres=gpu:1, however the task still fails the same way. Here's call-hiccups_input_hic/execution/stderr:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae
Warning Hi-C map may be too sparse to find many loops via HiCCUPS.
jcuda.CudaException: Could not prepare PTX for source file '/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae/temp_JCuda_3956590174754731503.cu'
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:389)
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:321)
	at jcuda.utils.KernelLauncher.compile(KernelLauncher.java:270)
	at juicebox.tools.utils.juicer.hiccups.GPUController.<init>(GPUController.java:72)
	at juicebox.tools.clt.juicer.HiCCUPS.buildGPUController(HiCCUPS.java:558)
	at juicebox.tools.clt.juicer.HiCCUPS.runCoreCodeForHiCCUPS(HiCCUPS.java:485)
	at juicebox.tools.clt.juicer.HiCCUPS.access$200(HiCCUPS.java:158)
	at juicebox.tools.clt.juicer.HiCCUPS$1.run(HiCCUPS.java:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Cannot run program "nvcc": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at java.base/java.lang.Runtime.exec(Runtime.java:592)
	at java.base/java.lang.Runtime.exec(Runtime.java:416)
	at java.base/java.lang.Runtime.exec(Runtime.java:313)
	at jcuda.utils.KernelLauncher.preparePtxFile(KernelLauncher.java:1113)
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:385)
	... 10 more
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
	... 16 more
GPU/CUDA Installation Not Detected
Exiting HiCCUPS

The call-delta/execution/stderr is just the line from my first post. The stdout is empty.

@wkc1986
Copy link
Author

wkc1986 commented Feb 10, 2023

Looking more at this, I believe the issue is that on our HPC CUDA needs to be loaded via the module system, otherwise it can't find nvcc. But neither module nor adding the CUDA directory to the path works in the container. Also, according to docker/hiccups/Dockerfile, shouldn't it be using a NVIDIA image that would already have nvcc?

How does one get nvcc in the container if it isn't already there?

@shengqh
Copy link

shengqh commented Feb 24, 2023

I met this issue too.

@wkc1986
Copy link
Author

wkc1986 commented Mar 1, 2023

Possibly solved. The hiccups and delta tasks had their own dockers specified in hic.wdl, but their singularitys were set to the main Docker image which does not have GPU stuff. So in hic.wdl I copied the line for hiccups_docker in workflow hic { input { to add this line:

String hiccups_singularity = “docker://encodedcc/hic-pipeline:1.15.1_hiccups”

and changed this in hiccups_runtime_environment:

”singularity” : hiccups_singularity

and successfully ran hiccups. I assume the same will work for delta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants