Can't call loops on SLURM #178

wkc1986 · 2023-02-09T21:11:50Z

Describe the bug
call-hiccups_input_hic failed, apparently because GPU resources not requested. Similar situation for call-delta.

OS/Platform

OS/Platform: Red Hat Enterprise Linux 8.6
Conda version: used Singularity
Pipeline version: v 1.15.1
Caper version: 2.2.3

Caper configuration file

backend=slurm

# SLURM partition. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford Sherlock.
#slurm-partition=large-mem
slurm-partition=gpu

# SLURM account. DEFINE ONLY IF REQUIRED BY YOUR CLUSTER'S POLICY.
# You must define it for Stanford SCG.
slurm-account=

# Local directory for localized files and Cromwell's intermediate files.
# If not defined then Caper will make .caper_tmp/ on CWD or `local-out-dir`.
# /tmp is not recommended since Caper store localized data files here.
local-loc-dir=

cromwell=/gs/gsfs0/users/kuchang/.caper/cromwell_jar/cromwell-82.jar
womtool=/gs/gsfs0/users/kuchang/.caper/womtool_jar/womtool-82.jar

# following parts added by me
#
# SLURM resource parameters
slurm-leader-job-resource-param=-t 48:00:00 --mem 4G

# This parameter defines resource parameters for submitting WDL task to job engine.
# It is for HPC backends only (slurm, sge, pbs and lsf).
# It is not recommended to change it unless your cluster has custom resource settings.
# See https://github.com/ENCODE-DCC/caper/blob/master/docs/resource_param.md for details.
slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=${cpu} ${if defined(memory_mb) then "--mem=" else ""}${memory_mb}${if defined(memory_mb) then "M" else ""} ${if defined(time) then "--time=" else ""}${time*60} ${if defined(gpu) then "--gres=gpu:" else ""}${gpu} --time=28-0
#slurm-resource-param=-n 1 --ntasks-per-node=1 --cpus-per-task=1 --mem=10000M

Input JSON file

{
  "hic.assembly_name": "mm10",
  "hic.chrsz": "../data/mm10/encode/mm10_no_alt.chrom.sizes.tsv",
  "hic.input_hic": "hic/70f45f73-c0c0-42a4-95e0-8242ca9eef03/call-add_norm/shard-1/execution/inter_30.hic",
  "hic.reference_index": "/gs/gsfs0/user/kuchang/data/mm10/encode/ENCFF018NEO.tar.gz",
  "hic.restriction_enzymes": [
    "none"
  ],
  "hic.restriction_sites": "/gs/gsfs0/user/kuchang/data/mm10/ftp-arimagenomics.sdsc.edu/pub/JUICER_CUTSITE_FILES/mm10_GATC_GANTC.txt.gz",
  "hic.create_accessibility_track_ram_gb": 64
}

call-hiccups_input_hic/execution/stderr ends with

GPU/CUDA Installation Not Detected
Exiting HiCCUPS

Looking at call-hiccups_input_hic/execution/script.submit, the sbatch call doesn't have --gres=gpu:1 which I'm guessing would be necessary. Same with call-delta/execution/script.submit. The slurm-partition specified should in fact have GPUs.

In addition, call-delta/execution/stderr contains /usr/bin/python: can't find '__main__' module in ''

The text was updated successfully, but these errors were encountered:

leepc12 · 2023-02-09T21:22:16Z

Please open up hic.wdl and manually add gpu attribute (not gpuCount) to runtime block of two hiccups tasks:

hic-pipeline/hic.wdl

Line 1031 in d8e821d

gpuType: "nvidia-tesla-p100"

hic-pipeline/hic.wdl

Line 1084 in d8e821d

gpuCount: 1

runtime {
	...
	gpu: 1
	...
}

That looks like a Singularity issue. Please post your call-delta/execution/stderr and also stdout too if possible.

wkc1986 · 2023-02-09T22:03:39Z

Hi Jin-wook, thanks for quick reply. I edited hic.wdl to put gpu: 1 in both hiccups and hiccups_2, and indeed the sbatch command now has --gres=gpu:1, however the task still fails the same way. Here's call-hiccups_input_hic/execution/stderr:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae
Warning Hi-C map may be too sparse to find many loops via HiCCUPS.
jcuda.CudaException: Could not prepare PTX for source file '/mnt/gsfs0/shared-collab/gecollab/hic/encode_hic-pipeline/hic/085f84e0-0790-4387-af97-b74e34b74f2f/call-hiccups_input_hic/tmp.97dfeaae/temp_JCuda_3956590174754731503.cu'
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:389)
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:321)
	at jcuda.utils.KernelLauncher.compile(KernelLauncher.java:270)
	at juicebox.tools.utils.juicer.hiccups.GPUController.<init>(GPUController.java:72)
	at juicebox.tools.clt.juicer.HiCCUPS.buildGPUController(HiCCUPS.java:558)
	at juicebox.tools.clt.juicer.HiCCUPS.runCoreCodeForHiCCUPS(HiCCUPS.java:485)
	at juicebox.tools.clt.juicer.HiCCUPS.access$200(HiCCUPS.java:158)
	at juicebox.tools.clt.juicer.HiCCUPS$1.run(HiCCUPS.java:414)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Cannot run program "nvcc": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at java.base/java.lang.Runtime.exec(Runtime.java:592)
	at java.base/java.lang.Runtime.exec(Runtime.java:416)
	at java.base/java.lang.Runtime.exec(Runtime.java:313)
	at jcuda.utils.KernelLauncher.preparePtxFile(KernelLauncher.java:1113)
	at jcuda.utils.KernelLauncher.create(KernelLauncher.java:385)
	... 10 more
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
	... 16 more
GPU/CUDA Installation Not Detected
Exiting HiCCUPS

The call-delta/execution/stderr is just the line from my first post. The stdout is empty.

wkc1986 · 2023-02-10T18:58:52Z

Looking more at this, I believe the issue is that on our HPC CUDA needs to be loaded via the module system, otherwise it can't find nvcc. But neither module nor adding the CUDA directory to the path works in the container. Also, according to docker/hiccups/Dockerfile, shouldn't it be using a NVIDIA image that would already have nvcc?

How does one get nvcc in the container if it isn't already there?

shengqh · 2023-02-24T17:34:07Z

I met this issue too.

wkc1986 · 2023-03-01T15:36:08Z

Possibly solved. The hiccups and delta tasks had their own dockers specified in hic.wdl, but their singularitys were set to the main Docker image which does not have GPU stuff. So in hic.wdl I copied the line for hiccups_docker in workflow hic { input { to add this line:

String hiccups_singularity = “docker://encodedcc/hic-pipeline:1.15.1_hiccups”

and changed this in hiccups_runtime_environment:

”singularity” : hiccups_singularity

and successfully ran hiccups. I assume the same will work for delta.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't call loops on SLURM #178

Can't call loops on SLURM #178

wkc1986 commented Feb 9, 2023

leepc12 commented Feb 9, 2023

wkc1986 commented Feb 9, 2023

wkc1986 commented Feb 10, 2023

shengqh commented Feb 24, 2023

wkc1986 commented Mar 1, 2023

Can't call loops on SLURM #178

Can't call loops on SLURM #178

Comments

wkc1986 commented Feb 9, 2023

leepc12 commented Feb 9, 2023

wkc1986 commented Feb 9, 2023

wkc1986 commented Feb 10, 2023

shengqh commented Feb 24, 2023

wkc1986 commented Mar 1, 2023