Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large run failing at mafft step #44

Closed
splaisan opened this issue Dec 13, 2023 · 4 comments
Closed

large run failing at mafft step #44

splaisan opened this issue Dec 13, 2023 · 4 comments

Comments

@splaisan
Copy link

splaisan commented Dec 13, 2023

Hi,
I have ran the analysis of 4 smart-cell runs after renaming the fastq to make them globally unique.
It seemed to go well but after 3 days of computing (44 threads 292GB RAM), it finally died in the mafft step (apparently) and I do not have the final html report and probably also not all the data.
I ran nextflow with -resume to the same result and also ran the experiment on a stronger server with identical outcome (84threads and 512GB RAM).

I attach a zip of the nextflow log, can you please help me fix this, I would rather run the 4 runs in one go to deliver merged data to the end-user since merging the individual runs (which succeeded all 4) is not documented and will in the best case not generate the nice html report.

Thanks in advance

dot.nextflow.log.zip
dot.nextflow_auto.log.zip

# all parameters are standard and rarefaction auto gives 8361 in the second run and was set to 10000 in the first run
# (rarefaction='' in the auto-run and ="--rarefaction_depth 10000" in the first run)
nextflow run main.nf \
  --input "${outfolder}/${outpfx}_samples.tsv" \
  --metadata "${outfolder}/${outpfx}_metadata.tsv" \
  --outdir "${outfolder}" \
  --dada2_cpu "${cpu}" \
  --vsearch_cpu "${cpu}" \ 
  --cutadapt_cpu "${cpu}" \
  "${rarefaction}" \
  --min_asv_totalfreq "${min_asv_totalfreq}" \
  --min_asv_sample "${min_asv_sample}" \
  --colorby "${colorby}" \
  -profile docker 2>&1 | tee ${outfolder}/run_log.txt
@splaisan splaisan changed the title large run failing at maff step large run failing at mafft step Dec 13, 2023
@proteinosome
Copy link
Collaborator

@splaisan Are you running this locally on a single node without a job scheduler? In that case, would you be able to find the error file in the tmp directory /tmp/qiime2-q2cli-err-adqu7ohp.log as indicated in the log? That would give us more idea on what happened.

Do you also know how many ASVs were discovered post-DADA2? You should be able to find dada2-ccs_stats.qza in the DADA2 output folder and open that in QIIME 2 View to investigate the stats.

Are you also doing separate denoising or pooled denoising? See here for a description.

@splaisan
Copy link
Author

splaisan commented Dec 14, 2023

Yes, I looked for the file but it is not in my /tmp.
Is it possible that the /tmp is that of the docker instead of that of my server (not avail after run completion or crash)?
To test this I am planning to mount /tmp to a local workfolder/tmp by editing the docker config file.
We also wonder if the mafft job is not too heavy with 624 samples. To test that we have added a conf block with much more ram to the config, specifically for that job.
I hope the resume will allow quick debugging.
I am ooo few days but will post at my return.
Cheers
S
PS will look at your other points then too

@proteinosome
Copy link
Collaborator

@splaisan Yes, the /tmp directory is the default in Docker. You can set the TMPDIR variable and mount the tmpdir for the tools to use a specific temporary directory. See this issue for example.

@splaisan
Copy link
Author

splaisan commented Dec 21, 2023

We (thanks to Kobe in our team) finally got the run to end happily by giving more space to work to two critical steps in the workflow.

Our fix was done in two ways:

  • add a custom config file extra.config saved in the repo folder with the following content
process {
  // more RAM for the diversity job
  withName: qiime2_phylogeny_diversity {
    cpus = 8
    memory = 240.GB
  }
  // more RAM for the report building
  withName: html_rep {
    cpus = 8
    memory = 128.GB
  }
}

// correct bug in path for reports
// Generate report
report {
  enabled = true
  overwrite = true
  file = "$params.outdir/report/report.html"
}
// Timeline
timeline {
  enabled = true
  overwrite = true
  file = "$params.outdir/report/timeline.html"
}
// DAG
dag {
  enabled = true
  file = "$params.outdir/report/dag.html"
  overwrite = true
}

Note: The amount of extra RAM is probably excessive but at least this ran without dying.

Then we ran the nextflow command with minor edits:

# create tmp folder in output folder
mkdir -p ${outfolder}/tmp

# run edited nextflow command
TMPDIR="${outfolder}/tmp" nextflow run main.nf \
  <... more command arguments ...> \
  -profile docker \
  -c extra.config  2>&1 | tee ${outfolder}/run_log.txt

Declaring TMPDIR just before running the nextflow command ensures that the /tmp (normally located inside the docker image) is remapped to a local folder and visible after the run ends to allow reading error report files when things turn out bad (as discussed in #42)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants