🍏 Change Nextflow I/O behavior #218

evanroyrees · 2022-01-18T23:26:47Z

Nextflow output structure now resembles what was discussed in #160. Note, metagenomes are not enumerated for generation of their output directory name, their meta.id is used which is the metagenome.simpleName (groovy method) from the respective input metagenome. This clobbers filenames with multiple .. For example, the meta.id of my.example.metagenome.fasta would be my. Moving forward (looking at #186), altering the input s.t. a sample sheet is provided would use values in this table to check for unique sample IDs to write each sample to its respective sample ID results directory.

🍏 fixes Do you think it would be worth changing the I/O behavior here s.t. each input metagenome has its own directory? #160
📝 Update running Autometa documentation
- 🔥 Remove trailing whitespaces
- 🎨 Change some of the note card headers to attention and caution cards
🔥 Remove redundancies in main autometa workflow
🍏 Add storeDir for fetching mock_data genomes
🍏 🛠️ 🐛 hmmsearch nf processes still need to be fixed
🔥 Remove unused parameters in nextflow_schema.json
🔥🎨 Fix nextflow.config nf-core settings to silence linter warnings

Change publishDir to write process files to their respective metagenome output directory - By default traceDir will now be written to params.outdir/trace - Change prodigal nf-core module output filenames - 🐛 Fix nsplits logic to only merge files when nsplits is greater than 1 (not 0) - 🔥 Remove internal directories (outdir and interim with hashes prepended) - 🔥 Remove interim directory parameter - 🍏 Change reduce_lca.nf label to process_medium - 🍏📝 Update schema to reflect removed parameters - 🍏📝 Change emitted results directory to user to only show params.outdir and not the removed params.interim_dir

…tep-by-step tutorial

evanroyrees · 2022-01-18T23:48:53Z

Will be merging by EOTD tomorrow unless an issue is raised

chasemc · 2022-01-19T15:16:24Z

Few things to address, I'll probably edit this with more notes...

storedir for process PREPARE_LCA will be wherever nextflow is running from, should be variable
- Autometa/modules/local/prepare_lca.nf
  
  Line 21 in 3c65a57
  
  storeDir 'db/lca'
because outdir defaults to baseDir and the output directory name is dyanmic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous
-

Autometa/nextflow.config

Line 40 in 0cba818

outdir = "${baseDir}"

evanroyrees · 2022-01-19T15:27:44Z

because outdir defaults to baseDir and the output directory name is dynamic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous

Autometa/nextflow.config

Line 40 in 0cba818

outdir = "${baseDir}"

Would you suggest hardcoding a default output directory here? Or making this a required parameter for the end user? I think the typical end-user behavior will be to fill in this parameter as they will want to place their analyses in a specific location.

Hardcoding the outdir would be a simple fix, right? Should we go this route?

for example
outdir = nf-output

evanroyrees · 2022-01-19T15:30:18Z

storedir for process PREPARE_LCA will be wherever nextflow is running from, should be variable

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

chasemc · 2022-01-19T15:35:57Z

Hardcoding the outdir would be a simple fix, right? Should we go this route?

Yeah I think that would be fine.

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from
Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

chasemc · 2022-01-19T16:06:00Z

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

evanroyrees · 2022-01-19T22:41:47Z

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

evanroyrees · 2022-01-19T22:44:45Z

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

I think this is as intended for wherever nextflow is run b/c these LCA dbs can be used across runs of different datasets. If I am recalling correctly, if the ncbi databases change, the cached LCA databases will be regenerated (I think nextflow is already doing some of this file hash tracking behind the scenes here).

🍏 Add autometa-nxf-output to .gitignore

chasemc · 2022-01-20T00:03:09Z

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

IMO- Because this PR removes the provenance (run ID) from the output I think either the pipeline should fail if files already exist or the provenance should be written as an output. This may be a larger issue but this PR does take a step backwards in data provenance

evanroyrees · 2022-01-20T16:02:33Z

Happy to add that in for the 2.1.0 release 👍

evanroyrees added 2 commits January 12, 2022 18:40

🎨 WIP

d948312

evanroyrees requested a review from chasemc January 18, 2022 23:26

evanroyrees self-assigned this Jan 18, 2022

evanroyrees added 2 commits January 18, 2022 17:35

📝 Change coverage link to point to coverage-calculations section in s…

7109cd6

…tep-by-step tutorial

📝🐛 Add link to coverage calculation

0cba818

evanroyrees changed the title ~~Change Nextflow I/O so each input metagenome has its own output directory~~ 🍏 Change Nextflow I/O behavior Jan 18, 2022

evanroyrees linked an issue Jan 18, 2022 that may be closed by this pull request

Do you think it would be worth changing the I/O behavior here s.t. each input metagenome has its own directory? #160

Closed

🍏🎨 Replace default params.outdir of baseDir with autometa-nxf-output

6347565

🍏 Add autometa-nxf-output to .gitignore

evanroyrees merged commit bdbffda into dev Jan 19, 2022

evanroyrees deleted the issue-160 branch January 19, 2022 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🍏 Change Nextflow I/O behavior #218

🍏 Change Nextflow I/O behavior #218

evanroyrees commented Jan 18, 2022 •

edited

Loading

evanroyrees commented Jan 18, 2022

chasemc commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

chasemc commented Jan 19, 2022 •

edited

Loading

chasemc commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

chasemc commented Jan 20, 2022

evanroyrees commented Jan 20, 2022

🍏 Change Nextflow I/O behavior #218

🍏 Change Nextflow I/O behavior #218

Conversation

evanroyrees commented Jan 18, 2022 • edited Loading

evanroyrees commented Jan 18, 2022

chasemc commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

chasemc commented Jan 19, 2022 • edited Loading

chasemc commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

evanroyrees commented Jan 19, 2022

chasemc commented Jan 20, 2022

evanroyrees commented Jan 20, 2022

evanroyrees commented Jan 18, 2022 •

edited

Loading

chasemc commented Jan 19, 2022 •

edited

Loading