Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🍏 Change Nextflow I/O behavior #218

Merged
merged 5 commits into from
Jan 19, 2022
Merged

🍏 Change Nextflow I/O behavior #218

merged 5 commits into from
Jan 19, 2022

Conversation

evanroyrees
Copy link
Collaborator

@evanroyrees evanroyrees commented Jan 18, 2022

Nextflow output structure now resembles what was discussed in #160. Note, metagenomes are not enumerated for generation of their output directory name, their meta.id is used which is the metagenome.simpleName (groovy method) from the respective input metagenome. This clobbers filenames with multiple .. For example, the meta.id of my.example.metagenome.fasta would be my. Moving forward (looking at #186), altering the input s.t. a sample sheet is provided would use values in this table to check for unique sample IDs to write each sample to its respective sample ID results directory.

  • 🍏 fixes Do you think it would be worth changing the I/O behavior here s.t. each input metagenome has its own directory?Β #160
  • πŸ“ Update running Autometa documentation
    • πŸ”₯ Remove trailing whitespaces
    • 🎨 Change some of the note card headers to attention and caution cards
  • πŸ”₯ Remove redundancies in main autometa workflow
  • 🍏 Add storeDir for fetching mock_data genomes
  • 🍏 πŸ› οΈ πŸ› hmmsearch nf processes still need to be fixed
  • πŸ”₯ Remove unused parameters in nextflow_schema.json
  • πŸ”₯🎨 Fix nextflow.config nf-core settings to silence linter warnings

Change publishDir to write process files to their respective metagenome output directory

- By default traceDir will now be written to params.outdir/trace
- Change prodigal nf-core module output filenames
- πŸ› Fix nsplits logic to only merge files when nsplits is greater than 1 (not 0)
- πŸ”₯ Remove internal directories (outdir and interim with hashes prepended)
- πŸ”₯ Remove interim directory parameter
- 🍏 Change reduce_lca.nf label to process_medium
- πŸπŸ“ Update schema to reflect removed parameters
- πŸπŸ“ Change emitted results directory to user to only show params.outdir and not the removed params.interim_dir
@evanroyrees evanroyrees self-assigned this Jan 18, 2022
@evanroyrees
Copy link
Collaborator Author

Will be merging by EOTD tomorrow unless an issue is raised

@evanroyrees evanroyrees changed the title Change Nextflow I/O so each input metagenome has its own output directory 🍏 Change Nextflow I/O behavior Jan 18, 2022
@chasemc
Copy link
Member

chasemc commented Jan 19, 2022

Few things to address, I'll probably edit this with more notes...

  • storedir for process PREPARE_LCA will be wherever nextflow is running from, should be variable

  • because outdir defaults to baseDir and the output directory name is dyanmic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous
    -

    outdir = "${baseDir}"

@evanroyrees
Copy link
Collaborator Author

  • because outdir defaults to baseDir and the output directory name is dynamic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous
    outdir = "${baseDir}"

Would you suggest hardcoding a default output directory here? Or making this a required parameter for the end user? I think the typical end-user behavior will be to fill in this parameter as they will want to place their analyses in a specific location.

Hardcoding the outdir would be a simple fix, right? Should we go this route?

for example

outdir = nf-output

@evanroyrees
Copy link
Collaborator Author

  • storedir for process PREPARE_LCA will be wherever nextflow is running from, should be variable

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

@chasemc
Copy link
Member

chasemc commented Jan 19, 2022

Hardcoding the outdir would be a simple fix, right? Should we go this route?

Yeah I think that would be fine.

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from
Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

@chasemc
Copy link
Member

chasemc commented Jan 19, 2022

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

@evanroyrees
Copy link
Collaborator Author

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

@evanroyrees
Copy link
Collaborator Author

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

I think this is as intended for wherever nextflow is run b/c these LCA dbs can be used across runs of different datasets. If I am recalling correctly, if the ncbi databases change, the cached LCA databases will be regenerated (I think nextflow is already doing some of this file hash tracking behind the scenes here).

@evanroyrees evanroyrees merged commit bdbffda into dev Jan 19, 2022
@evanroyrees evanroyrees deleted the issue-160 branch January 19, 2022 22:56
@chasemc
Copy link
Member

chasemc commented Jan 20, 2022

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

IMO- Because this PR removes the provenance (run ID) from the output I think either the pipeline should fail if files already exist or the provenance should be written as an output. This may be a larger issue but this PR does take a step backwards in data provenance

@evanroyrees
Copy link
Collaborator Author

Happy to add that in for the 2.1.0 release πŸ‘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants