Nextflow implementation template (#118)

* 🎨 Add entrypoints for taxon assignment workflow 🎨 Add Autometa nextflow implementation template. 🎨 Update majority_vote.py parameters to more easily construct taxon assignment workflow. * 🎨 Comment out container directives 🎨🔥 Update optional arguments to mandatory arguments (metagenome, interim, processed) 🎨 Prefix output files with metagenome.simpleName in their respective output directories. 🎨 Name main workflow AUTOMETA and call with channel 🔥 Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA) 🐛 fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers) * 🎨 Add nextflow config with slurm executor configuration and nextflow project details * 🐛 Add end of file newline * 🐛 Add missing line continuation in MARKERS command. * 🐛 Fix incorrect keyword argument in lca.py main call 🐛 Fix incorrect flag in entrypoint (MARKERS process) * 🎨 Keep hmmscan output file in MARKERS * Update gitignore with paths to ignore nextflow generated files * 🐛 Fix broken paths in SPLIT_KINGDOMS 🎨 Add parameter '--outdir' to autometa-taxonomy entrypoint. * 🐛 Fix missing line continuation in BINNING * 🎨 Update output paths so only binning results are in processed directory 🎨 Add completeness and purity parameters to autometa.nf * 🎨 Add completeness and purity parameters to log at beginning of run * 🐛 Handle for case where archaea are not recovered from metagenome * 🎨 Add config file for autometa input parameters 🔥 Remove copy mode from all publishDir settings for all processes in autometa workflow 🎨 Update autometa.taxonomy.vote entrypoint paramters 💚 Update mocked args to be compatible with new autometa.taxonomy.vote paramters 🎨 Add type hints to ncbi.py 🔥 Remove most of redundant logic from vote.py s.t. entrypoint now is only responsible for adding canonical ranks to voted taxids and writing out ranks split by provided rank 🎨🔥 Remove hardcoded parameters and add additional parameters to allow user finer control of entire autometa workflow 🎨 Add HTCondor executor profile with comments * 💚🐛🔥 Remove keyword argument 'out' from vote.add_ranks(...) func * 🎨 Add params.cpus to initial info log * 🔥🐛🎨 Remove unnecessary autometa prodigal wrapper. 🔥 Removes GNU parallel functionality from ORFs process. This was removed because the number of ORF sequences recovered using GNU parallel was non-deterministic This will take a hit on performance as a trade-off for determinism. * 🎨 Update nextflow scripts to use jason-c-kwan/autometa:dev docker image 🎨 Add dockerignore prevent unnecessary context bloat and image bloat. 🔥 Remove Makeflow autometa template 🎨 Move autometa.nf containing AUTOMETA workflow to nextflow directory :up_arrow: Add minimum pandas version of 1.1. 📝 Update link to references in normalize(...) func in kmers.py 🎨 Update parameters.config to reflect updated nextflow parameters 🎨 Update Dockerfile with entrypoint checks autometa-taxonomy-lca and autometa-taxonomy-majority-vote 🎨 Add main.nf for use with manifest as a pre-requisite for nextflow pipeline sharing through GitHub. 🎨 Update manifest in nextflow.config to reflect change in mainScript 🎨 Add fixOwnership to docker scop in nextflow.config * 🎨 Update manifest with 'doi' and 'defaultBranch' * 🎨 Update arguments for entrypoints autometa-binning and autometa-unclustered-recruitment 🎨 Propagate these argument changes to nextflow processes 💚 Update tests to accomodate updated arguments * 🔥 Remove unused/unnecessary configuration scripts 🎨 Move code in config/__init__.py to config/utilities.py and update respective imports to point to this file 🎨 Split autometa-configure entrypoint into two entrypoints autometa-config and autometa-update-databases 🐛 Change default markers directory to look inside default.config instead of source directory 🔥 Remove __main__.py and autometa.py wrapper to __main__.py in exchange for using nextflow files. ⬆️ Add diamond to requirements.txt 🐛 Modify config to point to autometa/databases after installation in Docker build 🎨📝 Add typehints across config scripts * 🎨 Apply black formatting * ✅🎨 Update call to parse_args from config.parse_args(...) to config.utilities.parse_args(...) * ✅🐛 Update config.parse_args(...) to autometa.config.utilities.parse_args(...) * ✅ Alias config.utilities imports to configutils. Provides access to parse_args attribute while avoiding confusion with autometa.common.utilities functions * 🎨 Update default databases retrieval logic 🐛 Remove issue of redundant executable versions being written in default.config 🐛 Fix automatically updating autometa home_dir configuration in default.config 🎨 Add exception handling in parse_argparse.py to provide more debugging information * ✅📝 Fix error when parsing databases argparse. 🎨 Remove any indentation in written argparse blocks for retrieving argparse usage * 🎨 add EOF line in dockerignore * 🐛 Fix default path to markers database in MARKERS process * 🐛 Fix incorrect option when attempting to download missing ncbi files * 🐛 Fix clean command in Makefile so it actually removes provided directories * 🎨 replace only first ftp in ncbi ftp filepaths * 🎨 Remove orfs filepath dependency in LCA and majority vote 🎨 Change entrypoint arguments for autometa-taxonomy-lca and autometa-taxonomy-majority-vote * 🎨 Changed entrypoint parameters for autometa-length-filter. 🔥 Remove unused methods in metagenome.py 🎨✅ Remove unuseded tests in test_metagenome. Update MockedParser to reflect new entrypoint args 🎨 Update nextflow LENGTH_FILTER process to accomodate new parameters. Now uses named emits (fasta, stats, gc_content) 🎨📝 Add new binning metrics into parameters.config (gc_stddev_limit,cov_stddev_limit) 📝🎨 Add type hints into metagenome.py * 📝 Update log with added parameters * 🐛 Fix incorrect path to default markers database in nf pipeline (location in docker image is currently hardcoded in MARKERS process). 🎨 Next step is for default to point to absolute path in docker image instead of relative path * 🔥 Remove --dbdir hardcoded parameter in MARKERS process. This is now being appropriately configured in the docker image that is utilized by nextflow 🐛 Add conda channels conda-forge and bioconda to create_environment command 🎨 Update Dockerfile to configure autometa databases with the DB_DIR environment variable as an absolute path (relative path may cause bugs) * Update autometa/common/metagenome.py * 🐛 replace 'orfs' tags with the respective single input path tag * 🐛🔥 Remove --multiprocess flag from autometa-kmers command in KMERS process * 🔥 Remove duplicate dependencies * 🐛 Fix cryptic bug where imports do not work when explicit python interpreter is used in Makefile commands 🎨 Add functionality to handle for gzipped orfs for autometa-markers entrypoint * 🔥 Remove Makefile from .dockerignore 🎨 use of make commands from Makefile for autometa directory cleanup and install 🐛⬆️ Set samtools minimum version in requirements.txt. Otherwise samtools command would not work properly * 🎨 Change --output parameter to --output-binning in recursive_dbscan.py > 🎨 Add '--output-master' paramter to autometa-binning entrypoint > ✅ Update MockArgs to account for updated entrypoint parameters > ✅🎨 Add args check to autometa-binning entrypoint for embed_dimensions and embed_pca_dimensions inputs > 🎨 Fix typo in kmers embed docstring > 🎨 Standardize output columns from kmers.embed(...) to 1-indexed 'x_1' to 'x_{embed_dimensions}' instead of x,y,z... > 🐛 Add coverage and gc_content std.dev. limits to drop columns in run_hdbscan(...) > 🎨 drop columns in run_hdbscan(...) and run_dbscan(...) are now performed on one line and if the df does not contain any of the columns in dropcols, the error is ignored * 🔥 Remove conda install using py2.7 🔥🎨 Rename references from master to main throughout nf and autometa binning scripts 📝 Format notes in parameters.config * ⬆️ Add minimum version of diamond 2.* 💚 Add output_main to MockedArgs * 📝🎨 Add copyright and short script description to all unit test files * 🎨 Add autometa-parse-bed entrypoint 🎨 Add READ_COVERAGE workflow in common-tasks to compute coverage from read alignments instead of SPAdes headers * 📝 Replace 2020 copyright with 2021 copyright 📝🔥 Remove note on ORF calling warning and replace with contig cutoff warning 📝 Update help text for --binning argument in unclustered_recruitment * 🔥 Remove --do-pca argument from kmers.py 📝 Fix help string in --norm-method in kmers.py 🎨 Change --normalized to --norm-output in kmers.py 🎨 Change --embedded to --embedding-output in kmers.py 🎨 Change --embed-dimensions to --embedding-dimensions in kmers.py 🎨 Change --embed-method to --embedding-method in kmers.py 🎨 Update KMERS in common-tasks.nf to account for updated parameters 💚 Update test_kmers.py MockedArgs to account for updated arguments * 🔥💚 Remove references to removed do_pca parameter 🐛 Update marker databases checksums so they correspond to md5sum 🎨 sort main file output columns in autometa-binning entrypoint * 🔥🎨 Remove 'string' metavar for clustering-method arg * 🔥 Remove kmer embedding args from autometa-binning entrypoint 🎨 Change KMERS.out.normalized as input for binning to KMERS.out.embedded 💚 Update test_recursive_dbscan kmers fixture and mocked args to account for removed kmer parameters 🎨 Add convert_dtypes method call to load(...) func for markers dataframe 🔥🎨 Remove parameters for kmers in binning-tasks and update parameters to correspond to kmers args 🎨 unclustered recruitment now writes output-binning with contig, cluster and recruited_cluster columns * 🎨 Add autometa-binning-summary entrypoint 🎨 unclustered recruitment now writes out binning with columns 'cluster' and 'recruited_cluster' 🐛💚 Fix duplicate mocks in test_recursive_dbscan(...) 🎨 Add BINNING_SUMMARY process in autometa.nf workflow 🎨 Define BINNING_SUMMARY process in binning-tasks.nf * 💚🐛 Change broken variable main to main_df * 💚🔥 Remove kmer embedding dimensions test * 🐛🔥 Remove assembly argument in get_metabin_stats(...) 💚🔥 Remove unused mocked dependencies in test_kmers.py 🔥💚 Remove tests corresponding to old summary.py functionality * 💚 Add gc_content column to bin_df fixture in test_summary * 📝 Add docstrings and explanation within vote.py 🎨 Change vote.py argument from --input to --votes and add metavars to parser args 💚 Change make_test_data.py summary data to create gc_content column instead of GC column 💚 Update MockedArgs in vote.py to correspond to updated --votes parameter 🎨 Replace --input argument in autometa-taxonomy for SPLIT_KINGDOMS process to --votes * 🐛 Fig arg passed in pd.read_csv(...) for autometa.taxonomy.vote * 🐎 Add autometa/databases to dockerignore * 🎨 Update autometa-orfs entrypoint arguments 📝 Add type hints to autometa.common.external.prodigal funcs 🔥🎨 Remove --parallel parameter from autometa-orfs. Parallel is now inferred from --cpus arg * 🐎 ignore the ignore for autometa/databases/markers Add test of autometa-binning-summary entrypoint * 🐛 Replace incorrect variable (orfs) in BINNING_SUMMARY tag * 📝 Replace old kmer paramters in log info with new paramters
KwanLab · Apr 17, 2021 · 50f7a60 · 50f7a60
1 parent 5cccf75
commit 50f7a60
Show file tree

Hide file tree

Showing 57 changed files with 2,217 additions and 2,745 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,22 @@
+# Ignore some root directory files unnecessarily expanding image (and context) size
+.git
+docs
+autometa.mf
+MANIFEST.in
+LICENSE.txt
+meta.yaml
+# Ignore tests related files
+pytest.ini
+tests
+make_test_data.py
+# Ignore nextflow related files
+autometa.nf
+nextflow
+nextflow.config
+pipeline_info
+# Ignore any files/directories built from source
+dist
+Autometa.egg-info
+# Ignore databases
+autometa/databases
+!autometa/databases/markers
diff --git a/.gitignore b/.gitignore
@@ -154,3 +154,9 @@ tests/data/*
 
 # Mac
 .DS_Store
+
+# nextflow
+.nextflow
+.nextflow.log*
+pipeline_info
+work
diff --git a/Dockerfile b/Dockerfile
@@ -20,7 +20,7 @@ LABEL maintainer="jason.kwan@wisc.edu"
 # along with Autometa. If not, see <http://www.gnu.org/licenses/>.
 
 RUN apt-get update \
-    && apt install -y procps g++ \
+    && apt-get install -y procps \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
 
@@ -29,21 +29,32 @@ RUN conda install -c bioconda -c conda-forge --file=requirements.txt \
     && conda clean --all -y
 
 COPY . .
-RUN python setup.py install
+RUN make install && make clean
 
-RUN hmmpress autometa/databases/markers/bacteria.single_copy.hmm \
-    && hmmpress autometa/databases/markers/archaea.single_copy.hmm
+# NOTE: DB_DIR must be an absolute path (not a relative path)
+ENV DB_DIR="/scratch/dbs"
+RUN hmmpress -f autometa/databases/markers/bacteria.single_copy.hmm \
+    && hmmpress -f autometa/databases/markers/archaea.single_copy.hmm \
+    && mkdir -p $DB_DIR \
+    && mv autometa/databases/* ${DB_DIR}/. \
+    && autometa-config --section databases --option base --value ${DB_DIR} \
+    && echo "databases base directory set in ${DB_DIR}/"
 
 RUN echo "Testing autometa import" \
     && python -c "import autometa"
 
 # Check entrypoints are available
 RUN echo "Checking autometa entrypoints" \
+    && autometa-config -h > /dev/null \
+    && autometa-update-databases -h > /dev/null \
     && autometa-length-filter -h > /dev/null \
     && autometa-orfs -h > /dev/null  \
     && autometa-coverage -h > /dev/null  \
     && autometa-kmers -h > /dev/null \
     && autometa-markers -h > /dev/null \
     && autometa-taxonomy -h > /dev/null \
+    && autometa-taxonomy-lca -h > /dev/null \
+    && autometa-taxonomy-majority-vote -h > /dev/null \
     && autometa-binning -h > /dev/null \
-    && autometa-unclustered-recruitment -h > /dev/null 
+    && autometa-unclustered-recruitment -h > /dev/null \
+    && autometa-binning-summary -h > /dev/null
diff --git a/Makefile b/Makefile
@@ -22,12 +22,12 @@ endif
 
 ## Delete all compiled Python files
 clean:
-	find . -type f -name "*.py[co]" -delete
-	find . -type d -name "__pycache__" -delete
-	find . -type d -name "htmlcov" -delete
-	find . -type d -name "Autometa.egg-info" -delete
-	find . -type d -name "dist" -delete
-	find . -type d -name "build" -delete
+	find . -type f -name "*.py[co]" -exec rm -r {} +
+	find . -type d -name "__pycache__" -exec rm -r {} +
+	find . -type d -name "htmlcov" -exec rm -r {} +
+	find . -type d -name "Autometa.egg-info" -exec rm -r {} +
+	find . -type d -name "dist" -exec rm -r {} +
+	find . -type d -name "build" -exec rm -r {} +
 
 ## Apply black formatting
 black:
@@ -38,9 +38,9 @@ create_environment: requirements.txt
 ifeq (True,$(HAS_CONDA))
 		@echo ">>> Detected conda, creating conda environment."
 ifeq (3,$(findstring 3,$(PYTHON_INTERPRETER)))
-	conda create --name $(PROJECT_NAME) python=3 --file=requirements.txt
+	conda create -c conda-forge -c bioconda --name $(PROJECT_NAME) python=3 --file=requirements.txt
 else
-	conda create --name $(PROJECT_NAME) python=2.7 --file=requirements.txt
+	@echo "It looks like you are not using python 3. Autometa is only compatible with python 3. Please upgrade."
 endif
 	@echo ">>> New conda env created. Activate with:\nsource activate $(PROJECT_NAME)"
 else
@@ -56,12 +56,12 @@ endif
 #################################################################################
 
 ## Install autometa from source
-install:
-	$(PYTHON_INTERPRETER) setup.py install
+install: setup.py
+	python setup.py install
 
 ## Install dependencies for test environment
 test_environment: tests/requirements.txt
-	$(PYTHON_INTERPRETER) -m pip install --requirement=tests/requirements.txt
+	python -m pip install --requirement=tests/requirements.txt
 
 ## Build docker image from Dockerfile (auto-taggged as jason-c-kwan/autometa:<current-branch>)
 image: Dockerfile
@@ -78,19 +78,19 @@ unit_test_data_download:
 
 ## Build test_data.json file for unit testing (requires all files from https://drive.google.com/open?id=189C6do0Xw-X813gspsafR9r8m-YfbhTS be downloaded into tests/data/)
 unit_test_data_build: tests/data/records.fna
-	$(PYTHON_INTERPRETER) make_test_data.py
+	python make_test_data.py
 
 ## Run all unit tests
 unit_test: tests/data/test_data.json test_environment
-	$(PYTHON_INTERPRETER) -m pytest --durations=0 --cov=autometa --emoji --cov-report=html tests
+	python -m pytest --durations=0 --cov=autometa --emoji --cov-report=html tests
 
 ## Run unit tests marked with WIP
 unit_test_wip: tests/data/test_data.json test_environment
-	$(PYTHON_INTERPRETER) -m pytest -m "wip" --durations=0 --cov=autometa --emoji --cov-report=html tests
+	python -m pytest -m "wip" --durations=0 --cov=autometa --emoji --cov-report=html tests
 
 ## Run unit tests marked with entrypoint
 unit_test_entrypoints: tests/data/test_data.json test_environment
-	$(PYTHON_INTERPRETER) -m pytest -m "entrypoint" --durations=0 --cov=autometa --emoji --cov-report=html tests
+	python -m pytest -m "entrypoint" --durations=0 --cov=autometa --emoji --cov-report=html tests
 
 
 #################################################################################

diff --git a/autometa.mf b/autometa.mf
diff --git a/autometa.py b/autometa.py