Skip to content

Commit

Permalink
Nextflow implementation template (#118)
Browse files Browse the repository at this point in the history
* 🎨 Add entrypoints for taxon assignment workflow
🎨 Add Autometa nextflow implementation template.
🎨 Update majority_vote.py parameters to more easily construct taxon assignment workflow.

* 🎨 Comment out container directives
🎨🔥 Update optional arguments to mandatory arguments (metagenome, interim, processed)
🎨 Prefix output files with metagenome.simpleName in their respective output directories.
🎨 Name main workflow AUTOMETA and call with channel
🔥 Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA)
🐛 fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers)

* 🎨 Add nextflow config with slurm executor configuration and nextflow project details

* 🐛 Add end of file newline

* 🐛 Add missing line continuation in MARKERS command.

* 🐛 Fix incorrect keyword argument in lca.py main call
🐛 Fix incorrect flag  in  entrypoint (MARKERS process)

* 🎨 Keep hmmscan output file in MARKERS

* Update gitignore with paths to ignore nextflow generated files

* 🐛 Fix broken paths in SPLIT_KINGDOMS
🎨 Add parameter '--outdir' to autometa-taxonomy entrypoint.

* 🐛 Fix missing line continuation in BINNING

* 🎨 Update output paths so only binning results are in processed directory
🎨 Add completeness and purity parameters to autometa.nf

* 🎨 Add completeness and purity parameters to log at beginning of run

* 🐛 Handle for case where archaea are not recovered from metagenome

* 🎨 Add config file for autometa input parameters
🔥 Remove copy mode from all publishDir settings for all processes in autometa workflow
🎨 Update autometa.taxonomy.vote entrypoint paramters
💚 Update mocked args to be compatible with new autometa.taxonomy.vote paramters
🎨 Add type hints to ncbi.py
🔥 Remove most of redundant logic from vote.py s.t. entrypoint now is only responsible for adding canonical ranks to voted taxids and writing out ranks split by provided rank
🎨🔥 Remove hardcoded parameters and add additional parameters to allow user finer control of entire autometa workflow
🎨 Add HTCondor executor profile with comments

* 💚🐛🔥 Remove keyword argument 'out' from vote.add_ranks(...) func

* 🎨 Add params.cpus to initial info log

* 🔥🐛🎨 Remove unnecessary autometa prodigal wrapper.
🔥 Removes GNU parallel functionality from ORFs process. This was removed because the number of ORF sequences recovered using GNU parallel was non-deterministic
This will take a hit on performance as a trade-off for determinism.

* 🎨 Update nextflow scripts to use jason-c-kwan/autometa:dev docker image
🎨 Add dockerignore prevent unnecessary context bloat and image bloat.
🔥 Remove Makeflow autometa template
🎨 Move autometa.nf containing AUTOMETA workflow to nextflow directory
:up_arrow: Add minimum pandas version of 1.1.
📝 Update link to references in normalize(...) func in kmers.py
🎨 Update parameters.config to reflect updated nextflow parameters
🎨 Update Dockerfile with entrypoint checks autometa-taxonomy-lca and autometa-taxonomy-majority-vote
🎨 Add main.nf for use with manifest as a pre-requisite for nextflow pipeline sharing through GitHub.
🎨 Update manifest in nextflow.config to reflect change in mainScript
🎨 Add fixOwnership to docker scop in nextflow.config

* 🎨 Update manifest with 'doi' and 'defaultBranch'

* 🎨 Update arguments for entrypoints autometa-binning and autometa-unclustered-recruitment
🎨 Propagate these argument changes to nextflow processes
💚 Update tests to accomodate updated arguments

* 🔥 Remove unused/unnecessary configuration scripts
🎨 Move code in config/__init__.py to config/utilities.py and update respective imports to point to this file
🎨 Split autometa-configure entrypoint into two entrypoints autometa-config and autometa-update-databases
🐛 Change default markers directory to look inside default.config instead of source directory
🔥 Remove __main__.py and autometa.py wrapper to __main__.py in exchange for using nextflow files.
⬆️ Add diamond to requirements.txt
🐛 Modify config to point to autometa/databases after installation in Docker build
🎨📝 Add typehints across config scripts

* 🎨 Apply black formatting

* ✅🎨 Update call to parse_args from config.parse_args(...) to config.utilities.parse_args(...)

* ✅🐛 Update config.parse_args(...) to autometa.config.utilities.parse_args(...)

* ✅ Alias config.utilities imports to configutils. Provides access to parse_args attribute while avoiding confusion with autometa.common.utilities functions

* 🎨 Update default databases retrieval logic
🐛 Remove issue of redundant executable versions being written in default.config
🐛 Fix automatically updating autometa home_dir configuration in default.config
🎨 Add exception handling in parse_argparse.py to provide more debugging information

* ✅📝 Fix error when parsing databases argparse.
🎨 Remove any indentation in written argparse blocks for retrieving argparse usage

* 🎨 add EOF line in dockerignore

* 🐛 Fix default path to markers database in MARKERS process

* 🐛 Fix incorrect option when attempting to download missing ncbi files

* 🐛 Fix clean command in Makefile so it actually removes provided directories

* 🎨 replace only first ftp in ncbi ftp filepaths

* 🎨 Remove orfs filepath dependency in LCA and majority vote
🎨 Change entrypoint arguments for autometa-taxonomy-lca and autometa-taxonomy-majority-vote

* 🎨 Changed entrypoint parameters for autometa-length-filter.
🔥 Remove unused methods in metagenome.py
🎨✅ Remove unuseded tests in test_metagenome. Update MockedParser to reflect new entrypoint args
🎨 Update nextflow LENGTH_FILTER process to accomodate new parameters. Now uses named emits (fasta, stats, gc_content)
🎨📝 Add new binning metrics into parameters.config (gc_stddev_limit,cov_stddev_limit)
📝🎨 Add type hints into metagenome.py

* 📝 Update log with added parameters

* 🐛 Fix incorrect path to default markers database in nf pipeline (location in docker image is currently hardcoded in MARKERS process).
🎨 Next step is for default to point to absolute path in docker image instead of relative path

* 🔥 Remove --dbdir hardcoded parameter in MARKERS process.
This is now being appropriately configured in the docker image that is utilized by nextflow
🐛 Add conda channels conda-forge and bioconda to create_environment command
🎨 Update Dockerfile to configure autometa databases with the DB_DIR environment variable as an absolute path (relative path may cause bugs)

* Update autometa/common/metagenome.py

* 🐛 replace 'orfs' tags with the respective single input path tag

* 🐛🔥 Remove --multiprocess flag from autometa-kmers command in KMERS process

* 🔥 Remove duplicate dependencies

* 🐛 Fix cryptic bug where imports do not work when explicit python interpreter is used in Makefile commands
🎨 Add functionality to handle for gzipped orfs for autometa-markers entrypoint

* 🔥 Remove Makefile from .dockerignore
🎨 use of make commands from Makefile for autometa directory cleanup and install
🐛⬆️ Set samtools minimum version in requirements.txt. Otherwise samtools command would not work properly

* 🎨 Change --output parameter to --output-binning in recursive_dbscan.py
> 🎨 Add '--output-master' paramter to autometa-binning entrypoint
> ✅ Update MockArgs to account for updated entrypoint parameters
> ✅🎨 Add args check to autometa-binning entrypoint for embed_dimensions and embed_pca_dimensions inputs
> 🎨 Fix typo in kmers embed docstring
> 🎨 Standardize output columns from kmers.embed(...) to 1-indexed 'x_1' to 'x_{embed_dimensions}' instead of x,y,z...
> 🐛 Add coverage and gc_content std.dev. limits to drop columns in run_hdbscan(...)
> 🎨 drop columns in run_hdbscan(...) and run_dbscan(...) are now performed on one line and if the df does not contain any of the columns in dropcols, the error is ignored

* 🔥 Remove conda install using py2.7
🔥🎨 Rename references from master to main throughout nf and autometa binning scripts
📝 Format notes in parameters.config

* ⬆️ Add minimum version of diamond 2.*
💚 Add output_main to MockedArgs

* 📝🎨 Add copyright and short script description to all unit test files

* 🎨 Add autometa-parse-bed entrypoint
🎨 Add READ_COVERAGE workflow in common-tasks to compute coverage from read alignments instead of SPAdes headers

* 📝 Replace 2020 copyright with 2021 copyright
📝🔥 Remove note on ORF calling warning and replace with contig cutoff warning
📝 Update help text for --binning argument in unclustered_recruitment

* 🔥 Remove --do-pca argument from kmers.py
📝 Fix help string in --norm-method in kmers.py
🎨 Change --normalized to --norm-output in kmers.py
🎨 Change --embedded to --embedding-output in kmers.py
🎨 Change --embed-dimensions to --embedding-dimensions in kmers.py
🎨 Change --embed-method to --embedding-method in kmers.py
🎨 Update KMERS in common-tasks.nf to account for updated parameters
💚 Update test_kmers.py MockedArgs to account for updated arguments

* 🔥💚 Remove references to removed do_pca parameter
🐛 Update marker databases checksums so they correspond to md5sum
🎨 sort main file output columns in autometa-binning entrypoint

* 🔥🎨 Remove 'string' metavar for clustering-method arg

* 🔥 Remove kmer embedding args from autometa-binning entrypoint
🎨 Change KMERS.out.normalized as input for binning to KMERS.out.embedded
💚 Update test_recursive_dbscan kmers fixture and mocked args to account for removed kmer parameters
🎨 Add convert_dtypes method call to load(...) func for markers dataframe
🔥🎨 Remove parameters for kmers in binning-tasks and update parameters to correspond to kmers args
🎨 unclustered recruitment now writes output-binning with contig, cluster and recruited_cluster columns

* 🎨 Add autometa-binning-summary entrypoint
🎨 unclustered recruitment now writes out binning with columns 'cluster' and 'recruited_cluster'
🐛💚 Fix duplicate mocks in test_recursive_dbscan(...)
🎨 Add BINNING_SUMMARY process in autometa.nf workflow
🎨 Define BINNING_SUMMARY process in binning-tasks.nf

* 💚🐛 Change broken variable main to main_df

* 💚🔥 Remove kmer embedding dimensions test

* 🐛🔥 Remove assembly argument in get_metabin_stats(...)
💚🔥 Remove unused mocked dependencies in test_kmers.py
🔥💚 Remove tests corresponding to old summary.py functionality

* 💚 Add gc_content column to bin_df fixture in test_summary

* 📝 Add docstrings and explanation within vote.py
🎨 Change vote.py argument from --input to --votes and add metavars to parser args
💚 Change make_test_data.py summary data to create gc_content column instead of GC column
💚 Update MockedArgs in vote.py to correspond to updated --votes parameter
🎨 Replace --input argument in autometa-taxonomy for SPLIT_KINGDOMS process to --votes

* 🐛 Fig arg passed in pd.read_csv(...) for autometa.taxonomy.vote

* 🐎 Add autometa/databases to dockerignore

* 🎨 Update autometa-orfs entrypoint arguments
📝 Add type hints to autometa.common.external.prodigal funcs
🔥🎨 Remove --parallel parameter from autometa-orfs. Parallel is now inferred from --cpus arg

* 🐎 ignore the ignore for autometa/databases/markers
Add test of autometa-binning-summary entrypoint

* 🐛 Replace incorrect variable (orfs) in BINNING_SUMMARY tag

* 📝 Replace old kmer paramters in log info with new paramters
  • Loading branch information
evanroyrees committed Apr 17, 2021
1 parent 5cccf75 commit 50f7a60
Show file tree
Hide file tree
Showing 57 changed files with 2,217 additions and 2,745 deletions.
22 changes: 22 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Ignore some root directory files unnecessarily expanding image (and context) size
.git
docs
autometa.mf
MANIFEST.in
LICENSE.txt
meta.yaml
# Ignore tests related files
pytest.ini
tests
make_test_data.py
# Ignore nextflow related files
autometa.nf
nextflow
nextflow.config
pipeline_info
# Ignore any files/directories built from source
dist
Autometa.egg-info
# Ignore databases
autometa/databases
!autometa/databases/markers
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,9 @@ tests/data/*

# Mac
.DS_Store

# nextflow
.nextflow
.nextflow.log*
pipeline_info
work
21 changes: 16 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ LABEL maintainer="jason.kwan@wisc.edu"
# along with Autometa. If not, see <http://www.gnu.org/licenses/>.

RUN apt-get update \
&& apt install -y procps g++ \
&& apt-get install -y procps \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Expand All @@ -29,21 +29,32 @@ RUN conda install -c bioconda -c conda-forge --file=requirements.txt \
&& conda clean --all -y

COPY . .
RUN python setup.py install
RUN make install && make clean

RUN hmmpress autometa/databases/markers/bacteria.single_copy.hmm \
&& hmmpress autometa/databases/markers/archaea.single_copy.hmm
# NOTE: DB_DIR must be an absolute path (not a relative path)
ENV DB_DIR="/scratch/dbs"
RUN hmmpress -f autometa/databases/markers/bacteria.single_copy.hmm \
&& hmmpress -f autometa/databases/markers/archaea.single_copy.hmm \
&& mkdir -p $DB_DIR \
&& mv autometa/databases/* ${DB_DIR}/. \
&& autometa-config --section databases --option base --value ${DB_DIR} \
&& echo "databases base directory set in ${DB_DIR}/"

RUN echo "Testing autometa import" \
&& python -c "import autometa"

# Check entrypoints are available
RUN echo "Checking autometa entrypoints" \
&& autometa-config -h > /dev/null \
&& autometa-update-databases -h > /dev/null \
&& autometa-length-filter -h > /dev/null \
&& autometa-orfs -h > /dev/null \
&& autometa-coverage -h > /dev/null \
&& autometa-kmers -h > /dev/null \
&& autometa-markers -h > /dev/null \
&& autometa-taxonomy -h > /dev/null \
&& autometa-taxonomy-lca -h > /dev/null \
&& autometa-taxonomy-majority-vote -h > /dev/null \
&& autometa-binning -h > /dev/null \
&& autometa-unclustered-recruitment -h > /dev/null
&& autometa-unclustered-recruitment -h > /dev/null \
&& autometa-binning-summary -h > /dev/null
30 changes: 15 additions & 15 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ endif

## Delete all compiled Python files
clean:
find . -type f -name "*.py[co]" -delete
find . -type d -name "__pycache__" -delete
find . -type d -name "htmlcov" -delete
find . -type d -name "Autometa.egg-info" -delete
find . -type d -name "dist" -delete
find . -type d -name "build" -delete
find . -type f -name "*.py[co]" -exec rm -r {} +
find . -type d -name "__pycache__" -exec rm -r {} +
find . -type d -name "htmlcov" -exec rm -r {} +
find . -type d -name "Autometa.egg-info" -exec rm -r {} +
find . -type d -name "dist" -exec rm -r {} +
find . -type d -name "build" -exec rm -r {} +

## Apply black formatting
black:
Expand All @@ -38,9 +38,9 @@ create_environment: requirements.txt
ifeq (True,$(HAS_CONDA))
@echo ">>> Detected conda, creating conda environment."
ifeq (3,$(findstring 3,$(PYTHON_INTERPRETER)))
conda create --name $(PROJECT_NAME) python=3 --file=requirements.txt
conda create -c conda-forge -c bioconda --name $(PROJECT_NAME) python=3 --file=requirements.txt
else
conda create --name $(PROJECT_NAME) python=2.7 --file=requirements.txt
@echo "It looks like you are not using python 3. Autometa is only compatible with python 3. Please upgrade."
endif
@echo ">>> New conda env created. Activate with:\nsource activate $(PROJECT_NAME)"
else
Expand All @@ -56,12 +56,12 @@ endif
#################################################################################

## Install autometa from source
install:
$(PYTHON_INTERPRETER) setup.py install
install: setup.py
python setup.py install

## Install dependencies for test environment
test_environment: tests/requirements.txt
$(PYTHON_INTERPRETER) -m pip install --requirement=tests/requirements.txt
python -m pip install --requirement=tests/requirements.txt

## Build docker image from Dockerfile (auto-taggged as jason-c-kwan/autometa:<current-branch>)
image: Dockerfile
Expand All @@ -78,19 +78,19 @@ unit_test_data_download:

## Build test_data.json file for unit testing (requires all files from https://drive.google.com/open?id=189C6do0Xw-X813gspsafR9r8m-YfbhTS be downloaded into tests/data/)
unit_test_data_build: tests/data/records.fna
$(PYTHON_INTERPRETER) make_test_data.py
python make_test_data.py

## Run all unit tests
unit_test: tests/data/test_data.json test_environment
$(PYTHON_INTERPRETER) -m pytest --durations=0 --cov=autometa --emoji --cov-report=html tests
python -m pytest --durations=0 --cov=autometa --emoji --cov-report=html tests

## Run unit tests marked with WIP
unit_test_wip: tests/data/test_data.json test_environment
$(PYTHON_INTERPRETER) -m pytest -m "wip" --durations=0 --cov=autometa --emoji --cov-report=html tests
python -m pytest -m "wip" --durations=0 --cov=autometa --emoji --cov-report=html tests

## Run unit tests marked with entrypoint
unit_test_entrypoints: tests/data/test_data.json test_environment
$(PYTHON_INTERPRETER) -m pytest -m "entrypoint" --durations=0 --cov=autometa --emoji --cov-report=html tests
python -m pytest -m "entrypoint" --durations=0 --cov=autometa --emoji --cov-report=html tests


#################################################################################
Expand Down
43 changes: 0 additions & 43 deletions autometa.mf

This file was deleted.

31 changes: 0 additions & 31 deletions autometa.py

This file was deleted.

Loading

0 comments on commit 50f7a60

Please sign in to comment.