-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation #121
Update documentation #121
Conversation
📝 Updated install for version 2
🔥 Remove Rest API 🎨 Add docs/source/_build to .gitignore 📝 Update autometa install guidelines. Added docker to it. 📝 Add benchmarking page 📝 Add Automappa to examining results 🎨 Replaced shell with bash in parse_argparse.py 📝 Add packages to install for developers in contributing guidelines
Codecov Report
@@ Coverage Diff @@
## dev #121 +/- ##
=======================================
Coverage 48.67% 48.67%
=======================================
Files 22 22
Lines 3022 3022
=======================================
Hits 1471 1471
Misses 1551 1551 Continue to review full report at Codecov.
|
⬇️ Remove dependency on sphinx.ext.paramout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into so many problems here. My approach was to go through the documentation (at least "running autometa" and try out the commands). Details in comments. I would recommend writing it as a walk through using one of the simulated datasets as an example, then as you are writing it, actually do the commands yourself (with copy/paste). This will show any errors and also errors with the data files (which I think I uncovered some of). The user will be really disheartened if they can't even do the example in the documentation, which is how I felt frankly.
docs/source/install.rst
Outdated
requests \ | ||
umap-learn \ | ||
hdbscan | ||
Conda installation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add instructions on how to install from source? I feel this is something I would like to refer to myself when testing various branches etc., and also people will want this if they want to run the dev
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also might be useful to have a section on building the docs
(for contributors perhaps - could go under contributing guidelines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add - the building from source part (with dependencies) will be doubly useful until the conda instructions here actually work.
For developers | ||
============== | ||
|
||
If you are wanting to help develop autometa, you will need these additional dependencies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check that these instructions work. I was just having a lot of trouble downloading Autometa into a fresh Docker container and then getting, for example, unit tests to work. See my comments on PR #120
docs/source/running-autometa.rst
Outdated
#. Accept clusters that are estimated to be over 20% complete and 90% pure based on single-copy marker genes. These are default papameteres and can be altered to suit your needs. | ||
#. Unclustered contigs leftover will be re-clustered until no more acceptable clusters are yielded | ||
|
||
If you include a taxonomy table in the, Autometa will attempt to further partition the data based on ascending taxonomic specificity (i.e. in the order phylum, class, order, family, genus, species) when clustering unclustered contigs from a previous attempt. We found that this is mainly useful if you have a highly complex metagenome (lots of species), or you have several related species at similar coverage level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the what? In the command?
docs/source/running-autometa.rst
Outdated
.. code-block:: bash | ||
|
||
# Archaeal binning | ||
autometa-binning <path/to/kmers_norm.tsv> \ | ||
<path/to/coverage_table.tsv> <path/to/archaea.markers.tsv> \ | ||
<path/to/archaea_binning.tsv> --embedded-kmers <path/to/embedded_kmers.tsv> \ | ||
--taxonomy <path/to/taxonomy.tsv> --clustering-method <dbscan or hdbscan> --domain archaea | ||
|
||
# Bacterial binning | ||
autometa-binning <path/to/kmers_norm.tsv> \ | ||
<path/to/coverage_table.tsv> <path/to/bacterial.markers.tsv> \ | ||
<path/to/bacteria_binning.tsv> --embedded-kmers <path/to/embedded_kmers.tsv> \ | ||
--taxonomy <path/to/taxonomy.tsv> --clustering-method <dbscan or hdbscan> --domain bacteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially couldn't run this (see Issue #133)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evan commented on Slack and thinks the command is wrong:
Looks like you’ve provided the kmers table where you should have provided the markers table. The sacc column is from the markers table
The kmer embedding is loaded already, so if you’ve copied and pasted from the docs, then the docs has the input paths wrong. Otherwise, this is not a bug, the kmers table is loaded in and then the markers table is trying to be loaded in but fails as the incorrect file was provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the positional arguments have to be after the optional ones.
docs/source/running-autometa.rst
Outdated
.. code-block:: bash | ||
|
||
# Archaea | ||
autometa-unclustered-recruitment <path/to/kmers_norm.tsv> \ | ||
<path/to/coverage_table.tsv> <path/to/archaea_binning.tsv> \ | ||
<path/to/archaea.markers.tsv> <path/to/arachaea_unclustered_recruitment.tsv> \ | ||
--taxonomy <path/to/taxonomy.tsv> --classifier decision_tree | ||
|
||
# Bacteria | ||
autometa-unclustered-recruitment <path/to/kmers_norm.tsv> \ | ||
<path/to/coverage_table.tsv> <path/to/bacteria_binning.tsv> \ | ||
<path/to/bacteria.markers.tsv> <path/to/bacteria_unclustered_recruitment.tsv> \ | ||
--taxonomy <path/to/taxonomy.tsv> --classifier decision_tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't even get to trying this - I just didn't have much hope after the last five or so commands failed.
Just a note, pre-commit hooks should also be addressed in contributing. Not just |
@@ -5,3 +5,19 @@ Contributing Guidelines | |||
"Autometa is an open-source project developed on | |||
GitHub. If you would like to help develop Autometa or | |||
have ideas for new features please see our `contributing guidelines <https://github.com/KwanLab/Autometa/blob/master/.github/CONTRIBUTING.md>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We have two different versions of CONTRIBUTING.md
. The dev
branch refers to using black
formatting whereas the master branch does not
docs/source/examining-results.rst
Outdated
library("ggplot2") | ||
fpath="master.tsv" | ||
data = read.table(fpath, header=TRUE, sep='\t') | ||
ggplot(data, aes(x=x, y=y, color=cluster, group=cluster)) + geom_point(size=(sqrt(data$length))/100, shape=20, alpha=0.5) + theme_classic() + xlab('BH-tSNE X') + ylab('BH-tSNE Y') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be formatted with line breaks?
docs/source/examining-results.rst
Outdated
|
||
library("ggplot2") | ||
fpath="master.tsv" | ||
ggplot(data, aes(x=x, y=y, color=phylum, group=phylum)) + geom_point(size=(sqrt(data$length))/100, shape=20, alpha=0.5) + theme_classic() + xlab('BH-tSNE X') + ylab('BH-tSNE Y') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be formatted with line breaks?
docs/source/install.rst
Outdated
#. Create a new environment ``conda create -n autometa "python>=3.7"`` | ||
#. Install autometa ``conda install -c conda-forge -c bioconda autometa --yes`` | ||
#. Actiavate autometa envoirnonment ``conda activate autometa`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: We have requirements.txt in our root directory. So the environment can be created and installed with one command:
Install from source (using make)
cd $HOME
git clone https://github.com/KwanLab/Autometa.git
cd $HOME/Autometa
# create autometa conda environment
make create_environment
# activate env
conda activate autometa
# install autometa source code
make install
Install from source (full commands)
cd $HOME
git clone https://github.com/KwanLab/Autometa.git
cd $HOME/Autometa
# Construct the environment from the listed requirements.
conda create -n autometa --file=requirements.txt
# Install the autometa code base from source
python setup.py install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build the docker image from source
Also note, you can building a docker image for your clone of the Autometa repo.
cd $HOME
git clone https://github.com/KwanLab/Autometa.git
cd $HOME/Autometa
# This will tag the image as jason-c-kwan/autometa:<your current branch>
make image
docs/source/install.rst
Outdated
You can also run Autometa using a prebuild Docker image. | ||
|
||
#. Install Docker_ | ||
#. Run the following commands | ||
|
||
.. code-block:: bash | ||
|
||
git clone https://github.com/KwanLab/Autometa.git | ||
docker pull jasonkwan/autometa:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is no need to clone the repository if the user just wants to use the docker image.
- Also no need to run docker pull as docker recognize if an image is not present/available.
User should be able to just run
docker run jason-c-kwan/autometa:dev
docker run jason-c-kwan/autometa:latest
# ... whatever other tags we plan on having available
and docker will retrieve the image if they do not already have it.
docs/source/install.rst
Outdated
python autometa.py --check-dependencies | ||
|
||
``git clone https://github.com/KwanLab/Autometa.git`` | ||
# If any of the checks return False, you can check which failed using | ||
python autometa.py --check-dependencies --debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 These commands are no longer available
docs/source/install.rst
Outdated
|
||
# Install packages for testing | ||
conda install -n autometa -c conda-forge \ | ||
black pre_commit pytest pytest-cov pytest-html pytest-repeat pytest-variables gdown --yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the --yes
should be here.
@@ -0,0 +1,9 @@ | |||
Community,Num. Genomes,Num. Control Sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 Add section on using these links. Or simply make some of these commands available to the user
See issue-#110 (assigned @ajlail98). May be worth a discussion b/w @Sidduppal and @ajlail98.
How to download simulated communities
Example for the 78Mbp community
gdown --id 15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y
You can get the file ID by navigating to any of the files and right clicking, then selecting the get link
option. This will have a copy link
button that you should use.
The link will look like this: https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing
The file ID is within the /
forward slashes between file/d/
and /
, e.g.
# Pasted from copy link button:
https://drive.google.com/file/d/15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y/view?usp=sharing
# begin file ID ^ ------------------------------^ end file ID
Now that we have the File ID, you can specify the ID or use the drive.google.com
prefix
file_id="15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y"
gdown https://drive.google.com/uc?id=${file_id} -O metagenome.fna.gz
# or
gdown --id 15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y
Either should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't went through the entire documentation, but there are a number of commands that need to be updated. I've also placed some replies to help fill out some of the docs. Also, #120 (comment) should likely be added within the "unit tests" section
Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
@WiscEvan If you eventually add that comment to the docs, you may want to change the |
📝 Update install using Makefile :up-arrow: Create a new file for step-by-step instructions on how to run Autometa 📝 Update benchmarking to add steps on how to download datasets 📝 Update contibuting guidelines on how to install dependencies for unit tests and docs
📝 Add tutorial on how to run nextflow 📝 Add binning figures in examining results sections 🎨 📝 Correct installation steps. Now uses make for everything 📝 Improved contribution guidelines
📝 Add channels when using requirnments.txt for autometa install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you have put a ton of work into this. Wonderful start and I think we are getting quite close to being ready for dev
and a 2.0 release. Most of my comments are in the step-by-step tutorial section. I think it would be useful to add some "advanced usage" sections for the more involved scripts (binning, kmers, unclustered-recruitment). A few minor formatting details. I think it would also be helpful to have a nextflow tutorial that corresponds to the step-by-step so a user can follow along and familiarize themselves with both approaches. I had a few questions regarding some entrypoints and output files that I can change if you think it would be more appropriate. Most of the comments should be easy fixes. 👍 Once again, well done 🎊 👷 🔧
We had talked about my reformatting the nextflow code to more closely follow nf-core's guidelines (to be able to use their linter, etc.) and organizing the nextflow code to allow organized addition of more modules in the future (ie NSF aims). I've been waiting until there was a working Nextflow branch to do all that, so maybe we should hold off on writing the Nextflow documentation until then? |
That sounds like a nice idea. I can start addressing comments which does not concern nextflow until @chasemc commits his changes. |
Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
📝 Add another column of opetional or required in usage table of each step
|
||
*params.completeness* : Minimum completeness needed to keep a cluster (default is atleast 20% complete). See :ref:`advanced-usage-binning` section for deails | ||
|
||
*params.purity* : Minimum purity needed to keep a cluster (default is atleast 95% pure). See :ref:`advanced-usage-binning` section for deails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"atleast"
"at least"
|
||
*params.classification_method* : Which clustering method to use for unclustered recruitment step. Choices are "decision_tree" and "random_forest" (default is "decision_tree"). See :ref:`advanced-usage-unclustered-recruitment` section for deails | ||
|
||
*params.completeness* : Minimum completeness needed to keep a cluster (default is atleast 20% complete). See :ref:`advanced-usage-binning` section for deails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"atleast"
"at least"
Will edit this comment later, putting this here for now: |
@Sidduppal @WiscEvan @jason-c-kwan Can this be merged? And the nextflow documentaion be added in a new, smaller PR? @Sidduppal I know you mentioned just writing the Nextflow stuff in a comment here and that you'd transfer it, but that requires a lot of duplicated effort between us and is probably harder for me than just writing in the .rst itself, especially since some of the file content needs to stay, some go. |
If @Sidduppal changed this PR to "ready to review', I can go through it and add my comments and maybe approve to merge. |
I have updated the PR however, I think it'll be better to review it after I have incorporated changes mention in PR #157 as the documentation would change after that merge. |
Still need to edit/add more
* init autometa 2 * added template classes from autometa class diagram discussion * autometa 1.0 refactored. New cli and beginnings of User API. markers cutoffs reformatted. configurations added to handle executable and database dependencies. k-mer counting (multiprocessing capable), normalization and embedding (multiple methods via TSNE and UMAP). external dependencies handled in external directory. utilities for archiving, unzipping, etc added in common directory. metagenome and mag classes to handle respective data. binning directory for multiple binning algorithms. docs directory containing jupyter notebooks with information about autometa as well as template python script for writing new modules ot plug in to autometa. Added projects folder as default location where autometa will place metagenome binning jobs. Added taxonomy folder for taxon assignment algorithms/utilities. * updated markers links and link to test_metagenome.config * removed unneeded class diagram doc and edit test config to display required options. updates to database handling and added timeit to main function calls. * bugfix to get_kingdoms assigning self.taxonomy using self.assign_taxonomy() and changed logger for diamond to debug. utilities timeit now is INFO level logging. some 'f' string formatting for kmers and diamond logs (added comma thousand separators). * updates to suppress parallel warnings when running prodigal. updated noparallel arg to parallel so does GNU parallel disabled by default. updated config sections to reflect parameter change * added coverage calculation handling for reads,sam,bam and bed files. Updated coverage.py args and metagenome.py and autometa.py to reflect new args. added respective files under [files] section in default.config and metagenome.config files. * updated logger to DEBUG for majority_vote.py updated argparse help for --out in coverage.py and moved return statement in taxon assignment in metagenome.py to reduce redundancy. * updates to autometa configuration. Added kingdom arg to tune for only binning respective to selected kingdom. Choices are bacteria and archaea. bugfix where environs were being placed under database section in config. added samtools and bedtools to environ checks (used in coverage calculations). Updated default config files respective to coverage calculations * suppressed parallel warnings when running hmmscan * bugfix to parallel warning supression for hmmscan * bugfix to orf calling in metagenome.py * upadted default config to handle coverage calculation files * update files in default.config * Add django-related files, clean up structure * Update ignore, force recache of un-ignored files * Update ignore, force recache of un-ignored files * Add vscode ignorance * Add vscode ignorance * Begin django website dev * Begin django website dev * Restart website, add first app * Restart website, add first app * Make startpage * Make startpage * Delete website; confusing naming * Delete website; confusing naming * Start new django website * Start new django website * Create startpage * Create startpage * bugfix to hmmer marker filtering (filepath handling). minor logging edits to lca and majority vote. removed unneccessary comment in metagenome. Comment in kmers to silence UMAP warnings. * Add startpage * Add startpage * Add template, css, bootstrap * Add template, css, bootstrap * Fix links, add nav-bar, css * Fix links, add nav-bar, css * Change blog template to autometa related terms * Change blog template to autometa related terms * Add Projects and Jobs as Models This is a temporary setup and could change later. For now, each user (class User read here: https://docs.djangoproject.com/en/3.0/ref/contrib/auth/) has projects (class Project) and each project has jobs (class Job). Each class creates a data table and is mapped to the related table using foreign keys. Please read more here: https://docs.djangoproject.com/en/3.0/ref/models/fields/#foreignkey * Add Projects and Jobs as Models This is a temporary setup and could change later. For now, each user (class User read here: https://docs.djangoproject.com/en/3.0/ref/contrib/auth/) has projects (class Project) and each project has jobs (class Job). Each class creates a data table and is mapped to the related table using foreign keys. Please read more here: https://docs.djangoproject.com/en/3.0/ref/models/fields/#foreignkey * Add login logout profile pages * Add login logout profile pages * Use conda env instead of pip * Use conda env instead of pip * Revert "Use conda env instead of pip" This reverts commit cbb42d481d1c6938928d3b505520ba20bb7e67ee. * Revert "Add login logout profile pages" This reverts commit 4ec643de3a14d7afa84dc920933b5c7410cb1a97. * Revert "Add Projects and Jobs as Models" This reverts commit cc2d5958e62e476b56cb4e5e3c9c06964710d659. * Revert "Change blog template to autometa related terms" This reverts commit 4d7c87bee2f9eca4c2d8eb09ea5e21e0001489f8. * Revert "Fix links, add nav-bar, css" This reverts commit 0270bf1f78f715f76e768faaba7500d127c26e8f. * Revert "Add template, css, bootstrap" This reverts commit c2c520218659faeb963968d68ca3a5412d57ffea. * Revert "Merge branch 'dev' of github.com:WiscEvan/Autometa into dev" This reverts commit 1ef067b5fb6411f55259c68f942b6ea37e25d987, reversing changes made to aa1dbc179eca5bcc8d2ee6b4f946c42fb49aec7a. * Revert "Add startpage" This reverts commit aa1dbc179eca5bcc8d2ee6b4f946c42fb49aec7a. * removed website dir and files (secrets published) * Revert "Use conda env instead of pip" This reverts commit a14c37c141f210c92d3e0cae11bfd51ed948b085. * Revert "Add login logout profile pages" This reverts commit 8252fbab5b1ae5046edb56dac05a1efe7d3e92ed. * Revert "Add Projects and Jobs as Models" This reverts commit 335b563262469fb6bc354f99f6fbc4755025c134. * Revert "Change blog template to autometa related terms" This reverts commit 619baf82786fd08f051c65d0724bf30ede119d1b. * Revert "Fix links, add nav-bar, css" This reverts commit ecb1476c47a070ce3aeb40b7f7f41af100635947. * Revert "Add template, css, bootstrap" This reverts commit 54f9e2abb9747b3f7479485e0289c20ac8a76b02. * Revert "Merge branch 'dev' of github.com:WiscEvan/Autometa into dev" This reverts commit de620c8f02bc204678e73e8c500b34d9388ddc87, reversing changes made to 28fabf9eaa0d2b8e7a6d37fa05b0f3a1bf14a7e5. * Revert "Add startpage" This reverts commit 28fabf9eaa0d2b8e7a6d37fa05b0f3a1bf14a7e5. * removed website dir and files (secrets published) * updated directory structure, README.md, moved tests to their own directory with 78Mbp simulated community. updated config files. py2.7 bhsne for kmers in its own script to run py2.7 version. removed shebang specifically specifying python3 to avoid cryptic errors where user defined python env is not selected when run. Added .gitignore to ncbi dir under databases to keep empty directory. Post-processing are in their own directory under validation. * Added checkpointing functionality in utilities.py. updated config to reflect checkpointing. Renamed MAG class to Mag to follow python conventions. Added prodigal parsing to lca.py. Reflected in majority_vote.py. removed superfluous attributes for DiamondResult object. Updated metagenome.config with checkpoints.tsv file. * updated prodigal parsing into marker retrieval algorithm (markers.py and hmmer.py). By default will pass in ORFs retrieved from Mag object. * resolved #10 Contributors added and copyright year updated to 2020. * resolved #10 Contributors added and copyright year updated to 2020. * Resolves KwanLab/Autometa#16, Resolves KwanLab/Autometa#17 and simplified config parsing. Renamed 'projects' to 'workspace' to avoid confusion with 'project'. test metagenome.config file has been updated with respective files & parameters. Reconfigured logger to stream info and write debug level to timestamped log file. Added exceptions. to be used across autometa pipeline. * updates to project configuration handling metagenome numbering. Now retrieves versions from each executable dependency in environ.py. This is used in prodigal to parse corresponding to the prodigal version. I.e. 2.5 differs from version >=2.6. Prodigal now will parse ORF headers and convert contigs to ORF headers according to version available. Default config now has versions section and generated config files now contain versions section related to executable dependencies. Renamed 'new_workspace' in user.py to 'new_project' as this is more appropriate. * significant simplification in API. Created Databases class in databases.py for handling databases config. Default behavior is to download and format required databases. Changed flag to flag to be more clear. autometa will print an issue request to the user upon any exceptions being encountered (NOT KeyboardInterrupt.. Although this will also be logged). Logging behavior changed slightly, where user can specify level (default is INFO) and path to log file. binning call has been moved to user.py. autometa.config imports in user.py have been removed and general autometa.config module is imported via to perform respective func call. * updates to check dependencies and control of debugging information when checking dependencies. Executable versions are now logged in debug info. log is now only written when flag is supplied. Timestamped log has been commented out. In the future, this could be a flag to log each run via a timestamped log. in databases now only returns the config and the method of databases is used when checking dependencies. * updated 'get_versions' function to return the version string if a program is provided as input. Updated respective files using this function. This should be clearer than returning a dict of the program passed in as key and removes redundant calls to pass in the program as input and then again as a key to retrieve the version value. * hotfix to case where new project does not contain any metagenomes. skip performing check to place appropriate metagenome number and just return 1. * Changed OSError to subclass ChildProcessError in prodigal.py. This is a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py * mostly resolves KwanLab/Autometa#21 and resolves KwanLab/Autometa#18. * resolved #19 added docstring, fixed nproc and removed depth function * Revert "resolved #19, did not add the copyright and liscence information" This reverts commit ca64a2fa5032b62517fa57b49973c7967c8ccf0c. * resolved #19 added docstring (and liscence), fixed nproc and removed depth function * resolved #19 made the improvements as suggested by Evan * resolved #19 made the improvements_2 as suggested by Evan * resolved #19 made change bam.file to alignment.bam file * resolved #19 improved the cmd function * fix to extract contigs from orf_ids using specific prodigal version. Note: entire pipeline currently assumes orf calling was performed using prodigal. Update to template.py where ArgumentParser now has default description, where previously this was by default usage. (Which the usage by default should be the name of the script). Updates to respective files where ORF to contig translations are necessary. * resolved #19 removed run function, removed intermedediate files and renamed stderr and stdout * updated pandas numpy module call for nan to pd.NA from pandas version 1.0. in kmers and recursive_dbscan. Updated main function for recursive dbscan with required coverage table input and subsetting taxonomy by the provided domain. Datatype conversion in pandas dataframes are now performed to optimize space in mag.py and recursive_dbscan.py. Added script description to coverage.py and removed unused exception handling in docstring. Renamed bedtools column 'breadth' to 'depth_fraction' and 'total_breadth' to 'depth_product'. Added KmerFormatError in docstring in kmers.load() func. Updated docstring in autometa.config.environ.find_executables() * resolved #19 all intermediate files are now being deleted, added additional line in the end, and used o.path.dirname(os.path.abspath(otput.bam)) * resolved #19 removed 'tail' from variable name, raise TypeError is nproc not int * update to docstrings added new file key in config and comma-delimited list handling for multiple reads files in input. Added fasta format check and simple fasta parser from Biopython for performance and Exception handling. Docstrings noting where discussions should be placed on readthedocs relating to specific autometa functionality. * returning from main rather than unnecessary sys import. * resolved #19 Temporary directory will now be delted under any circumstance * resolved #19 added FileNotFoundError, addressed other variable name issues * resolved #19 added docstring, fixed nproc and removed depth function * Revert "resolved #19, did not add the copyright and liscence information" This reverts commit ca64a2fa5032b62517fa57b49973c7967c8ccf0c. * resolved #19 added docstring (and liscence), fixed nproc and removed depth function * resolved #19 made the improvements as suggested by Evan * resolved #19 made the improvements_2 as suggested by Evan * resolved #19 made change bam.file to alignment.bam file * resolved #19 improved the cmd function * resolved #19 removed run function, removed intermedediate files and renamed stderr and stdout * resolved #19 all intermediate files are now being deleted, added additional line in the end, and used o.path.dirname(os.path.abspath(otput.bam)) * resolved #19 removed 'tail' from variable name, raise TypeError is nproc not int * resolved #19 Temporary directory will now be delted under any circumstance * resolved #19 added FileNotFoundError, addressed other variable name issues * Documentation (#34) * initial commit of documentation for readthedocs format * first commit * Commiting all files in Scripts folder * Modification to COPYRIGHT and config.py * stable build * Removed scripts_2, can now be fetched and run by any user * Create README.md File to explain how the documentation can be installed and used by anyone * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * automatic argparse, autosummary, automatic copyright update, last updated, can be run by anyone * Update README.md * Update README.md * modified all copyrights added usage and autodoc for all scripts * changed autometa to run_autometa * initial commit of documentation for readthedocs format * first commit * Commiting all files in Scripts folder * Modification to COPYRIGHT and config.py * stable build * Removed scripts_2, can now be fetched and run by any user * Create README.md File to explain how the documentation can be installed and used by anyone * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * automatic argparse, autosummary, automatic copyright update, last updated, can be run by anyone * Update README.md * Update README.md * modified all copyrights added usage and autodoc for all scripts * changed autometa to run_autometa * Applied changes to scripts and docs source files to remove warnings emitted by Sphinx. * fixes PR Review comments Sidduppal/Autometa#1 * environment.yaml and .readthedocs.yaml files for readthedocs integration * attempt to reduce memory consumption in readthedocs.org. Removed packages already available in readthedocs docker image * removed most dependencies from conda env and have moved to docs/requirements.txt * changed conda file to conda environment. * added pip in environment.yaml dependencies. * removed numba from docs/requirements.txt * fixes Sidduppal/documentation#1. minor changes in template.py to reflect changes to overall source code. Added work_queue.py to remove warning for readthedocs.org. Fixed suggestions from Sidduppal/documentation#1 PR review. * addressed Jason's comments for merge to dev * todo box added, function to automatically input modules, sidebar, and other comments by evan * Updated markers docstring (fixed incorrect f-string) to allow parameters/attributes/methods to be imported * addressed Jason's comments for merge to dev * todo box added, function to automatically input modules, sidebar, and other comments by evan * final changes, added Ian to copyright, removed hardcoded copyright * added reference for todo.py Co-authored-by: EvanRees <erees@wisc.edu> * Issue #5 Working conda recipe (#38) * init conda recipe files. * initial steps for setup.py and running autometa as its own installed application in any directory. changed structure to fit setup.py distutils and moving formatting for conda recipe * removed numpy from setup.py and removed script from build in meta.yaml * Added to numpy and removed pip * Updates to code structure to reflect proper setup for packaging/installation. Updated meta.yaml to reflect dependencies. Added autometa-configure to entrypoints as console script for database/environment configuration prior to binning runs. * reduce disk memory requirements for overall package size reduction * Working conda recipe for linux and osx. Removed uneeded ipynb in docs and unused build scripts. Moved databases under autometa package and updated default.config to reflect this. markers pointer to database updated in markers.py and added recursive directory construction within databases.py. * Updated <default> metagenome.config and removed (unused) WORKSPACE constant in config. * Updated parser descriptions * Updated version to pre-alpha changed main to __main__. Updated meta.yaml with jinja templating for version, home, license from setup.py * included description in meta.yaml * updated version to 2.0a0 and description in meta.yaml * Added doc url and dev url * updated gitignore and conda arc to reflect database dir change and added erees channel * updated argparse help information. Added COPYRIGHT tags to config/__init__.py. * Added copyright to autometa.py * Updated Dockerfile fixing issue-#3. Note: docker image will need to be updated when tsne is updated. * Added py3 compatible tsne to Dockerfile * updated --log parameter with user-friendly help description * bug found in logger message within func where args was being passed. (#49) * fixes #2 (#47) * fixes #2. Note: This currently operates using tsne hosted under channel for conda. * updated choices list to set for better membership checking and updated log message to join choices with comma-delimiter * updated default from umap to bhsne. * Contributing Guidelines (#50) * :memo: Add feature: contribution guidelines * :memo: :art: fix table of contents bulleted list * :memo: Add suggesting enhancements and notifying the team. * :memo: :art: reformat mention tags in teams table * :memo: Add ref for PR instructions in contributing code * Add entrypoint functionality. Update docstrings. * resolves issue-#54 * :memo: Update docstrings missing for functions. * :racehorse: Add cache for properties reading assembly sequences. * :art: Add Entrypoint functionality for incorporation to packaging/distribution. * :art::bug: Update main() to handle updated functions. * Update MAG class to MetaBin. * :memo: Add docstrings to methods and class * :art::fire: Remove split_nucleotides method. * :art: Update get_binning function to handle coverages as filepath not pd.DataFrame * :art: Change mag.py to metabin.py * :racehorse: fixed from PR-#66. Add cache functionality to time-consuming property methods. * :memo: Add COMBAK comment for checkpointing. Return to this when implemented in utilities.py * :art::memo::green_heart: Add Markers class documentation. (#62) * :art::green_heart: Add argparse formatter_class to show defaults without requiring f-strings. Helps doc builds from PR-#45 and issue-#22 * :art::memo: Rename --debug to --verbose flag (#63) * :art: functionality to increase verbosity with additional -v flags * :memo: Add docstrings to and * :art::memo: Add functionality to bin without taxonomy. Update docstrings. (#65) - Add parameter do_taxonomy to metagenome.config - MAG class now imported in user.py for binning without taxonomy. - resolves issue-#57 - resoves issue-#58 Note: Add COMBAK comments where checksum functionality should be added. This should be implemented in utilities.py and imported from there. * Add Documentation. Add readthedocs.org integration (#45) * :art::memo::green_heart: Add Makefile for running sphinx-build * :art::memo::green_heart: Add "parse_argparse.py" to generate usage information from autometa package modules * :art::memo::green_heart: Add ".readthedocs.yaml" readthedocs.org configuration * :art::memo::green_heart: Add "conf.py" for main build functionality for readthedocs.org * :art::memo::green_heart: Add "docs/requirements.txt" for installation in docs build env * :art: Add rst files in ".gitignore" to avoid committing docs to git history. Co-authored-by: EvanRees <erees@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Remove merge conflict resolution lines (Fixes #68) (#69) - :art: Update logger format to reflect template. - :bug: Removed lines generated from GUI merge conflict resolution. * Add mock import of modules and link to contribution guidelines (fixes #22) (#70) * :memo: Add link to contribution guidelines on KwanLab repo https://github.com/KwanLab/Autometa/blob/master/.github/CONTRIBUTING.md * :art::bug::green_heart: Add `autodoc_mock_imports = ["Bio", "hdbscan", "tsne", "sklearn", "umap", "tqdm"] * :art: Add `formatter_class=argparse.ArgumentDefaultsHelpFormatter)` to markers.py To resolve dependencies needed by apidoc during building the docs imports can be mocked by specifying the import in `autodoc_mock_imports`. This eliminates the need for `docs/requirements.txt` thereby reducing the build time and removing unexpected behavior from pip installs. Co-authored-by: EvanRees <erees@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Add docs badge Add badge for autometa.readthedocs.org * Update README.md Move autometa.rtfd.io badge above Autometa header * :art: :racehorse: Resolves #55 (#76) * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Revert "Resolves #43" Done so that future commits are not added on the same PR relating to hmmer.py This reverts commit e2c8bc1cae94e97582d1ed35a812db0562a5ab6c. * Revert "Resolves Issue #55" Done to make sure that future commits are not added to the PR relating to hmmer.py This reverts commit fc582de91db9a6e76e5a38291f858e0c4a7a3daf. * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Resolves comments for #55 Added shutil.which(). This replaces the which finction with a single line of code * Resolves #55 Removed function and added shutil.which wherever function was being called * :memo: Add docstrings to class properties. * :memo::art: Add defaults formatter_class to template.py * :memo: Update describe property GC statement. * :art: Change os.stat(fpath).st_size to os.path.getsize(fpath) * fixes-#54 Metagenome (#66) * Add entrypoint functionality. Update docstrings. * resolves issue-#54 * :memo: Update docstrings missing for functions. * :racehorse: Add cache for properties reading assembly sequences. * :art: Add Entrypoint functionality for incorporation to packaging/distribution. * :art::bug: Update main() to handle updated functions. * :memo: Add docstrings to class properties. * :memo::art: Add defaults formatter_class to template.py * hmmer (#72) * Resolves #43 * code currently in the if __name__ == '__main__': block moved to main * Output filehandle writes lines directly as they are being read from output file * Replaced manual creation of temporary files with tempfile module * Add the functionality to filter the results in case the user already has the hmmscan table * Change os.stat(fpath).st_size to path.getsize(fpath) Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo::art::fire: Add docstrings to LCA class. (#78) * :fire: Remove func aggregate_lcas as this is available in prodigal.py * :art: func blast2lca input parameter blast is now required instead of optional. * :racehorse: Remove redundant reading of nodes.dmp in prepare_tree * :memo: Add links to RMQ/LCA datastructures within prepare_tree docstring. * Fix writing (#82) :bug: Fix writing by resetting counts * Update majority_vote (#81) * :memo::art::racehorse: Update majority_vote.py * :memo: Add/Update docstrings in functions. * :art::racehorse: Change ncbi_dir parameter to ncbi in rank_taxids to take ncbi instance rather than instantiating in function. * :art::memo: Rename ctg_lcas parameter to rank_counts where dict corresponds to only one contig. * :memo: Update docstring of majority_vote func. * pre-commit hooks (#92) * Add pre-commit hooks. dquote> - :art: Add hook to remove any trailing-whitespace dquote> - :art: Add hook to check executables have shebangs dquote> - :art: Add hook to fix end of files with only a newline dquote> - :art: Add hook to check whether debug statements are in files - :art: Add hook to check whether merge conflict strings are in files - :art: Add hook to run black formatter on all files - :art: Skip autometa/validation as these are py2.7 specific (deprecated) * Update Contributing python style guide to black * :memo: Add information about contributing to dev or master branch * :memo: Update conda install instructions for pre-commit * prodigal and hmmer verbose bug (#90) - :fire: Remove verbose flags. - :fire: Remove log flag from hmmer.py - :art: Change prodigal.run and hmmer.hmmscan to annotate with respective funcs parallel and sequential. - :art: Aggregation of ORFs in prodigal performed with aggregrate_orfs func. - :art: Change subprocess.call to subprocess.run - :art: when GNU parallel called, use shell=True for subprocess.run otherwise use default shell=False - :art::green_heart: Isolate main block and move if name == main to bottom. - :art::memo::racehorse: Add hmmer serial/parallel modes. - :art: Add gnu-parallel arg to parameters. - :art: Default hmmscan from standalone module now runs in serial mode. - :art::fire: Remove unnecessary variable assignment. - :art::fire: Remove unnecessary proc.check_returncode(). - :fire::memo: Remove unused log parameter in hmmscan func. - :fire::memo: Add note to docstring. * Recursive DBSCAN (#84) * :art::memo::racehorse: Update docstrings * Update median completeness calculation to not incorporate all contigs but rather just cluster values. * :art: Change RecursiveDBSCANError exception to BinningError * :bug: Update metabin import in user.py from mag.py * :art::fire: Remove default 'z' column in run_dbscan function. * :art: add_metrics func now returns 2-tuple with cluster_metrics dataframe for median completeness lookup. * :art: Change default domain marker count lookup in add_metrics to raise error if domain does not match. * :art::racehorse: Add naive HDBSCAN implementation as clustering method. :memo: Add comments for break conditions within get_clusters function. * :art::memo::racehorse: hdbscan implementation now scans across min_samples and min_cluster_size. * databases and utilities (#77) * Fixes database checking issues. - fixes issue-#59 - fixes issue-#40 - Update .gitignore - Add md5 checksums for markers database files - Update default.config with md5 checksum urls - :art: Update file_length functionality with approximate parameter * :memo::art::fire: Write checksums for all database files * :memo: Add documentation to dunder init * :art: Write checksum after formatting/downloading database files * :fire: Remove redundant .condarc.yaml config file * :art: Update .gitignore for .vscode dir and everything within * :art: Add checksum writing to gunzip * :art::racehorse: Update downloading behavior. - :art: Update downloading behavior so corresponding remote checksums are immediately compared after file download. - :memo::fire: Remove taxdump tarball deletion routine in 'extract_taxdump' - :art: Format using 'black' formatter * :memo::art::fire: fix flake8 problems * :fire::art::racehorse: Swap ncbi download protocol to rsync from FTP. - :art: diamond database written checksum is re-written if diamond database hash does not match written checksum. * :memo: Add note format in docstring :art::fire: Remove overwriting of md5 :art: Check current md5 against remote md5 as well as current database. * :art: Add specific checksum checking variables in format_nr * Rank-specific binning (#96) * :art: Add taxonomy specific splitting control :art::racehorse: Add reverse-ranks parameter :art::racehorse: Add starting-rank parameter * :memo: Update --reverse-ranks parameter help text * :memo: Update help text for --reverse-ranks parameter * diamond.py (#87) * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Revert "Resolves #43" Done so that future commits are not added on the same PR relating to hmmer.py This reverts commit e2c8bc1cae94e97582d1ed35a812db0562a5ab6c. * Revert "Resolves Issue #55" Done to make sure that future commits are not added to the PR relating to hmmer.py This reverts commit fc582de91db9a6e76e5a38291f858e0c4a7a3daf. * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves #43 * Resolves #43 Added the funtionality to filter the results in case the user already has the hmmscan table * changed st.size to path.getsize Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Resolves #43 changed st.size to path.getsize * Enabled the running of 'Magic functions' i.e. function with double underscores Formatted some docstrings in parse_argparse * Resolves #36 Renamed top_pct to bitscr_thresh which translates to bitscore threshold Uses all the available cpus as default removed hardcoding of BLASTP Enabled the searching of merged.dmp in case the taxid is not found in nodes.dmp gunzipped database now opens with 'rt' mode Formatted docstrings * Resolves #36 Formatted doc strings Raising KeyError in __sub__ Blast now uses msktemp instead of os.curdir * Resolves #36 __sub__ now removes all the keys and not just the first one top_pct renamed to top_percentile * Added import temfile as a default import * Resolves #36 Removed default value of tempdir Raises AssertionError * Removed import tempfile * changes being done by pre-hooks * Resolves 36 temdir will now only be added if specified by user Diamond dedault output directory will be used if no temp dir specified when running the script as a module * Resolves #36 tmpdir=None in the blast function parameter Fomratted help texts * Resolves #36 Renamed top_percentile to bitscore_filter formatted docstrings * Apply suggestions from Jason's code review Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Format docstrings Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Format docstrings Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Add support request issue template. (#97) * :memo: Add support request template. * :memo::green_heart: Add comments to files section. :memo::green_heart: Add log details section :memo::green_heart: Add Suggestion to link to code for potential code contributors. * :fire::memo: Remove files section * ncbi.py (#83) - Resolves #33 🎨 `convert_taxid_dtype` to check if taxid is positive integer and in nodes.dmp and names.dmp and converts with merged.dmp, if needed. :art: Enabled the rank and parent functions to also search through 'merged.dmp' :art: Removed def(main) function :memo: Formatted docstrings :memo: replaced tar archive with tarball in databases.rst :art: added DatabaseOutOfSyncError custom exception :memo: formatted docstrings for exceptions.py :art: Moved issue request to entry point in __main__ :art: DatabaseOutofSyncError is raised is accession id from nr is not found in prot2accession2taxid.gz Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Samtools (#103) * Resolves #43 - code currently in the if __name__ == '__main__': block moved to main :art: Uses subprocess.run and check=True. :fire: Remove unused imports * Update autometa/common/external/samtools.py Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Binning stats/taxonomy summary (#99) * :art: Add kingdom-specific parameters/handling. * Add binning summary to handle 2.0 output * :art::racehorse: Add seqrecord handling in metabin init * :racehorse::art: Update methods for MetaBin init :art: Change run_binning method in user.py to account for MetaBin init :art: Change get_kingdoms method in metagenome.py to account for MetaBin init :racehorse: Perform any file parsing once rather than per func. * :art::bug: Update output summary columns. :bug: Change NCBI.CANONICAL_RANKS list to instance list :art: Update recursive_dbscan.py func add_metrics marker lookup methods for better readability :art: Remove output summary columns markers, duplicate_markers :art: Calculate completeness and purity within metabin_stats func * :bug: Ensure canonical_ranks always removes root in get_lineage_dataframe * :racehorse: calculate stats from dataframes instead of from metabins * :art::memo: Update docstrings. Add cov and GC std() to stats dataframe * decision tree classifier (#100) * Add unclustered recruitment implementation. * :art: ML_recruitment.py named unclustered_recruitment.py * Fix config and setup of user project (#104) * Fixes to config, setup and tests. * :art: Add entrypoints for length-filter, taxonomy, unclustered-recruitment, binning. * :fire: Remove unecessary method in AutometaUser class to set home directory on first use. * :art: Change config func to more clear name of set_home_dir. * :fire::art: Remove unneeded exceptions and replace with TableFormatError * :art: Replace os.stat(fpath).st_size with os.path.getsize(fpath) * Add makeflow script in tests to run autometa through cctools Makeflow system. * :art: Add checkpoint logic within utilities.py and used when starting/updating new/existing project * :fire: remove is_checkpoint func. :memo: Update docstring for metagenome.length_filter(...) * :art: Add Exception handling in get_versions(prog) and lca(tid,tid) :art: Rename upgrade to update in Databases class init string. :art: Add update db functionality argument in __main__.py * :art: Add binning parameters to configuration * :art: refactor user.run_binning to __main__.run_autometa(...) * :art: Root entrypoints to core functionalities across pipeline * :fire: Remove unnecessary Markers class. * :fire: Remove Markers from relevant scripts * :art: Change type checks to use isinstance func. * :art: Add corresponding parameters to config parser type converter. * Update project docstrings (#108) * :memo: Add class and method docstrings. * :bug: save project config after adding metagenome to project dir * :art::memo: Add two methods: new_metagenome_directory() and setup_checkpoints_and_files() * CI/CD (#101) * :fire: Remove test_autometa.py * Add unit tests directory for tests. :white_check_mark: Add tests for metagenome and metabin * :art::green_heart::memo: Add pytest.ini :green_heart: Add Makefile :art::green_heart: Add tests for coverage, markers, kmers :art: Add make_test_data.py to generate test_data.json for use with pytest fixtures :fire: Remove test_metabin.py :fire: Remove test_metagenome.py * :art: Update Makefile and install.rst :fire: Remove metagenome class methods (orfs, prots, nucls) :green_heart: Add metagenome tests :bug: Fix import error in recursive dbscan. :bug: Fix incorrect keyword arg fasta -> assembly in vote.assign(...) * :art: Change test_data path in pytest.ini to tests/data dir. :fire::art: Move tests/make_test_data.py -> make_test_data.py :art: diamond.py add type hints :art: Add type hints :fire: Move Makeflow to base directory :fire: Remove test metagenome.config file :green_heart: Add taxonomy.vote tests * :memo::white_check_mark: Add tests for markers, metagenome and vote. :art: Update test_data.json generation (smaller) for markers, metagenome and kmers. :art: Add docs command to Makefile :art: Update clean command in Makefile to incorporate docs. :bug: Minor fixes to metagenome, markers, hmmer, lca, diamond, vote :white_check_mark: Add entrypoints mark in pytest.ini * :bug: Fix raising exception when full steps are required and assembly is _not_ specified * :fire::art::white_check_mark: Change naming of test data respective to testing area. :white_check_mark: Add recursive_dbscan.py tests. :white_check_mark::art: Rename variables key in test_kmers to correspond to change in make_test_data.py * :white_check_mark: Add conftest.py and test_summary.py :memo::art: Move NCBI fixtures to conftest.py to be used throughout test session. :bug: Fix bug in summary.py when accessing marker counts. :fire: Remove unnecessary metabin.py file. Update __main__.py and vote.py to account for difference. :white_check_mark::green_heart::racehorse::fire: Change conflicting session-scoped fixture names. * :fire: Remove wip marks * :white_check_mark::green_heart: Add entrypoints mark to entrypoints * :art::memo::white_check_mark::green_heart: Add unclustered recruitment test :art: Add entrypoint marks to entrypoints. :bug: Bug fixes and increasing coverage for current tests. :green_heart::white_check_mark: Subset fixtures (and test_data.json) using pd.DataFrame.sample(...) method. * CI/CD (#1) * :green_heart: add .travis.yaml file * Included sam, bam and bed files in make_test_data.py * Updated .pre-commit-config.yaml file to make sure the commit hooks work on all versions of python and not just 3.7 * :art: Updated parsing of alignment files to * :art: Removed bug from bowtie.py * :art: updated make_test_data.py to use fwd and rev reads * :green_heart: Added unit tests for coverage.py * :green_heart: test for argpase block of coverage.py, metagenome.py and kmers.py * :bug: Resolved a bug in metagenome.py * :green_heart: :art: miniconda update has renamed the default installation path to miniconda2 * See [:link:](https://stackoverflow.com/a/34257781/12671809) * Made changes as per the official documentation * official documentation [link](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/use-conda-with-travis-ci.html#the-travis-yml-file) Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :art::white_check_mark: Update conda install command. :bug: Fix reads param bug in bowtie.py :white_check_mark: Fix test fixture in metagenome path.join -> path.joinpath :white_check_mark::racehorse: skip slow tests in test_vote.py :green_heart: Add requirements.txt for conda install * :green_heart: change conda install command to install to base for worker * :white_check_mark::green_heart: Add pytest plugins for running tests with coverage * :green_heart: installing gdown and pointing to hosted test_data.json file in shared autometa_test_data google drive * :art: Change kmers and vote entrypoint logic :art: Change vote outdir parameter to cache * :racehorse::art: Remove unnecessary gzip decompression and add autometa-orfs entrypoint. * :art: Use utilities.gunzip(...) methods instead of gzip with extra lines. * :art::racehorse: Add parallelization options to markers entrypoint. * :art::bug: Change raised DatabaseOutOfSyncError with incorrect error message. :bug: Now will issue a warning for sseqid and respective qseqid. * :bug: Add ranks in vote entrypoint output * :bug::racehorse: Fix bug in lca exception (wrong indentation level causing LCA taxids to be root). :fire: Removed DiamondResult class as this was causing unnecessary memory consumption. This was in place for algos proposed in NSF. Better to implement these data structures, when the approaches are being implemented. Removed many unneccessary parameters b/w lca.py and majority_vote.py. Removed main logic for diamond.py * :art::fire: Propagate funcs. arg. changes from lca.py, majority_vote.py to vote.py * :bug: Fix dict comprehension where int generated instead of list * :bug: missed dict.get call in list comprehension * :art: Change sparse.pkl.gz filename to precomputed_lcas.pkl.gz * :bug: Change ranked name retrieval to emit unclassified when not found rather than root * :art: Add type hints * binning.py fix keyword argument call to `get_clusters` function :bug: broken keyword argument call to `get_clusters` function. ✅ Add filepaths to test data class Now can handle multiple fastq files respective to forward, reverse or single-end reads * :bug: Add read arguments to bowtie as strings only when they are provided as a list otherwise :bug: fix edge case where user provides binning without any unclustered contigs :bug: Fix edge case where no bins are recovered from dataset :racehorse: Remove tempfile dependency for cluster taxon assignment :fire: Remove unused library (shutil) from coverage.py :art: Add binning exceptions for unclustered_recruitment.py and recursive_dbscan.py * :bug: Fix temp line handling in binning/summary.py Co-authored-by: Siddharth Uppal <suppal3@wisc.edu> * :bug: Change > to >= when calculating N50 (#119) * Update README.md * Update README.md * Fix Dockerfile (#123) * :art::bug::green_heart::penguin: Fix Dockerfile. Now capable of running autometa entrypoints * :bug: Add procps to prevent the nextflow error 'Command 'ps' required by nextflow to collect task metrics cannot be found' * :bug: hmmpress markers so they are pre-formatted for pipeline execution * :fire::down_arrow: Remove ndcctools dependency :art: Instead of cloning repo, copy branches current contents into build env for docker image creation :art: Point to requirements.txt during conda install instead of explicitly listing in docker build (remove redundancy) :bug: Add extra ampersand s.t. if an entrypoint is unavailable, the build will fail. :art: Redirect help text to /dev/null to clean up build log * :art: Add support for gzipped assemblies (#129) resolves KwanLab/Autometa/issue#125 * :memo: Update bug report template (#130) * Remove --multiprocess from autometa-kmers entrypoint (#127) * :art: Remove --multiprocess flag. Now performs multiprocessing if user provides cpus > 1. * :green_heart::fire: Remove multiprocess arg from test_kmers.py * Add GC content std.dev. limit and coverage std. dev. limit Binning metrics (#120) * :green_heart: Fix test_data.json creation. Update filepaths :art: Add issue-#46 feature of coverage std.dev. and GC content std.dev. binning metrics :green_heart: Fix mocked input args for additional binning metrics parameters. :art: Update Makefile for auto-documenting help messages and add make commands * :memo: Add help text to unlisted commands (now listed) :fire: Remove test_environment command :art: Change test_data command to unit_test_data :art: Change test_entrypoints command to unit_test_entrypoints. * :green_heart: Add command to construct unit test environment :memo: Update install documentation to reflect environment for building the docs as well as unit tests :fire: Remove ndcctools dependency and add nextflow dependency in requiremtents.txt * :fire::art: Remove most defaults in get_clusters(...) function * :art: Rename test_environment command to unit_test_environment to avoid confusion * :art: Add logic to handle exceptions when clusters are not recovered and we are attempting to add clustering metrics * :arrow_up: Add minimum pandas version 1.1 * :art: Add 'image' command to build docker image from current branch :art: Update commands associated with unit_test s.t. they are grouped together * :art: Add dirs to delete in clean command. :art: Create and install env libs with create_environment command * :art: Add test_environment command to Makefile. :white_check_mark: Add tests/requirements.txt file * :bug: fix path to requirements in test_environment command * :bug::white_check_mark: Add make into requirements.txt and update path to tests for unit tests * Nextflow implementation template (#118) * :art: Add entrypoints for taxon assignment workflow :art: Add Autometa nextflow implementation template. :art: Update majority_vote.py parameters to more easily construct taxon assignment workflow. * :art: Comment out container directives :art::fire: Update optional arguments to mandatory arguments (metagenome, interim, processed) :art: Prefix output files with metagenome.simpleName in their respective output directories. :art: Name main workflow AUTOMETA and call with channel :fire: Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA) :bug: fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers) * :art: Add nextflow config with slurm executor configuration and nextflow project details * :bug: Add end of file newline * :bug: Add missing line continuation in MARKERS command. * :bug: Fix incorrect keyword argument in lca.py main call :bug: Fix incorrect flag in entrypoint (MARKERS process) * :art: Keep hmmscan output file in MARKERS * Update gitignore with paths to ignore nextflow generated files * :bug: Fix broken paths in SPLIT_KINGDOMS :art: Add parameter '--outdir' to autometa-taxonomy entrypoint. * :bug: Fix missing line continuation in BINNING * :art: Update output paths so only binning results are in processed directory :art: Add completeness and purity parameters to autometa.nf * :art: Add completeness and purity parameters to log at beginning of run * :bug: Handle for case where archaea are not recovered from metagenome * :art: Add config file for autometa input parameters :fire: Remove copy mode from all publishDir settings for all processes in autometa workflow :art: Update autometa.taxonomy.vote entrypoint paramters :green_heart: Update mocked args to be compatible with new autometa.taxonomy.vote paramters :art: Add type hints to ncbi.py :fire: Remove most of redundant logic from vote.py s.t. entrypoint now is only responsible for adding canonical ranks to voted taxids and writing out ranks split by provided rank :art::fire: Remove hardcoded parameters and add additional parameters to allow user finer control of entire autometa workflow :art: Add HTCondor executor profile with comments * :green_heart::bug::fire: Remove keyword argument 'out' from vote.add_ranks(...) func * :art: Add params.cpus to initial info log * :fire::bug::art: Remove unnecessary autometa prodigal wrapper. :fire: Removes GNU parallel functionality from ORFs process. This was removed because the number of ORF sequences recovered using GNU parallel was non-deterministic This will take a hit on performance as a trade-off for determinism. * :art: Update nextflow scripts to use jason-c-kwan/autometa:dev docker image :art: Add dockerignore prevent unnecessary context bloat and image bloat. :fire: Remove Makeflow autometa template :art: Move autometa.nf containing AUTOMETA workflow to nextflow directory :up_arrow: Add minimum pandas version of 1.1. :memo: Update link to references in normalize(...) func in kmers.py :art: Update parameters.config to reflect updated nextflow parameters :art: Update Dockerfile with entrypoint checks autometa-taxonomy-lca and autometa-taxonomy-majority-vote :art: Add main.nf for use with manifest as a pre-requisite for nextflow pipeline sharing through GitHub. :art: Update manifest in nextflow.config to reflect change in mainScript :art: Add fixOwnership to docker scop in nextflow.config * :art: Update manifest with 'doi' and 'defaultBranch' * :art: Update arguments for entrypoints autometa-binning and autometa-unclustered-recruitment :art: Propagate these argument changes to nextflow processes :green_heart: Update tests to accomodate updated arguments * :fire: Remove unused/unnecessary configuration scripts :art: Move code in config/__init__.py to config/utilities.py and update respective imports to point to this file :art: Split autometa-configure entrypoint into two entrypoints autometa-config and autometa-update-databases :bug: Change default markers directory to look inside default.config instead of source directory :fire: Remove __main__.py and autometa.py wrapper to __main__.py in exchange for using nextflow files. :arrow_up: Add diamond to requirements.txt :bug: Modify config to point to autometa/databases after installation in Docker build :art::memo: Add typehints across config scripts * :art: Apply black formatting * :white_check_mark::art: Update call to parse_args from config.parse_args(...) to config.utilities.parse_args(...) * :white_check_mark::bug: Update config.parse_args(...) to autometa.config.utilities.parse_args(...) * :white_check_mark: Alias config.utilities imports to configutils. Provides access to parse_args attribute while avoiding confusion with autometa.common.utilities functions * :art: Update default databases retrieval logic :bug: Remove issue of redundant executable versions being written in default.config :bug: Fix automatically updating autometa home_dir configuration in default.config :art: Add exception handling in parse_argparse.py to provide more debugging information * :white_check_mark::memo: Fix error when parsing databases argparse. :art: Remove any indentation in written argparse blocks for retrieving argparse usage * :art: add EOF line in dockerignore * :bug: Fix default path to markers database in MARKERS process * :bug: Fix incorrect option when attempting to download missing ncbi files * :bug: Fix clean command in Makefile so it actually removes provided directories * :art: replace only first ftp in ncbi ftp filepaths * :art: Remove orfs filepath dependency in LCA and majority vote :art: Change entrypoint arguments for autometa-taxonomy-lca and autometa-taxonomy-majority-vote * :art: Changed entrypoint parameters for autometa-length-filter. :fire: Remove unused methods in metagenome.py :art::white_check_mark: Remove unuseded tests in test_metagenome. Update MockedParser to reflect new entrypoint args :art: Update nextflow LENGTH_FILTER process to accomodate new parameters. Now uses named emits (fasta, stats, gc_content) :art::memo: Add new binning metrics into parameters.config (gc_stddev_limit,cov_stddev_limit) :memo::art: Add type hints into metagenome.py * :memo: Update log with added parameters * :bug: Fix incorrect path to default markers database in nf pipeline (location in docker image is currently hardcoded in MARKERS process). :art: Next step is for default to point to absolute path in docker image instead of relative path * :fire: Remove --dbdir hardcoded parameter in MARKERS process. This is now being appropriately configured in the docker image that is utilized by nextflow :bug: Add conda channels conda-forge and bioconda to create_environment command :art: Update Dockerfile to configure autometa databases with the DB_DIR environment variable as an absolute path (relative path may cause bugs) * Update autometa/common/metagenome.py * :bug: replace 'orfs' tags with the respective single input path tag * :bug::fire: Remove --multiprocess flag from autometa-kmers command in KMERS process * :fire: Remove duplicate dependencies * :bug: Fix cryptic bug where imports do not work when explicit python interpreter is used in Makefile commands :art: Add functionality to handle for gzipped orfs for autometa-markers entrypoint * :fire: Remove Makefile from .dockerignore :art: use of make commands from Makefile for autometa directory cleanup and install :bug::arrow_up: Set samtools minimum version in requirements.txt. Otherwise samtools command would not work properly * :art: Change --output parameter to --output-binning in recursive_dbscan.py > :art: Add '--output-master' paramter to autometa-binning entrypoint > :white_check_mark: Update MockArgs to account for updated entrypoint parameters > :white_check_mark::art: Add args check to autometa-binning entrypoint for embed_dimensions and embed_pca_dimensions inputs > :art: Fix typo in kmers embed docstring > :art: Standardize output columns from kmers.embed(...) to 1-indexed 'x_1' to 'x_{embed_dimensions}' instead of x,y,z... > :bug: Add coverage and gc_content std.dev. limits to drop columns in run_hdbscan(...) > :art: drop columns in run_hdbscan(...) and run_dbscan(...) are now performed on one line and if the df does not contain any of the columns in dropcols, the error is ignored * :fire: Remove conda install using py2.7 :fire::art: Rename references from master to main throughout nf and autometa binning scripts :memo: Format notes in parameters.config * :arrow_up: Add minimum version of diamond 2.* :green_heart: Add output_main to MockedArgs * :memo::art: Add copyright and short script description to all unit test files * :art: Add autometa-parse-bed entrypoint :art: Add READ_COVERAGE workflow in common-tasks to compute coverage from read alignments instead of SPAdes headers * :memo: Replace 2020 copyright with 2021 copyright :memo::fire: Remove note on ORF calling warning and replace with contig cutoff warning :memo: Update help text for --binning argument in unclustered_recruitment * :fire: Remove --do-pca argument from kmers.py :memo: Fix help string in --norm-method in kmers.py :art: Change --normalized to --norm-output in kmers.py :art: Change --embedded to --embedding-output in kmers.py :art: Change --embed-dimensions to --embedding-dimensions in kmers.py :art: Change --embed-method to --embedding-method in kmers.py :art: Update KMERS in common-tasks.nf to account for updated parameters :green_heart: Update test_kmers.py MockedArgs to account for updated arguments * :fire::green_heart: Remove references to removed do_pca parameter :bug: Update marker databases checksums so they correspond to md5sum :art: sort main file output columns in autometa-binning entrypoint * :fire::art: Remove 'string' metavar for clustering-method arg * :fire: Remove kmer embedding args from autometa-binning entrypoint :art: Change KMERS.out.normalized as input for binning to KMERS.out.embedded :green_heart: Update test_recursive_dbscan kmers fixture and mocked args to account for removed kmer parameters :art: Add convert_dtypes method call to load(...) func for markers dataframe :fire::art: Remove parameters for kmers in binning-tasks and update parameters to correspond to kmers args :art: unclustered recruitment now writes output-binning with contig, cluster and recruited_cluster columns * :art: Add autometa-binning-summary entrypoint :art: unclustered recruitment now writes out binning with columns 'cluster' and 'recruited_cluster' :bug::green_heart: Fix duplicate mocks in test_recursive_dbscan(...) :art: Add BINNING_SUMMARY process in autometa.nf workflow :art: Define BINNING_SUMMARY process in binning-tasks.nf * :green_heart::bug: Change broken variable main to main_df * :green_heart::fire: Remove kmer embedding dimensions test * :bug::fire: Remove assembly argument in get_metabin_stats(...) :green_heart::fire: Remove unused mocked dependencies in test_kmers.py :fire::green_heart: Remove tests corresponding to old summary.py functionality * :green_heart: Add gc_content column to bin_df fixture in test_summary * :memo: Add docstrings and explanation within vote.py :art: Change vote.py argument from --input to --votes and add metavars to parser args :green_heart: Change make_test_data.py summary data to create gc_content column instead of GC column :green_heart: Update MockedArgs in vote.py to correspond to updated --votes parameter :art: Replace --input argument in autometa-taxonomy for SPLIT_KINGDOMS process to --votes * :bug: Fig arg passed in pd.read_csv(...) for autometa.taxonomy.vote * :racehorse: Add autometa/databases to dockerignore * :art: Update autometa-orfs entrypoint arguments :memo: Add type hints to autometa.common.external.prodigal funcs :fire::art: Remove --parallel parameter from autometa-orfs. Parallel is now inferred from --cpus arg * :racehorse: ignore the ignore for autometa/databases/markers Add test of autometa-binning-summary entrypoint * :bug: Replace incorrect variable (orfs) in BINNING_SUMMARY tag * :memo: Replace old kmer paramters in log info with new paramters * Update documentation (#121) * :art: Added link to Automappa in examining results :memo: Updated install for version 2 * :memo: Add step-by-step tutorial on how to run Autometa :fire: Remove Rest API :art: Add docs/source/_build to .gitignore :memo: Update autometa install guidelines. Added docker to it. :memo: Add benchmarking page :memo: Add Automappa to examining results :art: Replaced shell with bash in parse_argparse.py :memo: Add packages to install for developers in contributing guidelines * :memo: Add information regarding test datasets :arrow_down: Remove dependency on sphinx.ext.paramout * :memo: Added python and R script in emanining results * Apply suggestions from code review Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo: Added nextflow tutorial :memo: Update install using Makefile :up-arrow: Create a new file for step-by-step instructions on how to run Autometa :memo: Update benchmarking to add steps on how to download datasets :memo: Update contibuting guidelines on how to install dependencies for unit tests and docs * :art: :memo: Remove Quickstart from index.rst * :memo: Add step-by-step tutorial on how to run autometa using entrypoints :memo: Add tutorial on how to run nextflow :memo: Add binning figures in examining results sections :art: :memo: Correct installation steps. Now uses make for everything :memo: Improved contribution guidelines * :memo: Fix table in tutorial :memo: Add channels when using requirnments.txt for autometa install * :art: :memo: Incorporated Evan's comments * Apply suggestions from code review Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo: Add Advance usage in step-by-step-tutorial :memo: Add another column of opetional or required in usage table of each step * First pass on nextflow documentation Still need to edit/add more Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> Co-authored-by: chasemc <18691127+chasemc@users.noreply.github.com> * Add feature to download google drive datasets (#138) * Add feature to download google drive datasets Issue #110 * Add gdown to requirements.txt * :art: Formatted script according to template, renamed variables * :art: Changed permissions * :art: Added unique filenames for each file size * :art: Moved to external folder * Moved script to validation and renamed * Rename function and add type hints * Add file containing fileIDs to reference * Add user input options for files/folders * Reformat with black * Change targets variable name * Change "folder" to "dataset" * Update column names * Condense logic into one function * Change logic to input multiple files and multiple output dirs * Add logger warnings * Add datasets.py info to setup.py * Change internet_is_connected into an import * Add internet connection checker and error message * Directory structure to organize downloads * Change variable names and clean up extra bits * Add __init__.py to validation * Add error for non-existent dir_path * Add detail to internet_is_connected failure * Added NotImplementedError * Only read csv once * Change strategy for filtering df * Using df.loc to retrieve file_id * Argparse and var name refinements * Add ability to ping custom IP * Reformatting * Hardcode fileID csv hosted on google drive * Reformatting * Remove gdown_fileIDs.csv * Add verbose error message and dockerfile entrypoint * Add densmap embed method and fix binning-summary cluster column bug (#176) * :memo: Update bug report template * :snake: Add densmap --embed-method to autometa-kmers :memo: Add TODO comments for easy addition of denSNE when it is easily available through conda or pip installation :bug: Change hardcoded 'cluster' column in autometa-binning-summary to cluster_col variable * :art: Add trimap as embedding method :memo: Update installation instructions to use trimap :white_check_mark: Add trimap to kmer tests :whale: Add trimap installation to Dockerfile :arrow_up: Add trimap requirements to requirements.txt * :green_apple::memo: Update parameter comments for embedding_method * :arrow_up: pinned umap-learn and prodigal in requirements.txt :memo: Add comment for trimap requirement in requirements.txt * :fire: Remove TODO comment of densne import :memo: Change densmap hyperlink to point to umap-learn readthedocs :memo: Add comments on densmap and trimap in step-by-step tutorial * :memo: Add newline to note in advanced kmer usage b/w sksne and bhsne * Classification and Clustering Benchmarking (#141) * :art: Add entrypoints for taxon assignment workflow :art: Add Autometa nextflow implementation template. :art: Update majority_vote.py parameters to more easily construct taxon assignment workflow. * :art: Comment out container directives :art::fire: Update optional arguments to mandatory arguments (metagenome, interim, processed) :art: Prefix output files with metagenome.simpleName in their respective output directories. :art: Name main workflow AUTOMETA and call with channel :fire: Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA) :bug: fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers) * :art: Add nextflow config with slurm executor configuration and nextflow project details * :bug: Add end of file newline * :bug: Add missing line continuation in MARKERS command. * :bug: Fix incorrect keyword argument in lca.py main call :bug: Fix incorrect flag in entrypoint (MARKERS process) * :art: Keep hmmscan output file in MARKERS * Update gitignore with paths to ignore nextflow generated files * :bug: Fix broken paths in SPLIT_KINGDOMS :art: Add parameter '--outdir' to autometa-taxonomy entrypoint. * :bug: Fix missing line continuation in BINNING * :art: Update output paths so only binning results are in processed directory :art: Add completeness and purity parameters to autometa.nf * :art: Add completeness and purity parameters to log at beginning of run * :bug: Handle for case where archaea are not recovered from metagenome * :art: Add config file for autometa input parameters :fire: Remove copy mode from all publishDir settings for all processes in autometa workflow :art: Update autometa.t…
📝 Add step-by-step tutorial on how to run Autometa
🔥 Remove Rest API
🎨 Add docs/source/_build to .gitignore
📝 Update autometa install guidelines. Added docker to it.
📝 Add benchmarking page
📝 Add Automappa to examining results
🎨 Replaced shell with bash in parse_argparse.py
📝 Add packages to install for developers in contributing guidelines"