-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
databases and utilities #77
Conversation
evanroyrees
commented
May 13, 2020
- 🎨📝 fixes issue-[dev branch] Issues with autometa/config/databases.py #59
- 🎨📝🐎 Resolves issue-[dev branch] Issues with autometa/common/utilities.py #40
- 🎨 Update .gitignore
- 🎨🐎 Add md5 checksums for markers database files
- 🎨 Update default.config with md5 checksum urls
- 🐎 Update file_length functionality with approximate parameter
- fixes issue-KwanLab#59 - fixes issue-KwanLab#40 - Update .gitignore - Add md5 checksums for markers database files - Update default.config with md5 checksum urls - 🎨 Update file_length functionality with approximate parameter
autometa/config/databases.py
Outdated
db_outfpath = db_infpath.replace('.gz','.dmnd') | ||
checksums_match = False | ||
if os.path.exists(db_infpath_md5) and os.path.exists(db_outfpath): | ||
# If nr.dmnd.md5 exists, then we will check if nr.gz.md5 is matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of these files is very confusing. Shouldn't it be nr.dmnd.md5
? Wait, I see what is going on. Yes, I agree we should make a md5 file of nr.gz
, so that it can be checked against the FTP site later. However, I think we also should make an md5 file of nr.dmnd
so that we can check later that that file is not corrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm only doing this for nr.gz.md5
is because we have a reference md5 for nr.gz
but we do not have a reference for nr.dmnd.md5
a priori.
I think this is a subtle difference that may be more specific for issue-#71? I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, perhaps. I just think that since the conversion to a .dmnd file takes a long time, it is important here to get a checksum at the end (only provided, of course, that the diamond command finished without error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking over the checksum writing this weekend. I agree with you on writing checksums after each successful download of each of the database files. As well as writing checksums for formatted files like nr.dmnd
that take a long time to create. I'm not sure we need to checksum the formatted markers or the files within taxdump. Although maybe we write a database summary file and add in the checksums from the downloaded and formatted databases? So in this case we would calculate checksums for all of the formatted files and write them to one database summary file. If any of the corresponding database checksums are updated, this would be reflected in the summary checksum. I'm thinking of logic similar to a Makefile here. This way we can update the formatted databases (*.h3*
, nr.dmnd
, nodes.dmp
, names.dmp
, merged.dmp
) when any of their respective parent database files are updated while also ensuring file integrity later with lookup of the database summary during any Autometa run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think a single file for checksums would work. My argument for doing a checksum for every output file is that it is much more work devising all sorts of tests for the format of each file to check it is OK and not corrupted. Checksums are both (1) foolproof/accurate and (2) easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked at the code changes in this function, and they look good. The one problem is that we are now checking that the nr.dmnd.md5
file exists, but not that the checksum matches that of the file as it now exists on the disk.
autometa/config/databases.py
Outdated
no_update_db = os.path.exists(db_outfpath) and not self.do_update | ||
if self.dryrun or no_update_db or checksums_match: | ||
self.config.set('ncbi', 'nr', db_outfpath) | ||
logger.debug(f'set ncbi nr: {db_outfpath}') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this point in the code we are here because checksums_match == False
. So shouldn't we redownload the database? Or at least have some sort of option to do so. It seems weird that we check whether the checksums match and then don't do anything about it. Also, couldn't checksums_match == True
but we still need to make the nr.dmnd
file? It seems that in that scenario, the function will return
and not format the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.dryrun
is a user setting telling autometa not to run the formatting/database downloading.no_update_db
(I've renamed todo_not_upgrade
... Struggling to determine an appropriate variable name) checks thatnr.dmnd
exists and the user does not want to upgrade the database even if the checksums do not match.- To toggle
checksums_match
toTrue
,os.path.exists(db_outfpath)
must also beTrue
. So this ensures if the checksums matchnr.dmnd
will also exist.
autometa/config/databases.py
Outdated
if self.do_update and os.path.exists(outfpath): | ||
os.remove(outfpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, wouldn't we need to also check that the taxdump
file exists, because previous runs of this function would have deleted it? The below code assumes it exists.
autometa/config/databases.py
Outdated
taxdump_md5 = f'{taxdump_fpath}.md5' | ||
write_checksum(taxdump_fpath, taxdump_md5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a little confusing to me. I get that making a md5 file of the taxdump
file before deleting allows you to later check whether the version we have is the same as is found on the FTP server. But why is this not done when the file is downloaded (I mean calculated, to verify the download)? Wouldn't it make sense to also make checksums of the extracted files, so that later we can verify that they are not corrupted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The taxdump.tar.gz download is fairly lightweight. I've also noticed this file gets updated multiple times daily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All files after they are downloaded are (now) checksummed. I'll write checksums out for any of the formatted/extracted files.
See #77 (comment) for details
autometa/config/databases.py
Outdated
with FTP(host) as ftp, open(outfpath, 'wb') as fp: | ||
ftp.login() | ||
logger.debug(f'starting {option} download') | ||
result = ftp.retrbinary(f'RETR {ftp_fpath}', fp.write) | ||
if not result.startswith('226 Transfer complete'): | ||
raise ConnectionError(f'{option} download failed') | ||
ftp.quit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here after download, it would make sense to then calculate the md5 of the file and check it against the corresponding md5 URL on the FTP server. Writing the md5 to a file then could signify to the pipeline that the download was successful, and then checking the file later against the md5 file would assure that the file is not corrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the addition of write_checksum
- this is not quite right yet, because if the download was corrupted then the checksum will probably match but be wrong. So we should additionally check that the written checksum file matches the remote checksum at this point.
autometa/config/databases.py
Outdated
hmms = (os.path.join(self.markers_dir, fn) for fn in os.listdir(self.markers_dir) if fn.endswith('.hmm')) | ||
hmm_search_str = os.path.join('autometa/databases/markers/','*.h3*') | ||
pressed_hmms = glob(hmm_search_str) | ||
pressed_hmms = set(os.path.realpath(os.path.splitext(fp)[0]) for fp in pressed_hmms) | ||
for fp in hmms: | ||
if fp in pressed_hmms: | ||
continue | ||
hmmer.hmmpress(fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would make sense to write checksum files for the pressed databases when the command has finished successfully (or within hmmer.py, I guess). When these files are to be used, then we can check the checksum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean write a checksum for each of the four files that are created by hmmpress
?
e.g.
markers.hmm.h3p
markers.hmm.h3i
markers.hmm.h3m
markers.hmm.h3f
Creating
markers.hmm.h3p.md5
markers.hmm.h3i.md5
markers.hmm.h3m.md5
markers.hmm.h3f.md5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If we have a convention that we always make a checksum (either as a separate file or in some special checksums file) after a file is created, then the pipeline can always check whether a checksum is available for a particular file, and we can be confident that the file is good if it's current checksum (i.e. recalculated) matches the one we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being done now in press_hmms
autometa/config/databases.py
Outdated
If the database file already exists, an md5 checksum comparison will be | ||
performed bewteen the current and file's remote md5 to ensure file integrity | ||
prior to checking the respective file as valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't seem to do this though. If the file exists and validate == True
then it returns True
. If validate == False
then it seems to just return what files do not currently exist. So we need to actually add the md5 comparison. Also, as written right now I think this will fail for the marker files at least because an md5 file is not made then they are downloaded (see other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try returning an empty dict
instead mixed dtypes (bool and dict) to try to make this more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now though,
if a user were to specify validate = True
, then the if validate: return True
block will only be encountered if if os.path.exists(fpath): continue
fails. Meaning there is a missing file, so return True
. Otherwise, if validate = True
, and no missing files are encountered, the function will return False
.
autometa/config/databases.py
Outdated
# Skip user added options not required by Autometa | ||
continue | ||
# nodes.dmp, names.dmp and merged.dmp are all in taxdump.tar.gz | ||
option = 'taxdump' if option in {'nodes','names','merged'} else option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made another comment elsewhere but here we are just checking that the parent taxdump file is OK but not the untarred component files.
autometa/config/databases.py
Outdated
|
||
def update_missing(self): | ||
"""Download and format databases for all options in each section. | ||
def compare_checksums(self, section=None, validate=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a related function which looks at the checksum stored for a local file, then recalculates the file's checksum to see if they are the same. This could be used for files created by the pipeline itself to ensure the file is OK. If the md5 file is not there then that would signify that the step didn't finish properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, this would be needed for the individual files that are untarred from taxdump. But it could also be applied to all other files made in other parts of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we would do these comparisons without having the remote checksum available. Otherwise, we can perform the checksum comparisons like I am doing with the database files. A general checksum comparison for files in the pipeline is likely to be a bit different because there will is not a remote checksum. compare_checksums
here is specific to databases. Another checksum comparison function can be added to utilities to aid in comparing pipeline generated files.
autometa/config/databases.py
Outdated
parser.add_argument('--dryrun', help='Log configuration actions but do not perform them.', | ||
action='store_true', default=False) | ||
'--no-checksum', | ||
help='Do not perform checksum comparisons to validate integrity of database files', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description says this is to check the validity of the files. I think this is a little misleading for the user, because it could also trigger if the ftp site has newer databases even though the local files are fine. Perhaps modify the wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the related function mentioned in another comment is implemented, then this wording makes a bit more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been addressed in the new commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a number of comments. The principle problem is that the behavior here is a bit different to what I was expecting. I think that the behavior should be:
- Whenever a file is downloaded, if the download task finishes successfully, we calculate a md5 checksum and write a
.md5
file. Immediately we check whether this result is the same as on the remote server (if available). If not, that shows that the download didn't work. We could retry or throw an error. - Whenever we make a new file, we calculate and write a new
.md5
file if the task finished without error. That way other parts of the pipeline can always verify that that task finished (because of the presence of the.md5
file), and when we want to use the file we can re-calculate the md5 and check that it matches the checksum in the file written previously. This avoids all the file exists and is non-zero business.
Note: The tasks here are slightly different, and I think this should also be communicated clearly to the user.
* 📝 Add documentation to dunder init * 🎨 Write checksum after formatting/downloading database files * 🔥 Remove redundant .condarc.yaml config file * 🎨 Update .gitignore for .vscode dir and everything within * 🎨 Add checksum writing to gunzip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments. As a general request - when you change function names, can you edit the changelog to say what happened to them and why? That way it is easier for the reviewer to connect issues related to functions that don't exist any more to the current code (see checklists in linked issues). Also, I would appreciate it if you could respond to all the threads so that I can parse what has happened (I think only the new code shows up in the view changes section, so it is a bit confusing).
autometa/config/databases.py
Outdated
logger.debug(f'nr checksums match, skipping...') | ||
checksums_match = True | ||
dmnd_md5 = f'{db_outfpath}.md5' | ||
do_not_upgrade = os.path.exists(db_outfpath) and not self.upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely do_not_upgrade
should be set based on whether the checksum in f'{db_outfpath}.md5'
matches that calculated from the file on the disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checksums_match
variable is checked with do_not_upgrade
in line 250. This way, the user can specify to keep the database even if the checksums do not match, i.e. they do not mind a database that is a few weeks out-of-date. do_not_upgrade
will only evaluate to True if nr.dmnd.
exists and the user specifies not to upgrade the database to the most up-to-date version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that if f'{db_outfpath}.md5'
does not match the current checksum of db_outfpath
then it should be remade, and perhaps that should be checked here if it would not be caught later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reasoning behind lines 261-266? It looks like if the md5 file exists, then it recalculates the md5 from the diamond file. But then if they don't match the old one is re-written?? Wouldn't this cover up a corrupted file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I think is the only outstanding issue that has not been resolved yet.
autometa/config/databases.py
Outdated
db_outfpath = db_infpath.replace('.gz','.dmnd') | ||
checksums_match = False | ||
if os.path.exists(db_infpath_md5) and os.path.exists(db_outfpath): | ||
# If nr.dmnd.md5 exists, then we will check if nr.gz.md5 is matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked at the code changes in this function, and they look good. The one problem is that we are now checking that the nr.dmnd.md5
file exists, but not that the checksum matches that of the file as it now exists on the disk.
autometa/config/databases.py
Outdated
After successful extraction of the files, a checksum will be written of the | ||
archive for future checking and then the archive will be removed to save user disk space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong - I couldn't see where the .tar file is deleted in this function, so the comment should be changed. For what it is worth, I think we should leave the parent .tar.gz file on the disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the lines that discard taxdump.tar.gz
. This routine was at the end of the function. Lines 307-309
autometa/config/databases.py
Outdated
with FTP(host) as ftp, open(outfpath, 'wb') as fp: | ||
ftp.login() | ||
logger.debug(f'starting {option} download') | ||
result = ftp.retrbinary(f'RETR {ftp_fpath}', fp.write) | ||
if not result.startswith('226 Transfer complete'): | ||
raise ConnectionError(f'{option} download failed') | ||
ftp.quit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the addition of write_checksum
- this is not quite right yet, because if the download was corrupted then the checksum will probably match but be wrong. So we should additionally check that the written checksum file matches the remote checksum at this point.
Regarding #77 (comment). This will |
- 🎨 Update downloading behavior so corresponding remote checksums are immediately compared after file download. - 📝🔥 Remove taxdump tarball deletion routine in 'extract_taxdump' - 🎨 Format using 'black' formatter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much finished, but I just had 2 comments to address.
autometa/config/databases.py
Outdated
logger.debug(f'nr checksums match, skipping...') | ||
checksums_match = True | ||
dmnd_md5 = f'{db_outfpath}.md5' | ||
do_not_upgrade = os.path.exists(db_outfpath) and not self.upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that if f'{db_outfpath}.md5'
does not match the current checksum of db_outfpath
then it should be remade, and perhaps that should be checked here if it would not be caught later?
- 🎨 diamond database written checksum is re-written if diamond database hash does not match written checksum.
This was probably lost in the weeds above, but I had an outstanding comment on Can you explain the reasoning behind lines 261-266? It looks like if the md5 file exists, then it recalculates the md5 from the diamond file. But then if they don't match the old one is re-written?? Wouldn't this cover up a corrupted file? |
I added this after you mentioned wanting to overwrite the existing file if the current database checksum did not match. See #77 (comment)
I think I've resolved this. I interpretted the comment above differently than intended I think. I think now we are checking all of the bases. |
🎨🔥 Remove overwriting of md5 🎨 Check current md5 against remote md5 as well as current database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found just one logic error within databases.py
autometa/config/databases.py
Outdated
if os.path.exists(dmnd_md5) and db_outfpath_exists: | ||
dmnd_md5_hash, __ = read_checksum(dmnd_md5).split() | ||
if db_outfpath_hash == dmnd_md5_hash: | ||
checksums_match = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksums_match
should be set to False
before this block, because it is likely to have been set to True
further up. Therefore, in the failure case where db_outfpath_hash != dmnd_md5_hash
, it will still be True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
* init autometa 2 * added template classes from autometa class diagram discussion * autometa 1.0 refactored. New cli and beginnings of User API. markers cutoffs reformatted. configurations added to handle executable and database dependencies. k-mer counting (multiprocessing capable), normalization and embedding (multiple methods via TSNE and UMAP). external dependencies handled in external directory. utilities for archiving, unzipping, etc added in common directory. metagenome and mag classes to handle respective data. binning directory for multiple binning algorithms. docs directory containing jupyter notebooks with information about autometa as well as template python script for writing new modules ot plug in to autometa. Added projects folder as default location where autometa will place metagenome binning jobs. Added taxonomy folder for taxon assignment algorithms/utilities. * updated markers links and link to test_metagenome.config * removed unneeded class diagram doc and edit test config to display required options. updates to database handling and added timeit to main function calls. * bugfix to get_kingdoms assigning self.taxonomy using self.assign_taxonomy() and changed logger for diamond to debug. utilities timeit now is INFO level logging. some 'f' string formatting for kmers and diamond logs (added comma thousand separators). * updates to suppress parallel warnings when running prodigal. updated noparallel arg to parallel so does GNU parallel disabled by default. updated config sections to reflect parameter change * added coverage calculation handling for reads,sam,bam and bed files. Updated coverage.py args and metagenome.py and autometa.py to reflect new args. added respective files under [files] section in default.config and metagenome.config files. * updated logger to DEBUG for majority_vote.py updated argparse help for --out in coverage.py and moved return statement in taxon assignment in metagenome.py to reduce redundancy. * updates to autometa configuration. Added kingdom arg to tune for only binning respective to selected kingdom. Choices are bacteria and archaea. bugfix where environs were being placed under database section in config. added samtools and bedtools to environ checks (used in coverage calculations). Updated default config files respective to coverage calculations * suppressed parallel warnings when running hmmscan * bugfix to parallel warning supression for hmmscan * bugfix to orf calling in metagenome.py * upadted default config to handle coverage calculation files * update files in default.config * Add django-related files, clean up structure * Update ignore, force recache of un-ignored files * Update ignore, force recache of un-ignored files * Add vscode ignorance * Add vscode ignorance * Begin django website dev * Begin django website dev * Restart website, add first app * Restart website, add first app * Make startpage * Make startpage * Delete website; confusing naming * Delete website; confusing naming * Start new django website * Start new django website * Create startpage * Create startpage * bugfix to hmmer marker filtering (filepath handling). minor logging edits to lca and majority vote. removed unneccessary comment in metagenome. Comment in kmers to silence UMAP warnings. * Add startpage * Add startpage * Add template, css, bootstrap * Add template, css, bootstrap * Fix links, add nav-bar, css * Fix links, add nav-bar, css * Change blog template to autometa related terms * Change blog template to autometa related terms * Add Projects and Jobs as Models This is a temporary setup and could change later. For now, each user (class User read here: https://docs.djangoproject.com/en/3.0/ref/contrib/auth/) has projects (class Project) and each project has jobs (class Job). Each class creates a data table and is mapped to the related table using foreign keys. Please read more here: https://docs.djangoproject.com/en/3.0/ref/models/fields/#foreignkey * Add Projects and Jobs as Models This is a temporary setup and could change later. For now, each user (class User read here: https://docs.djangoproject.com/en/3.0/ref/contrib/auth/) has projects (class Project) and each project has jobs (class Job). Each class creates a data table and is mapped to the related table using foreign keys. Please read more here: https://docs.djangoproject.com/en/3.0/ref/models/fields/#foreignkey * Add login logout profile pages * Add login logout profile pages * Use conda env instead of pip * Use conda env instead of pip * Revert "Use conda env instead of pip" This reverts commit cbb42d481d1c6938928d3b505520ba20bb7e67ee. * Revert "Add login logout profile pages" This reverts commit 4ec643de3a14d7afa84dc920933b5c7410cb1a97. * Revert "Add Projects and Jobs as Models" This reverts commit cc2d5958e62e476b56cb4e5e3c9c06964710d659. * Revert "Change blog template to autometa related terms" This reverts commit 4d7c87bee2f9eca4c2d8eb09ea5e21e0001489f8. * Revert "Fix links, add nav-bar, css" This reverts commit 0270bf1f78f715f76e768faaba7500d127c26e8f. * Revert "Add template, css, bootstrap" This reverts commit c2c520218659faeb963968d68ca3a5412d57ffea. * Revert "Merge branch 'dev' of github.com:WiscEvan/Autometa into dev" This reverts commit 1ef067b5fb6411f55259c68f942b6ea37e25d987, reversing changes made to aa1dbc179eca5bcc8d2ee6b4f946c42fb49aec7a. * Revert "Add startpage" This reverts commit aa1dbc179eca5bcc8d2ee6b4f946c42fb49aec7a. * removed website dir and files (secrets published) * Revert "Use conda env instead of pip" This reverts commit a14c37c141f210c92d3e0cae11bfd51ed948b085. * Revert "Add login logout profile pages" This reverts commit 8252fbab5b1ae5046edb56dac05a1efe7d3e92ed. * Revert "Add Projects and Jobs as Models" This reverts commit 335b563262469fb6bc354f99f6fbc4755025c134. * Revert "Change blog template to autometa related terms" This reverts commit 619baf82786fd08f051c65d0724bf30ede119d1b. * Revert "Fix links, add nav-bar, css" This reverts commit ecb1476c47a070ce3aeb40b7f7f41af100635947. * Revert "Add template, css, bootstrap" This reverts commit 54f9e2abb9747b3f7479485e0289c20ac8a76b02. * Revert "Merge branch 'dev' of github.com:WiscEvan/Autometa into dev" This reverts commit de620c8f02bc204678e73e8c500b34d9388ddc87, reversing changes made to 28fabf9eaa0d2b8e7a6d37fa05b0f3a1bf14a7e5. * Revert "Add startpage" This reverts commit 28fabf9eaa0d2b8e7a6d37fa05b0f3a1bf14a7e5. * removed website dir and files (secrets published) * updated directory structure, README.md, moved tests to their own directory with 78Mbp simulated community. updated config files. py2.7 bhsne for kmers in its own script to run py2.7 version. removed shebang specifically specifying python3 to avoid cryptic errors where user defined python env is not selected when run. Added .gitignore to ncbi dir under databases to keep empty directory. Post-processing are in their own directory under validation. * Added checkpointing functionality in utilities.py. updated config to reflect checkpointing. Renamed MAG class to Mag to follow python conventions. Added prodigal parsing to lca.py. Reflected in majority_vote.py. removed superfluous attributes for DiamondResult object. Updated metagenome.config with checkpoints.tsv file. * updated prodigal parsing into marker retrieval algorithm (markers.py and hmmer.py). By default will pass in ORFs retrieved from Mag object. * resolved #10 Contributors added and copyright year updated to 2020. * resolved #10 Contributors added and copyright year updated to 2020. * Resolves KwanLab/Autometa#16, Resolves KwanLab/Autometa#17 and simplified config parsing. Renamed 'projects' to 'workspace' to avoid confusion with 'project'. test metagenome.config file has been updated with respective files & parameters. Reconfigured logger to stream info and write debug level to timestamped log file. Added exceptions. to be used across autometa pipeline. * updates to project configuration handling metagenome numbering. Now retrieves versions from each executable dependency in environ.py. This is used in prodigal to parse corresponding to the prodigal version. I.e. 2.5 differs from version >=2.6. Prodigal now will parse ORF headers and convert contigs to ORF headers according to version available. Default config now has versions section and generated config files now contain versions section related to executable dependencies. Renamed 'new_workspace' in user.py to 'new_project' as this is more appropriate. * significant simplification in API. Created Databases class in databases.py for handling databases config. Default behavior is to download and format required databases. Changed flag to flag to be more clear. autometa will print an issue request to the user upon any exceptions being encountered (NOT KeyboardInterrupt.. Although this will also be logged). Logging behavior changed slightly, where user can specify level (default is INFO) and path to log file. binning call has been moved to user.py. autometa.config imports in user.py have been removed and general autometa.config module is imported via to perform respective func call. * updates to check dependencies and control of debugging information when checking dependencies. Executable versions are now logged in debug info. log is now only written when flag is supplied. Timestamped log has been commented out. In the future, this could be a flag to log each run via a timestamped log. in databases now only returns the config and the method of databases is used when checking dependencies. * updated 'get_versions' function to return the version string if a program is provided as input. Updated respective files using this function. This should be clearer than returning a dict of the program passed in as key and removes redundant calls to pass in the program as input and then again as a key to retrieve the version value. * hotfix to case where new project does not contain any metagenomes. skip performing check to place appropriate metagenome number and just return 1. * Changed OSError to subclass ChildProcessError in prodigal.py. This is a bug fix related to exception hierarchy. changed timeit logging message format. Respective exception handling updatedin metagenome.py * mostly resolves KwanLab/Autometa#21 and resolves KwanLab/Autometa#18. * resolved #19 added docstring, fixed nproc and removed depth function * Revert "resolved #19, did not add the copyright and liscence information" This reverts commit ca64a2fa5032b62517fa57b49973c7967c8ccf0c. * resolved #19 added docstring (and liscence), fixed nproc and removed depth function * resolved #19 made the improvements as suggested by Evan * resolved #19 made the improvements_2 as suggested by Evan * resolved #19 made change bam.file to alignment.bam file * resolved #19 improved the cmd function * fix to extract contigs from orf_ids using specific prodigal version. Note: entire pipeline currently assumes orf calling was performed using prodigal. Update to template.py where ArgumentParser now has default description, where previously this was by default usage. (Which the usage by default should be the name of the script). Updates to respective files where ORF to contig translations are necessary. * resolved #19 removed run function, removed intermedediate files and renamed stderr and stdout * updated pandas numpy module call for nan to pd.NA from pandas version 1.0. in kmers and recursive_dbscan. Updated main function for recursive dbscan with required coverage table input and subsetting taxonomy by the provided domain. Datatype conversion in pandas dataframes are now performed to optimize space in mag.py and recursive_dbscan.py. Added script description to coverage.py and removed unused exception handling in docstring. Renamed bedtools column 'breadth' to 'depth_fraction' and 'total_breadth' to 'depth_product'. Added KmerFormatError in docstring in kmers.load() func. Updated docstring in autometa.config.environ.find_executables() * resolved #19 all intermediate files are now being deleted, added additional line in the end, and used o.path.dirname(os.path.abspath(otput.bam)) * resolved #19 removed 'tail' from variable name, raise TypeError is nproc not int * update to docstrings added new file key in config and comma-delimited list handling for multiple reads files in input. Added fasta format check and simple fasta parser from Biopython for performance and Exception handling. Docstrings noting where discussions should be placed on readthedocs relating to specific autometa functionality. * returning from main rather than unnecessary sys import. * resolved #19 Temporary directory will now be delted under any circumstance * resolved #19 added FileNotFoundError, addressed other variable name issues * resolved #19 added docstring, fixed nproc and removed depth function * Revert "resolved #19, did not add the copyright and liscence information" This reverts commit ca64a2fa5032b62517fa57b49973c7967c8ccf0c. * resolved #19 added docstring (and liscence), fixed nproc and removed depth function * resolved #19 made the improvements as suggested by Evan * resolved #19 made the improvements_2 as suggested by Evan * resolved #19 made change bam.file to alignment.bam file * resolved #19 improved the cmd function * resolved #19 removed run function, removed intermedediate files and renamed stderr and stdout * resolved #19 all intermediate files are now being deleted, added additional line in the end, and used o.path.dirname(os.path.abspath(otput.bam)) * resolved #19 removed 'tail' from variable name, raise TypeError is nproc not int * resolved #19 Temporary directory will now be delted under any circumstance * resolved #19 added FileNotFoundError, addressed other variable name issues * Documentation (#34) * initial commit of documentation for readthedocs format * first commit * Commiting all files in Scripts folder * Modification to COPYRIGHT and config.py * stable build * Removed scripts_2, can now be fetched and run by any user * Create README.md File to explain how the documentation can be installed and used by anyone * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * automatic argparse, autosummary, automatic copyright update, last updated, can be run by anyone * Update README.md * Update README.md * modified all copyrights added usage and autodoc for all scripts * changed autometa to run_autometa * initial commit of documentation for readthedocs format * first commit * Commiting all files in Scripts folder * Modification to COPYRIGHT and config.py * stable build * Removed scripts_2, can now be fetched and run by any user * Create README.md File to explain how the documentation can be installed and used by anyone * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * automatic argparse, autosummary, automatic copyright update, last updated, can be run by anyone * Update README.md * Update README.md * modified all copyrights added usage and autodoc for all scripts * changed autometa to run_autometa * Applied changes to scripts and docs source files to remove warnings emitted by Sphinx. * fixes PR Review comments Sidduppal/Autometa#1 * environment.yaml and .readthedocs.yaml files for readthedocs integration * attempt to reduce memory consumption in readthedocs.org. Removed packages already available in readthedocs docker image * removed most dependencies from conda env and have moved to docs/requirements.txt * changed conda file to conda environment. * added pip in environment.yaml dependencies. * removed numba from docs/requirements.txt * fixes Sidduppal/documentation#1. minor changes in template.py to reflect changes to overall source code. Added work_queue.py to remove warning for readthedocs.org. Fixed suggestions from Sidduppal/documentation#1 PR review. * addressed Jason's comments for merge to dev * todo box added, function to automatically input modules, sidebar, and other comments by evan * Updated markers docstring (fixed incorrect f-string) to allow parameters/attributes/methods to be imported * addressed Jason's comments for merge to dev * todo box added, function to automatically input modules, sidebar, and other comments by evan * final changes, added Ian to copyright, removed hardcoded copyright * added reference for todo.py Co-authored-by: EvanRees <erees@wisc.edu> * Issue #5 Working conda recipe (#38) * init conda recipe files. * initial steps for setup.py and running autometa as its own installed application in any directory. changed structure to fit setup.py distutils and moving formatting for conda recipe * removed numpy from setup.py and removed script from build in meta.yaml * Added to numpy and removed pip * Updates to code structure to reflect proper setup for packaging/installation. Updated meta.yaml to reflect dependencies. Added autometa-configure to entrypoints as console script for database/environment configuration prior to binning runs. * reduce disk memory requirements for overall package size reduction * Working conda recipe for linux and osx. Removed uneeded ipynb in docs and unused build scripts. Moved databases under autometa package and updated default.config to reflect this. markers pointer to database updated in markers.py and added recursive directory construction within databases.py. * Updated <default> metagenome.config and removed (unused) WORKSPACE constant in config. * Updated parser descriptions * Updated version to pre-alpha changed main to __main__. Updated meta.yaml with jinja templating for version, home, license from setup.py * included description in meta.yaml * updated version to 2.0a0 and description in meta.yaml * Added doc url and dev url * updated gitignore and conda arc to reflect database dir change and added erees channel * updated argparse help information. Added COPYRIGHT tags to config/__init__.py. * Added copyright to autometa.py * Updated Dockerfile fixing issue-#3. Note: docker image will need to be updated when tsne is updated. * Added py3 compatible tsne to Dockerfile * updated --log parameter with user-friendly help description * bug found in logger message within func where args was being passed. (#49) * fixes #2 (#47) * fixes #2. Note: This currently operates using tsne hosted under channel for conda. * updated choices list to set for better membership checking and updated log message to join choices with comma-delimiter * updated default from umap to bhsne. * Contributing Guidelines (#50) * :memo: Add feature: contribution guidelines * :memo: :art: fix table of contents bulleted list * :memo: Add suggesting enhancements and notifying the team. * :memo: :art: reformat mention tags in teams table * :memo: Add ref for PR instructions in contributing code * Add entrypoint functionality. Update docstrings. * resolves issue-#54 * :memo: Update docstrings missing for functions. * :racehorse: Add cache for properties reading assembly sequences. * :art: Add Entrypoint functionality for incorporation to packaging/distribution. * :art::bug: Update main() to handle updated functions. * Update MAG class to MetaBin. * :memo: Add docstrings to methods and class * :art::fire: Remove split_nucleotides method. * :art: Update get_binning function to handle coverages as filepath not pd.DataFrame * :art: Change mag.py to metabin.py * :racehorse: fixed from PR-#66. Add cache functionality to time-consuming property methods. * :memo: Add COMBAK comment for checkpointing. Return to this when implemented in utilities.py * :art::memo::green_heart: Add Markers class documentation. (#62) * :art::green_heart: Add argparse formatter_class to show defaults without requiring f-strings. Helps doc builds from PR-#45 and issue-#22 * :art::memo: Rename --debug to --verbose flag (#63) * :art: functionality to increase verbosity with additional -v flags * :memo: Add docstrings to and * :art::memo: Add functionality to bin without taxonomy. Update docstrings. (#65) - Add parameter do_taxonomy to metagenome.config - MAG class now imported in user.py for binning without taxonomy. - resolves issue-#57 - resoves issue-#58 Note: Add COMBAK comments where checksum functionality should be added. This should be implemented in utilities.py and imported from there. * Add Documentation. Add readthedocs.org integration (#45) * :art::memo::green_heart: Add Makefile for running sphinx-build * :art::memo::green_heart: Add "parse_argparse.py" to generate usage information from autometa package modules * :art::memo::green_heart: Add ".readthedocs.yaml" readthedocs.org configuration * :art::memo::green_heart: Add "conf.py" for main build functionality for readthedocs.org * :art::memo::green_heart: Add "docs/requirements.txt" for installation in docs build env * :art: Add rst files in ".gitignore" to avoid committing docs to git history. Co-authored-by: EvanRees <erees@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Remove merge conflict resolution lines (Fixes #68) (#69) - :art: Update logger format to reflect template. - :bug: Removed lines generated from GUI merge conflict resolution. * Add mock import of modules and link to contribution guidelines (fixes #22) (#70) * :memo: Add link to contribution guidelines on KwanLab repo https://github.com/KwanLab/Autometa/blob/master/.github/CONTRIBUTING.md * :art::bug::green_heart: Add `autodoc_mock_imports = ["Bio", "hdbscan", "tsne", "sklearn", "umap", "tqdm"] * :art: Add `formatter_class=argparse.ArgumentDefaultsHelpFormatter)` to markers.py To resolve dependencies needed by apidoc during building the docs imports can be mocked by specifying the import in `autodoc_mock_imports`. This eliminates the need for `docs/requirements.txt` thereby reducing the build time and removing unexpected behavior from pip installs. Co-authored-by: EvanRees <erees@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Add docs badge Add badge for autometa.readthedocs.org * Update README.md Move autometa.rtfd.io badge above Autometa header * :art: :racehorse: Resolves #55 (#76) * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Revert "Resolves #43" Done so that future commits are not added on the same PR relating to hmmer.py This reverts commit e2c8bc1cae94e97582d1ed35a812db0562a5ab6c. * Revert "Resolves Issue #55" Done to make sure that future commits are not added to the PR relating to hmmer.py This reverts commit fc582de91db9a6e76e5a38291f858e0c4a7a3daf. * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Resolves comments for #55 Added shutil.which(). This replaces the which finction with a single line of code * Resolves #55 Removed function and added shutil.which wherever function was being called * :memo: Add docstrings to class properties. * :memo::art: Add defaults formatter_class to template.py * :memo: Update describe property GC statement. * :art: Change os.stat(fpath).st_size to os.path.getsize(fpath) * fixes-#54 Metagenome (#66) * Add entrypoint functionality. Update docstrings. * resolves issue-#54 * :memo: Update docstrings missing for functions. * :racehorse: Add cache for properties reading assembly sequences. * :art: Add Entrypoint functionality for incorporation to packaging/distribution. * :art::bug: Update main() to handle updated functions. * :memo: Add docstrings to class properties. * :memo::art: Add defaults formatter_class to template.py * hmmer (#72) * Resolves #43 * code currently in the if __name__ == '__main__': block moved to main * Output filehandle writes lines directly as they are being read from output file * Replaced manual creation of temporary files with tempfile module * Add the functionality to filter the results in case the user already has the hmmscan table * Change os.stat(fpath).st_size to path.getsize(fpath) Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo::art::fire: Add docstrings to LCA class. (#78) * :fire: Remove func aggregate_lcas as this is available in prodigal.py * :art: func blast2lca input parameter blast is now required instead of optional. * :racehorse: Remove redundant reading of nodes.dmp in prepare_tree * :memo: Add links to RMQ/LCA datastructures within prepare_tree docstring. * Fix writing (#82) :bug: Fix writing by resetting counts * Update majority_vote (#81) * :memo::art::racehorse: Update majority_vote.py * :memo: Add/Update docstrings in functions. * :art::racehorse: Change ncbi_dir parameter to ncbi in rank_taxids to take ncbi instance rather than instantiating in function. * :art::memo: Rename ctg_lcas parameter to rank_counts where dict corresponds to only one contig. * :memo: Update docstring of majority_vote func. * pre-commit hooks (#92) * Add pre-commit hooks. dquote> - :art: Add hook to remove any trailing-whitespace dquote> - :art: Add hook to check executables have shebangs dquote> - :art: Add hook to fix end of files with only a newline dquote> - :art: Add hook to check whether debug statements are in files - :art: Add hook to check whether merge conflict strings are in files - :art: Add hook to run black formatter on all files - :art: Skip autometa/validation as these are py2.7 specific (deprecated) * Update Contributing python style guide to black * :memo: Add information about contributing to dev or master branch * :memo: Update conda install instructions for pre-commit * prodigal and hmmer verbose bug (#90) - :fire: Remove verbose flags. - :fire: Remove log flag from hmmer.py - :art: Change prodigal.run and hmmer.hmmscan to annotate with respective funcs parallel and sequential. - :art: Aggregation of ORFs in prodigal performed with aggregrate_orfs func. - :art: Change subprocess.call to subprocess.run - :art: when GNU parallel called, use shell=True for subprocess.run otherwise use default shell=False - :art::green_heart: Isolate main block and move if name == main to bottom. - :art::memo::racehorse: Add hmmer serial/parallel modes. - :art: Add gnu-parallel arg to parameters. - :art: Default hmmscan from standalone module now runs in serial mode. - :art::fire: Remove unnecessary variable assignment. - :art::fire: Remove unnecessary proc.check_returncode(). - :fire::memo: Remove unused log parameter in hmmscan func. - :fire::memo: Add note to docstring. * Recursive DBSCAN (#84) * :art::memo::racehorse: Update docstrings * Update median completeness calculation to not incorporate all contigs but rather just cluster values. * :art: Change RecursiveDBSCANError exception to BinningError * :bug: Update metabin import in user.py from mag.py * :art::fire: Remove default 'z' column in run_dbscan function. * :art: add_metrics func now returns 2-tuple with cluster_metrics dataframe for median completeness lookup. * :art: Change default domain marker count lookup in add_metrics to raise error if domain does not match. * :art::racehorse: Add naive HDBSCAN implementation as clustering method. :memo: Add comments for break conditions within get_clusters function. * :art::memo::racehorse: hdbscan implementation now scans across min_samples and min_cluster_size. * databases and utilities (#77) * Fixes database checking issues. - fixes issue-#59 - fixes issue-#40 - Update .gitignore - Add md5 checksums for markers database files - Update default.config with md5 checksum urls - :art: Update file_length functionality with approximate parameter * :memo::art::fire: Write checksums for all database files * :memo: Add documentation to dunder init * :art: Write checksum after formatting/downloading database files * :fire: Remove redundant .condarc.yaml config file * :art: Update .gitignore for .vscode dir and everything within * :art: Add checksum writing to gunzip * :art::racehorse: Update downloading behavior. - :art: Update downloading behavior so corresponding remote checksums are immediately compared after file download. - :memo::fire: Remove taxdump tarball deletion routine in 'extract_taxdump' - :art: Format using 'black' formatter * :memo::art::fire: fix flake8 problems * :fire::art::racehorse: Swap ncbi download protocol to rsync from FTP. - :art: diamond database written checksum is re-written if diamond database hash does not match written checksum. * :memo: Add note format in docstring :art::fire: Remove overwriting of md5 :art: Check current md5 against remote md5 as well as current database. * :art: Add specific checksum checking variables in format_nr * Rank-specific binning (#96) * :art: Add taxonomy specific splitting control :art::racehorse: Add reverse-ranks parameter :art::racehorse: Add starting-rank parameter * :memo: Update --reverse-ranks parameter help text * :memo: Update help text for --reverse-ranks parameter * diamond.py (#87) * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves Issue #55 - Removed dispatcher dictionary by using globals() - Changed the location of checking TypeError in get_versions() function * Revert "Resolves #43" Done so that future commits are not added on the same PR relating to hmmer.py This reverts commit e2c8bc1cae94e97582d1ed35a812db0562a5ab6c. * Revert "Resolves Issue #55" Done to make sure that future commits are not added to the PR relating to hmmer.py This reverts commit fc582de91db9a6e76e5a38291f858e0c4a7a3daf. * Resolves #43 code currently in the if __name__ == '__main__': block moved to main Output filehandle writes lines directly as they are being read from output file Replaced manual creatio of temporary files with tempfile module * Resolves #43 * Resolves #43 Added the funtionality to filter the results in case the user already has the hmmscan table * changed st.size to path.getsize Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Resolves #43 changed st.size to path.getsize * Enabled the running of 'Magic functions' i.e. function with double underscores Formatted some docstrings in parse_argparse * Resolves #36 Renamed top_pct to bitscr_thresh which translates to bitscore threshold Uses all the available cpus as default removed hardcoding of BLASTP Enabled the searching of merged.dmp in case the taxid is not found in nodes.dmp gunzipped database now opens with 'rt' mode Formatted docstrings * Resolves #36 Formatted doc strings Raising KeyError in __sub__ Blast now uses msktemp instead of os.curdir * Resolves #36 __sub__ now removes all the keys and not just the first one top_pct renamed to top_percentile * Added import temfile as a default import * Resolves #36 Removed default value of tempdir Raises AssertionError * Removed import tempfile * changes being done by pre-hooks * Resolves 36 temdir will now only be added if specified by user Diamond dedault output directory will be used if no temp dir specified when running the script as a module * Resolves #36 tmpdir=None in the blast function parameter Fomratted help texts * Resolves #36 Renamed top_percentile to bitscore_filter formatted docstrings * Apply suggestions from Jason's code review Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Format docstrings Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Format docstrings Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> * Add support request issue template. (#97) * :memo: Add support request template. * :memo::green_heart: Add comments to files section. :memo::green_heart: Add log details section :memo::green_heart: Add Suggestion to link to code for potential code contributors. * :fire::memo: Remove files section * ncbi.py (#83) - Resolves #33 🎨 `convert_taxid_dtype` to check if taxid is positive integer and in nodes.dmp and names.dmp and converts with merged.dmp, if needed. :art: Enabled the rank and parent functions to also search through 'merged.dmp' :art: Removed def(main) function :memo: Formatted docstrings :memo: replaced tar archive with tarball in databases.rst :art: added DatabaseOutOfSyncError custom exception :memo: formatted docstrings for exceptions.py :art: Moved issue request to entry point in __main__ :art: DatabaseOutofSyncError is raised is accession id from nr is not found in prot2accession2taxid.gz Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Samtools (#103) * Resolves #43 - code currently in the if __name__ == '__main__': block moved to main :art: Uses subprocess.run and check=True. :fire: Remove unused imports * Update autometa/common/external/samtools.py Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Binning stats/taxonomy summary (#99) * :art: Add kingdom-specific parameters/handling. * Add binning summary to handle 2.0 output * :art::racehorse: Add seqrecord handling in metabin init * :racehorse::art: Update methods for MetaBin init :art: Change run_binning method in user.py to account for MetaBin init :art: Change get_kingdoms method in metagenome.py to account for MetaBin init :racehorse: Perform any file parsing once rather than per func. * :art::bug: Update output summary columns. :bug: Change NCBI.CANONICAL_RANKS list to instance list :art: Update recursive_dbscan.py func add_metrics marker lookup methods for better readability :art: Remove output summary columns markers, duplicate_markers :art: Calculate completeness and purity within metabin_stats func * :bug: Ensure canonical_ranks always removes root in get_lineage_dataframe * :racehorse: calculate stats from dataframes instead of from metabins * :art::memo: Update docstrings. Add cov and GC std() to stats dataframe * decision tree classifier (#100) * Add unclustered recruitment implementation. * :art: ML_recruitment.py named unclustered_recruitment.py * Fix config and setup of user project (#104) * Fixes to config, setup and tests. * :art: Add entrypoints for length-filter, taxonomy, unclustered-recruitment, binning. * :fire: Remove unecessary method in AutometaUser class to set home directory on first use. * :art: Change config func to more clear name of set_home_dir. * :fire::art: Remove unneeded exceptions and replace with TableFormatError * :art: Replace os.stat(fpath).st_size with os.path.getsize(fpath) * Add makeflow script in tests to run autometa through cctools Makeflow system. * :art: Add checkpoint logic within utilities.py and used when starting/updating new/existing project * :fire: remove is_checkpoint func. :memo: Update docstring for metagenome.length_filter(...) * :art: Add Exception handling in get_versions(prog) and lca(tid,tid) :art: Rename upgrade to update in Databases class init string. :art: Add update db functionality argument in __main__.py * :art: Add binning parameters to configuration * :art: refactor user.run_binning to __main__.run_autometa(...) * :art: Root entrypoints to core functionalities across pipeline * :fire: Remove unnecessary Markers class. * :fire: Remove Markers from relevant scripts * :art: Change type checks to use isinstance func. * :art: Add corresponding parameters to config parser type converter. * Update project docstrings (#108) * :memo: Add class and method docstrings. * :bug: save project config after adding metagenome to project dir * :art::memo: Add two methods: new_metagenome_directory() and setup_checkpoints_and_files() * CI/CD (#101) * :fire: Remove test_autometa.py * Add unit tests directory for tests. :white_check_mark: Add tests for metagenome and metabin * :art::green_heart::memo: Add pytest.ini :green_heart: Add Makefile :art::green_heart: Add tests for coverage, markers, kmers :art: Add make_test_data.py to generate test_data.json for use with pytest fixtures :fire: Remove test_metabin.py :fire: Remove test_metagenome.py * :art: Update Makefile and install.rst :fire: Remove metagenome class methods (orfs, prots, nucls) :green_heart: Add metagenome tests :bug: Fix import error in recursive dbscan. :bug: Fix incorrect keyword arg fasta -> assembly in vote.assign(...) * :art: Change test_data path in pytest.ini to tests/data dir. :fire::art: Move tests/make_test_data.py -> make_test_data.py :art: diamond.py add type hints :art: Add type hints :fire: Move Makeflow to base directory :fire: Remove test metagenome.config file :green_heart: Add taxonomy.vote tests * :memo::white_check_mark: Add tests for markers, metagenome and vote. :art: Update test_data.json generation (smaller) for markers, metagenome and kmers. :art: Add docs command to Makefile :art: Update clean command in Makefile to incorporate docs. :bug: Minor fixes to metagenome, markers, hmmer, lca, diamond, vote :white_check_mark: Add entrypoints mark in pytest.ini * :bug: Fix raising exception when full steps are required and assembly is _not_ specified * :fire::art::white_check_mark: Change naming of test data respective to testing area. :white_check_mark: Add recursive_dbscan.py tests. :white_check_mark::art: Rename variables key in test_kmers to correspond to change in make_test_data.py * :white_check_mark: Add conftest.py and test_summary.py :memo::art: Move NCBI fixtures to conftest.py to be used throughout test session. :bug: Fix bug in summary.py when accessing marker counts. :fire: Remove unnecessary metabin.py file. Update __main__.py and vote.py to account for difference. :white_check_mark::green_heart::racehorse::fire: Change conflicting session-scoped fixture names. * :fire: Remove wip marks * :white_check_mark::green_heart: Add entrypoints mark to entrypoints * :art::memo::white_check_mark::green_heart: Add unclustered recruitment test :art: Add entrypoint marks to entrypoints. :bug: Bug fixes and increasing coverage for current tests. :green_heart::white_check_mark: Subset fixtures (and test_data.json) using pd.DataFrame.sample(...) method. * CI/CD (#1) * :green_heart: add .travis.yaml file * Included sam, bam and bed files in make_test_data.py * Updated .pre-commit-config.yaml file to make sure the commit hooks work on all versions of python and not just 3.7 * :art: Updated parsing of alignment files to * :art: Removed bug from bowtie.py * :art: updated make_test_data.py to use fwd and rev reads * :green_heart: Added unit tests for coverage.py * :green_heart: test for argpase block of coverage.py, metagenome.py and kmers.py * :bug: Resolved a bug in metagenome.py * :green_heart: :art: miniconda update has renamed the default installation path to miniconda2 * See [:link:](https://stackoverflow.com/a/34257781/12671809) * Made changes as per the official documentation * official documentation [link](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/use-conda-with-travis-ci.html#the-travis-yml-file) Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :art::white_check_mark: Update conda install command. :bug: Fix reads param bug in bowtie.py :white_check_mark: Fix test fixture in metagenome path.join -> path.joinpath :white_check_mark::racehorse: skip slow tests in test_vote.py :green_heart: Add requirements.txt for conda install * :green_heart: change conda install command to install to base for worker * :white_check_mark::green_heart: Add pytest plugins for running tests with coverage * :green_heart: installing gdown and pointing to hosted test_data.json file in shared autometa_test_data google drive * :art: Change kmers and vote entrypoint logic :art: Change vote outdir parameter to cache * :racehorse::art: Remove unnecessary gzip decompression and add autometa-orfs entrypoint. * :art: Use utilities.gunzip(...) methods instead of gzip with extra lines. * :art::racehorse: Add parallelization options to markers entrypoint. * :art::bug: Change raised DatabaseOutOfSyncError with incorrect error message. :bug: Now will issue a warning for sseqid and respective qseqid. * :bug: Add ranks in vote entrypoint output * :bug::racehorse: Fix bug in lca exception (wrong indentation level causing LCA taxids to be root). :fire: Removed DiamondResult class as this was causing unnecessary memory consumption. This was in place for algos proposed in NSF. Better to implement these data structures, when the approaches are being implemented. Removed many unneccessary parameters b/w lca.py and majority_vote.py. Removed main logic for diamond.py * :art::fire: Propagate funcs. arg. changes from lca.py, majority_vote.py to vote.py * :bug: Fix dict comprehension where int generated instead of list * :bug: missed dict.get call in list comprehension * :art: Change sparse.pkl.gz filename to precomputed_lcas.pkl.gz * :bug: Change ranked name retrieval to emit unclassified when not found rather than root * :art: Add type hints * binning.py fix keyword argument call to `get_clusters` function :bug: broken keyword argument call to `get_clusters` function. ✅ Add filepaths to test data class Now can handle multiple fastq files respective to forward, reverse or single-end reads * :bug: Add read arguments to bowtie as strings only when they are provided as a list otherwise :bug: fix edge case where user provides binning without any unclustered contigs :bug: Fix edge case where no bins are recovered from dataset :racehorse: Remove tempfile dependency for cluster taxon assignment :fire: Remove unused library (shutil) from coverage.py :art: Add binning exceptions for unclustered_recruitment.py and recursive_dbscan.py * :bug: Fix temp line handling in binning/summary.py Co-authored-by: Siddharth Uppal <suppal3@wisc.edu> * :bug: Change > to >= when calculating N50 (#119) * Update README.md * Update README.md * Fix Dockerfile (#123) * :art::bug::green_heart::penguin: Fix Dockerfile. Now capable of running autometa entrypoints * :bug: Add procps to prevent the nextflow error 'Command 'ps' required by nextflow to collect task metrics cannot be found' * :bug: hmmpress markers so they are pre-formatted for pipeline execution * :fire::down_arrow: Remove ndcctools dependency :art: Instead of cloning repo, copy branches current contents into build env for docker image creation :art: Point to requirements.txt during conda install instead of explicitly listing in docker build (remove redundancy) :bug: Add extra ampersand s.t. if an entrypoint is unavailable, the build will fail. :art: Redirect help text to /dev/null to clean up build log * :art: Add support for gzipped assemblies (#129) resolves KwanLab/Autometa/issue#125 * :memo: Update bug report template (#130) * Remove --multiprocess from autometa-kmers entrypoint (#127) * :art: Remove --multiprocess flag. Now performs multiprocessing if user provides cpus > 1. * :green_heart::fire: Remove multiprocess arg from test_kmers.py * Add GC content std.dev. limit and coverage std. dev. limit Binning metrics (#120) * :green_heart: Fix test_data.json creation. Update filepaths :art: Add issue-#46 feature of coverage std.dev. and GC content std.dev. binning metrics :green_heart: Fix mocked input args for additional binning metrics parameters. :art: Update Makefile for auto-documenting help messages and add make commands * :memo: Add help text to unlisted commands (now listed) :fire: Remove test_environment command :art: Change test_data command to unit_test_data :art: Change test_entrypoints command to unit_test_entrypoints. * :green_heart: Add command to construct unit test environment :memo: Update install documentation to reflect environment for building the docs as well as unit tests :fire: Remove ndcctools dependency and add nextflow dependency in requiremtents.txt * :fire::art: Remove most defaults in get_clusters(...) function * :art: Rename test_environment command to unit_test_environment to avoid confusion * :art: Add logic to handle exceptions when clusters are not recovered and we are attempting to add clustering metrics * :arrow_up: Add minimum pandas version 1.1 * :art: Add 'image' command to build docker image from current branch :art: Update commands associated with unit_test s.t. they are grouped together * :art: Add dirs to delete in clean command. :art: Create and install env libs with create_environment command * :art: Add test_environment command to Makefile. :white_check_mark: Add tests/requirements.txt file * :bug: fix path to requirements in test_environment command * :bug::white_check_mark: Add make into requirements.txt and update path to tests for unit tests * Nextflow implementation template (#118) * :art: Add entrypoints for taxon assignment workflow :art: Add Autometa nextflow implementation template. :art: Update majority_vote.py parameters to more easily construct taxon assignment workflow. * :art: Comment out container directives :art::fire: Update optional arguments to mandatory arguments (metagenome, interim, processed) :art: Prefix output files with metagenome.simpleName in their respective output directories. :art: Name main workflow AUTOMETA and call with channel :fire: Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA) :bug: fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers) * :art: Add nextflow config with slurm executor configuration and nextflow project details * :bug: Add end of file newline * :bug: Add missing line continuation in MARKERS command. * :bug: Fix incorrect keyword argument in lca.py main call :bug: Fix incorrect flag in entrypoint (MARKERS process) * :art: Keep hmmscan output file in MARKERS * Update gitignore with paths to ignore nextflow generated files * :bug: Fix broken paths in SPLIT_KINGDOMS :art: Add parameter '--outdir' to autometa-taxonomy entrypoint. * :bug: Fix missing line continuation in BINNING * :art: Update output paths so only binning results are in processed directory :art: Add completeness and purity parameters to autometa.nf * :art: Add completeness and purity parameters to log at beginning of run * :bug: Handle for case where archaea are not recovered from metagenome * :art: Add config file for autometa input parameters :fire: Remove copy mode from all publishDir settings for all processes in autometa workflow :art: Update autometa.taxonomy.vote entrypoint paramters :green_heart: Update mocked args to be compatible with new autometa.taxonomy.vote paramters :art: Add type hints to ncbi.py :fire: Remove most of redundant logic from vote.py s.t. entrypoint now is only responsible for adding canonical ranks to voted taxids and writing out ranks split by provided rank :art::fire: Remove hardcoded parameters and add additional parameters to allow user finer control of entire autometa workflow :art: Add HTCondor executor profile with comments * :green_heart::bug::fire: Remove keyword argument 'out' from vote.add_ranks(...) func * :art: Add params.cpus to initial info log * :fire::bug::art: Remove unnecessary autometa prodigal wrapper. :fire: Removes GNU parallel functionality from ORFs process. This was removed because the number of ORF sequences recovered using GNU parallel was non-deterministic This will take a hit on performance as a trade-off for determinism. * :art: Update nextflow scripts to use jason-c-kwan/autometa:dev docker image :art: Add dockerignore prevent unnecessary context bloat and image bloat. :fire: Remove Makeflow autometa template :art: Move autometa.nf containing AUTOMETA workflow to nextflow directory :up_arrow: Add minimum pandas version of 1.1. :memo: Update link to references in normalize(...) func in kmers.py :art: Update parameters.config to reflect updated nextflow parameters :art: Update Dockerfile with entrypoint checks autometa-taxonomy-lca and autometa-taxonomy-majority-vote :art: Add main.nf for use with manifest as a pre-requisite for nextflow pipeline sharing through GitHub. :art: Update manifest in nextflow.config to reflect change in mainScript :art: Add fixOwnership to docker scop in nextflow.config * :art: Update manifest with 'doi' and 'defaultBranch' * :art: Update arguments for entrypoints autometa-binning and autometa-unclustered-recruitment :art: Propagate these argument changes to nextflow processes :green_heart: Update tests to accomodate updated arguments * :fire: Remove unused/unnecessary configuration scripts :art: Move code in config/__init__.py to config/utilities.py and update respective imports to point to this file :art: Split autometa-configure entrypoint into two entrypoints autometa-config and autometa-update-databases :bug: Change default markers directory to look inside default.config instead of source directory :fire: Remove __main__.py and autometa.py wrapper to __main__.py in exchange for using nextflow files. :arrow_up: Add diamond to requirements.txt :bug: Modify config to point to autometa/databases after installation in Docker build :art::memo: Add typehints across config scripts * :art: Apply black formatting * :white_check_mark::art: Update call to parse_args from config.parse_args(...) to config.utilities.parse_args(...) * :white_check_mark::bug: Update config.parse_args(...) to autometa.config.utilities.parse_args(...) * :white_check_mark: Alias config.utilities imports to configutils. Provides access to parse_args attribute while avoiding confusion with autometa.common.utilities functions * :art: Update default databases retrieval logic :bug: Remove issue of redundant executable versions being written in default.config :bug: Fix automatically updating autometa home_dir configuration in default.config :art: Add exception handling in parse_argparse.py to provide more debugging information * :white_check_mark::memo: Fix error when parsing databases argparse. :art: Remove any indentation in written argparse blocks for retrieving argparse usage * :art: add EOF line in dockerignore * :bug: Fix default path to markers database in MARKERS process * :bug: Fix incorrect option when attempting to download missing ncbi files * :bug: Fix clean command in Makefile so it actually removes provided directories * :art: replace only first ftp in ncbi ftp filepaths * :art: Remove orfs filepath dependency in LCA and majority vote :art: Change entrypoint arguments for autometa-taxonomy-lca and autometa-taxonomy-majority-vote * :art: Changed entrypoint parameters for autometa-length-filter. :fire: Remove unused methods in metagenome.py :art::white_check_mark: Remove unuseded tests in test_metagenome. Update MockedParser to reflect new entrypoint args :art: Update nextflow LENGTH_FILTER process to accomodate new parameters. Now uses named emits (fasta, stats, gc_content) :art::memo: Add new binning metrics into parameters.config (gc_stddev_limit,cov_stddev_limit) :memo::art: Add type hints into metagenome.py * :memo: Update log with added parameters * :bug: Fix incorrect path to default markers database in nf pipeline (location in docker image is currently hardcoded in MARKERS process). :art: Next step is for default to point to absolute path in docker image instead of relative path * :fire: Remove --dbdir hardcoded parameter in MARKERS process. This is now being appropriately configured in the docker image that is utilized by nextflow :bug: Add conda channels conda-forge and bioconda to create_environment command :art: Update Dockerfile to configure autometa databases with the DB_DIR environment variable as an absolute path (relative path may cause bugs) * Update autometa/common/metagenome.py * :bug: replace 'orfs' tags with the respective single input path tag * :bug::fire: Remove --multiprocess flag from autometa-kmers command in KMERS process * :fire: Remove duplicate dependencies * :bug: Fix cryptic bug where imports do not work when explicit python interpreter is used in Makefile commands :art: Add functionality to handle for gzipped orfs for autometa-markers entrypoint * :fire: Remove Makefile from .dockerignore :art: use of make commands from Makefile for autometa directory cleanup and install :bug::arrow_up: Set samtools minimum version in requirements.txt. Otherwise samtools command would not work properly * :art: Change --output parameter to --output-binning in recursive_dbscan.py > :art: Add '--output-master' paramter to autometa-binning entrypoint > :white_check_mark: Update MockArgs to account for updated entrypoint parameters > :white_check_mark::art: Add args check to autometa-binning entrypoint for embed_dimensions and embed_pca_dimensions inputs > :art: Fix typo in kmers embed docstring > :art: Standardize output columns from kmers.embed(...) to 1-indexed 'x_1' to 'x_{embed_dimensions}' instead of x,y,z... > :bug: Add coverage and gc_content std.dev. limits to drop columns in run_hdbscan(...) > :art: drop columns in run_hdbscan(...) and run_dbscan(...) are now performed on one line and if the df does not contain any of the columns in dropcols, the error is ignored * :fire: Remove conda install using py2.7 :fire::art: Rename references from master to main throughout nf and autometa binning scripts :memo: Format notes in parameters.config * :arrow_up: Add minimum version of diamond 2.* :green_heart: Add output_main to MockedArgs * :memo::art: Add copyright and short script description to all unit test files * :art: Add autometa-parse-bed entrypoint :art: Add READ_COVERAGE workflow in common-tasks to compute coverage from read alignments instead of SPAdes headers * :memo: Replace 2020 copyright with 2021 copyright :memo::fire: Remove note on ORF calling warning and replace with contig cutoff warning :memo: Update help text for --binning argument in unclustered_recruitment * :fire: Remove --do-pca argument from kmers.py :memo: Fix help string in --norm-method in kmers.py :art: Change --normalized to --norm-output in kmers.py :art: Change --embedded to --embedding-output in kmers.py :art: Change --embed-dimensions to --embedding-dimensions in kmers.py :art: Change --embed-method to --embedding-method in kmers.py :art: Update KMERS in common-tasks.nf to account for updated parameters :green_heart: Update test_kmers.py MockedArgs to account for updated arguments * :fire::green_heart: Remove references to removed do_pca parameter :bug: Update marker databases checksums so they correspond to md5sum :art: sort main file output columns in autometa-binning entrypoint * :fire::art: Remove 'string' metavar for clustering-method arg * :fire: Remove kmer embedding args from autometa-binning entrypoint :art: Change KMERS.out.normalized as input for binning to KMERS.out.embedded :green_heart: Update test_recursive_dbscan kmers fixture and mocked args to account for removed kmer parameters :art: Add convert_dtypes method call to load(...) func for markers dataframe :fire::art: Remove parameters for kmers in binning-tasks and update parameters to correspond to kmers args :art: unclustered recruitment now writes output-binning with contig, cluster and recruited_cluster columns * :art: Add autometa-binning-summary entrypoint :art: unclustered recruitment now writes out binning with columns 'cluster' and 'recruited_cluster' :bug::green_heart: Fix duplicate mocks in test_recursive_dbscan(...) :art: Add BINNING_SUMMARY process in autometa.nf workflow :art: Define BINNING_SUMMARY process in binning-tasks.nf * :green_heart::bug: Change broken variable main to main_df * :green_heart::fire: Remove kmer embedding dimensions test * :bug::fire: Remove assembly argument in get_metabin_stats(...) :green_heart::fire: Remove unused mocked dependencies in test_kmers.py :fire::green_heart: Remove tests corresponding to old summary.py functionality * :green_heart: Add gc_content column to bin_df fixture in test_summary * :memo: Add docstrings and explanation within vote.py :art: Change vote.py argument from --input to --votes and add metavars to parser args :green_heart: Change make_test_data.py summary data to create gc_content column instead of GC column :green_heart: Update MockedArgs in vote.py to correspond to updated --votes parameter :art: Replace --input argument in autometa-taxonomy for SPLIT_KINGDOMS process to --votes * :bug: Fig arg passed in pd.read_csv(...) for autometa.taxonomy.vote * :racehorse: Add autometa/databases to dockerignore * :art: Update autometa-orfs entrypoint arguments :memo: Add type hints to autometa.common.external.prodigal funcs :fire::art: Remove --parallel parameter from autometa-orfs. Parallel is now inferred from --cpus arg * :racehorse: ignore the ignore for autometa/databases/markers Add test of autometa-binning-summary entrypoint * :bug: Replace incorrect variable (orfs) in BINNING_SUMMARY tag * :memo: Replace old kmer paramters in log info with new paramters * Update documentation (#121) * :art: Added link to Automappa in examining results :memo: Updated install for version 2 * :memo: Add step-by-step tutorial on how to run Autometa :fire: Remove Rest API :art: Add docs/source/_build to .gitignore :memo: Update autometa install guidelines. Added docker to it. :memo: Add benchmarking page :memo: Add Automappa to examining results :art: Replaced shell with bash in parse_argparse.py :memo: Add packages to install for developers in contributing guidelines * :memo: Add information regarding test datasets :arrow_down: Remove dependency on sphinx.ext.paramout * :memo: Added python and R script in emanining results * Apply suggestions from code review Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo: Added nextflow tutorial :memo: Update install using Makefile :up-arrow: Create a new file for step-by-step instructions on how to run Autometa :memo: Update benchmarking to add steps on how to download datasets :memo: Update contibuting guidelines on how to install dependencies for unit tests and docs * :art: :memo: Remove Quickstart from index.rst * :memo: Add step-by-step tutorial on how to run autometa using entrypoints :memo: Add tutorial on how to run nextflow :memo: Add binning figures in examining results sections :art: :memo: Correct installation steps. Now uses make for everything :memo: Improved contribution guidelines * :memo: Fix table in tutorial :memo: Add channels when using requirnments.txt for autometa install * :art: :memo: Incorporated Evan's comments * Apply suggestions from code review Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> * :memo: Add Advance usage in step-by-step-tutorial :memo: Add another column of opetional or required in usage table of each step * First pass on nextflow documentation Still need to edit/add more Co-authored-by: Jason Kwan <jason.kwan@wisc.edu> Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com> Co-authored-by: chasemc <18691127+chasemc@users.noreply.github.com> * Add feature to download google drive datasets (#138) * Add feature to download google drive datasets Issue #110 * Add gdown to requirements.txt * :art: Formatted script according to template, renamed variables * :art: Changed permissions * :art: Added unique filenames for each file size * :art: Moved to external folder * Moved script to validation and renamed * Rename function and add type hints * Add file containing fileIDs to reference * Add user input options for files/folders * Reformat with black * Change targets variable name * Change "folder" to "dataset" * Update column names * Condense logic into one function * Change logic to input multiple files and multiple output dirs * Add logger warnings * Add datasets.py info to setup.py * Change internet_is_connected into an import * Add internet connection checker and error message * Directory structure to organize downloads * Change variable names and clean up extra bits * Add __init__.py to validation * Add error for non-existent dir_path * Add detail to internet_is_connected failure * Added NotImplementedError * Only read csv once * Change strategy for filtering df * Using df.loc to retrieve file_id * Argparse and var name refinements * Add ability to ping custom IP * Reformatting * Hardcode fileID csv hosted on google drive * Reformatting * Remove gdown_fileIDs.csv * Add verbose error message and dockerfile entrypoint * Add densmap embed method and fix binning-summary cluster column bug (#176) * :memo: Update bug report template * :snake: Add densmap --embed-method to autometa-kmers :memo: Add TODO comments for easy addition of denSNE when it is easily available through conda or pip installation :bug: Change hardcoded 'cluster' column in autometa-binning-summary to cluster_col variable * :art: Add trimap as embedding method :memo: Update installation instructions to use trimap :white_check_mark: Add trimap to kmer tests :whale: Add trimap installation to Dockerfile :arrow_up: Add trimap requirements to requirements.txt * :green_apple::memo: Update parameter comments for embedding_method * :arrow_up: pinned umap-learn and prodigal in requirements.txt :memo: Add comment for trimap requirement in requirements.txt * :fire: Remove TODO comment of densne import :memo: Change densmap hyperlink to point to umap-learn readthedocs :memo: Add comments on densmap and trimap in step-by-step tutorial * :memo: Add newline to note in advanced kmer usage b/w sksne and bhsne * Classification and Clustering Benchmarking (#141) * :art: Add entrypoints for taxon assignment workflow :art: Add Autometa nextflow implementation template. :art: Update majority_vote.py parameters to more easily construct taxon assignment workflow. * :art: Comment out container directives :art::fire: Update optional arguments to mandatory arguments (metagenome, interim, processed) :art: Prefix output files with metagenome.simpleName in their respective output directories. :art: Name main workflow AUTOMETA and call with channel :fire: Remove handling of coverage outside of SPAdes assembly (TODO: Incorporate separate COVERAGE workflow to pass into AUTOMETA) :bug: fix AUTOMETA:UNCLUSTERED_RECRUITMENT input where BINNING.out was emitting two outputs instead of the binning results (was also emitting embedded kmers) * :art: Add nextflow config with slurm executor configuration and nextflow project details * :bug: Add end of file newline * :bug: Add missing line continuation in MARKERS command. * :bug: Fix incorrect keyword argument in lca.py main call :bug: Fix incorrect flag in entrypoint (MARKERS process) * :art: Keep hmmscan output file in MARKERS * Update gitignore with paths to ignore nextflow generated files * :bug: Fix broken paths in SPLIT_KINGDOMS :art: Add parameter '--outdir' to autometa-taxonomy entrypoint. * :bug: Fix missing line continuation in BINNING * :art: Update output paths so only binning results are in processed directory :art: Add completeness and purity parameters to autometa.nf * :art: Add completeness and purity parameters to log at beginning of run * :bug: Handle for case where archaea are not recovered from metagenome * :art: Add config file for autometa input parameters :fire: Remove copy mode from all publishDir settings for all processes in autometa workflow :art: Update autometa.t…