🐛 Fix GTDB database setup #329

Sidduppal · 2023-06-16T15:49:13Z

🐛 Fix bug with running GTDB taxonomic workflow
🐛 Fix bug with setting up gtdb database. issue328

* Pin sphinx to version 6 * readthedocs build now requires installing autometa using `pip` in .readthedocs.yml * Add mocks for gdown, attrs, numpy, pandas, scipy, numba, skbio, trimap * Pin docutils between 0.18 and 0.20 * Pin sphinx_rtd_theme to 1.2

- `autometa-binning` parameter explanation is now in the same order as the commands are input - deprecated `--domain` has been replaced with `--rank-filter-name`

🐛 Fix bug with running GTDB taxonomic workflow

chasemc · 2023-06-16T16:25:41Z

Why do the files need to be decompressed?

Sidduppal · 2023-06-16T16:41:19Z

Decompression is required to modify the fasta headers and then concatenate the sequences together for the final database.

chasemc · 2023-06-16T17:02:52Z

That's the .sh files /grep that was also modified?

Sidduppal · 2023-06-16T17:09:51Z

Yes, that's an independent bug fix that I found during some testing. It adds an underscore after the orf ID, preventing any partial matches.

chasemc · 2023-06-16T17:14:59Z

Possible to use zgrep on still gzip'ed files instead?

jason-c-kwan · 2023-06-16T18:44:05Z

Or pipe the output from zcat, or use the gzip module in Python

Sidduppal · 2023-06-16T20:15:02Z

@chasemc

Possible to use zgrep on still gzip'ed files instead?

We are doing file modification on the files to change the FASTA headers. zgrep would just get the header but modifying it would be hard if it's not unzipped. It was also require needing to use external subprocesses and not internal python modules.

@jason-c-kwan

Or pipe the output from zcat, or use the gzip module in Python

I am currently using the gzip module in python for file manipulation.
I believe a possible scenario could be there where you use zcat modify the header and then it, but I don't think how efficient it would be as compared to unzipping the files which takes around 5-10 min.

chasemc · 2023-06-16T21:37:34Z

My confusion and suggestion of zgrep was because I thought the edits were related (in the future try to only fix a single thing in a PR or at least separate out the commits)

My question is then the same as Jason's- is there a reason not to just read the files using the gzip module rather than decompress, write and then read back in

Is the following code's single purpose to read some fasta files, edit the identifier and then concatenate into a single file?
https://github.com/KwanLab/Autometa/blob/255066a2cdd9ed9371a2b68a344a269adee56554/autometa/taxonomy/gtdb.py#L57C2-L103

" single purpose" meaning no other code relies on any of the extracted files

jason-c-kwan · 2023-06-16T21:53:53Z

Yeah seems like there are too many steps.

Get protein accession from filepath (can be done on gz)
Open combined gz file for writing with gzip module
Open each component file with gzip, write line to output gzip, change header line as appropriate
Close output file.
Not above, all files can remain gzipped, but you are effectively copying all the input faa files into a combined file. Is this necessary?

chasemc · 2023-06-16T21:58:58Z

Just to comment before I leave for the weekend...
If the answer is yes, if possible, probably best to read the desired files (filename match) directly from the tar, edit the header/id while reading and write directly into the concatenated file. Note: I'm not familiar with this section of code function and I don't know the structure of the tar file so this may or may not be a good suggestion

chasemc · 2023-06-16T21:59:55Z

Hit submit before seeing @jason-c-kwan responded

evanroyrees

It looks like some formatting should be done prior to this being merged (details in the comments). Otherwise I had just a few questions & suggestions.

autometa/taxonomy/gtdb.py

workflows/autometa-large-data-mode.sh

autometa/taxonomy/gtdb.py

workflows/autometa.sh

@Sidduppal

commit @Sidduppal additions and @WiscEvan suggestions minus changes moved to separate PR Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>

…into sidd/gtdb-hotfix

evanroyrees

👍

@sidd

* implements changes by @sidd in issue #329 in a separate new PR * add pre commit hook to remove unused imports * 🎨💚 removed sed/cut changes that belong to another PR

- 💚🐳🔥⬆️ Remove pins for scipy, scikit-learn and joblib - 💚 🐳 Add build schedule for Autometa docker images > This will help to more quickly identify when builds begin failing > Add `nightly` tag for scheduled build - 🐳 change user workdir to `/Autometa`

bheimbu · 2023-08-23T07:37:28Z

Hi,

I'd like to use the gtdb database, but I'm not able to build it. Any news when this will be fixed?

Cheers Bastian

evanroyrees · 2023-08-23T15:00:57Z

Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)

see: #329 (comment) scikit-learn-contrib/hdbscan#607

See: #329 (comment) scikit-learn-contrib/hdbscan#607

chasemc · 2023-08-23T18:37:13Z

Rolling back cython as suggested by some comments in

Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)

didn't work. .

evanroyrees · 2023-08-24T15:23:59Z

autometa/taxonomy/gtdb.py

                    if line.startswith(">"):
                        seqheader = line.lstrip(">")
-                        line = f"\n>{acc} {seqheader}"
-                    f_out.write(line)
+                        outline = f"\n>{acc} {seqheader}"


Just a minor quibble. It is convention to put the newline character at the end of the line.
The newline character at the beginning looks kind of odd 🤷

Suggested change

outline = f"\n>{acc} {seqheader}"

outline = f">{acc} {seqheader}\n"

This would also require changing

# from seqheader = line.lstrip(">") # to seqheader = line.lstrip(">").strip()

Append underscore to contig id to prevent partial matches See also: #329 (comment)

Sidduppal and others added 3 commits April 11, 2023 17:14

🧑‍🔧 📝 Fix docs (#323)

148f490

* Pin sphinx to version 6 * readthedocs build now requires installing autometa using `pip` in .readthedocs.yml * Add mocks for gdown, attrs, numpy, pandas, scipy, numba, skbio, trimap * Pin docutils between 0.18 and 0.20 * Pin sphinx_rtd_theme to 1.2

Reorder autometa-binning parameters in step-by-step tutorial (#314)

32f44d0

- `autometa-binning` parameter explanation is now in the same order as the commands are input - deprecated `--domain` has been replaced with `--rank-filter-name`

🐛 Fix bug with setting up gtdb database

255066a

🐛 Fix bug with running GTDB taxonomic workflow

Sidduppal added the bug Something isn't working label Jun 16, 2023

Sidduppal requested review from jason-c-kwan, chasemc and evanroyrees June 16, 2023 15:49

Sidduppal self-assigned this Jun 16, 2023

Sidduppal mentioned this pull request Jun 16, 2023

autometa-update-databases, error building GTDB diamond database #328

Closed

Sidduppal added 2 commits July 7, 2023 11:34

Created database without unzipping the files

0c46f35

test

3ddb0f3

evanroyrees changed the title ~~🐛 Fix bug #Issue 328~~ 🐛 Fix GTDB database setup and taxon-binning workflow Jul 8, 2023

evanroyrees linked an issue Jul 8, 2023 that may be closed by this pull request

autometa-update-databases, error building GTDB diamond database #328

Closed

evanroyrees requested changes Jul 8, 2023

View reviewed changes

shaneroesemann and others added 3 commits July 31, 2023 13:46

Apply suggestions from code review

1134acc

commit @Sidduppal additions and @WiscEvan suggestions minus changes moved to separate PR Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>

revert cut/sed edits

40be5a3

Merge branch 'sidd/gtdb-hotfix' of https://github.com/KwanLab/Autometa …

f1ae6e2

…into sidd/gtdb-hotfix

shaneroesemann added a commit that referenced this pull request Aug 1, 2023

sed/cut changes mentioned in issue #329

5e0ee9f

evanroyrees self-requested a review August 1, 2023 21:47

🐍🔥🎨 Remove unused imports, apply black formatting

c35d0ab

evanroyrees approved these changes Aug 2, 2023

View reviewed changes

shaneroesemann added a commit that referenced this pull request Aug 4, 2023

implements changes by @sidd in issue #329 in a separate new PR

a2ae5d7

shaneroesemann mentioned this pull request Aug 9, 2023

🐛🐚 Fix GTDB taxon-binning workflow #339

Merged

shaneroesemann and others added 2 commits August 10, 2023 11:03

🎨 🍏 Issue 330 redo (#338)

f13ee91

* implements changes by @sidd in issue #329 in a separate new PR * add pre commit hook to remove unused imports * 🎨💚 removed sed/cut changes that belong to another PR

evanroyrees added 2 commits August 23, 2023 09:25

Merge branch 'dev' of github.com:KwanLab/Autometa into sidd/gtdb-hotfix

883eeed

💚⬆️🔥 Remove pinned dependencies in test env

972ddb5

chasemc added a commit that referenced this pull request Aug 23, 2023

attempt to fix tests by pinning cython

c25a8b0

see: #329 (comment) scikit-learn-contrib/hdbscan#607

chasemc added a commit that referenced this pull request Aug 23, 2023

Pin cython until hdbscan is fixed

1bb702d

See: #329 (comment) scikit-learn-contrib/hdbscan#607

chasemc force-pushed the sidd/gtdb-hotfix branch from a78f0ce to 972ddb5 Compare August 23, 2023 18:34

evanroyrees changed the title ~~🐛 Fix GTDB database setup and taxon-binning workflow~~ 🐛 Fix GTDB database setup Aug 23, 2023

evanroyrees reviewed Aug 24, 2023

View reviewed changes

Address @evan's comments

58b1b5a

Sidduppal requested a review from evanroyrees August 24, 2023 15:56

🔥 Revert dev branch changes

b653e52

evanroyrees merged commit 737fa70 into main Aug 24, 2023
1 of 4 checks passed

evanroyrees deleted the sidd/gtdb-hotfix branch August 24, 2023 16:26

evanroyrees pushed a commit that referenced this pull request Aug 24, 2023

🐛🐚 Fix GTDB taxon-binning workflow (#339)

c8f142c

Append underscore to contig id to prevent partial matches See also: #329 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix GTDB database setup #329

🐛 Fix GTDB database setup #329

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023

Sidduppal commented Jun 16, 2023 •

edited

chasemc commented Jun 16, 2023

jason-c-kwan commented Jun 16, 2023

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023 •

edited

jason-c-kwan commented Jun 16, 2023

chasemc commented Jun 16, 2023

chasemc commented Jun 16, 2023

evanroyrees left a comment

evanroyrees left a comment

bheimbu commented Aug 23, 2023

evanroyrees commented Aug 23, 2023

chasemc commented Aug 23, 2023

evanroyrees Aug 24, 2023

	outline = f"\n>{acc} {seqheader}"
	outline = f">{acc} {seqheader}\n"

🐛 Fix GTDB database setup #329

🐛 Fix GTDB database setup #329

Conversation

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023

Sidduppal commented Jun 16, 2023 • edited

chasemc commented Jun 16, 2023

jason-c-kwan commented Jun 16, 2023

Sidduppal commented Jun 16, 2023

chasemc commented Jun 16, 2023 • edited

jason-c-kwan commented Jun 16, 2023

chasemc commented Jun 16, 2023

chasemc commented Jun 16, 2023

evanroyrees left a comment

Choose a reason for hiding this comment

evanroyrees left a comment

Choose a reason for hiding this comment

bheimbu commented Aug 23, 2023

evanroyrees commented Aug 23, 2023

chasemc commented Aug 23, 2023

evanroyrees Aug 24, 2023

Choose a reason for hiding this comment

Sidduppal commented Jun 16, 2023 •

edited

chasemc commented Jun 16, 2023 •

edited