Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix GTDB database setup #329

Merged
merged 15 commits into from Aug 24, 2023
Merged

🐛 Fix GTDB database setup #329

merged 15 commits into from Aug 24, 2023

Conversation

Sidduppal
Copy link
Collaborator

🐛 Fix bug with running GTDB taxonomic workflow
🐛 Fix bug with setting up gtdb database. issue328

Sidduppal and others added 3 commits April 11, 2023 17:14
* Pin sphinx to version 6
* readthedocs build now requires installing autometa using `pip` in .readthedocs.yml
* Add mocks for gdown, attrs, numpy, pandas, scipy, numba, skbio, trimap
* Pin docutils between 0.18 and 0.20
* Pin sphinx_rtd_theme to 1.2
- `autometa-binning` parameter explanation is now in the same order as the commands are input
- deprecated `--domain` has been replaced with `--rank-filter-name`
🐛 Fix bug with running GTDB taxonomic workflow
@Sidduppal Sidduppal added the bug Something isn't working label Jun 16, 2023
@Sidduppal Sidduppal self-assigned this Jun 16, 2023
@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

Why do the files need to be decompressed?

@Sidduppal
Copy link
Collaborator Author

Decompression is required to modify the fasta headers and then concatenate the sequences together for the final database.

@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

That's the .sh files /grep that was also modified?

@Sidduppal
Copy link
Collaborator Author

Sidduppal commented Jun 16, 2023

Yes, that's an independent bug fix that I found during some testing. It adds an underscore after the orf ID, preventing any partial matches.

@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

Possible to use zgrep on still gzip'ed files instead?

@jason-c-kwan
Copy link
Collaborator

Or pipe the output from zcat, or use the gzip module in Python

@Sidduppal
Copy link
Collaborator Author

@chasemc

Possible to use zgrep on still gzip'ed files instead?

We are doing file modification on the files to change the FASTA headers. zgrep would just get the header but modifying it would be hard if it's not unzipped. It was also require needing to use external subprocesses and not internal python modules.

@jason-c-kwan

Or pipe the output from zcat, or use the gzip module in Python

I am currently using the gzip module in python for file manipulation.
I believe a possible scenario could be there where you use zcat modify the header and then it, but I don't think how efficient it would be as compared to unzipping the files which takes around 5-10 min.

@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

My confusion and suggestion of zgrep was because I thought the edits were related (in the future try to only fix a single thing in a PR or at least separate out the commits)

My question is then the same as Jason's- is there a reason not to just read the files using the gzip module rather than decompress, write and then read back in

Is the following code's single purpose to read some fasta files, edit the identifier and then concatenate into a single file?
https://github.com/KwanLab/Autometa/blob/255066a2cdd9ed9371a2b68a344a269adee56554/autometa/taxonomy/gtdb.py#L57C2-L103

" single purpose" meaning no other code relies on any of the extracted files

@jason-c-kwan
Copy link
Collaborator

Yeah seems like there are too many steps.

  1. Get protein accession from filepath (can be done on gz)
  2. Open combined gz file for writing with gzip module
  3. Open each component file with gzip, write line to output gzip, change header line as appropriate
  4. Close output file.
    Not above, all files can remain gzipped, but you are effectively copying all the input faa files into a combined file. Is this necessary?

@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

Just to comment before I leave for the weekend...
If the answer is yes, if possible, probably best to read the desired files (filename match) directly from the tar, edit the header/id while reading and write directly into the concatenated file. Note: I'm not familiar with this section of code function and I don't know the structure of the tar file so this may or may not be a good suggestion

@chasemc
Copy link
Member

chasemc commented Jun 16, 2023

Hit submit before seeing @jason-c-kwan responded

@evanroyrees evanroyrees changed the title 🐛 Fix bug #Issue 328 🐛 Fix GTDB database setup and taxon-binning workflow Jul 8, 2023
@evanroyrees evanroyrees linked an issue Jul 8, 2023 that may be closed by this pull request
Copy link
Collaborator

@evanroyrees evanroyrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some formatting should be done prior to this being merged (details in the comments). Otherwise I had just a few questions & suggestions.

autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
workflows/autometa-large-data-mode.sh Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
autometa/taxonomy/gtdb.py Outdated Show resolved Hide resolved
workflows/autometa.sh Outdated Show resolved Hide resolved
shaneroesemann and others added 3 commits July 31, 2023 13:46
commit @Sidduppal  additions and @WiscEvan  suggestions minus changes moved to separate PR

Co-authored-by: Evan Rees <25933122+WiscEvan@users.noreply.github.com>
shaneroesemann added a commit that referenced this pull request Aug 1, 2023
@evanroyrees evanroyrees self-requested a review August 1, 2023 21:47
Copy link
Collaborator

@evanroyrees evanroyrees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

shaneroesemann and others added 2 commits August 10, 2023 11:03
* implements changes by @sidd in issue #329 in a separate new PR

* add pre commit hook to remove unused imports

* 🎨💚 removed sed/cut changes that belong to another PR
- 💚🐳🔥⬆️   Remove pins for scipy, scikit-learn and joblib
- 💚 🐳 Add build schedule for Autometa docker images
    > This will help to more quickly identify when builds begin failing
    > Add `nightly` tag for scheduled build
- 🐳 change user workdir to `/Autometa`
@bheimbu
Copy link

bheimbu commented Aug 23, 2023

Hi,

I'd like to use the gtdb database, but I'm not able to build it. Any news when this will be fixed?

Cheers Bastian

@evanroyrees
Copy link
Collaborator

Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)

@chasemc
Copy link
Member

chasemc commented Aug 23, 2023

Rolling back cython as suggested by some comments in

Looks like the tests are failing due to a recent issue with hdbscan and cython (scikit-learn-contrib/hdbscan#600)

didn't work. .

@evanroyrees evanroyrees changed the title 🐛 Fix GTDB database setup and taxon-binning workflow 🐛 Fix GTDB database setup Aug 23, 2023
if line.startswith(">"):
seqheader = line.lstrip(">")
line = f"\n>{acc} {seqheader}"
f_out.write(line)
outline = f"\n>{acc} {seqheader}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor quibble. It is convention to put the newline character at the end of the line.
The newline character at the beginning looks kind of odd 🤷

Suggested change
outline = f"\n>{acc} {seqheader}"
outline = f">{acc} {seqheader}\n"

This would also require changing

# from
seqheader = line.lstrip(">")
# to
seqheader = line.lstrip(">").strip()

@evanroyrees evanroyrees merged commit 737fa70 into main Aug 24, 2023
1 of 4 checks passed
@evanroyrees evanroyrees deleted the sidd/gtdb-hotfix branch August 24, 2023 16:26
evanroyrees pushed a commit that referenced this pull request Aug 24, 2023
Append underscore to contig id to prevent partial matches

See also: #329 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

autometa-update-databases, error building GTDB diamond database
7 participants