Skip to content

Commit

Permalink
Merge pull request #504 from Ecogenomics/staging
Browse files Browse the repository at this point in the history
2.3.0
  • Loading branch information
pchaumeil committed May 9, 2023
2 parents e41c38d + 2aec9ac commit c3597ba
Show file tree
Hide file tree
Showing 31 changed files with 709 additions and 471 deletions.
7 changes: 0 additions & 7 deletions .github/workflows/master-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,6 @@ jobs:
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
- name: Extract config values
working-directory: ${{ github.workspace }}/master/gtdbtk/config
run: |
grep AF_THRESHOLD config.py > config2.py
grep PPLACER_MIN_RAM_BAC_FULL config.py >> config2.py
mv config2.py config.py
- name: Build documentation
working-directory: ${{ github.workspace }}/master/docs
run: make html
Expand Down
10 changes: 3 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,9 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB

## ✨ New Features

GTDB-Tk v2.2.0+ includes the following new features:
- GTDB-TK `classify` and `classify_wf` have changed in version 2.2.0+. There is now an ANI classification stage (`ANI screen`) that precedes classification by placement in a reference tree.
- **This is now the default behavior for `classify` and `classify_wf`.**
- In `classify`, user genomes are first compared against a Mash database comprised of all GTDB representative genomes and genome pairs of sufficient similarity processed by FastANI. User genomes classified to a GTDB representative based on FastANI results are not run through pplacer.
- In the `classify_wf` workflow, genomes are classified using Mash and FastANI before executing the identify step. User genomes classified with FastANI are not run through the remainder of the pipeline (identify, align, classify).
- `classify_wf` and `classify` have now **an extra mutually exclusive required argument**: You can either pick `--skip_ani_screen` (to skip the ani_screening step to classify genomes using mash and FastANI) or `--mash_db` path to save/read (if exists) the Mash reference sketch database.
- To classify genomes without the additional `ani_screen` step, use the `--skip_ani_screen` flag.
GTDB-Tk v2.3.0+ includes the following new features:
- New functionality ``convert_to_species`` function to convert GTDB genome IDs to GTDB species names


## 📈 Performance
Using ANI screen "can" reduce computation by >50%, although it depends on the set of input genomes. A set of input genomes consisting primarily of new species will not benefit from ANI screen as much as a set of genomes that are largely assigned to GTDB species clusters. In the latter case, the ANI screen will reduce the number of genomes that need to be classified by pplacer which reduces computation time substantially (between 25% and 60% in our testing).
Expand Down
19 changes: 19 additions & 0 deletions docs/src/announcements.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
Announcements
=============

GTDB-Tk 2.3.0 available
-----------------------

*May 09, 2023*

* GTDB-Tk version ``2.3.0`` is now available.
* This version of GTDB-Tk **does not** require a new version of the GTDB-Tk reference package.


GTDB R214 available
-------------------

*April 31, 2021*

* GTDB Release 214 is now available and will be used from version ``2.3.0`` and up.
* This version of GTDB-Tk is compatible with both release207 and release214 of the GTDB-Tk reference package.
`gtdbtk_r214_data.tar.gz <https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/>`_.


GTDB-Tk 2.2.0 available
-----------------------

Expand Down
15 changes: 15 additions & 0 deletions docs/src/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,21 @@
Change log
==========

2.3.0
-----

Bug Fixes:

* (`#508 <https://github.com/Ecogenomics/GTDBTk/issues/508>`_) (`#509 <https://github.com/Ecogenomics/GTDBTk/issues/509>`_) If **ALL** genomes for a specific domain are either filtered out or classified with ANI they are now reported in the summary file.

Minor changes:

* (`#491 <https://github.com/Ecogenomics/GTDBTk/issues/491>`_) (`#498 <https://github.com/Ecogenomics/GTDBTk/issues/498>`_) Allow GTDB-Tk to show ``--help`` and ``-v`` without ``GTDBTK_DATA_PATH`` being set.
* WARNING: This is a breaking change if you are importing GTDB-Tk as a library and importing values from ``gtdbtk.config.config``, instead you need to import as ``from gtdbtk.config.common import CONFIG`` then access values via ``CONFIG.<var>``
* (`#508 <https://github.com/Ecogenomics/GTDBTk/issues/508>`_) Mash distance is changed from 0.1 to 0.15 . This is will increase the number of FastANI comparisons but will cover cases wheere genomes have a larger Mash distance but a small ANI.
* (`#497 <https://github.com/Ecogenomics/GTDBTk/issues/497>`_) Add a ``convert_to_species`` function is GTDB-Tk to replace GCA/GCF ids with their GTDB species name
* Add ``--db_version`` flag to ``check_install`` to check the version of previous GTDB-Tk packages.

2.2.6
-----

Expand Down
20 changes: 12 additions & 8 deletions docs/src/installing/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,12 @@ Hardware requirements
- Storage
- Time
* - Archaea
- ~34 GB
- ~65 GB
- ~45 GB
- ~85 GB
- ~1 hour / 1,000 genomes @ 64 CPUs
* - Bacteria
- ~55GB (320 GB when using --full_tree)
- ~65 GB
- ~65GB (410 GB when using --full_tree)
- ~85 GB
- ~1 hour / 1,000 genomes @ 64 CPUs

.. note::
Expand Down Expand Up @@ -117,13 +117,13 @@ Please cite these tools if you use GTDB-Tk in your work.
GTDB-Tk reference data
----------------------

GTDB-Tk requires ~66G of external data that needs to be downloaded and unarchived:
GTDB-Tk requires ~84G of external data that needs to be downloaded and unarchived:

.. code-block:: bash
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz (or, mirror)
tar xvzf gtdbtk_v2_data.tar.gz
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_data.tar.gz (or, mirror)
tar xvzf gtdbtk_data.tar.gz
.. note:: Note that different versions of the GTDB release data may not run on all versions of GTDB-Tk, check the supported versions!
Expand All @@ -137,6 +137,10 @@ GTDB-Tk requires ~66G of external data that needs to be downloaded and unarchive
- Minimum version
- Maximum version
- MD5
* - `R214 <https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz>`_
- 2.1.0
- Current
- 630745840850c532546996b22da14c27
* - `R207_v2 <https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz>`_
- 2.1.0
- Current
Expand Down
2 changes: 1 addition & 1 deletion gtdbtk/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@
__status__ = 'Production'
__title__ = 'GTDB-Tk'
__url__ = 'https://github.com/Ecogenomics/GTDBTk'
__version__ = '2.2.6'
__version__ = '2.3.0'
13 changes: 7 additions & 6 deletions gtdbtk/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,13 @@ def print_help():
decorate -> Decorate tree with GTDB taxonomy
Tools:
infer_ranks -> Establish taxonomic ranks of internal nodes using RED
ani_rep -> Calculates ANI to GTDB representative genomes
trim_msa -> Trim an untrimmed MSA file based on a mask
export_msa -> Export the untrimmed archaeal or bacterial MSA file
remove_labels -> Remove labels (bootstrap values, node labels) from an Newick tree
convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree
infer_ranks -> Establish taxonomic ranks of internal nodes using RED
ani_rep -> Calculates ANI to GTDB representative genomes
trim_msa -> Trim an untrimmed MSA file based on a mask
export_msa -> Export the untrimmed archaeal or bacterial MSA file
remove_labels -> Remove labels (bootstrap values, node labels) from an Newick tree
convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree
convert_to_species -> Convert GTDB genome IDs to GTDB species names
Testing:
Expand Down
7 changes: 3 additions & 4 deletions gtdbtk/ani_rep.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@
from gtdbtk.biolib_lite.common import canonical_gid
from gtdbtk.biolib_lite.execute import check_dependencies
from gtdbtk.biolib_lite.taxonomy import Taxonomy
from gtdbtk.config.config import (TAXONOMY_FILE,
AF_THRESHOLD)
from gtdbtk.config.common import CONFIG
from gtdbtk.config.output import DIR_ANI_REP_INT_MASH
from gtdbtk.exceptions import GTDBTkExit
from gtdbtk.external.fastani import FastANI
Expand Down Expand Up @@ -76,7 +75,7 @@ def run(self, genomes, no_mash, mash_d, out_dir, prefix, mash_k, mash_v, mash_s,
prefix, mash_k, mash_v,
mash_s, max_mash_dist, mash_db=mash_db)

taxonomy = Taxonomy().read(TAXONOMY_FILE, canonical_ids=True)
taxonomy = Taxonomy().read(CONFIG.TAXONOMY_FILE, canonical_ids=True)
ani_summary_file = ANISummaryFile(out_dir, prefix, fastani_results, taxonomy)
ani_summary_file.write()
ANIClosestFile(out_dir,
Expand Down Expand Up @@ -269,7 +268,7 @@ def _write(self):
fh.write(f'{gid}\t{ref_gid}')
fh.write(f'\t{closest_ani}\t{closest_af}')
fh.write(f'\t{taxonomy_str}')
fh.write(f'\t{closest_ani >= gtdb_ani_radius and closest_af >= AF_THRESHOLD}\n')
fh.write(f'\t{closest_ani >= gtdb_ani_radius and closest_af >= CONFIG.AF_THRESHOLD}\n')
else:
fh.write(f'{gid}\tno result\tno result\tno result\tno result\tno result\n')
else:
Expand Down
9 changes: 4 additions & 5 deletions gtdbtk/ani_screen.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,8 @@
from gtdbtk.classify import Classify
from gtdbtk.config.output import DIR_ANISCREEN

from gtdbtk.config.config import (TAXONOMY_FILE, MASH_SKETCH_FILE, AF_THRESHOLD)
from gtdbtk.files.gtdb_radii import GTDBRadiiFile

from gtdbtk.config.common import CONFIG

class ANIScreener(object):
"""Computes a list of genomes to a list of representatives."""
Expand Down Expand Up @@ -39,7 +38,7 @@ def run_aniscreen(self,genomes, no_mash,out_dir,prefix, mash_k, mash_v, mash_s,
if mash_db.endswith('/'):
make_sure_path_exists(mash_db)
if os.path.isdir(mash_db):
mash_db = os.path.join(mash_db, MASH_SKETCH_FILE)
mash_db = os.path.join(mash_db, CONFIG.MASH_SKETCH_FILE)

#we set mash_d == mash_max_dist to avoid user to run mash with impossible values
mash_d = mash_max_dist
Expand All @@ -49,7 +48,7 @@ def run_aniscreen(self,genomes, no_mash,out_dir,prefix, mash_k, mash_v, mash_s,
fastani_results = ani_rep.run_mash_fastani(genomes, no_mash, mash_d, os.path.join(out_dir, DIR_ANISCREEN),
prefix, mash_k, mash_v, mash_s, mash_max_dist, mash_db)

taxonomy = Taxonomy().read(TAXONOMY_FILE, canonical_ids=True)
taxonomy = Taxonomy().read(CONFIG.TAXONOMY_FILE, canonical_ids=True)

mash_classified_user_genomes = self.sort_fastani_ani_screen(
fastani_results,taxonomy)
Expand Down Expand Up @@ -88,7 +87,7 @@ def sort_fastani_ani_screen(self,fastani_results,taxonomy,bac_ar_diff=None):
# sort the dictionary by ani then af
for gid in fastani_results.keys():
thresh_results = [(ref_gid, hit) for (ref_gid, hit) in fastani_results[gid].items() if
hit['af'] >= AF_THRESHOLD and hit['ani'] >= self.gtdb_radii.get_rep_ani(
hit['af'] >= CONFIG.AF_THRESHOLD and hit['ani'] >= self.gtdb_radii.get_rep_ani(
canonical_gid(ref_gid))]
all_results = [(ref_gid, hit) for (ref_gid, hit) in fastani_results[gid].items()]
closest = sorted(thresh_results, key=lambda x: (-x[1]['ani'], -x[1]['af']))
Expand Down
6 changes: 3 additions & 3 deletions gtdbtk/biolib_lite/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

from tqdm import tqdm

from gtdbtk.config.config import LOG_TASK
from gtdbtk.config.common import CONFIG
from .common import make_sure_path_exists


Expand Down Expand Up @@ -128,7 +128,7 @@ class SpecialFormatter(logging.Formatter):
datefmt="%Y-%m-%d %H:%M:%S")

def format(self, record):
if record.levelno == LOG_TASK:
if record.levelno == CONFIG.LOG_TASK:
return self.task_fmt.format(record)
if record.levelno >= logging.ERROR:
return self.err_fmt.format(record)
Expand Down Expand Up @@ -162,7 +162,7 @@ class ColourlessFormatter(SpecialFormatter):

def format(self, record):
record.msg = self.ansi_escape.sub('', record.msg)
if record.levelno == LOG_TASK:
if record.levelno == CONFIG.LOG_TASK:
return self.task_fmt.format(record)
if record.levelno >= logging.ERROR:
return self.err_fmt.format(record)
Expand Down

0 comments on commit c3597ba

Please sign in to comment.