Skip to content

Commit

Permalink
Fixing lost ref edits
Browse files Browse the repository at this point in the history
  • Loading branch information
akrinos committed Dec 17, 2020
1 parent aa11d56 commit 756b33c
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 18 deletions.
30 changes: 15 additions & 15 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,9 @@ @misc{hillebrand2018climate
}

@article{kent2002blat,
title={BLAT—the BLAST-like alignment tool},
title={{BLAT—the BLAST-like alignment tool}},
author={Kent, W James},
journal={Genome research},
journal={{Genome Research}},
doi={10.1101/gr.229202},
volume={12},
number={4},
Expand All @@ -331,9 +331,9 @@ @article{kent2002blat
}

@article{zhang2015tara,
title={The Tara Oceans project: new opportunities and greater challenges ahead},
title={{The Tara Oceans project: new opportunities and greater challenges ahead}},
author={Zhang, Houjin and Ning, Kang},
journal={Genomics, proteomics \& bioinformatics},
journal={{Genomics, Proteomics \& Bioinformatics}},
doi={10.1016/j.gpb.2015.08.003},
volume={13},
number={5},
Expand All @@ -345,7 +345,7 @@ @article{zhang2015tara
@article{tully2018reconstruction,
title={The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans},
author={Tully, Benjamin J and Graham, Elaina D and Heidelberg, John F},
journal={Scientific data},
journal={{Scientific Data}},
doi={10.1038/sdata.2017.203},
volume={5},
pages={170203},
Expand All @@ -366,9 +366,9 @@ @article{vaser2016sword
}

@article{buchfink2015fast,
title={Fast and sensitive protein alignment using DIAMOND},
title={{Fast and sensitive protein alignment using DIAMOND}},
author={Buchfink, Benjamin and Xie, Chao and Huson, Daniel H},
journal={Nature methods},
journal={{Nature Methods}},
doi={10.1038/nmeth.3176},
volume={12},
number={1},
Expand All @@ -378,28 +378,28 @@ @article{buchfink2015fast
}

@article{richter2020eukprot,
title={EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life},
title={{EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life}},
author={Richter, Daniel J and Berney, Cedric and Strassert, Jurgen FH and Burki, Fabien and De Vargas, Colomban},
journal={BioRxiv},
journal={{BioRxiv}},
doi={10.1101/2020.06.30.180687},
year={2020},
publisher={Cold Spring Harbor Laboratory}
}

@article{fell1992partial,
title={Partial rRNA sequences in marine yeasts: a model for identification of marine eukaryotes.},
title={{Partial rRNA sequences in marine yeasts: a model for identification of marine eukaryotes.}},
author={Fell, JW and Statzell-Tallman, A and Lutz, MJ and Kurtzman, CP},
journal={Molecular marine biology and biotechnology},
journal={{Molecular Marine Biology and Biotechnology}},
volume={1},
number={3},
pages={175--186},
year={1992}
}

@article{keeling2014marine,
title={The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing},
title={{The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing}},
author={Keeling, Patrick J and Burki, Fabien and Wilcox, Heather M and Allam, Bassem and Allen, Eric E and Amaral-Zettler, Linda A and Armbrust, E Virginia and Archibald, John M and Bharti, Arvind K and Bell, Callum J and others},
journal={PLoS Biol},
journal={{PLoS Biol}},
doi={10.1371/journal.pbio.1001889},
volume={12},
number={6},
Expand All @@ -409,9 +409,9 @@ @article{keeling2014marine
}

@article{piganeau2011and,
title={How and why DNA barcodes underestimate the diversity of microbial eukaryotes},
title={{How and why DNA barcodes underestimate the diversity of microbial eukaryotes}},
author={Piganeau, Gwenael and Eyre-Walker, Adam and Grimsley, Nigel and Moreau, Herve},
journal={PloS one},
journal={{PloS One}},
doi={10.1371/journal.pone.0016342},
volume={6},
number={2},
Expand Down
6 changes: 3 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,15 @@ bibliography: paper.bib

The assessment of microbial species biodiversity is essential in ecology and evolutionary biology [@reaka1996biodiversity], but especially challenging for communities of microorganisms found in the environment [@das2006marine;@hillebrand2018climate]. Beyond providing a census of organisms in the ocean, assessing marine microbial biodiversity can reveal how microbes respond to environmental change [@salazar2017marine], clarify the ecological roles of community members [@hehemann2016adaptive], and lead to biotechnology discoveries [@das2006marine]. Computational approaches to characterize taxonomic diversity and phylogeny based on the quality of available data for environmental sequence datasets is fundamental for advancing our understanding of the role of these organisms in the environment. Even more pressing is the need for comprehensive and consistent methods to assign taxonomy to environmentally-relevant microbial eukaryotes. Here, we present `EUKulele`, an open-source software tool designed to assign taxonomy to microeukaryotes detected in meta-omic samples, and complement analysis approaches in other domains by accommodating assembly output and providing concrete metrics reporting the taxonomic completeness of each sample.

`EUKulele` is motivated by ongoing efforts in our community to create and curate databases of genetic and genomic information [@phylodb;@caron2017probing;@marineref;@Richer2020]. For decades, it has been recognized that genetic and genomic techniques are key to understanding microbial diversity [@fell1992partial]. Genetic approaches are particularly useful in poorly-understood or difficult-to-access environmental systems, which may have a high degree of species diversity [@das2006marine;@mock2016bridging]. The most common approach for censusing microbial diversity is genetic barcoding, which targets the hyper-variable regions of highly conserved genes such as 16S or 18S rRNA [@leray2016censusing]. Computational approaches to assess the origin of these barcode-based studies (or tag-sequencing) have been well established [@Bolyen2018;@schloss2009introducing], and enable biologists to compare microbial communities and estimate sequence phylogeny. The recent collation of reference databases, e.g. PR2 and EukRef, for ribosomal RNA in eukaryotes have enabled more accurate taxonomic assessment [@del2018eukref;@guillou2012protist]. However, barcoding approaches that focus on single marker genes or variable regions limit the field of view of microbes--especially protists, which have complex and highly variable genomes [@del2014others]--potentially limiting the organisms recovered and leaving the “true” diversity poorly constrained [@piganeau2011and;@caron2019we].
`EUKulele` is motivated by ongoing efforts in our community to create and curate databases of genetic and genomic information [@phylodb;@caron2017probing;@marineref;@richter2020eukprot]. For decades, it has been recognized that genetic and genomic techniques are key to understanding microbial diversity [@fell1992partial]. Genetic approaches are particularly useful in poorly-understood or difficult-to-access environmental systems, which may have a high degree of species diversity [@das2006marine;@mock2016bridging]. The most common approach for censusing microbial diversity is genetic barcoding, which targets the hyper-variable regions of highly conserved genes such as 16S or 18S rRNA [@leray2016censusing]. Computational approaches to assess the origin of these barcode-based studies (or tag-sequencing) have been well established [@Bolyen2018;@schloss2009introducing], and enable biologists to compare microbial communities and estimate sequence phylogeny. The recent collation of reference databases, e.g. PR2 and EukRef, for ribosomal RNA in eukaryotes have enabled more accurate taxonomic assessment [@del2018eukref;@guillou2012protist]. However, barcoding approaches that focus on single marker genes or variable regions limit the field of view of microbes--especially protists, which have complex and highly variable genomes [@del2014others]--potentially limiting the organisms recovered and leaving the “true” diversity poorly constrained [@piganeau2011and;@caron2019we].

Shotgun sequencing approaches (e.g. metagenomics and metatranscriptomics) have become increasingly tractable, emerging as a viable, untargeted means to simultaneously assess community diversity and function. Large-scale meta-omic surveys, such as the Tara Oceans project [@zhang2015tara], have presented opportunities to assemble and annotate full "genomes" from environmental metagenomic samples [@tully2018reconstruction;@Delmont2018] and assemble massive eukaryotic gene catalogs from environmental metatranscriptomic samples [@Carradec2018]. The interpretation of these meta-omic surveys hinges upon curated, culture-based reference material. Several curated databases that contain predicted proteins from a mixture of genomic and transcriptomic references from eukaryotes, as well as bacteria and archaea have been created (e.g. [@phylodb;@eukzoo;@marineref;@Richter2020]). Building upon the creation of high-quality reference databases, we sought to create a tool similar to MEGAN [@beier2017functional], CCMetagen [@marcelino2020ccmetagen], and MG-RAST [@keegan2016mg], but independent of NCBI databases and useful for both metagenomes and metatranscriptomes, as well as the study of environmental eukaryotes. Further, we sought to create a tool with a single function to download and format databases, which is necessary for computational tools to remain relevant and usable as reference databases grow.
Shotgun sequencing approaches (e.g. metagenomics and metatranscriptomics) have become increasingly tractable, emerging as a viable, untargeted means to simultaneously assess community diversity and function. Large-scale meta-omic surveys, such as the Tara Oceans project [@zhang2015tara], have presented opportunities to assemble and annotate full "genomes" from environmental metagenomic samples [@tully2018reconstruction;@Delmont2018] and assemble massive eukaryotic gene catalogs from environmental metatranscriptomic samples [@Carradec2018]. The interpretation of these meta-omic surveys hinges upon curated, culture-based reference material. Several curated databases that contain predicted proteins from a mixture of genomic and transcriptomic references from eukaryotes, as well as bacteria and archaea have been created (e.g. [@phylodb;@eukzoo;@marineref;@richter2020eukprot]). Building upon the creation of high-quality reference databases, we sought to create a tool similar to MEGAN [@beier2017functional], CCMetagen [@marcelino2020ccmetagen], and MG-RAST [@keegan2016mg], but independent of NCBI databases and useful for both metagenomes and metatranscriptomes, as well as the study of environmental eukaryotes. Further, we sought to create a tool with a single function to download and format databases, which is necessary for computational tools to remain relevant and usable as reference databases grow.

![A flowchart describing the general workflow of the software as it relates to metatranscriptomes (METs) and metagenomes (MAGs). \label{fig:eukulelediagram}](eukulele_diagram_simplified.png){ height=50% }

## Implementation

We built a tool with default databases MMETSP [], PhyloDB [@phylodb], EUKZoo [@eukzoo], MarineRefII [@marineref], and EukProt [@Richter2020], for optimum compatibility with environmental eukaryotic sequences. In particular, the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) database, which contains over 650 fully-assembled reference transcriptomes [@keeling2014marine;@johnson2019re], is among the largest single projects to create a unified reference. These databases are an invaluable resource yet, to our knowledge, no single integrated software tool currently exists to enable an end-user to harness databases in a consistent and reproducible manner.
We built a tool with default databases MMETSP [@caron2017probing], PhyloDB [@phylodb], EUKZoo [@eukzoo], MarineRefII [@marineref], and EukProt [@richter2020eukprot], for optimum compatibility with environmental eukaryotic sequences. In particular, the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) database, which contains over 650 fully-assembled reference transcriptomes [@keeling2014marine;@johnson2019re], is among the largest single projects to create a unified reference. These databases are an invaluable resource yet, to our knowledge, no single integrated software tool currently exists to enable an end-user to harness databases in a consistent and reproducible manner.

``EUKulele`` [@eukulele] (Figure \ref{fig:eukulelediagram}) is an open-source ``Python``-based package designed to simplify taxonomic identification of marine eukaryotes in meta-omic samples. The package is written in `Python`, but may be installed as a `Python` module via [PyPI](https://pypi.org/), as a standalone tool via `conda`, or through download of the `EUKulele` tarball through `GitHub`. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user's choosing, using a user-chosen aligner (``BLAST`` [@kent2002blat] or ``DIAMOND`` [@buchfink2015fast]). The "blastx" utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the "blastp" utility is used for samples available as translated protein sequences. Any consistently-formatted database may be used, but five microbial eukaryotic database options are provided by default: MMETSP [@keeling2014marine;@caron2017probing;@johnson2019re], PhyloDB [@phylodb], EukProt [@richter2020eukprot], EukZoo [@eukzoo], and a combination of the MMETSP and MarRef [@keeling2014marine;@caron2017probing;@johnson2019re;@klemetsen2018mar] (referred to as MarRef-MMETSP). This final database is the default database option, and allows the eukaryotic sequences to be compared against the expansive and high-quality MMETSP, while also distinguishing prokaryotic sequences that may be present in the sample. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from domain or supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned.

Expand Down

0 comments on commit 756b33c

Please sign in to comment.