Skip to content

Commit

Permalink
updating link to framework figure
Browse files Browse the repository at this point in the history
  • Loading branch information
LunaSare committed Jul 21, 2021
1 parent 0f37eaf commit e096916
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/mds/methods_extended.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<div id="dna-sequence-search-and-filtering" class="section level2">
<h2>DNA sequence search and filtering</h2>
<p>Physcraper uses the GenBank DNA database as source to search for new sequences. The DNA sequence search can be performed on the GenBank remote database or in a GenBank local database set up by the user, which can speed up the search process. Detailed instructions to setup a local database are provided on Physcraper’s software documentation.</p>
<p>The next step is to identify a “search taxon” to constrain the sequence search on the GenBank database within that taxonomic group. The search taxon can be chosen by the user from the NCBI taxonomy. If none is provided, then the search taxon is automatically set using the taxa in the input tree labeled as the “ingroup” (see Physcraper's framework [figure](https://physcraper.readthedocs.io/en/latest/how_to_start.html#the-physcraper-framework)). The search taxon is The Most Recent Common Ancestor (MRCA) of the ingroup taxa in the OpenTree synthetic tree, that is also a named clade in the NCBI taxonomy. <!-- FIGURE RECOMMENDED: Figure \@ref(fig:search) --> This is known in the OpenTree as the Most Recent Common Ancestral Taxon (MRCAT; also referred as the Least Inclusive Common Ancestral taxon - LICA). The MRCAT can be different from the phylogenetic MRCA when the latter is an unnamed clade in the synthetic tree. To identify the MRCAT of a group of taxon names, we use the OpenTree <a href="https://github.com/OpenTreeOfLife/germinator/wiki/Taxonomy-API-v3#mrca">API</a> <span class="citation">(Rees &amp; Cranston <a href="#ref-rees2017automated">2017</a>)</span>.</p>
<p>The next step is to identify a “search taxon” to constrain the sequence search on the GenBank database within that taxonomic group. The search taxon can be chosen by the user from the NCBI taxonomy. If none is provided, then the search taxon is automatically set using the taxa in the input tree labeled as the “ingroup” (see Physcraper's framework <a href="https://physcraper.readthedocs.io/en/latest/how_to_start.html#the-physcraper-framework">figure</a>. The search taxon is The Most Recent Common Ancestor (MRCA) of the ingroup taxa in the OpenTree synthetic tree, that is also a named clade in the NCBI taxonomy. <!-- FIGURE RECOMMENDED: Figure \@ref(fig:search) --> This is known in the OpenTree as the Most Recent Common Ancestral Taxon (MRCAT; also referred as the Least Inclusive Common Ancestral taxon - LICA). The MRCAT can be different from the phylogenetic MRCA when the latter is an unnamed clade in the synthetic tree. To identify the MRCAT of a group of taxon names, we use the OpenTree <a href="https://github.com/OpenTreeOfLife/germinator/wiki/Taxonomy-API-v3#mrca">API</a> <span class="citation">(Rees &amp; Cranston <a href="#ref-rees2017automated">2017</a>)</span>.</p>
<p>Users can provide a search taxon that is either a more or a less inclusive clade relative to the ingroup of the original phylogeny. If the search taxon is more inclusive, the sequence search will be performed outside the MRCAT of the matched taxa, e.g., including all taxa within the family or the order that the ingroup belongs to. If the search taxon is a less inclusive clade, the users can focus on enriching a particular clade/region within the ingroup of the phylogeny.</p>
<p>The Basic Local Alignment Search Tool, BLAST <span class="citation">(Altschul <em>et al.</em> <a href="#ref-altschul1990basic">1990</a>, <a href="#ref-altschul1997gapped">1997</a>)</span> is used to identify similarity between DNA sequences within the search taxon in a nucleotide database, and the sequences on the checked alignment. The <code>blastn</code> function from the BLAST command line tools <span class="citation">(Camacho <em>et al.</em> <a href="#ref-camacho2009blast">2009</a>)</span> is used for local database sequence searches. For remote database searches, we modified the BioPython <span class="citation">(Cock <em>et al.</em> <a href="#ref-cock2009biopython">2009</a>)</span> BLAST function from the <a href="https://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html">NCBIWWW module</a> to accept an alternative BLAST address (URL). This is useful when a user has no access to the computer capacity needed to setup a local database, and a local blast database can be set up on a remote machine to BLAST avoiding NCBI’s required waiting times, which slow down the searches markedly. A constrained BLAST search is performed, in which each sequence in the alignment is BLASTed once against all database DNA sequences belonging to the search taxon. All results from each BLAST run are stored, and sequences with match scores better than the e-value cutoff (default to 0.00001) are saved along with their corresponding metadata, i.e., their GenBank accession number. The full sequence for each match is downloaded from NCBI into a dedicated library within the “physcraper” folder, allowing for secondary analyses to run significantly faster.</p>
<p>BLAST result sequences will be discarded if they fall outside the user set min and max length cutoffs, set as proportions of the average length without gaps of sequences in the input alignment (defaults values of 80% and 120%, respectively). This filtering guarantees the exclusion of whole genome sequences, which create problems in multiple sequence alignment. The GenBank accession numbers of sequences removed due to not meeting e-value or length cutoffs are stored in output files. All sequences accepted up to this point are assigned an internal identifier. New sequences that are either identical or a subset of any existing sequence in the input alignment are discarded, unless they represent a different taxon in the OTT taxonomy or the NCBI taxonomy, or they are longer than the sequence in the input alignment. Among the filtered sequences, there are often several representatives per taxon. Although it can be useful to keep some of them, for example, to investigate monophyly within species, there can be hundreds of exemplar sequences per taxon for some markers. To control the number of sequences per taxon in downstream analyses, 5 sequences per taxon are chosen at random. This number is set by default but can be modified by the user.</p>
Expand Down

0 comments on commit e096916

Please sign in to comment.