GitHub - ahmedmoustafa/iTree: Phylogenomic Pipeline

#iTree: phylogenomic pipeline

##Phylogenomics Phylogenomics, conventionally defined as the intersection of phylogenetics and genomics, has become a key instrument in a wide spectrum of biological studies, including resolution of complex evolutionary relationships, assignment of taxonomic affiliation, prediction of protein molecular functions, and tracing horizontal gene transfer event. iTree automates the execution of phylogenetic analyses under multithreaded or grid-computing environments, providing a scalable high-throughput platform for performing genome-wide evolutionary analyses. ##Databases A key step in a phylogenetic analysis is collecting homologous sequences to the query of interest. This step is typically done through a BLAST search against a database. The content of the database has a direct impact on the taxon sampling and the phylogeny to be inferred. To maximize the sampling, iTree uses the results of BLAST against NBCI RefSeq for protein phylogenies and SILVA for ribosomal RNA (rRNA) phylogenies. In both cases, there are BLAST-formatted database (via formatdb of a Fasta file) and the corresponding relational database. ###RefSeq To make the tree more readable in terms of taxonomic information, the sequences in RefSeq are renamed in the iTree version. The adopted naming convention is domain.group.genus_species-txid_gi, where:

Token	Description
`domain`	`A` : Archaea, `B` : Bacteria, `E` : Eukarya, `V` : Vira (Viruses)
`group`	Major taxonomic group or clade
`genus`	Genus name
`species`	Species (or strain) name
`txid`	NCBI taxon identifier
`gi`	NCBI gi number

Although, this naming convention produces pretty long names (the average name length is 66 characters), it makes much easier to recognize the taxonomic classification and the relationship between lineages in a phylogenetic tree even for non-taxonomists.

The renamed RefSeq protein sequences are stored as Fasta for BLAST and MySQL (at least for now) for fast access and retrieval.

Because of a GitHub limitation on the size of files to be pushed to repositories (for more information, see Working with large files and What is my disk quota?), the iTree databases have been deployed to Sourceforge.

The current versions based on the RefSeq Release 61 (September 2013) can be downloaded from here.

Database File	Description	Size
`itree_refseq_61.fas.bz2`	Fasta sequences	6.5 GB
`itree_refseq_61.sql.bz2`	MySQL dump	6.8 GB

To load the MySQL dump:

$ bzip2 -d itree_refseq_61.sql.bz2
$ mysqladmin -u root -p create itreedb
$ mysql -u root -p itreedb < itree_refseq_61.sql

Given the large size of the dump (> 20 GB uncompressed), the last step takes quite some time, varying according to the power of the host machine. For example, on an Amazon EC2 medium instance, doing nothing else, it takes about 12 hours!

To format the Fasta database (to make it ready for BLAST):

$ bzip2 -d itree_refseq_61.fas.bz2
$ ln -s itree_refseq_61.fas itreedb
$ formatdb -i itreedb
$ rm itreedb

Generally, these databases (Fasta and MySQL) can be utilized independently of iTree. They might be plugged into other phylogenomic pipelines or other general-purpose usage.

###SILVA Coming soon...

##Citation Moustafa, A., Bhattacharya, D., and Allen, A.E. (2010). iTree: A high-throughput phylogenomic pipeline. Biomedical Engineering Conference (CIBEC), 2010 5th Cairo International, pp. 103–107.

DOI: 10.1109/CIBEC.2010.5716071

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Groups.md		Groups.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groups.md

Groups.md

LICENSE

LICENSE

README.md

README.md

Repository files navigation

About

Releases

Packages

License

ahmedmoustafa/iTree

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks