Gene tree updating
Clone or download
Latest commit 8be6d3c Jul 18, 2018

README.md

physcraper

Continual gene tree updating. Uses a tree from Open tree of Life and an alignment to search for and add homologous sequences to phylogenetic inference.

Still work in progress (documentation in particular), please contact ejmctavish, gmail if you need any help!

There is a full example python script with comments in docs/example.py

(it takes a while long time though)

###Dependencies (need to be in path):

###Python packages: These will all be installed if you install physcraper using python setup.py install

(but note, if you are using virtualenv there are some weird interactions with setuptools and python 2.7.6)

Inputs needed are:

  • ott_study_id = OpenTree study identifier
  • ott_tree_id = Tree id from that study
  • seqaln = the sequence alignment that generated that tree
  • matrix_type = alignment matrix type (only tested with fasta so far)
  • Working directory name (will be created by run)

Currently Physcraper relies on metadata information from the Open Tree of Life, and only uses trees from that database. Go to https://tree.opentreeoflife.org/curator to find a tree, or upload it! You can get the tree ID by clicking on your tree of interest, and looking at the URL.

###Taxon infomation from ncbi It is easiest if you keep the taxon information in the included taxonomy folder. (the file is too big for github) To get it from the NCBI ftp site

rsync -av ftp.ncbi.nih.gov::pub/taxonomy/gi_taxid_nucl.dmp.gz taxonomy/gi_taxid_nucl.dmp.gz  
gunzip taxonomy/gi_taxid_nucl.dmp.gz

Physcraper generates an ATT (alignment, taxonomy, tree) object. This is important because it tracks the shared namespaces across these three data objects, which can otherwise get a bit separated.

local blast databases

To initiate a local Blast db run the following commands: sudo apt-get install ncbi-blast+

in the folder of your future blast database run:

  1. update_blastdb nt
  2. cat *.tar.gz | tar -xvzf - -i
  3. blastdbcmd -db nt -info

The last command shows you if it worked correctly. 'nt' means, we are making the nucleotide database. The database needs to be update regularly.