Create taxids.txt and seq_info.csv files from Genbank file.

In [83]:
from Bio import SeqIO

gb = '../../Reference_DB/CytB_European-fish_SATIVA_cleaned.gb'
seqinfo_file = 'seq_info.csv'
taxid_file = 'taxids.txt'

#####
seq_info = ['"seqname","accession","tax_id","species_name","is_type"']
taxids = []
Seqs = SeqIO.parse(open(gb,'r'), 'genbank')

for r in Seqs:
    sp = r.features[0].qualifiers['organism'][0]
    for db_xref in r.features[0].qualifiers['db_xref']:
        if 'taxon' in db_xref:
            taxid = db_xref.split(":")[1]
    r_seqinfo = '"%s","%s","%s","%s","0"' %(r.id, r.id, taxid, sp)
    seq_info.extend([r_seqinfo])
    taxids.append(taxid)
    
seq_info_out = open(seqinfo_file,'w')
for l in seq_info:
    seq_info_out.write(l+"\n")
seq_info_out.close()

taxids_out = open(taxid_file, 'w')
for t in list(set(taxids)):
    taxids_out.write(t+"\n")
taxids_out.close()
    



__BUILD A HMM FOR THE REFERENCE ALIGNMENT__

We will do this with the program `hmmbuild` from the [hmmer v3](http://hmmer.janelia.org/) program suite.

In [None]:
!hmmbuild -h #for help

In [84]:
!hmmbuild CytB_ref.hmm ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.fasta

# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b1 (May 2013); http://hmmer.org/
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.fasta
# output HMM file:                  CytB_ref.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen     W eff_nseq re/pos description
#---- -------------------- ----- ----- ----- ----- -------- ------ -----------
1     CytB_European-fish_SATIVA_cleaned.alignment  1736  1008  1008  1218     2.43  0.450 

# CPU time: 0.46u 0.02s 00:00:00.48 Elapsed: 00:00:00.50


__Create reference package required by pplacer__

First we will need to compile some information about the taxonomy of the reference sequences.

We start by producing a taxonomy table for the set of taxa that is used as reference. The file `taxids.txt` is a simple text file that contains the taxonomic ids [taxids](http://www.ncbi.nlm.nih.gov/taxonomy) for all taxa. 

In [85]:
!head taxids.txt

134920
8030
98395
69293
7936
54556
7965
8032
8038
47308


We will use a tool from the [taxtastic](http://fhcrc.github.io/taxtastic/) package to fetch the taxonomic information for these taxa from the global [NCBI taxonomy](http://www.ncbi.nlm.nih.gov/taxonomy), which is present as the so-called 'taxonomy dump' in our container (`/usr/bin/taxonomy.db`). Information on how to format the taxonomy dump for use with taxtastic can be found [here](http://fhcrc.github.io/taxtastic/commands.html#new-database).


[Taxtastic](http://fhcrc.github.io/taxtastic/) is suite of tools.

In [86]:
!taxit -h #for help

usage: taxit [-h] [-V]
             {help,add_nodes,add_to_taxtable,check,composition,create,findcompany,info,lonelynodes,merge,new_database,refpkg_intersection,reroot,rollback,rollforward,rp,strip,taxids,taxtable,update,update_taxids}
             ...

Creation, validation, and modification of reference packages for use with
`pplacer` and related software.

positional arguments:
  {help,add_nodes,add_to_taxtable,check,composition,create,findcompany,info,lonelynodes,merge,new_database,refpkg_intersection,reroot,rollback,rollforward,rp,strip,taxids,taxtable,update,update_taxids}
    help                Detailed help for actions using `help <action>`
    add_nodes           Add new nodes to a database containing a taxonomy.
    add_to_taxtable     Adds some nodes to a taxtable
    check               Run a series of deeper checks on a RefPkg.
    composition         Show taxonomic composition of a reference package.
    create              Creates a reference package
    f

To explore individual functions (such as `taxtable`) further do, e.g.:

In [87]:
!taxit taxtable -h

usage: taxit taxtable [-h] [-v] [-q] -d FILE
                      [-n FILE | -t FILE-OR-LIST | -i SEQ_INFO] [-o FILE]

Creates a CSV file describing lineages for a set of taxa

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Increase verbosity of screen output (eg, -v is
                        verbose, -vv more so)
  -q, --quiet           Suppress output
  -d FILE, --database-file FILE
                        Name of the sqlite database file

Input options:
  -n FILE, --tax-names FILE
                        A file identifing taxa in the form of taxonomic names.
                        Names are matched against both primary names and
                        synonyms. Lines beginning with "#" are ignored. Taxa
                        identified here will be added to those specified using
                        --tax-ids
  -t FILE-OR-LIST, --tax-ids FILE-OR-LIST
                        File containing a whitespac

Now, let's produce the relevant taxtable.

In [88]:
!taxit taxtable -d /usr/bin/taxonomy.db -t taxids.txt -o taxa.csv

The resulting `taxa.csv` file contains just the taxonomic information relevant for the reference sequences to be used for the phylogenetic placement. 

Have a look into the file.



In [89]:
!head taxa.csv

"tax_id","parent_id","rank","tax_name","root","below_root","superkingdom","below_superkingdom","kingdom","below_kingdom","below_below_kingdom","below_below_below_kingdom","phylum","subphylum","below_subphylum","below_below_subphylum","below_below_below_subphylum","below_below_below_below_subphylum","superclass","class","subclass","infraclass","below_infraclass","below_below_infraclass","below_below_below_infraclass","below_below_below_below_infraclass","below_below_below_below_below_infraclass","below_below_below_below_below_below_infraclass","below_below_below_below_below_below_below_infraclass","below_below_below_below_below_below_below_below_infraclass","below_below_below_below_below_below_below_below_below_infraclass","below_below_below_below_below_below_below_below_below_below_infraclass","superorder","order","suborder","infraorder","superfamily","family","subfamily","tribe","genus","species"
"1","1","root","root","1","","","","","","","","","","","","","","","","","","","","","

We will also need to provide information that links the taxonomic ids to the actual sequence ids. This file is called the 'seqinfo' file by taxtasic. We provide this as `seq_info.csv`. 

Have a look:

In [90]:
!head seq_info.csv

"seqname","accession","tax_id","species_name","is_type"
"AY184273.1","AY184273.1","219545","Ameiurus melas","0"
"KM874552.1","KM874552.1","58321","Alburnoides bipunctatus","0"
"AY225663.1","AY225663.1","109273","Ambloplites rupestris","0"
"KM874532.1","KM874532.1","58321","Alburnoides bipunctatus","0"
"AF006715.1","AF006715.1","7936","Anguilla anguilla","0"
"HM173121.1","HM173121.1","58321","Alburnoides bipunctatus","0"
"AB021776.1","AB021776.1","7936","Anguilla anguilla","0"
"HM173102.1","HM173102.1","58321","Alburnoides bipunctatus","0"
"KJ564260.1","KJ564260.1","7936","Anguilla anguilla","0"


The reference package also needs to contain a reference tree, the log from the tree inference, the underlying alignment in fasta format as well as the HMM profile that you have produced above to align the query sequences to. 



We have already built the tree (see [here]()). We'll also need the raxml info file, which contains the model parameters. pplacer seems to only be able to parse info files from older RaxML versions (e.g. v7.2.6), so we'll assess model paramters with this version.

In [92]:
%%bash

raxmlHPC-PTHREADS -f e -T 5 -t ../../Reference_DB/4-post_SATIVA/RAxML_bestTree.CytB@mafftLinsi-SATIVA_aln_clipped0 -m GTRGAMMA -s ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.phylip -n test


































































































































Found 64 sequences that are exactly identical to other sequences in the alignment.
Normally they should be excluded from the analysis.

An alignment file with sequence duplicates removed has already
been printed to file ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.phylip.reduced

This is the RAxML Master Pthread

This is RAxML Worker Pthread Number: 1

This is RAxML Worker Pthread Number: 4

This is RAxML Worker Pthread Number: 3

This is RAxML Worker Pthread Number: 2


This is RAxML version 7.2.6 released by Alexandros Stamatakis in February 2010.

With greatly appreciated code contributions by:
Andre Aberer (TUM)
Simon Berger (TUM)
John Cazes (TACC)
Michael Ott (TUM)
Nick Pattengale (UNM)
Wayne Pfeiffer (SDSC)


Alignment has 861 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 11.5

Build the reference package.

In [93]:
%%bash
taxit create \
-l CytB \
-P CytB.refpkg \
--aln-fasta ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.fasta \
--tree-stats RAxML_info.test \
--tree-file ../../Reference_DB/4-post_SATIVA/RAxML_bestTree.CytB@mafftLinsi-SATIVA_aln_clipped0 \
--profile CytB_ref.hmm \
--seq-info seq_info.csv \
--taxonomy taxa.csv

rerooting at below_below_subphylum
root found at node 1384


Some explanation for the above command:
    
```bash
taxit create \ #call the program
-l CytB \ #arbitrary marker name
-P CytB.refpkg \ #name to be given to reference package
--aln-fasta ../../Reference_DB/CytB_European-fish_SATIVA_cleaned.alignment.fasta \ #alignment
--tree-stats RAxML_info.test \ #info file from RAxML containing the model paramters for the tree
--tree-file ../../Reference_DB/4-post_SATIVA/RAxML_bestTree.CytB@mafftLinsi-SATIVA_aln_clipped0 \ #RAxML tree
--profile CytB_ref.hmm \ #HMM profile built for reference alignment
--seq-info seq_info.csv \ #seqinfo file mapping taxonomy to sequence ids
--taxonomy taxa.csv #taxonmic information for the relevant taxa
```



#WELL DONE!#

