# Introduction to Phylopandas

Let me introduce you to PhyloPandas. A Pandas dataframe and interface for phylogenetics.

In [1]:
import pandas as pd

In [2]:
import phylopandas as ph

## Reading data

Phylopandas comes with various `read_` methods to load phylogenetic data into a Pandas DataFrame.

Check out the various formats by hitting `tab` after `read` in the cell below.

In [None]:
ph.read_

Try reading some of the sequence files in the `data` folder.

In [4]:
with open('PF08793_seed.fasta', 'r') as f:
    print(f.read())

>Q0E553_SFAVA/184-218
KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG
>Q8QUQ5_ISKNN/123-157
ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES
>Q0E553_SFAVA/142-176
YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG
>Q8QUQ5_ISKNN/45-79
LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG
>Q8QUQ6_ISKNN/37-75
VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT
>019R_FRG3G/249-283
VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG
>019R_FRG3G/302-336
TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC
>VF380_IIV6/7-45
KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH
>VF380_IIV3/8-47
KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT
>VF378_IIV6/4-38
KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP
>O41158_PBCV1/63-96
LCSKWKA----NP-LVNPATGRKIKKDGPVYEKIQKKCS-
>019R_FRG3G/5-39
YCDEFER----NP-TRNPRTGRTIKRGGPVFRALERECSD
>019R_FRG3G/139-172
-CPEFAR----DP-TRNPRTGRTIKRGGPTYRALEAECAD
>VF232_IIV6/64-98
ECEQWLA----NK-GINPRTGKAIKIGGPTYKKLEMECKE
>Q0E553_SFAVA/60-94
VCKKFLA----NK-TVSPYSGRPIKPGKKLYNDLEKHCSG
>Q8QUQ5_ISKNN/164-198
QCRAFEE----NP-DVNPNTGRRISPTGPIASSMRRRCMN
>Q8QUQ5_ISKNN/7-42
KCNQLRN----

In [3]:
ph.read_fasta('PF08793_seed.fasta')

Unnamed: 0,description,id,label,sequence,uid
0,Q0E553_SFAVA/184-218,Q0E553_SFAVA/184-218,Q0E553_SFAVA/184-218,KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG,8CdUJdvadV
1,Q8QUQ5_ISKNN/123-157,Q8QUQ5_ISKNN/123-157,Q8QUQ5_ISKNN/123-157,ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES,ao7epMtcSL
2,Q0E553_SFAVA/142-176,Q0E553_SFAVA/142-176,Q0E553_SFAVA/142-176,YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG,YyaoS6Yz0x
3,Q8QUQ5_ISKNN/45-79,Q8QUQ5_ISKNN/45-79,Q8QUQ5_ISKNN/45-79,LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG,hWKsPoL53L
4,Q8QUQ6_ISKNN/37-75,Q8QUQ6_ISKNN/37-75,Q8QUQ6_ISKNN/37-75,VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT,CtDg4S1G4E
5,019R_FRG3G/249-283,019R_FRG3G/249-283,019R_FRG3G/249-283,VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG,WZpr7GRTG5
6,019R_FRG3G/302-336,019R_FRG3G/302-336,019R_FRG3G/302-336,TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC,Kkz787fqqm
7,VF380_IIV6/7-45,VF380_IIV6/7-45,VF380_IIV6/7-45,KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH,rgeMtSthnF
8,VF380_IIV3/8-47,VF380_IIV3/8-47,VF380_IIV3/8-47,KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT,6irRaLMHBZ
9,VF378_IIV6/4-38,VF378_IIV6/4-38,VF378_IIV6/4-38,KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP,HauPaaLExc


In [6]:
ph.read_phylip('PF08793_seed.phylip')

Unnamed: 0,description,id,label,sequence,uid
0,seq-0,seq-0,seq-0,KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG,dQq1CenwKh
1,seq-1,seq-1,seq-1,ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES,Lu0Wg9zmz6
2,seq-2,seq-2,seq-2,YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG,mIFl3vZ1mp
3,seq-3,seq-3,seq-3,LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG,rBKd05BgEq
4,seq-4,seq-4,seq-4,VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT,DpwswmEUtP
5,seq-5,seq-5,seq-5,VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG,JLGuPB6Jfv
6,seq-6,seq-6,seq-6,TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC,586UBgn7LV
7,seq-7,seq-7,seq-7,KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH,a7joWOpZvD
8,seq-8,seq-8,seq-8,KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT,E0DJI8pg90
9,seq-9,seq-9,seq-9,KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP,L4FeOrx62G


In [7]:
ph.read_clustal('PF08793_seed.clustal')

Unnamed: 0,description,id,label,sequence,uid
0,seq-0,seq-0,<unknown name>,KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG,gUsUIvkoQc
1,seq-1,seq-1,<unknown name>,ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES,T8DMwemLwu
2,seq-2,seq-2,<unknown name>,YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG,FtjDPNP0Rz
3,seq-3,seq-3,<unknown name>,LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG,0F7mCE8Oyo
4,seq-4,seq-4,<unknown name>,VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT,0I1UFycA2W
5,seq-5,seq-5,<unknown name>,VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG,RXiIhu8EHz
6,seq-6,seq-6,<unknown name>,TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC,opwvh83vq8
7,seq-7,seq-7,<unknown name>,KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH,WirqWGQzoK
8,seq-8,seq-8,<unknown name>,KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT,V131kUYGUS
9,seq-9,seq-9,<unknown name>,KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP,PsCBJynkDO


## Writing data

PhyloPandas attaches a `phylo` accessor to the standard Pandas DataFrame. Inside this accessor are various writing methods, following Pandas syntax, allowing you to write to various sequence formats.

To quickly see the writing functions, hit `tab` after `to_` in the cell below.

In [4]:
df = ph.read_fasta('PF08793_seed.fasta')

In [5]:
df.phylo.to_

AttributeError: 'PhyloPandasDataFrameMethods' object has no attribute 'to_'

Let's write the dataframe back out to fasta. If you don't give a filename, it will return a string.

In [9]:
s = df.phylo.to_fasta()
print(s)

>yDedRnezWc
KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG
>8rwHeXhjra
ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES
>WQQ7gkkH9j
YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG
>8ZbEo69bg9
LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG
>ulbkN7QqYd
VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT
>CMxkl0OO78
VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG
>1G6Dn1DLQk
TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC
>Ple3uHhBLQ
KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH
>M4zXTN70Vy
KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT
>9OsqjDTxTY
KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP
>sIWjzltCy9
LCSKWKA----NP-LVNPATGRKIKKDGPVYEKIQKKCS-
>jS8RCGoQJg
YCDEFER----NP-TRNPRTGRTIKRGGPVFRALERECSD
>Ome6LyTOPx
-CPEFAR----DP-TRNPRTGRTIKRGGPTYRALEAECAD
>6d6LmSHsvs
ECEQWLA----NK-GINPRTGKAIKIGGPTYKKLEMECKE
>CCiALBFHEW
VCKKFLA----NK-TVSPYSGRPIKPGKKLYNDLEKHCSG
>44LK2hOL70
QCRAFEE----NP-DVNPNTGRRISPTGPIASSMRRRCMN
>TxrqDWwTcd
KCNQLRN----NRYTVNPVSNRAIAPRGDTANTLRRICEQ
>2vCOS3IBEP
QCETFKR----NKQAVSPLTNCPIDKFGRTAARFRKECD-



## Converting between formats

Of course, this means you can easily convert between sequence formats. 

In [11]:
df = ph.read_phylip('PF08793_seed.phylip')

fasta_str = df.phylo.to_fasta()

print(fasta_str)

>3ahCCkdjdF
KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG
>TB1w14mi0N
ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES
>3TCTdyZrdP
YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG
>Rq5q066CZ3
LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG
>JzBKFVzlMA
VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT
>tsUbEnBKLS
VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG
>9dxxNo4mDs
TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC
>pXItZYbnRA
KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH
>xu085ucxDd
KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT
>oNwqOiuw2J
KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP
>BzXdJ5F5Y2
LCSKWKA----NP-LVNPATGRKIKKDGPVYEKIQKKCS-
>w7p6PCce11
YCDEFER----NP-TRNPRTGRTIKRGGPVFRALERECSD
>k1VWK003NE
-CPEFAR----DP-TRNPRTGRTIKRGGPTYRALEAECAD
>jYSHi3V0xs
ECEQWLA----NK-GINPRTGKAIKIGGPTYKKLEMECKE
>dkdrmREsLS
VCKKFLA----NK-TVSPYSGRPIKPGKKLYNDLEKHCSG
>yufq3xG2R2
QCRAFEE----NP-DVNPNTGRRISPTGPIASSMRRRCMN
>Getr1VsCQj
KCNQLRN----NRYTVNPVSNRAIAPRGDTANTLRRICEQ
>3EZbWMjIYO
QCETFKR----NKQAVSPLTNCPIDKFGRTAARFRKECD-



## Reading Tree Data

Phylopandas can also read in phylogenetic tree data.

In [12]:
with open('PF08793_seed.newick', 'r') as f:
    print( f.read())

(Q8QUQ5_ISKNN/45-79:0.38376442,Q8QUQ6_ISKNN/37-75:0.93473288,(Q8QUQ5_ISKNN/123-157:1.14582942,(Q0E553_SFAVA/142-176:0.94308689,(Q0E553_SFAVA/184-218:0.98977147,(Q0E553_SFAVA/60-94:0.95706148,(((019R_FRG3G/5-39:0.06723315,(019R_FRG3G/139-172:0.05690376,(019R_FRG3G/249-283:0.95772959,019R_FRG3G/302-336:0.58361302)2.745285:0.61968795)1.680162:0.12814819)8.545520:0.30724093,((VF232_IIV6/64-98:0.77338949,((VF380_IIV6/7-45:0.56133629,VF380_IIV3/8-47:0.64307079)7.484104:0.37367018,(VF378_IIV6/4-38:0.31530205,O41158_PBCV1/63-96:0.46076842)1.909391:0.20522645)0.218717:0.09388521)2.531435:0.20551347,Q0E553_SFAVA/14-48:1.58834786)0.265099:0.00027193)6.209727:0.37908212,(Q8QUQ5_ISKNN/164-198:0.63907222,Q8QUQ5_ISKNN/7-42:0.96743219)2.806276:0.362965)0.677978:0.20054193)0.718698:0.20642561)2.503850:0.27168922)1.162623:0.15868612)6.040602:0.48939921);



In [6]:
ph.read_newick('PF08793_seed.newick')

Unnamed: 0,distance,id,label,length,parent,type,uid
0,0.0,0,0,0.0,,root,JKzlhJ40Yz
1,0.383764,Q8QUQ5_ISKNN/45-79,Q8QUQ5_ISKNN/45-79,0.383764,0.0,leaf,lynIqiNKd4
2,0.934733,Q8QUQ6_ISKNN/37-75,Q8QUQ6_ISKNN/37-75,0.934733,0.0,leaf,7B7HIYeolg
3,0.489399,1,1,0.489399,0.0,node,Hvhcvm2B2n
4,1.635229,Q8QUQ5_ISKNN/123-157,Q8QUQ5_ISKNN/123-157,1.145829,1.0,leaf,2Ie7VZ2vKq
5,0.648085,2,2,0.158686,1.0,node,9P4bNQBbsI
6,1.591172,Q0E553_SFAVA/142-176,Q0E553_SFAVA/142-176,0.943087,2.0,leaf,LaA8DgBqnx
7,0.919775,3,3,0.271689,2.0,node,xTHBQofQbB
8,1.909546,Q0E553_SFAVA/184-218,Q0E553_SFAVA/184-218,0.989771,3.0,leaf,jt0bRnr6el
9,1.1262,4,4,0.206426,3.0,node,lZk1Zvpc9Y


## Why is PhyloPandas useful? 

We already have BioPython, DendroPy, ete3, etc. right?

In [7]:
df = ph.read_newick('PF08793_seed.newick')

df2 = df.loc[df.type == "leaf"]

In [9]:
df

Unnamed: 0,distance,id,label,length,parent,type,uid
0,0.0,0,0,0.0,,root,kr9FApEa21
1,0.383764,Q8QUQ5_ISKNN/45-79,Q8QUQ5_ISKNN/45-79,0.383764,0.0,leaf,pnyhTVyfcB
2,0.934733,Q8QUQ6_ISKNN/37-75,Q8QUQ6_ISKNN/37-75,0.934733,0.0,leaf,ox9Nz2jhu7
3,0.489399,1,1,0.489399,0.0,node,C3AXNs47MO
4,1.635229,Q8QUQ5_ISKNN/123-157,Q8QUQ5_ISKNN/123-157,1.145829,1.0,leaf,NdpQLLOzIN
5,0.648085,2,2,0.158686,1.0,node,YCPwdfRw3l
6,1.591172,Q0E553_SFAVA/142-176,Q0E553_SFAVA/142-176,0.943087,2.0,leaf,LEjTuZ3MMt
7,0.919775,3,3,0.271689,2.0,node,4cDk4MHXw0
8,1.909546,Q0E553_SFAVA/184-218,Q0E553_SFAVA/184-218,0.989771,3.0,leaf,BOyEQ3kXkl
9,1.1262,4,4,0.206426,3.0,node,bgDyASzOsO


In [8]:
df2

Unnamed: 0,distance,id,label,length,parent,type,uid
1,0.383764,Q8QUQ5_ISKNN/45-79,Q8QUQ5_ISKNN/45-79,0.383764,0,leaf,pnyhTVyfcB
2,0.934733,Q8QUQ6_ISKNN/37-75,Q8QUQ6_ISKNN/37-75,0.934733,0,leaf,ox9Nz2jhu7
4,1.635229,Q8QUQ5_ISKNN/123-157,Q8QUQ5_ISKNN/123-157,1.145829,1,leaf,NdpQLLOzIN
6,1.591172,Q0E553_SFAVA/142-176,Q0E553_SFAVA/142-176,0.943087,2,leaf,LEjTuZ3MMt
8,1.909546,Q0E553_SFAVA/184-218,Q0E553_SFAVA/184-218,0.989771,3,leaf,BOyEQ3kXkl
10,2.083262,Q0E553_SFAVA/60-94,Q0E553_SFAVA/60-94,0.957061,4,leaf,7ozAopZGki
14,2.080298,019R_FRG3G/5-39,019R_FRG3G/5-39,0.067233,7,leaf,iiUxOmwKIK
16,2.198117,019R_FRG3G/139-172,019R_FRG3G/139-172,0.056904,8,leaf,5wAiI9pPMG
18,3.718631,019R_FRG3G/249-283,019R_FRG3G/249-283,0.95773,9,leaf,2t3ZxLlU5l
19,3.344514,019R_FRG3G/302-336,019R_FRG3G/302-336,0.583613,9,leaf,3gOACsbJ5t


# Here is where the real magic happens!

## Reading Sequence *and* Tree Data

Phylopandas has the ability to combine sequence and tree data in a single DataFrame.

In [15]:
# Read sequences.
df = ph.read_fasta('PF08793_seed.fasta')

# Read tree.
df = df.phylo.read_newick('PF08793_seed.newick', combine_on='id')
df

Unnamed: 0,description,id,label,sequence,uid,distance,length,parent,type
0,,0,0,,9blcbb9Umv,0.0,0.0,,root
1,019R_FRG3G/139-172,019R_FRG3G/139-172,019R_FRG3G/139-172,-CPEFAR----DP-TRNPRTGRTIKRGGPTYRALEAECAD,FE2bSiaUcq,2.19812,0.0569038,8.0,leaf
2,019R_FRG3G/249-283,019R_FRG3G/249-283,019R_FRG3G/249-283,VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG,4rUH5dZKLd,3.71863,0.95773,9.0,leaf
3,019R_FRG3G/302-336,019R_FRG3G/302-336,019R_FRG3G/302-336,TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC,4OWuG2q3KM,3.34451,0.583613,9.0,leaf
4,019R_FRG3G/5-39,019R_FRG3G/5-39,019R_FRG3G/5-39,YCDEFER----NP-TRNPRTGRTIKRGGPVFRALERECSD,Lok1X0tNUv,2.0803,0.0672332,7.0,leaf
5,,1,1,,kVEXAYog9D,0.489399,0.489399,0.0,node
6,,10,10,,31j7X3Yu8R,1.7061,0.00027193,6.0,node
7,,11,11,,GrAJsJU0Qj,1.91161,0.205513,10.0,node
8,,12,12,,mSJ5dqHny9,2.00549,0.0938852,11.0,node
9,,13,13,,bfBDROZnei,2.37916,0.37367,12.0,node


This enables us to build phylogenetics tools around a single, core dataframe. 