# Instructions
0. Install Biopython in Jupyter kernel if not already installed


Initilize PhyloTree object:

 1) assigns labels to internal nodes if labels are not present. 
    Default labels for internal nodes are of the form: IN_0_98,
    the first number is the node index and the second number its
    bootstrap value.
    
 2) (optional) collapses branches with bootstrap value below threshold

    2.1) Bootstrap values from fasttree are re-scaled from 0-1 to 0-100 to match
         iqtree values.
         
    2.2) Set bootstrap_threshold to None (default) to disable collapsing branches.

In [1]:
from src import PhyloTree, filterFASTAbyIDs, exportTreeClustersToFile


"""
Arguments:
(optional) tree_format: defaults to 'newick'
(optional) bootstrap_threshold: defaults to None

NOTE:
Internal nodes are labelled at PhyloTree class initialization,
thus, labels may be different for different initialized PhyloTree
objects if a different bootstrap_threshold value is employ. This is
because if the threshold is set to anything other than 'None', then
branches with bootstrap value below threshold are collapsed, and some
internal nodes may disappear.
"""


tree_path = 'example_taxo.newick'

phylotree = PhyloTree(tree_path=tree_path, tree_format='newick', bootstrap_threshold=None)

In [2]:
# Export tree to file
phylotree.exportTree(outfile='labelled_example.newick', tree_format='newick')


Load 'tree.newick' into iTOL or tool of choice and select:

a) internal nodes of interest (copy label) and/or
b) leaves (final reference sequences) of interest

NOTE: iTOL removes '_' from labels, so, e.g. IN 1 9 in iTOL would be IN_1_9 here

Come back to jupyter notebook with initialized phylotree object.

EXAMPLE: we have selected some internal nodes and some leaves and
         are interested in finding out the resulting cluster in the
         first case, and the closest internal node (cluster) that 
         contains the selected leaves.


In [3]:
# First find out all leaves downstream of selected nodes
cluster_IN_13 = phylotree.getAllDescendantsOfTargetNode('IN_13')
cluster_IN_13

['Ancylomarina_sp_A4_MMP09721042_QTZN01000004_1_MMP09721042_286_375050_375931_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_Marinifilaceae_g_Labilibaculum',
 'Bacteroidetes_bacterium_strain_SZUA_561_MMP09288109_QKHT01000160_1_005_002565_003389_neg_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_UBA5342_g_SZUA_561_s_SZUA_561_sp003249935',
 'Bacteroidales_bacterium_UBA5342_MMP06454386_DHQI01000129_1_010_016822_017646_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_UBA5342_g_UBA5342_s_UBA5342_sp002408065']

In [4]:
# Now find closest internal now to set of leaves

leaves = {
    'group_a': [
        'Bacteroidales_bacterium_UBA5342_MMP06454386_DHQI01000129_1_010_016822_017646_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_UBA5342_g_UBA5342_s_UBA5342_sp002408065',
        'Ancylomarina_sp_A4_MMP09721042_QTZN01000004_1_MMP09721042_286_375050_375931_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_Marinifilaceae_g_Labilibaculum'
        ],
    'group_b': [
        'ANME_1_cluster_archaeon_CONS3730B06UFb1_MMP08574241_QENH01000192_1_0MMP08574241_8_06644_07546_neg_d_Archaea_p_Halobacterota_c_Syntrophoarchaeia_o_ANME_1_f_ANME_1',
        'ANME_2_cluster_archaeon_HR1_MMP06562579_MZXQ01000108_1_0MMP06562579_3_02148_02891_pos_d_Archaea_p_Halobacterota_c_Methanosarcinia_o_Methanosarcinales_f_HR1_g_HR1_s_HR1_sp002926195'
        ]
}

In [5]:
IN_a = phylotree.getClosestCommonAncestor(target_names=leaves['group_a'])
IN_a

'IN_13'

In [6]:
# Now find closest internal now to set of leaves
IN_b = phylotree.getClosestCommonAncestor(target_names=leaves['group_b'])
IN_b

'IN_0'

In [7]:
"""
We can also enumerate all clusters as defined by internal nodes at once,
a python dictionary will be generated:
"""

tree_clusters = phylotree.extractClustersFromInternalNodes()

# All clusters are in this dictionary:
tree_clusters['IN_13']

['Ancylomarina_sp_A4_MMP09721042_QTZN01000004_1_MMP09721042_286_375050_375931_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_Marinifilaceae_g_Labilibaculum',
 'Bacteroidetes_bacterium_strain_SZUA_561_MMP09288109_QKHT01000160_1_005_002565_003389_neg_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_UBA5342_g_SZUA_561_s_SZUA_561_sp003249935',
 'Bacteroidales_bacterium_UBA5342_MMP06454386_DHQI01000129_1_010_016822_017646_pos_d_Bacteria_p_Bacteroidota_c_Bacteroidia_o_Bacteroidales_f_UBA5342_g_UBA5342_s_UBA5342_sp002408065']

In [8]:
"""
We can also filter clusters by a common pattern (substring) that
all leaves (reference names) must contain within the cluster.
For example, let's extract clusters containing references within the
class Proteobacteria
"""

proteo_clusters = phylotree.extractClustersFromInternalNodes(filter_by_pattern='Proteobacteria')

proteo_clusters

{'IN_16': ['Agarivorans_gilvus_MMP04215004_NZ_0CP013021_1_348_0365630_0366508_pos_d_Bacteria_p_Proteobacteria_c_Gammaproteobacteria_o_Enterobacterales_f_Psychromonadaceae_A_g_Agarivorans_s_Agarivorans_gilvus',
  'Alteromonadales_BS08_MMP05661223_NZ_MRUG01000033_1_220_252950_253837_neg_d_Bacteria_p_Proteobacteria_c_Gammaproteobacteria_o_Pseudomonadales_f_Cellvibrionaceae_g_Teredinibacter_s_Teredinibacter_sp001922955'],
 'IN_18': ['Afifella_marina_BN_126_MMP03080610_FMVW01000002_1_552_0584494_0585393_pos_d_Bacteria_p_Proteobacteria_c_Alphaproteobacteria_o_Rhizobiales_f_Afifellaceae_g_Afifella_s_Afifella_marina',
  'Azoarcus_toluclasticus_MF63_MMP02441344_NZ_KB899493_1_262_290533_291402_pos_d_Bacteria_p_Proteobacteria_c_Gammaproteobacteria_o_Burkholderiales_f_Rhodocyclaceae_g_Azoarcus_A_s_Azoarcus_A_toluclasticus']}

In [9]:
"""
Function to transform dictionary into clusters.tsv file
"""
exportTreeClustersToFile(proteo_clusters, outfile='proteoclusters.tsv')

In [13]:


# Parser external nodes by common taxon (export to tsv)
# Translate back a forth reference names
# Look at counting individual clusters
# Look at clusters.tsv export function (already made)
# Reference seuquences with cluster ID but not with taxonomy: what happens with them?

In [12]:
# Obtaining reference sequences of selected tree cluster

# filterFASTAbyIDs(
#     input_fasta="ref_database.faa",
#     record_ids=tree_clusters["IN_7_83"],
#     output_fasta="IN_7_83.faa"
# )