# Phylogenetic Compositional Tensor Factorization (phylo-CTF) Introduction

This tutorial builds upon the CTF tutorial and explains how you can also perform CTF weighted by phylogenetic information. If you would like to better understand CTF please first complete that tutorial. In this tutorial we will use Phylogenetic Compositional Tensor Factorization (phylo-CTF) which we provide in the software package [gemelli](https://github.com/biocore/gemelli). Phylo-CTF can account for repeated measures, compositionality, and sparsity in microbiome data.

In this tutorial we use _gemelli_ to perform phylo-CTF on a time series dataset comparing Crohn's and control subjects over a period of 25 weeks published in [Vázquez-Baeza et al](https://gut.bmj.com/content/67/9/1743). First we will download the processed data originally from [here](https://qiita.ucsd.edu/study/description/2538#). This data can be downloaded with the following links:

* **Table** (table.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/table.qza)
* **Rarefied Table** (rarefied-table.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/rarefied-table.qza)
* **Sample Metadata** (metadata.tsv) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/metadata.tsv)
* **Feature Metadata** (taxonomy.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/taxonomy.qza)
* **Tree** (sepp-insertion-tree.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/sepp-insertion-tree.qza)

**Note**: This tutorial assumes you have installed [QIIME2](https://qiime2.org/) using one of the procedures in the [install documents](https://docs.qiime2.org/2020.2/install/). This tutorial also assumed you have installed, [Qurro](https://github.com/biocore/qurro), [DEICODE](https://github.com/biocore/DEICODE), and [gemelli](https://github.com/biocore/gemelli).

First, we will make a tutorial directory and download the data above and move the files to the `IBD-2538/data` directory:

```bash
mkdir IBD-2538
```
```bash
# move downloaded data here
mkdir IBD-2538/data
```

Next we will import our data with the QIIME2 Python API. 


In [6]:
import os
import warnings
import qiime2 as q2
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table(s)
table = q2.Artifact.load('IBD-2538/data/table.qza')
rarefied_table = q2.Artifact.load('IBD-2538/data/rarefied-table.qza')
# import metadata
metadata = q2.Metadata.load('IBD-2538/data/metadata.tsv')
# import tree
tree = q2.Artifact.load('IBD-2538/data/sepp-insertion-tree.qza')
# import taxonomy
taxonomy = q2.Artifact.load('IBD-2538/data/taxonomy.qza')


In order to account for the correlation among samples from the same subject we will employ phylogenetic compositional tensor factorization (CTF). CTF builds on the ability to account for compositionality and sparsity using the robust center log-ratio transform covered in the RPCA tutorial (found [here](https://forum.qiime2.org/t/robust-aitchison-pca-beta-diversity-with-deicode)) but restructures and factors the data as a tensor. Here we will run CTF through _gemelli_ and explore/interpret the different results. Additionally, phylogenetic CTF incorporates internal nodes and branch lengths of a tree into the factorization. 

There are two forms of phylogenetic CTF bieng (1) with taxonomy information and (2) without. Here we will run phylogenetic-ctf-with-taxonomy but it can also be run without taxonomy called just phylogenetic-ctf.

To run phylo-CTF we only need to run one command (gemelli phylogenetic-ctf-with-taxonomy). The required input requirements are:

1. table
    - The table is of type `FeatureTable[Frequency]` which is a table where the rows are features (e.g. ASVs/microbes), the columns are samples, and the entries are the number of sequences for each sample-feature pair.
2. tree
    - This is a phylogeny of type `Phylogeny[Rooted]` where all the features in the `table` are represented in the tree. 
3. feature-metadata-file
    - This is a metadata file (e.g. tsv, or `FeatureTable[Taxonomy]` .qza) where the rows are matched to the table features and the columns are feature metadata such as taxonomy, gene pathway, etc... 
4. sample-metadata
    - This is a QIIME2 formatted [metadata](https://docs.qiime2.org/2020.2/tutorials/metadata/) (e.g. tsv format) where the rows are samples matched to the (1) table and the columns are different sample data (e.g. time point). 
5. individual-id-column
    - This is the name of the column in the (2) metadata that indicates the individual subject/site (e.g. subject ID) that was sampled repeatedly.
6. state-column
    - This is the name of the column in the (2) metadata that indicates the numeric repeated measure (e.g., Time in months/days) or non-numeric category (i.e. decade/body-site). 
7. output-dir
    - The desired location of the output. We will cover each output independently below.  


In this tutorial our individual-id-column is `host_subject_id` and our state-column is different time points denoted as `timepoint` in the sample metadata. Now we are ready to run phylo-CTF:


In [7]:
from qiime2.plugins.gemelli.actions import phylogenetic_ctf_with_taxonomy
from qiime2.plugins.fragment_insertion.methods import filter_features

# make a dir. for results
#os.mkdir('IBD-2538/ctf-results')
# first ensure all the table features are in the tree
table = filter_features(table, tree).filtered_table
# now run phylo-CTF
ctf_results = phylogenetic_ctf_with_taxonomy(table, tree, taxonomy.view(q2.Metadata),
                                             metadata, 'host_subject_id', 'timepoint',
                                             min_feature_frequency=10)

ctf_results


Results (name = value)
--------------------------------------------------------------------------------------------------------------------
subject_biplot           = <artifact: PCoAResults % Properties('biplot') uuid: 153a5d84-648a-4ef5-a52c-c8db20a487e8>
state_biplot             = <artifact: PCoAResults % Properties('biplot') uuid: 668f9315-9d79-4964-acd5-6f81afb43d5f>
distance_matrix          = <artifact: DistanceMatrix uuid: b6ff2ad6-e869-4b3a-aeaa-cf091e983662>
state_subject_ordination = <artifact: SampleData[SampleTrajectory] uuid: 5016ae27-ad16-4a30-a806-30b21f9ec115>
state_feature_ordination = <artifact: FeatureData[FeatureTrajectory] uuid: dab556b3-06fe-4101-ba9b-ebedbf417795>
counts_by_node_tree      = <artifact: Phylogeny[Rooted] uuid: 4ddbace6-2eeb-40b7-b99f-a91924341aa6>
counts_by_node           = <artifact: FeatureTable[Frequency] uuid: bc5eb5fa-5313-4f47-84b5-b1cb04c99a4b>
t2t_taxonomy             = <artifact: FeatureData[Taxonomy] uuid: 2af2010a-c636-48fe-a62b-5b1c97c82


We will now cover the output files:
* subject_biplot
* state_biplot
* distance_matrix
* state_subject_ordination
* state_feature_ordination
* counts_by_node_tree
* counts_by_node
* t2t_taxonomy
* subject_table


First, we will explore the `subject_biplot` which is an ordination where dots represent _subjects_ not _samples_ and arrows represent features (e.g. ASVs). First, we will need to aggregate the metadata by subject (i.e. collapsing the metadata of all samples from a given subject). This can be done by hand or using DataFrames in python (with pandas) or R like so:


In [8]:
import pandas as pd
from qiime2 import Metadata
import numpy as np
from biom import Table
from skbio import OrdinationResults

# first we import the metdata into pandas
mf = pd.read_csv('IBD-2538/data/metadata.tsv', sep='\t',index_col=0)
# next we aggregate by subjects (i.e. 'host_subject_id') 
# and keep the first instance of 'diagnosis_full' by subject.
mf = mf.groupby('host_subject_id').agg({'ibd':'first','active_disease':'first'})
# now we save the metadata in QIIME2 format.
mf.index.name = '#SampleID'
mf.to_csv('IBD-2538/data/subject-metadata.tsv', sep='\t')
mf.head(5)


Unnamed: 0_level_0,ibd,active_disease
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1
s1000100,Control,quiescent
s1000200,Control,quiescent
s1000300,Control,quiescent
s1000500,Control,quiescent
s1000600,Control,quiescent


In [9]:
# combine feature metadata
phylo_ctf_taxonomy = ctf_results.t2t_taxonomy.view(q2.Metadata).to_dataframe()
phylo_ctf_feature_loadings = ctf_results.subject_biplot.view(OrdinationResults).features.rename({0:'PC1',
                                                                                                 1:"PC2",
                                                                                                 2:"PC3"},
                                                                                                axis=1)
phylo_ctf_taxonomy_and_loadings = pd.concat([phylo_ctf_taxonomy, phylo_ctf_feature_loadings], axis=1)
phylo_ctf_taxonomy_and_loadings.index.name = 'featureid'
phylo_ctf_taxonomy_and_loadings.head(5)


Unnamed: 0_level_0,Taxon,PC1,PC2,PC3
featureid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AACATAGGGGGCAAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGCAGGCGGTCTGTTAAGTCAGATGTGAAAGGTTAGGGCTCAACCCTGAACGTG,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,0.042088,-0.03316,-0.039901
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGACCGGCAAGTTGGAAGTGAAAACTATGGGCTCAACCCATAAATTG,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,-0.024657,0.060054,-0.010885
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGACCGGCAAGTTGGAAGTGAAATCCATGGGCTCAACCCGTGAATTG,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,-0.004121,-0.061348,0.006902
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGATTGGCAAGTTGGGAGTGAAATCTATGGGCTCAACCCATAAATTG,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,-0.0275,-0.019843,0.036397
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAAACCATGGGCTCAACCCATGAATTG,k__Bacteria; p__Firmicutes; c__Clostridia; o__...,-0.001026,0.030095,-0.000666


Unlike conventional CTF, the arrows in the phylo-CTF biplot can be both internal nodes and features of the table. In order to label the taxonomy we use the lowest common ancestor of that internal node. To do this we feed the `t2t_taxonomy` output from phylo-CTF into the biplot. Additionally, we can use Empress to generate plots combining both the tree and biplot view at one time. This view can help us understand what phylogenetic partitions separate our samples along a PC axis. 

Phylo-CTF output contains a table that allows us to use the internal nodes of the tree as possible log-ratios (i.e. `counts_by_node`). In this table the internal nodes contain the sum of all the counts up to that node. `Note,` it is important not to take the log-ratio of two internal nodes in the tree so we will visualize the loadings on the tree using [Empress](https://github.com/biocore/empress).

The log-ratio can then be taken with [Qurro](https://github.com/biocore/qurro) to explore the feature loading partitions (arrows) in this biplot as a log-ratio of the original table counts. This allows us to relate these low-dimensional representations back to our original data. Additionally, log-ratios provide a nice set of data points for additional analysis such as LME models. 


In [11]:
from qiime2.plugins.empress.actions import community_plot

joint_tree_biplot = community_plot(ctf_results.counts_by_node_tree,
                                   ctf_results.subject_table,
                                   q2.Metadata(mf),
                                   pcoa = ctf_results.subject_biplot,
                                   feature_metadata = q2.Metadata(phylo_ctf_taxonomy_and_loadings),
                                   number_of_features=50,
                                   ignore_missing_samples=False,
                                   filter_extra_samples=False,
                                   filter_missing_features=True)
joint_tree_biplot.visualization.save('IBD-2538/ctf-results/empress.qzv')


In [8]:
from qiime2.plugins.qurro.actions import loading_plot

# run Qurro
qurro_plot = loading_plot(ctf_results.subject_biplot, ctf_results.counts_by_node,
                          metadata,
                          feature_metadata=ctf_results.t2t_taxonomy.view(q2.Metadata))
# save visual
qurro_plot.visualization.save('IBD-2538/ctf-results/qurro.qzv')


'IBD-2538/ctf-results/qurro.qzv'

From this visualization we can see that the Crohn's subjects clearly separate from the healthy controls in the ordination on the right. We can also observe, by adding a barplot of the PC2 loadings (see the [empress tutorial](https://github.com/biocore/empress) for more info on how to make these plots). 

![image.png](etc/ctf_empress_plot_one2.png)

One additional benefit to having the phylogeny is the ability to explore ratios between phylogenetic paritions. For example the nodes n531 and n135/n119 which represent Faecalibacterium spp. and Akkermansia spp. respectively. 


![image.png](etc/ctf_empress_plot_two2.png)

![image.png](etc/ctf_empress_plot_three2.png)

We can then obtain the log-ratio based between these nodes from the qurro visualization.


![image.png](etc/ctf_phylo_qurro.png)


We can then export the log-ratio output and plot the data.


In [49]:
import pandas as pd

# import log-ratio data
metadata_one = pd.read_csv('IBD-2538/data/metadata.tsv',
                           sep='\t', index_col=0)
# import rest of the metadata
metadata_two = pd.read_csv('IBD-2538/ctf-results/sample_plot_data.tsv',
                           sep='\t', index_col=0)[['Current_Natural_Log_Ratio']]
# merge the data
log_ratio_metdata = pd.concat([metadata_two, metadata_one], axis=1)
# ensure no duplicate columns
log_ratio_metdata = log_ratio_metdata.dropna(subset=['Current_Natural_Log_Ratio'])
log_ratio_metdata.index = log_ratio_metdata.index.astype(str)
# export in QIIME2 format
log_ratio_metdata = log_ratio_metdata[['timepoint','host_subject_id',
                                       'ibd','Current_Natural_Log_Ratio']]

log_ratio_metdata.index.name = '#SampleID'
log_ratio_metdata.to_csv('IBD-2538/ctf-results/merged_sample_plot_data.tsv', sep='\t')
log_ratio_metdata.head(2)


Unnamed: 0_level_0,timepoint,host_subject_id,ibd,Current_Natural_Log_Ratio
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2538.1000102,25,s1000100,Control,4.663439
2538.1000104000004,6,s1000100,Control,0.470004



As you can see above the metadata now has the added column of `Current_Natural_Log_Ratio` from Qurro. So now we will continue to explore this log-ratio by first plotting it explicitly over time with q2-longitudinal.


In [51]:
from qiime2.plugins.longitudinal.actions import (volatility, linear_mixed_effects)

# make a time series plot of log-ratio
temporal_plot = volatility(q2.Metadata(log_ratio_metdata),
                           'timepoint',
                           individual_id_column='host_subject_id',
                           default_group_column='ibd',
                           default_metric='Current_Natural_Log_Ratio')
temporal_plot.visualization.save('IBD-2538/ctf-results/log_ratio_plot.qzv')



'IBD-2538/ctf-results/log_ratio_plot.qzv'


This demonstrates that we can recreate the separation by IBD that we saw in both the `subject_biplot` & `state_subject_ordination`, allowing us to associate specific taxa (in the numerator or denominator) with a particular phenotype.

![image.png](etc/ctf_phylo-lr-vol.png)

We can test the statistical power of this log-ratio to differentiate samples by IBD status using a linear mixed effects (LME) through q2-longitudinal. 


In [53]:
# Run LME model on log-ratio
lme_plot = linear_mixed_effects(q2.Metadata(log_ratio_metdata),
                                'timepoint',
                                individual_id_column='host_subject_id',
                                group_columns='ibd',
                                metric='Current_Natural_Log_Ratio')
lme_plot.visualization.save('IBD-2538/ctf-results/lme_log_ratio.qzv')




'IBD-2538/ctf-results/lme_log_ratio.qzv'

From this LME model we can see that indeed the IBD grouping is significant across time. 

![image.png](etc/ctf_lme-logratio-phylo.png)



