# Tutorial Introduction

Repeat measure experimental designs (e.g. time series) are valid and powerful method to control inter-individual variation. However, conventional dimensionality reduction methods can not account for the high-correlation of each subject to thierself at a later time point. This inherent correlation structure can cause subject grouping to confound or even outweigh important phenotype groupings. To address this we will use Compositional Tensor Factorization (CTF) which we provide in the software package [gemelli](https://github.com/cameronmartino/gemelli). CTF can account for repeated measures, compositionality, and sparsity in microbiome data.

In this tutorial we use _gemelli_ to perform CTF on a time series dataset comparing Crohn's and control subjects over a period of 25 weeks published in [Vázquez-Baeza et al](https://gut.bmj.com/content/67/9/1743). First we will download the processed data originally from [here](https://qiita.ucsd.edu/study/description/2538#). The pre-processed data can be downloaded with the following links:

* **Table** (table.qza) | [download](https://github.com/cameronmartino/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/table.qza)
* **Rarefied Table** (rarefied-table.qza) | [download](https://github.com/cameronmartino/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/rarefied-table.qza)
* **Sample Metadata** (metadata.tsv) | [download](https://github.com/cameronmartino/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/metadata.tsv)
* **Feature Metadata** (taxonomy.qza) | [download](https://github.com/cameronmartino/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/taxonomy.qza)
* **Tree** (sepp-insertion-tree.qza) | [download](https://github.com/cameronmartino/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/sepp-insertion-tree.qza)

**Note**: This tutorial assumes you have installed [QIIME2](https://qiime2.org/) using one of the procedures in the [install documents](https://docs.qiime2.org/2020.2/install/). This tutorial also assumed you have installed, [Qurro](https://github.com/biocore/qurro), [DEICODE](https://github.com/biocore/DEICODE), and [gemelli](https://github.com/cameronmartino/gemelli).

First we will make a tutorial directory and download the data above and move the files to the `IBD-2538/data` directory:

```bash
mkdir IBD-2538
```
```bash
# move downloaded data here
mkdir IBD-2538/data
```

First we will import our data with the QIIME2 Python API. 


In [5]:
import os
import warnings
import qiime2 as q2
# hide pandas Future/Deprecation Warning(s) for tutorial
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)

# import table(s)
table = q2.Artifact.load('IBD-2538/data/table.qza')
rarefied_table = q2.Artifact.load('IBD-2538/data/rarefied-table.qza')
# import metadata
metadata = q2.Metadata.load('IBD-2538/data/metadata.tsv')
# import tree
tree = q2.Artifact.load('IBD-2538/data/sepp-insertion-tree.qza')
# import taxonomy
taxonomy = q2.Artifact.load('IBD-2538/data/taxonomy.qza')



Next, we will demonstrate the issues with using conventional dimensionality reduction methods on time series data. To do this we will perform PCoA dimensionality reduction on weighted and unweighted UniFrac $\beta$-diversity distances. We will also run Aitchison Robust PCA with _DEICODE_ which is built on the same framework as CTF but does not account for repeated measures.


In [22]:
from qiime2.plugins.deicode.actions import rpca
from qiime2.plugins.emperor.actions import (plot, biplot)
from qiime2.plugins.diversity.actions import (beta_phylogenetic, pcoa)

# generate distances
unweighted_unifrac_distance = beta_phylogenetic(rarefied_table, tree, 'unweighted_unifrac')
weighted_unifrac_distance = beta_phylogenetic(rarefied_table, tree, 'weighted_unifrac')
# perform PCoA
unweighted_unifrac_pcoa = pcoa(unweighted_unifrac_distance.distance_matrix)
weighted_unifrac_pcoa = pcoa(weighted_unifrac_distance.distance_matrix)
# use emperor to plot
unweighted_unifrac_pcoa_plot = plot(unweighted_unifrac_pcoa.pcoa, metadata)
weighted_unifrac_pcoa_plot = plot(weighted_unifrac_pcoa.pcoa, metadata)
# run RPCA and plot with emperor
rpca_biplot, rpca_distance = rpca(table)
rpca_biplot_emperor = biplot(rpca_biplot, metadata)
# make directory to store results
output_path = 'IBD-2538/core-metric-output'
os.mkdir(output_path)
# now we can save the plots
unweighted_unifrac_pcoa_plot.visualization.save(os.path.join(output_path, 'unweighted-unifrac-distance-pcoa.qzv'))
weighted_unifrac_pcoa_plot.visualization.save(os.path.join(output_path, 'weighted-unifrac-distance-pcoa.qzv'))
rpca_biplot_emperor.visualization.save(os.path.join(output_path, 'RPCA-biplot.qzv'))


'IBD-2538/core-metric-output/RPCA-biplot.qzv'

Now we can visualize the sample groupings by host subject ID and IBD with [Emperor](https://biocore.github.io/emperor/). From this we can see all three metric the PCoA samples clearly separate by host subject ID which in some cases (e.g. UniFrac) can overwhelm the control (blue) v. Crohn's disease (orange) sample groupings. Even in the case where the IBD grouping is not completely lost (e.g. RPCA) we can still see confounding groupings in the control (blue) groups by subject ID. In either case this can complicate the interpretation of these analysis.    






![image.png](etc/subjectidsgroups.png)



This confounding effect can also be observed in the statistics performed on pairwise $\beta$-diversity distances (e.g. PERMANOVA). In this case of exploring the distance matrices [q2-longitudinal](https://msystems.asm.org/content/3/6/e00219-18) has many excellent methods for accounting for repeated measure data. You can find the q2-longitudinal tutorial [here](https://docs.qiime2.org/2020.2/tutorials/longitudinal/).


# Compositional Tensor Factorization (CTF) Introduction

In order to account for the correlation from a subject to thierself we will compositional tensor factorization (CTF). CTF builds on the ability to account for compositionality and sparsity using the robust center log-ratio transform covered in the RPCA tutorial (found [here](https://forum.qiime2.org/t/robust-aitchison-pca-beta-diversity-with-deicode)) but restructures and factors the data as a tensor. Here we will run CTF through _gemelli_ and explore/interpret the different results. 


To run CTF we only need to run one command (gemelli ctf). The required input requirements are:

1. table
    - The table is of type `FeatureTable[Frequency]` which is a table where the rows are features (e.g. ASVs/microbes), the columns are samples, and the entries are the number of sequences for each sample-feature pair.
2. sample-metadata
    - This is a QIIME2 formatted [metadata](https://docs.qiime2.org/2020.2/tutorials/metadata/) (e.g. tsv format) where the rows are samples matched to the (1) table and the columns are different sample data (e.g. time point).  
3. individual-id-column
    - This is the name of the column in the (2) metdata that indicates the individual subject/site that was sampled repeatedly.
4. state-column
    - This is the name of the column in the (2) metdata that indicates the numeric repeated measure (e.g., Time in months/days) or non-numeric categorical (i.e. decade/body-site). 
5. output-dir
    - The desired location of the output. We will cover each output independently below.  

There are also optional input parameters:

* ( _Optional_ ) feature-metadata-file
    - This is a metadata file (e.g. tsv) where the rows are matched to the table features and the columns are feature metadata such as taxonomy, gene pathway, etc... 

In this tutorial out subject id column is `host_subject_id` and our state-column is different time points denoted as `timepoint` in the sample metadata. Now we are ready to run CTF:


In [44]:
from qiime2.plugins.gemelli.actions import ctf

# make a dir. for results
os.mkdir('IBD-2538/ctf-results')
# run CTF
ctf_results = ctf(table, metadata,
                   'host_subject_id',
                   'timepoint',
                   feature_metadata=taxonomy.view(q2.Metadata))
ctf_results




Results (name = value)
--------------------------------------------------------------------------------------------------------------------
subject_biplot           = <artifact: PCoAResults % Properties('biplot') uuid: 20cc873b-a6d6-46f9-86e4-948a726eac40>
state_biplot             = <artifact: PCoAResults % Properties('biplot') uuid: 4fba65ac-b94c-4647-9e02-7fb45ec140ed>
distance_matrix          = <artifact: DistanceMatrix uuid: 087d4a46-72a6-4d70-b564-2480d2cd44d9>
state_subject_ordination = <artifact: SampleData[SampleTrajectory] uuid: 1251705a-5e2a-4c0a-a389-fa9c1b523fcd>
state_feature_ordination = <artifact: FeatureData[FeatureTrajectory] uuid: 9b65e359-2e0d-45bf-aa0f-51480ac8aea3>

We will now cover the output files being:
* subject_biplot
* state_biplot
* distance_matrix
* state_subject_ordination
* state_feature_ordination


First, we will explore the `state_subject_ordination`. The subject trajectory has PC axes like a conventional ordination (i.e. PCoA) but with time as the second axis. This can be visualized through the existing q2-longitudinal plugin. 


In [45]:
from qiime2.plugins.longitudinal.actions import (volatility, linear_mixed_effects)

# make a time series plot
temporal_plot = volatility(ctf_results.state_subject_ordination.view(q2.Metadata),
                           'timepoint',
                           individual_id_column='subject_id',
                           default_group_column='ibd',
                           default_metric='PC1')
temporal_plot.visualization.save('IBD-2538/ctf-results/state_subject_ordination.qzv')


'IBD-2538/ctf-results/state_subject_ordination.qzv'

The interpretation is also similar to a conventional ordination scatter plot -- where the larger the distance is between subjects at each time point the greater the difference in their microbial communities. Here we can see that CTF can effectively show a difference between controls and Crohn's subjects across time.

![image.png](etc/sample-visualization.png)


There is not a strong chnage over time in this example. However, we could explore the `distance_matrix` to test the differences by IBD by looking at pairwise distances with a Mixed Effects Model. How to use and evaluate the q2-longitudinal commands is covered in depth in thier tutorial [here](https://docs.qiime2.org/2020.2/tutorials/longitudinal/).

Now we will explore the `subject_biplot` which is a ordination where dots represent _subjects_ not _samples_ and arrows represent features (e.g. ASVs). First, we will need to aggregate the metadata by subject. This can be done by hand or using DataFrames in python (with pandas) or R like so:


In [46]:
import pandas as pd
from qiime2 import Metadata

# first we import the metdata into pandas
mf = pd.read_csv('IBD-2538/data/metadata.tsv', sep='\t',index_col=0)
# next we aggregate by subjects (i.e. 'host_subject_id') 
# and keep the first instance of 'diagnosis_full' by subject.
mf = mf.groupby('host_subject_id').agg({'ibd':'first','active_disease':'first'})
# now we save the metadata in QIIME2 format.
mf.index.name = '#SampleID'
mf.to_csv('IBD-2538/data/subject-metadata.tsv', sep='\t')
mf.head(5)


Unnamed: 0_level_0,ibd,active_disease
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1
s1000100,Control,quiescent
s1000200,Control,quiescent
s1000300,Control,quiescent
s1000500,Control,quiescent
s1000600,Control,quiescent


With out `subject-metadata` table build we are not ready to plot with emperor. 


In [47]:
# plot subject biplot
subject_biplot_emperor = biplot(ctf_results.subject_biplot,
                                q2.Metadata(mf),
                                feature_metadata=taxonomy.view(q2.Metadata),
                                number_of_features=100)
# save visual
subject_biplot_emperor.visualization.save('IBD-2538/ctf-results/subject_biplot.qzv')


'IBD-2538/ctf-results/subject_biplot.qzv'

From this visualization we can see that the IBD type is clearly separated in two groupings.

![image.png](etc/per_subject_biplot.png)


We can also see that the IBD grouping is separated entirely over the first PC. We can now use [Qurro](https://github.com/biocore/qurro) to explore the feature loading partitions (arrows) in this biplot as a log-ratio of the original table counts. This allows us to relate these low-dimensional representations back to our original data. Additionally, log-ratio provide a nice set of data points for additional analysis such as LME models. 

In [69]:
from qiime2.plugins.qurro.actions import loading_plot

# run Qurro
qurro_plot = loading_plot(ctf_results.subject_biplot, table,
                          metadata,
                          feature_metadata=taxonomy.view(q2.Metadata))
# save visual
qurro_plot.visualization.save('IBD-2538/ctf-results/qurro.qzv')


176 sample(s) in the sample metadata file were not present in the BIOM table.
These sample(s) have been removed from the visualization.


'IBD-2538/ctf-results/qurro.qzv'

From the Qurro output `qurro.qzv` we will simply choose the PC1 loadings above and below zero as the numerator (red ranks) and denominator (blue ranks) respectively but these could also be partitioned by taxonomy or sequence identifiers (see the Qurro tutorials [here](https://github.com/biocore/qurro#tutorials) for more information). We will also plot this log-ratio in Qurro with the x-axis as time and the color as IBD, which clearly shows nice separation between phenotypes. 

![image.png](etc/qurro-plot.png)

We can further explore these phenotype differences by exporting the `sample_plot_data.tsv` from Qurro (marked in a orange box above). We will then merge this `sample_plot_data` with our sample metadata in python or R. 

**Note:** Qurro will have an option to export all of the metadata or only the log-ratio soon.


In [61]:
import pandas as pd

# import log-ratio data
metadata_one = pd.read_csv('IBD-2538/data/metadata.tsv',
                           sep='\t', index_col=0)
# import rest of the metadata
metadata_two = pd.read_csv('IBD-2538/ctf-results/sample_plot_data.tsv',
                           sep='\t', index_col=0)[['Current_Natural_Log_Ratio']]
# merge the data
log_ratio_metdata = pd.concat([metadata_two, metadata_one], axis=1)
# ensure no duplicate columns
log_ratio_metdata = log_ratio_metdata.dropna(subset=['Current_Natural_Log_Ratio'])
log_ratio_metdata.index = log_ratio_metdata.index.astype(str)
# export in QIIME2 format
log_ratio_metdata = log_ratio_metdata[['timepoint','host_subject_id',
                                       'ibd','Current_Natural_Log_Ratio']]

log_ratio_metdata.index.name = '#SampleID'
log_ratio_metdata.to_csv('IBD-2538/ctf-results/merged_sample_plot_data.tsv', sep='\t')
log_ratio_metdata.head(2)


Unnamed: 0_level_0,timepoint,host_subject_id,ibd,Current_Natural_Log_Ratio
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2538.1000102,25,s1000100,Control,7.34901
2538.1000104000004,6,s1000100,Control,6.492556


As you can see above the metadata now has the added column of `Current_Natural_Log_Ratio` from Qurro. So now we will continue to explore this log-ratio by first plotting it explicitly over time with q2-longitudinal.


In [62]:
# make a time series plot of log-ratio
temporal_plot = volatility(q2.Metadata(log_ratio_metdata),
                           'timepoint',
                           individual_id_column='host_subject_id',
                           default_group_column='ibd',
                           default_metric='Current_Natural_Log_Ratio')
temporal_plot.visualization.save('IBD-2538/ctf-results/log_ratio_plot.qzv')



'IBD-2538/ctf-results/log_ratio_plot.qzv'

This can clearly show that we are recreating the separation by IBD that we saw in both the `subject_biplot` & `state_subject_ordination`. 

![image.png](etc/log-ratio-visualization.png)

We can now test this difference by running a linear mixed effects (LME). 

In [64]:
# Run LME model on log-ratio
lme_plot = linear_mixed_effects(q2.Metadata(log_ratio_metdata),
                                'timepoint',
                                individual_id_column='host_subject_id',
                                group_columns='ibd',
                                metric='Current_Natural_Log_Ratio')
lme_plot.visualization.save('IBD-2538/ctf-results/lme_log_ratio.qzv')




'IBD-2538/ctf-results/lme_log_ratio.qzv'

From this LME model we can see that indeed the IBD grouping is significant across time. 

![image.png](etc/lme-logratio.png)

