# Data and Study Design

In our experiment we had two cohorts of mice. **Mice 2-5** received a fecal transplant from a **Healthy** human donor. **Mice 6-10** received am FMT from a human donor with **Ulcerative Colitis (UC)**. We refere to these mouse cohorts as the Healthy and UC cohorts.  

An overview of the experiment is provided in the following figure
![](figure/study_design.png)

In this notebook we will explore the two sources of data: qPCR for total bacterial load and reads from Amplicon sequencing. We will also go over the timing of fecal sample collection and perturbations as well as provide taxonomies for the associated amplicon sequences that were identified in the samples.

---

In [12]:
import pandas as pd
lines_to_show=12


## Metadata

The following table is a metadata table.  For every sampleID it has the corresponding subject (mouse) and timepoint (in days) of fecal sample collection

In [3]:
metadata = pd.read_csv('data/metadata.tsv', sep='\t')
metadata.head(lines_to_show)

Unnamed: 0,sampleID,subject,time
0,10-D0AM,10,0.0
1,10-D10,10,10.0
2,10-D11,10,11.0
3,10-D14,10,14.0
4,10-D16,10,16.0
5,10-D18,10,18.0
6,10-D1AM,10,1.0
7,10-D1PM,10,1.5
8,10-D21AM,10,21.0
9,10-D21PM,10,21.5


## qPCR


The following table is a qPCR table. This has the replicate measurements (3) of the qPCR for every sample (already normalized by the mass of the fecal pellet) in the units of CFU/g.

CFU/g: Colony Forming Units per gram of feces

In [4]:
qpcr = pd.read_csv('data/qpcr.tsv', sep='\t')
qpcr.head(lines_to_show)

Unnamed: 0,sampleID,measurement1,measurement2,measurement3
0,10-D0AM,31244460.0,66240730.0,21468920.0
1,10-D10,124971400000.0,147657100000.0,58661290000.0
2,10-D11,143043900000.0,215455400000.0,293199700000.0
3,10-D14,72746840000.0,195094300000.0,69494830000.0
4,10-D16,93644840000.0,116778800000.0,110718400000.0
5,10-D18,57604660000.0,79849200000.0,103215100000.0
6,10-D1AM,8787440000.0,15034460000.0,10812380000.0
7,10-D1PM,49758140000.0,109323700000.0,112121400000.0
8,10-D21AM,130503000000.0,230374900000.0,168095800000.0
9,10-D21PM,187967500000.0,195384900000.0,302999300000.0


## Sequencing counts table - rows: ASVs, columns: samples

The following table is a read counts table. This shows the number of reads associated with each ASV for each sample ID.

In [13]:
counts = pd.read_csv('data/counts.tsv', sep='\t')
counts.head(lines_to_show)

Unnamed: 0.1,Unnamed: 0,10-D0AM,10-D10,10-D11,10-D14,10-D16,10-D18,10-D1AM,10-D1PM,10-D21AM,...,9-D60AM,9-D60PM,9-D61,9-D62,9-D63,9-D64AM,9-D64PM,9-D7,9-D8,9-D9
0,ASV_1,22,10503,21726,25990,21572,21352,17823,16916,26669,...,23613,20729,37601,38306,21944,12971,24560,12622,10323,10960
1,ASV_2,21,9319,17515,20188,14972,14868,6627,7697,20212,...,10875,11709,6582,7369,4935,7152,15477,7814,9361,8009
2,ASV_3,3,9380,17748,20899,15247,17260,45,1078,21997,...,21,6,120,21,502,1034,5944,19830,15778,22068
3,ASV_4,0,0,7,24,5,20,0,0,0,...,22,19,25,23,24,20,45,8,0,0
4,ASV_5,0,0,0,0,0,0,0,0,0,...,15,20,0,0,0,0,0,0,0,0
5,ASV_6,0,3066,3391,3617,3919,3874,0,26,4377,...,3496,3423,1102,5511,5398,3471,9295,1722,2431,2276
6,ASV_7,0,1612,2871,3172,2444,2769,922,1222,3760,...,1903,2187,1153,1280,846,1357,2866,1314,1482,1336
7,ASV_8,14,404,842,1022,469,346,380,453,273,...,56,55,16,17,9,27,89,272,532,548
8,ASV_9,0,2125,3261,3806,2642,2598,10071,6941,1678,...,1821,2001,639,479,654,1666,3398,2390,3350,2630
9,ASV_10,3,852,1401,1768,1589,1835,2723,1757,2737,...,224,676,291,143,79,187,469,1214,1928,1682


Recall that read counts are not quantitative (but are instead relative to the read depth). The total read depth for each sample can be computed as

In [14]:
read_depth=counts.sum()[1:-1] # first column was just the ASV number hence indexing from 1 to the end, removing the 0 index
read_depth.head(12)

10-D0AM         84
10-D10       46512
10-D11       82066
10-D14       96268
10-D16       78009
10-D18       82540
10-D1AM      49805
10-D1PM      48392
10-D21AM    108082
10-D21PM     70385
10-D22AM     59366
10-D22PM     73812
dtype: object

The read depth for each sample changes, but the read depth does not correspond to changes in total bacterial load. That is why we also have a qPCR measurement for estimating total bacterial load in each sample.

## ASV number, sequences and taxonomy

The taxonomy table relates the ASV number to the actual amplicon sequence associated with the ASV number. In addition you have the taxonomic information down to the species level (if the classification is known). When multiple species are provided it means that those species all have the same ASV (and this specific region of the 16S rRNA gene is not enough to destinguish them) and if `NaN` then that ASV did not have an associated species in the training set or more than 5 species were returned.

In [15]:
asv_and_taxonomy = pd.read_csv('data/asv_and_taxonomy.tsv', sep='\t')
asv_and_taxonomy.head(lines_to_show)

Unnamed: 0,name,sequence,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,ASV_1,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Phocaeicola,
1,ASV_2,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,fragilis/ovatus
2,ASV_3,TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTG...,Bacteria,Verrucomicrobia,Verrucomicrobiae,Verrucomicrobiales,Akkermansiaceae,Akkermansia,muciniphila
3,ASV_4,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,cellulosilyticus/intestinalis/timonensis
4,ASV_5,TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTG...,Bacteria,Verrucomicrobia,Verrucomicrobiae,Verrucomicrobiales,Akkermansiaceae,Akkermansia,muciniphila
5,ASV_6,TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGGGTG...,Bacteria,Proteobacteria,Betaproteobacteria,Burkholderiales,Sutterellaceae,Parasutterella,excrementihominis
6,ASV_7,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,fragilis/koreensis/kribbi/ovatus
7,ASV_8,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,ovatus
8,ASV_9,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,dorei/fragilis
9,ASV_10,TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,caccae


## Perturbation type with start and stop time (days) for each subject

Recall that there were three perturbations applied, High Fat Diet, Vancomycin, and Gentamicin. This table has the start and stop time (in days) for the perturbations

In [16]:
perturbations = pd.read_csv('data/perturbations.tsv', sep='\t')
perturbations.head(12)

Unnamed: 0,name,start,end,subject
0,High Fat Diet,21.5,28.5,2
1,High Fat Diet,21.5,28.5,3
2,High Fat Diet,21.5,28.5,4
3,High Fat Diet,21.5,28.5,5
4,High Fat Diet,21.5,28.5,6
5,High Fat Diet,21.5,28.5,7
6,High Fat Diet,21.5,28.5,8
7,High Fat Diet,21.5,28.5,9
8,High Fat Diet,21.5,28.5,10
9,Vancomycin,35.5,42.5,2
