# Getting the core microbiomes

Using definition from [Wallace et al (2018)](https://apsjournals.apsnet.org/doi/full/10.1094/PBIOMES-02-18-0008-R), the core microbiome would comprise of OTUs present in at least 80% of samples.

Here, I (RACS) will apply the same definition to get core microbiome for:

 * OTU table from Wallace et al (2018)
 * Feature table with ASVs generated with dada2 (min quality 20 both sides of forward reads)
 * Feature table with ASVs generated with dada2 in MaLiAmPi (default parameters), which was used to generate phylotypes in first analyses using this Python library (phylotypes)
 * Feature table using phylotypes generated after running MaLiAmPi with default parameters

## OTU table

The generation of OTU count tables were done using the biom software, included in qiime2 as described [here](https://gitlab.com/SantosRAC/conekt-grasses-microbiome/-/wikis/Running-correlations-for-OTUs-in-Dr.-Wallace's-manuscript-vs-TPM).

In [1]:
import pandas as pd

out_table_taxonomy_df = pd.read_table('/media/rsantos/4TB_drive/Projects/UGA_RACS/16S/Workflow/2_QiimeOtus/2f_otu_table.sample_filtered.no_mitochondria_chloroplast_taxonomy.tsv',
                                      comment='#')

out_table_taxonomy_df.head()

  out_table_taxonomy_df = pd.read_table('/media/rsantos/4TB_drive/Projects/UGA_RACS/16S/Workflow/2_QiimeOtus/2f_otu_table.sample_filtered.no_mitochondria_chloroplast_taxonomy.tsv',


Unnamed: 0,OTU ID,LMAN.8.14A0051,LMAN.8.14A0304,LMAD.8.14A0247,LMAN.8.14A0159,LMAD.8.14A0051,LMAD.26.14A0381,LMAD.26.14A0533,LMAD.8.14A0281,LMAD.8.14A0295,...,LMAN.8.14A0011,LMAD.26.14A0137,LMAN.26.14A0327,LMAN.8.14A0205,LMAD.8.14A0265,LMAD.26.14A0155,LMAD.26.14A0167,LMAD.26.14A0481,LMAN.26.14A0329,taxonomy
0,4479944,1.0,2.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Actinobacteria; c__MB-A2-108; ...
1,995900,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Proteobacteria; c__Gammaproteo...
2,1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Bacteroidetes; c__Cytophagia; ...
3,541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Tenericutes; c__Mollicutes; o_...
4,533625,1.0,36.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Proteobacteria; c__Alphaproteo...


In [2]:
out_table_taxonomy_df.set_index('OTU ID', inplace=True)
out_table_taxonomy_df.head()

Unnamed: 0_level_0,LMAN.8.14A0051,LMAN.8.14A0304,LMAD.8.14A0247,LMAN.8.14A0159,LMAD.8.14A0051,LMAD.26.14A0381,LMAD.26.14A0533,LMAD.8.14A0281,LMAD.8.14A0295,LMAN.26.14A0319,...,LMAN.8.14A0011,LMAD.26.14A0137,LMAN.26.14A0327,LMAN.8.14A0205,LMAD.8.14A0265,LMAD.26.14A0155,LMAD.26.14A0167,LMAD.26.14A0481,LMAN.26.14A0329,taxonomy
OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,2.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Actinobacteria; c__MB-A2-108; ...
995900,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,8.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Proteobacteria; c__Gammaproteo...
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Bacteroidetes; c__Cytophagia; ...
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Tenericutes; c__Mollicutes; o_...
533625,1.0,36.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,k__Bacteria; p__Proteobacteria; c__Alphaproteo...


In [3]:
value = out_table_taxonomy_df.loc[4450360, 'taxonomy']
print(value)

k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingomonas; s__


In [4]:
out_table_df = out_table_taxonomy_df.drop('taxonomy', axis=1)
out_table_df.head()

Unnamed: 0_level_0,LMAN.8.14A0051,LMAN.8.14A0304,LMAD.8.14A0247,LMAN.8.14A0159,LMAD.8.14A0051,LMAD.26.14A0381,LMAD.26.14A0533,LMAD.8.14A0281,LMAD.8.14A0295,LMAN.26.14A0319,...,LMAN.26.14A0303,LMAN.8.14A0011,LMAD.26.14A0137,LMAN.26.14A0327,LMAN.8.14A0205,LMAD.8.14A0265,LMAD.26.14A0155,LMAD.26.14A0167,LMAD.26.14A0481,LMAN.26.14A0329
OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,2.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,8.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,1.0,36.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
def core_filter(df, threshold=0.5):
    # 
    threshold_int = int(df.shape[1] * threshold)
    print(f'Threshold: {threshold_int} (threshold * number of columns)')
    zero_counts = df.apply(lambda row: (row == 0).sum(), axis=1)
    nonzero_counts = df.shape[1] - zero_counts
    return df[nonzero_counts > threshold_int]

In [6]:
out_table_filtered_core_df = core_filter(out_table_df, 0.8)
out_table_filtered_core_df.head()

Threshold: 432 (threshold * number of columns)


Unnamed: 0_level_0,LMAN.8.14A0051,LMAN.8.14A0304,LMAD.8.14A0247,LMAN.8.14A0159,LMAD.8.14A0051,LMAD.26.14A0381,LMAD.26.14A0533,LMAD.8.14A0281,LMAD.8.14A0295,LMAN.26.14A0319,...,LMAN.26.14A0303,LMAN.8.14A0011,LMAD.26.14A0137,LMAN.26.14A0327,LMAN.8.14A0205,LMAD.8.14A0265,LMAD.26.14A0155,LMAD.26.14A0167,LMAD.26.14A0481,LMAN.26.14A0329
OTU ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
734945,143.0,490.0,629.0,2538.0,110.0,25.0,44.0,10.0,236.0,2614.0,...,10.0,4.0,498.0,33.0,443.0,42.0,49.0,62.0,19.0,3.0
1074373,295.0,34.0,29.0,6.0,198.0,9.0,13.0,549.0,326.0,31.0,...,1.0,2.0,11.0,0.0,0.0,34.0,4.0,9.0,15.0,0.0
4418394,2129.0,1559.0,3.0,86.0,1695.0,25.0,3.0,260.0,109.0,213.0,...,311.0,13.0,41.0,2.0,2.0,17.0,78.0,28.0,11.0,4.0
561885,1380.0,3455.0,592.0,452.0,8201.0,25.0,81.0,3899.0,8683.0,9281.0,...,239.0,7.0,32.0,2.0,35.0,3.0,4.0,1.0,4.0,0.0
487725,13.0,18.0,3.0,16.0,1.0,2.0,1.0,0.0,2.0,1737.0,...,1.0,2.0,4.0,0.0,6.0,0.0,1.0,1.0,1.0,0.0


In [7]:
out_table_filtered_core_df.shape

(80, 540)

In [63]:
row_sums = out_table_filtered_core_df.sum(axis=1)
row_sums.sort_values(ascending=False, inplace=True)
row_sums.head(n=30)

OTU ID
4450360    6139303.0
58865      5377591.0
334409     4781476.0
895733     2320838.0
575956     2104206.0
714278     1845464.0
1080436    1468586.0
941237     1035760.0
1111874    1030262.0
607212      870107.0
1091060     845626.0
1045797     814473.0
1061429     799176.0
4104869     688481.0
561885      577457.0
734945      521941.0
4317400     497141.0
922761      459211.0
907206      403764.0
186518      402674.0
574655      349397.0
1760354     317567.0
572586      275794.0
591699      249537.0
966903      228693.0
406145      178239.0
868791      176688.0
1108960     171986.0
899931      152867.0
861160      149002.0
dtype: float64

In [62]:
#print(pd.get_option('display.max_colwidth'))
pd.set_option('max_colwidth', 1000)
print(out_table_taxonomy_df.loc[row_sums.head(n=30).index]['taxonomy'])
pd.set_option('max_colwidth', 50)
#print(pd.get_option('display.max_colwidth'))

OTU ID
4450360               k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingomonas; s__
58865                 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingomonas; s__
334409                k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingomonas; s__
895733     k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Methylobacteriaceae; g__Methylobacterium; s__adhaesivum
575956                               k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Methylobacteriaceae; g__; s__
714278      k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales; f__Sphingomonadaceae; g__Sphingomonas; s__yabuuchiae
1080436                                 k__Bacteria; p__Bacteroidetes; c__Cytophagia; o__Cytophagales; f__Cytophagaceae; g__Hymenobacter; s__

In [34]:
pd.get_option('display.max_colwidth')

## Feature table with ASVs from dada2 (qual 20, forward reads)

## Feature table with ASVs from dada2 (MaLiAmPi)

## Feature table with phylotypes generated from MaLiAmPi with default params