Translate the microbiome-metabolism correlation from mice to human. 
MICE - OSA
HUMAN - iHMP

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [6]:
mice_significant_corr = pd.read_pickle('mice/haddad_osa/analysis/metabolome_bacteria_significant_corr_kendall-tau.pkl')
mice_corr_coefficient= pd.read_pickle('mice/haddad_osa/analysis/metabolome_bacteria_corr_coefficient_kendall-tau.pkl')
mice_p_values_fdr_correction = pd.read_pickle('mice/haddad_osa/analysis/metabolome_bacteria_p_values_fdr_correction_kendall-tau.pkl')

human_significant_corr = pd.read_pickle('human/iHMP_IBDMDB_2019/metabolome_bacteria_significant_corr_kendall-tau.pkl')
human_corr_coefficient = pd.read_pickle('human/iHMP_IBDMDB_2019/metabolome_bacteria_corr_coefficient_kendall-tau.pkl')
human_p_values_fdr_correction = pd.read_pickle('human/iHMP_IBDMDB_2019/metabolome_bacteria_p_values_fdr_correction_kendall-tau.pkl')

In [12]:
# Keep only the genus label for simplicity: 

taxa_rename_to_genus = pd.Series(human_corr_coefficient.index.get_level_values(1), index=human_corr_coefficient.index.get_level_values(1)).str.extract('.*g__(.*)').squeeze()
human_significant_corr = human_significant_corr.rename(taxa_rename_to_genus.to_dict(), axis=0, level=1)
human_corr_coefficient = human_corr_coefficient.rename(taxa_rename_to_genus.to_dict(), axis=0, level=1)
human_p_values_fdr_correction = human_p_values_fdr_correction.rename(taxa_rename_to_genus.to_dict(), axis=0, level=1)


Analysis 

How many genus are shared (mice - human)? 
(Note, when calculating the discrete matrix - I should also present the baseline (aka how many genus are shared without any translation)

In [21]:
human_genus = human_corr_coefficient.index.get_level_values(1).unique()
mice_genus = mice_corr_coefficient.index.get_level_values(1).unique()

In [30]:
print(f"There are {len(mice_genus.intersection(human_genus))} genus shared in the mice and human data. \n"
      f"Out of {len(mice_genus)} genus in mice and {len(human_genus)} genus in human. \n "
      f"AKA {round(len(mice_genus.intersection(human_genus)) / len(mice_genus) * 100, 2)} % of the genus in mice are shared with humans in our ds.")

There are 16 genus shared in the mice and human data. 
Out of 39 genus in mice and 107 genus in human. 
 AKA 41.03 % of the genus in mice are shared with humans in our ds.


Taxonomic distance mapping based on MGBC closest taxa table 

In [35]:
mgbc_clostest_taxa_table_path = '/home/noa/lab_code/MGBC-Toolkit/data/closest_tax.tsv'
mgbc_clostest_taxa_table = pd.read_csv(mgbc_clostest_taxa_table_path, sep='\t', header=None, names = ['method', 'reference_genome', 'query_genome', 'distance', 'reference_taxonomy', 'query_taxonomy'])

In [36]:
mgbc_clostest_taxa_table.head()

Unnamed: 0,method,reference_genome,query_genome,distance,reference_taxonomy,query_taxonomy
0,taxonomy,GUT_GENOME000010,MGBC000100,0.020884,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__O...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__O...
1,taxonomy,GUT_GENOME000035,MGBC000465,0.009662,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...
2,taxonomy,GUT_GENOME000049,MGBC109121,0.200162,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
3,taxonomy,GUT_GENOME000057,MGBC115383,0.008205,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...
4,taxonomy,GUT_GENOME000064,MGBC161554,0.625505,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...,d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactob...


The direction of translation is important. Here we translate mice to human.
Therefore, we demand that reference_genome will be from MGBC (MGBC prefix)

In [43]:
taxonomic_table = mgbc_clostest_taxa_table.query('method == "taxonomy" & reference_genome.str.contains("MGBC")')

In [49]:
# extract reference genus
taxonomic_table['reference_genus'] = taxonomic_table.reference_taxonomy.str.extract('(.*;g__[^;]+).*')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  taxonomic_table['reference_genus'] = taxonomic_table.reference_taxonomy.str.extract('(.*;g__[^;]+).*')


In [53]:
taxonomic_table['query_genus'] = taxonomic_table.query_taxonomy.str.extract('(.*;g__[^;]+).*')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  taxonomic_table['query_genus'] = taxonomic_table.query_taxonomy.str.extract('(.*;g__[^;]+).*')


In [60]:
# sanity check: we expect all the genus to be mapped to the same genus according to taxonomic distance
# some_genus = taxonomic_table['reference_genus'].iloc[0]
# taxonomic_table.query('reference_genus == @some_genus').query_genus.unique()
number_of_genus_mapped_per_reference_genus = taxonomic_table.groupby('reference_genus').apply(lambda x: len(x.query_genus.unique()))

In [63]:
number_of_genus_mapped_per_reference_genus.describe()

count    225.000000
mean       1.008889
std        0.094070
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        2.000000
dtype: float64

In [64]:
np.count_nonzero(number_of_genus_mapped_per_reference_genus == 2)

2

In [72]:
outlier_genus = number_of_genus_mapped_per_reference_genus[number_of_genus_mapped_per_reference_genus == 2].index
taxonomic_table.query('reference_genus in @outlier_genus[0]')[['reference_taxonomy', 'query_taxonomy', 'query_genus']]
# One of the species was not labeled to genus level, only to family level - which explain the difference (actully they are the same and we can ignore, we just need to verify we take the genus mapping and ignore the nuns.) 

Unnamed: 0,reference_taxonomy,query_taxonomy,query_genus
45836,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,
45846,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...


In [75]:
taxonomic_table.query('reference_genus in @outlier_genus[1]')[['reference_taxonomy', 'query_taxonomy']]


Unnamed: 0,reference_taxonomy,query_taxonomy
45343,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45405,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45611,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45654,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45700,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45795,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45867,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45869,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45876,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45881,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...


In [74]:
taxonomic_table.query('reference_genus in @outlier_genus[1]')

Unnamed: 0,method,reference_genome,query_genome,distance,reference_taxonomy,query_taxonomy,reference_genus,query_genus
45343,taxonomy,MGBC104730,GUT_GENOME009178,0.394019,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45405,taxonomy,MGBC107002,GUT_GENOME009178,0.397136,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45611,taxonomy,MGBC117467,GUT_GENOME009178,0.377058,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45654,taxonomy,MGBC119940,GUT_GENOME009178,0.394995,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45700,taxonomy,MGBC123503,GUT_GENOME009178,0.39405,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45795,taxonomy,MGBC130483,GUT_GENOME009178,0.398698,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45867,taxonomy,MGBC139490,GUT_GENOME009178,0.362799,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45869,taxonomy,MGBC139512,GUT_GENOME009178,0.351621,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45876,taxonomy,MGBC139895,GUT_GENOME009178,0.397381,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...
45881,taxonomy,MGBC139981,GUT_GENOME009178,0.395103,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...,d__Bacteria;p__Firmicutes_A;c__Clostridia;o__L...


We have 225 mice-genus in the MGBC table. 
We expected all the genus to be mapped to the same genus according to taxonomic distance, it's true mostly (for 223 genus)
But 2 genus have been mapped to 2 different genus based on their taxonomic distance:
Those genus are: 
'd__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Paraprevotella' 
 'd__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Eubacterium_J'
 
