Try to predict metabolite-level using microbial community (in different host - human and mice) 
and compare the ability to predict to Efrat's result on "universally-well-predicted" metabolites. 

A list of universally well-predicted metabolite from Efrat's paper:


In [1]:
import pandas as pd 

universal_robust_well_predicted = pd.read_excel("supplementary_robustness and universality of gut microbiome-metabolome associations.xlsx", sheet_name='Table S6', skiprows=9)
universal_robust_well_predicted.head()

Unnamed: 0,HMDB ID,Compound Name,Number of datasets included in model,REM Mean Spearman Rho [95% confidence interval],REM 95% Prediction Interval,REM P Value,REM Rho - FDR,REM Q Statistic (heterogeneity),REM Q P Value (heterogeneity),REM I^2 Statistic (%) [95% confidence interval] (heterogeneity),Robust (1),Robust using the SJ estimator (2),Robust using the DL estimator and HAKN adjustment (2),Robust using the SJ estimator and HAKN adjustment (2)
0,HMDB0000784,Azelaic acid,6,"0.596 [0.51, 0.67]","[0.371, 1.003]",2.4627140000000003e-27,6.72321e-25,8.053692,0.153301,"37.9 [0, 75.3]",True,True,True,True
1,HMDB0002064,N-Acetylputrescine,6,"0.534 [0.445, 0.612]","[0.302, 0.889]",2.5740300000000002e-23,3.51355e-21,7.844834,0.164997,"36.3 [0, 74.6]",True,True,True,True
2,HMDB0004160,Urobilin,5,"0.652 [0.548, 0.736]","[0.291, 1.266]",8.015751e-21,7.294332999999999e-19,7.864014,0.096688,"49.1 [0, 81.4]",True,True,True,True
3,HMDB0012252,Linoleoyl ethanolamide,4,"0.493 [0.395, 0.58]","[0.271, 0.81]",6.3697560000000005e-18,4.347359e-16,0.165251,0.982994,"0 [0, 0]",True,True,True,True
4,HMDB0002226,Adrenic acid,4,"0.517 [0.41, 0.61]","[0.273, 0.871]",1.889452e-16,1.031641e-14,1.462982,0.69084,"0 [0, 68.6]",True,True,True,True


In [2]:
universal_robust_well_predicted = universal_robust_well_predicted[universal_robust_well_predicted['Robust (1)']][['HMDB ID', 'Compound Name']]

In [3]:
universal_robust_well_predicted

Unnamed: 0,HMDB ID,Compound Name
0,HMDB0000784,Azelaic acid
1,HMDB0002064,N-Acetylputrescine
2,HMDB0004160,Urobilin
3,HMDB0012252,Linoleoyl ethanolamide
4,HMDB0002226,Adrenic acid
...,...,...
176,HMDB0001325,"N6,N6,N6-Trimethyl-L-lysine"
177,HMDB0000112,gamma-Aminobutyric acid
183,HMDB0000448,Adipic acid
189,HMDB0001186,N1-Acetylspermine


In [4]:
'Tetradecanoylcarnitine' in universal_robust_well_predicted['Compound Name'].tolist()

False

Predict metabolite levels:
 train a random forest regression model to predict metabolite levels based on genera relative abundances (with default hyper-parameter as done in Efrat's paper)
  evaluated each model’s performance using leave-one-subject-out cross validation by calculating the Spearman’s correlation coefficient, ρ, between actual vs predicted left out metabolite levels. Spearman’s correlation P value was also recorded, and FDR-correction was applied to all metabolite-models in each dataset (see “Methods” section). Metabolites with a predictability of ρ > 0.3 and an FDR < 0.1 were referred to as ‘well-predicted’ metabolites.





In [2]:

import pandas as pd
import numpy as np
import scipy.stats as stats
from h2m_translation.taxonomy_normalizer import NaiveTaxonomyNormalizer, preprocess_filter_rare_taxa_relative_abundance


Mice - Load data from Haddad OSA experiment:

In [3]:
taxa = pd.read_csv('mice/haddad_osa/data/taxonomic_observed_abundance_HaddadOSA.csv').set_index('#SampleID')

metadata = pd.read_csv('mice/haddad_osa/data/relevant_metadata_haddad_osa.csv').set_index('#SampleID')

metabolite_features = pd.read_csv('mice/haddad_osa/data/metabolite_unique_gnp_annotated_HaddadOSA.csv').set_index('sample_id')

In [4]:
metadata[metadata.control].host_subject_id.unique().shape

(8,)

filter only the control samples from both the metabolite and the taxonomy using the metadata:


In [5]:
control_samples = metadata[metadata.control].index
metabolite_features = metabolite_features.loc[control_samples, :]
taxa = taxa.loc[control_samples, :]
metadata = metadata[metadata.control]

Prep taxa: calculate relative-abundance, consider if to drop unknown or not


In [6]:
DROP_UNKNOWN_TAXA = False
UNKNOWN_TH = 1.0
naive_normalizer = NaiveTaxonomyNormalizer(drop_unknown_taxa=DROP_UNKNOWN_TAXA, unknown_taxa_sample_threshold=UNKNOWN_TH)
relative_abundance = naive_normalizer.normalize(taxa)
relative_abundance = preprocess_filter_rare_taxa_relative_abundance(relative_abundance,  verbose=False, percentage=10, abundance_threshold=0.001)

In [7]:
relative_abundance.head()

Unnamed: 0_level_0,14-2,1XD42-69,Acetatifactor,Acutalibacter,Anaerotignum,Anaerotruncus,Angelakisella,CAG-317,CAG-485,CAG-56,...,Merdisoma,NM07-P-09,Pelethomonas,Romboutsia,Roseburia,Schaedlerella,Sporofaciens,Turicibacter,UBA7109,Unknown
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10422.25.F.10,0.0005,0.0105,0.0045,0.003,0.0035,0.0,0.0075,0.001,0.016,0.001,...,0.0065,0.0345,0.009,0.0,0.0,0.0025,0.0025,0.0,0.0,0.558
10422.25.F.11,0.0065,0.003,0.0045,0.0045,0.001,0.0,0.004,0.0025,0.0415,0.0015,...,0.001,0.026,0.006,0.0,0.0,0.0005,0.0,0.0,0.001,0.571
10422.25.F.12,0.004,0.002,0.005,0.002,0.0025,0.0,0.003,0.0015,0.0685,0.001,...,0.002,0.028,0.001,0.0,0.0,0.0005,0.0015,0.0,0.0005,0.6245
10422.25.F.13,0.007,0.0055,0.004,0.001,0.0035,0.0005,0.004,0.002,0.0445,0.0015,...,0.0015,0.047,0.0135,0.0,0.0,0.0025,0.001,0.0,0.0,0.543
10422.25.F.3,0.0015,0.003,0.0145,0.0085,0.0045,0.006,0.0075,0.0,0.0555,0.014,...,0.0035,0.0,0.0,0.0,0.0,0.0,0.005,0.011,0.0,0.368


Prep metabolite:


In [8]:
from preprocessors import MetabolitePreprocessor
metabolite_preprocessor = MetabolitePreprocessor(verbose=False)
metabolite_features = metabolite_preprocessor.preprocess(metabolite_features)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metabolite_features.replace(to_replace=np.nan, value=0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  metabolite_features.replace(to_replace=0, value=min_value_per_metabolite, inplace=True)


In [9]:
metabolite_features.head()

Unnamed: 0_level_0,Glycerophosphocholine,gamma-Aminobutyric acid,Guanosine,L-Glutamic acid,Hypoxanthine,L-Tyrosine,L-Phenylalanine,L-Histidine,Inosine,Pantothenic acid,...,HMDB0030808,HMDB0032797,HMDB0039531,HMDB0060665,HMDB0240594,HMDB0247607,HMDB0250631,HMDB0254199,HMDB0259275,HMDB0341212
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10422.25.F.10,19.788512,17.504196,17.654117,19.612544,20.198391,19.864974,20.257097,18.547365,19.68238,21.04448,...,20.118693,20.130114,18.914552,20.40654,21.196795,20.966187,20.529572,19.590397,19.496697,16.140513
10422.25.F.11,20.132941,17.844568,20.06601,20.303923,19.710069,19.475774,19.270946,20.219976,20.638934,19.8999,...,19.931083,19.990823,17.167801,19.728497,19.997623,20.544878,19.374676,19.157107,19.780036,17.688974
10422.25.F.12,21.313526,18.218477,19.644726,20.050728,19.729961,19.42526,19.324003,20.125227,21.26882,20.32143,...,20.127113,20.481937,19.44247,20.074368,19.031961,20.546731,19.73853,19.529179,19.519951,19.612295
10422.25.F.13,20.681483,18.204897,18.986461,20.174736,19.715158,19.697018,19.56157,20.1048,20.926525,20.125604,...,20.616337,20.330078,17.718873,19.741273,19.508676,20.427176,19.720946,19.14092,19.430002,15.966109
10422.25.F.3,19.822835,18.860213,19.756987,20.088768,19.744542,19.649559,19.423534,20.187218,19.196104,19.719493,...,21.12594,18.977721,20.466351,19.676169,19.953001,20.51696,19.649859,19.768406,19.696387,22.169016


In [3]:
# scale metabolite levels to uni-variance and zero-mean (Standard transform) in the training (but this is fitted on the training set, in the pipeline itself). 

# for sanity check, verify my learner don't use the unknown feature, or even drop it. 
# start by training simply human/ mice data. 

random_state=10
metabolite_name = 'Glycerophosphocholine'
host='mice'
dir_name = 'mice_cage_effect'

In [11]:
metadata_o = pd.read_csv('mice/haddad_osa/original/haddad_6weeks_metadata_matched.txt', sep='\t').set_index(
    '#SampleID')
metadata_o = pd.concat([metadata_o['host_subject_id'].to_frame()
                           , metadata_o['Description'].str.extract(r'.*collection (\d+) of .*').squeeze().to_frame(
        'seq_sample_number'),
                        metadata_o['exposure_type'].map({'IHH': False, 'Air': True}).to_frame('control'),
                        metadata_o['cage_number'].to_frame()], axis=1)
metadata_o = metadata_o[metadata_o.control]

In [19]:
# Predict and evaluate
from evaluator import evaluation_report_metabolite_level_v0
from predictor import predict_metabolite_level_v0
from utils import metabolite_to_str
import os 
files = pd.Series(os.listdir(f'metabolite_level_regressors/{dir_name}/'))

# groups = metadata['host_subject_id']
groups, _ = metadata_o['cage_number'].align(metabolite_features[metabolite_name])


for metabolite_name in metabolite_features.columns:
# for metabolite_name in ['Palmitoleic acid']:    
    metabolite_name_str = metabolite_to_str(metabolite_name)
    if f'score_df_{metabolite_name_str}.pkl' not in files.values:
        print(f'------------- pred and eval metabolite {metabolite_name} ------------')
        pred_test = predict_metabolite_level_v0(relative_abundance, metabolite_features, metabolite_name, groups,
                                        host='mice', random_state=10, verbose=False, dir_name=dir_name)
        html = evaluation_report_metabolite_level_v0(metabolite_features, metabolite_name, host, pred_test, dir_name=dir_name)


------------- pred and eval metabolite Glycerophosphocholine ------------
------------- pred and eval metabolite gamma-Aminobutyric acid ------------
------------- pred and eval metabolite Guanosine ------------
------------- pred and eval metabolite L-Glutamic acid ------------
------------- pred and eval metabolite Hypoxanthine ------------
------------- pred and eval metabolite L-Tyrosine ------------
------------- pred and eval metabolite L-Phenylalanine ------------
------------- pred and eval metabolite L-Histidine ------------
------------- pred and eval metabolite Inosine ------------
------------- pred and eval metabolite Pantothenic acid ------------
------------- pred and eval metabolite N-Acetyl-D-glucosamine ------------
------------- pred and eval metabolite Palmitoylcarnitine ------------
------------- pred and eval metabolite Riboflavin ------------
------------- pred and eval metabolite Sucrose ------------
------------- pred and eval metabolite L-Arginine ------------

In [15]:
# from evaluator import evaluation_report_metabolite_level_v0
# host='mice'
# 
# for metabolite_name in metabolite_features.columns:
#     metabolite_name_str = metabolite_to_str(metabolite_name)
#     pred_test = pd.read_pickle(f'metabolite_level_regressors/{host}/pred_test_{metabolite_name_str}.pkl').squeeze()
#     evaluation_report_metabolite_level_v0(metabolite_features, metabolite_name, host, pred_test)

In [None]:
# from IPython.display import display, HTML
# display(HTML(html))

Aggregate results:

In [20]:
from utils import metabolite_to_str
import os 
files = pd.Series(os.listdir(f'metabolite_level_regressors/{dir_name}/'))

score_dataframes = []
print(f"Host: {host} random state: {random_state}")
for metabolite_name in metabolite_features.columns:
    metabolite_name_str = metabolite_to_str(metabolite_name)
    if f'score_df_{metabolite_name_str}.pkl' in files.values:
        score_df = pd.read_pickle(f'metabolite_level_regressors/{dir_name}/score_df_{metabolite_name_str}.pkl')
        score_dataframes.append(score_df)
    else: 
        print(f"Metabolite: {metabolite_name} file is missing")
score_dataframes = pd.concat(score_dataframes, axis=1)
score_dataframes.to_pickle(f'metabolite_level_regressors/{dir_name}/scoree_dataframes_all_metabolites.pkl')


Host: mice random state: 10


In [4]:
score_dataframes = pd.read_pickle(f'metabolite_level_regressors/{dir_name}/scoree_dataframes_all_metabolites.pkl')

In [5]:
score_dataframes

Unnamed: 0,Glycerophosphocholine,gamma-Aminobutyric acid,Guanosine,L-Glutamic acid,Hypoxanthine,L-Tyrosine,L-Phenylalanine,L-Histidine,Inosine,Pantothenic acid,...,HMDB0030808,HMDB0032797,HMDB0039531,HMDB0060665,HMDB0240594,HMDB0247607,HMDB0250631,HMDB0254199,HMDB0259275,HMDB0341212
spearman_corr,-0.215237,0.123678,0.092349,0.135613,0.322962,0.081203,0.321769,0.258749,0.043611,0.43324,...,0.236479,-0.013376,0.015286,0.5263388,0.095087,-0.229881,0.022464,0.123579,-0.108417,-0.096956
p_value,0.041618,0.245489,0.386653,0.202502,0.001905,0.446745,0.001984,0.013798,0.683173,2e-05,...,0.024832,0.900421,0.886292,9.922476e-08,0.372667,0.029284,0.833545,0.24587,0.309079,0.3633
r2,-1.930674,-8.71588,-12.419714,-4.13917,-1.212962,-4.371215,-3.44552,-5.264134,-5.303529,-6.077042,...,-1.919492,-6.004409,-7.651598,-2.504584,-2.032071,-17.139725,-11.798711,-3.117018,-6.107752,-7.889362
rmse,2.736015,2.040244,3.058294,0.171986,0.25229,0.239737,0.48961,0.483226,4.21403,0.423779,...,2.642098,1.554723,2.028549,0.2049942,4.167157,0.581283,0.547306,0.542457,0.586232,4.784276
MAPE,0.07243,0.05953,0.074971,0.016148,0.01954,0.01882,0.027287,0.028006,0.085281,0.027264,...,0.065685,0.047442,0.05803,0.01808864,0.092995,0.031074,0.029398,0.026242,0.029922,0.084981


In [10]:
score_dataframes.loc['spearman_corr', :][score_dataframes.loc['spearman_corr', :] > 0.3].index.tolist()

['Hypoxanthine',
 'L-Phenylalanine',
 'Pantothenic acid',
 'Tauroursodeoxycholic acid',
 'Oleoylethanolamide',
 'Glutaminylleucine',
 'Methionyl-Valine',
 'HMDB0060665']

In [14]:
from evaluator import evaluation_report_metabolite_level_all_v0

t = evaluation_report_metabolite_level_all_v0(score_dataframes, metabolite_features, host='mice', dataset_name='OSA', dir_name=dir_name,
                                              robust_well_predicted_path='/home/noa/lab_code/H2Mtranslation/h2m_translation/map_hmdb_id_to_name.pkl')

Human - Load data from Haddad iHMP experiment:


In [16]:
PROJECT_DIR = '/home/noa/lab_code/H2Mtranslation'

human_db_path = '/home/noa/lab_code/microbiome-metabolome-curated-data/data/processed_data'
dataset = 'iHMP_IBDMDB_2019'

metadata = pd.read_csv(f'{human_db_path}/{dataset}/metadata.tsv' ,sep='\t')
metabolite = pd.read_csv(f'{human_db_path}/{dataset}/mtb.tsv' ,sep='\t')
metabolite_to_hmdb = pd.read_csv(f'{human_db_path}/{dataset}/mtb.map.tsv' ,sep='\t')
taxa = pd.read_csv(f'{human_db_path}/{dataset}/genera.tsv' ,sep='\t')

metabolite_to_hmdb = metabolite_to_hmdb.set_index('Compound')['HMDB']


In [17]:
# Filter to control samples only:
control_samples_ids = metadata[metadata['Study.Group'] == 'nonIBD'].Sample.unique()
taxa = taxa[taxa.Sample.isin(control_samples_ids)]
metabolite = metabolite[metabolite.Sample.isin(control_samples_ids)]

In [18]:
metadata = metadata[metadata['Study.Group'] == 'nonIBD']

In [19]:
# Filter rare taxa (taxa have non-zero values in >10% of the samples)")
# Filter rare taxa according to the following criteria: we want that the at least 10% of the samples will have relative-abundance larger than 0.001.

taxa = taxa.set_index('Sample')
verbose=False
percentage=10
abundance_threshold=0.001
min_number_of_samples = int((taxa.shape[0] / 100) * percentage)
print(
    f"The number of genus/features that have relative abundance values larger then {abundance_threshold} on > {percentage} % samples are: "
    f"{(((taxa > abundance_threshold).sum(axis=0)) >= min_number_of_samples).sum()}"
    f" out of {taxa.shape[1]} genus/features before-filtering.")

non_rare_columns = taxa.columns[
    ((taxa > abundance_threshold).sum(axis=0)) >= min_number_of_samples]
taxa = taxa[non_rare_columns]

The number of genus/features that have relative abundance values larger then 0.001 on > 10 % samples are: 107 out of 9694 genus/features before-filtering.


In [20]:
metabolite = metabolite.set_index('Sample')
metabolite = metabolite[metabolite.columns.intersection(metabolite_to_hmdb.dropna().index)].rename(columns=metabolite_to_hmdb.dropna())
metabolite = metabolite.groupby(by=metabolite.columns, axis=1).sum()
print(metabolite.shape)
# Rename from HMDB to meangful naming... 
map_hmdb_id_to_name = pd.read_pickle(f'{PROJECT_DIR}/h2m_translation/map_hmdb_id_to_name.pkl')
hmdb_metadata = pd.read_pickle(f'{PROJECT_DIR}/h2m_translation/hmdb_name_and_description.pkl')
metabolite = metabolite.rename(columns=map_hmdb_id_to_name)

(104, 454)


In [21]:
percentage = 85
min_number_of_samples = int((metabolite.shape[0] / 100) * percentage)
print(f"min_number_of_samples: {min_number_of_samples}")
print(f"metabolite shape: {metabolite.shape}")

min_number_of_samples: 88
metabolite shape: (104, 454)


In [22]:
metabolite = metabolite.fillna(0)
non_rare_columns = metabolite.columns[((metabolite.shape[0] - (
            metabolite.round(decimals=8) == 0).sum(axis=0)) >= min_number_of_samples)]
print(f"There are {len(non_rare_columns)} metabolite with sufficient number of samples "
      f"(>{percentage}%) out of {metabolite.shape[1]} metabolites.")
metabolite= metabolite[non_rare_columns]

There are 375 metabolite with sufficient number of samples (>85%) out of 454 metabolites.


In [23]:
min_value_per_metabolite = metabolite.replace(to_replace=0, value=np.nan).min(axis=0) / 2
metabolite.replace(to_replace=np.nan, value=0, inplace=True)
metabolite.replace(to_replace=0, value=min_value_per_metabolite, inplace=True)
metabolite = metabolite.apply(lambda x: np.log(x + 1))
metabolite.head()


Unnamed: 0_level_0,Deoxycytidine,4-Pyridoxic acid,alpha-Ketoisovaleric acid,p-Hydroxyphenylacetic acid,Ureidopropionic acid,Biotin,Adenine,Taurocholic acid,Butyric acid,Betaine,...,N-Acetylhistidine,HMDB0035665,17-Methyloctadecanoic acid,HMDB0037942,TG(14:0/14:0/16:0),HMDB0042093,HMDB0043058,2-Hydroxyglutarate,HMDB0059824,10Z-Heptadecenoic acid
Sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HSM5MD5D,12.818366,14.149857,10.236991,10.802408,9.801067,18.017231,14.302072,13.069583,15.220003,11.722051,...,10.98307,8.921725,7.474772,11.372916,8.328209,9.034677,9.106978,11.996672,8.412055,12.759515
MSM6J2K6,15.51048,15.616391,9.592059,11.748164,9.639001,17.552187,15.118838,14.322898,15.558922,13.311884,...,11.961903,8.158516,10.554145,12.910531,7.724005,8.360773,10.696412,12.976577,10.601199,15.023866
HSM5MD6K,15.26784,14.491845,7.985825,11.97747,9.835423,17.051244,14.985529,13.674023,16.23645,13.114035,...,12.452268,10.565273,7.426549,14.486872,6.380123,7.874739,10.09146,11.489984,9.605822,12.312858
MSM6J2HT,16.72706,15.203554,9.617271,13.011155,10.635158,16.334514,14.196841,15.915374,15.786928,13.73633,...,10.711658,7.231287,10.719251,12.291635,2.140066,8.196437,10.134004,12.532018,10.553153,13.993887
MSM6J2JR,14.153341,12.077506,7.546974,11.778416,8.489205,18.341299,13.402568,11.597285,15.355539,13.193619,...,10.173706,8.054205,10.09852,12.539816,15.878592,7.670429,11.928935,11.867189,9.873131,13.960778


In [24]:
human_relative_abundance = taxa
human_metabolite_features = metabolite
human_matadata = metadata

In [25]:
human_matadata.head()

Unnamed: 0,Dataset,Sample,Subject,Study.Group,Gender,DOI,Publication.Name,consent_age,Age.Units,site_sub_coll,...,visit_num,site_name,Age at diagnosis,Antibiotics,race,fecalcal,BMI_at_baseline,Height_at_baseline,Weight_at_baseline,smoking status
23,iHMP_IBDMDB_2019,CSM67UH7,C3022,nonIBD,Male,10.1038/s41586-019-1237-9,Multi-omics of the gut microbial ecosystem in ...,69.0,Years,C3022C1,...,4,Cedars-Sinai,,No,White,15.64728,,,,
25,iHMP_IBDMDB_2019,CSM79HGZ,C3022,nonIBD,Male,10.1038/s41586-019-1237-9,Multi-omics of the gut microbial ecosystem in ...,69.0,Years,C3022C5,...,8,Cedars-Sinai,,No,White,23.3,,,,
41,iHMP_IBDMDB_2019,CSM79HP4,C3022,nonIBD,Male,10.1038/s41586-019-1237-9,Multi-omics of the gut microbial ecosystem in ...,69.0,Years,C3022C11,...,15,Cedars-Sinai,,No,White,,,,,
62,iHMP_IBDMDB_2019,CSM7KOOH,C3022,nonIBD,Male,10.1038/s41586-019-1237-9,Multi-omics of the gut microbial ecosystem in ...,69.0,Years,C3022C14,...,19,Cedars-Sinai,,No,White,17.58098,,,,
88,iHMP_IBDMDB_2019,CSMAG78W,C3022,nonIBD,Male,10.1038/s41586-019-1237-9,Multi-omics of the gut microbial ecosystem in ...,69.0,Years,C3022C23,...,29,Cedars-Sinai,,No,White,15.87925,,,,


In [26]:
len(human_matadata.Subject.unique())

26

In [54]:
from utils import metabolite_to_str

random_state=10
host='human'

files = pd.Series(os.listdir(f'metabolite_level_regressors/{host}/'))

print(f"Host: {host} random state: {random_state}")

# Predict and evaluate
from evaluator import evaluation_report_metabolite_level_v0
from predictor import predict_metabolite_level_v0
score_dataframes = []
for metabolite_name in human_metabolite_features.columns:
    metabolite_name_str = metabolite_to_str(metabolite_name)
    if f'score_df_{metabolite_name_str}.pkl' in files.values:
        score_df = pd.read_pickle(f'metabolite_level_regressors/{host}/score_df_{metabolite_name_str}.pkl')
        score_dataframes.append(score_df)
    else:
        print(f'------------- pred and eval metabolite {metabolite_name} ------------')
        pred_test = predict_metabolite_level_v0(human_relative_abundance, human_metabolite_features, metabolite_name, human_matadata['Subject'],
                                        host=host, random_state=random_state, verbose=False)
        html = evaluation_report_metabolite_level_v0(human_metabolite_features, metabolite_name, host, pred_test)
score_dataframes = pd.concat(score_dataframes, axis=1)
score_dataframes.to_pickle(f'metabolite_level_regressors/{host}/scoree_dataframes_all_metabolites.pkl')


Host: human random state: 10


In [61]:
score_dataframes

Unnamed: 0,Deoxycytidine,4-Pyridoxic acid,alpha-Ketoisovaleric acid,p-Hydroxyphenylacetic acid,Ureidopropionic acid,Biotin,Adenine,Taurocholic acid,Butyric acid,Betaine,...,N-Acetylhistidine,HMDB0035665,17-Methyloctadecanoic acid,HMDB0037942,TG(14:0/14:0/16:0),HMDB0042093,HMDB0043058,2-Hydroxyglutarate,HMDB0059824,10Z-Heptadecenoic acid
spearman_corr,0.264013,0.238686,-0.126705,0.5237704,0.287214,0.341423,0.377382,0.460973,0.328145,0.176486,...,0.005804,0.402572,0.214979,0.194292,-0.017825,-0.011267,0.278961,0.206455,0.433394,0.389278
p_value,0.006767,0.014684,0.19995,1.160208e-08,0.003115,0.00039,7.8e-05,8.462034e-07,0.000672,0.073109,...,0.953371,2.3e-05,0.028411,0.048118,0.857469,0.909618,0.004136,0.035492,4e-06,4.4e-05
r2,-4.601876,-4.965335,-11.803692,-1.187233,-2.945396,-4.968236,-2.895569,-2.485935,-4.160112,-5.0364,...,-12.10238,-3.096099,-4.892206,-3.616876,-13.726955,-12.978523,-5.3809,-3.5257,-3.051735,-2.742173
rmse,2.218255,0.721529,0.985386,0.6747416,0.941523,0.649711,1.3471,2.155436,0.649317,1.046138,...,1.049242,3.80908,0.880263,0.907981,11.602237,2.675345,1.857352,0.598083,0.336689,0.752859
MAPE,0.076444,0.039928,0.075734,0.05247306,0.074995,0.036432,0.057329,0.07868942,0.039326,0.055182,...,0.063996,0.175143,0.073228,0.058311,0.328192,0.159918,0.095429,0.047288,0.046742,0.04627


In [96]:
from evaluator import evaluation_report_metabolite_level_all_v0

t = evaluation_report_metabolite_level_all_v0(score_dataframes, human_metabolite_features, host='human', dataset_name='iHMP',
                                              robust_well_predicted_path='/home/noa/lab_code/H2Mtranslation/h2m_translation/map_hmdb_id_to_name.pkl')

Thoughts:
* Maybe there is a lot of noice following from factors we don't have, so predicting the metabolite level exactly won't work, but maybe predicting extreme values - which may indicate on decease - will work better? 
* Maybe we need to separate the ability to predict 'is this metabolite in the sample (zero/not) from the metabolite level prediction?
(AKA I believe that splitting the problem into outline detection (zero/not, outline/not) and metabolite-level regression will improve our prediction abilities) 

Comparing Prediction scores in Mice and in Humans (OSA vs. iHMP)

In [63]:
human_scores = pd.read_pickle('metabolite_level_regressors/human/scoree_dataframes_all_metabolites.pkl')
mice_scores = pd.read_pickle('metabolite_level_regressors/mice/scoree_dataframes_all_metabolites.pkl')


In [64]:
import pandas as pd
import scipy.stats as stats
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_percentage_error
import base64
from io import BytesIO
import matplotlib.pyplot as plt
from statsmodels.stats.multitest import fdrcorrection
from utils import metabolite_to_str
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np
import matplotlib.colors as mcolors

In [65]:
human_scores

Unnamed: 0,Deoxycytidine,4-Pyridoxic acid,alpha-Ketoisovaleric acid,p-Hydroxyphenylacetic acid,Ureidopropionic acid,Biotin,Adenine,Taurocholic acid,Butyric acid,Betaine,...,N-Acetylhistidine,HMDB0035665,17-Methyloctadecanoic acid,HMDB0037942,TG(14:0/14:0/16:0),HMDB0042093,HMDB0043058,2-Hydroxyglutarate,HMDB0059824,10Z-Heptadecenoic acid
spearman_corr,0.264013,0.238686,-0.126705,0.5237704,0.287214,0.341423,0.377382,0.460973,0.328145,0.176486,...,0.005804,0.402572,0.214979,0.194292,-0.017825,-0.011267,0.278961,0.206455,0.433394,0.389278
p_value,0.006767,0.014684,0.19995,1.160208e-08,0.003115,0.00039,7.8e-05,8.462034e-07,0.000672,0.073109,...,0.953371,2.3e-05,0.028411,0.048118,0.857469,0.909618,0.004136,0.035492,4e-06,4.4e-05
r2,-4.601876,-4.965335,-11.803692,-1.187233,-2.945396,-4.968236,-2.895569,-2.485935,-4.160112,-5.0364,...,-12.10238,-3.096099,-4.892206,-3.616876,-13.726955,-12.978523,-5.3809,-3.5257,-3.051735,-2.742173
rmse,2.218255,0.721529,0.985386,0.6747416,0.941523,0.649711,1.3471,2.155436,0.649317,1.046138,...,1.049242,3.80908,0.880263,0.907981,11.602237,2.675345,1.857352,0.598083,0.336689,0.752859
MAPE,0.076444,0.039928,0.075734,0.05247306,0.074995,0.036432,0.057329,0.07868942,0.039326,0.055182,...,0.063996,0.175143,0.073228,0.058311,0.328192,0.159918,0.095429,0.047288,0.046742,0.04627


In [91]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(8, 9))

human_scores.loc['spearman_corr', :].plot.hist(ax=axes[0][0], bins=100, title='Human Metabolite Level Spearman correlation')
mice_scores.loc['spearman_corr', :].plot.hist(ax=axes[0][1], bins=100, title='MICE Metabolite Level Spearman correlation')


human_scores.loc['rmse', :].plot.hist(ax=axes[1][0], bins=100, title='Human Metabolite Level rmse')
mice_scores.loc['rmse', :].plot.hist(ax=axes[1][1], bins=100, title='MICE Metabolite Level rmse')

human_scores.loc['MAPE', :].plot.hist(ax=axes[2][0], bins=100, title='Human Metabolite Level MAPE')
mice_scores.loc['MAPE', :].plot.hist(ax=axes[2][1], bins=100, title='MICE Metabolite Level MAPE')

axes[0][0].set_xlim([0.0, 0.7])
axes[0][1].set_xlim([0.0, 0.7])
axes[1][0].set_xlim([0.0, 10])
axes[1][1].set_xlim([0.0, 10])
axes[2][0].set_xlim([0.0, 0.2])
axes[2][1].set_xlim([0.0, 0.2])

tmpfile = BytesIO()
plt.savefig(tmpfile, format='jpg', pad_inches=0.1, edgecolor='gray', bbox_inches='tight')
plt.tight_layout()

# Avoid displaying the figure when calling this function:
plt.close(fig)

first_fig = base64.b64encode(tmpfile.getvalue()).decode('utf-8')

In [73]:
shared_metabolites = human_scores.columns.intersection(mice_scores.columns)

In [75]:
print(f"Num. Shared metabolites: {shared_metabolites.shape[0]}")

Num. Shared metabolites: 25


In [92]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))


pd.DataFrame({'human': human_scores.loc['spearman_corr', shared_metabolites], 'mice': mice_scores.loc['spearman_corr', shared_metabolites]}).plot.scatter(x='mice', y='human', title='Spearman correlation on the shared metabolites', ax=axes[0])

pd.DataFrame({'human': human_scores.loc['rmse', shared_metabolites], 'mice': mice_scores.loc['rmse', shared_metabolites]}).plot.scatter(x='mice', y='human', title='rmse on the shared metabolites', ax=axes[1])

pd.DataFrame({'human': human_scores.loc['MAPE', shared_metabolites], 'mice': mice_scores.loc['MAPE', shared_metabolites]}).plot.scatter(x='mice', y='human',  title='MAPE on the shared metabolites', ax=axes[2])

tmpfile = BytesIO()
plt.savefig(tmpfile, format='jpg', pad_inches=0.1, edgecolor='gray', bbox_inches='tight')
plt.tight_layout()

# Avoid displaying the figure when calling this function:
plt.close(fig)

second_fig = base64.b64encode(tmpfile.getvalue()).decode('utf-8')

In [97]:
report = ('<html> \n <title> Evaluation Report </title> \n ' +
          '<body> \n '
          f' <p><b><u> Evaluation Report comparing Metabolites Level prediction in Mice vs. Human </b></u></p>'
          + '<img src=\'data:image/png;base64,{}\'>'.format(first_fig) +
          "<p>\n \n </p>" +
          '<img src=\'data:image/png;base64,{}\'>'.format(second_fig) +
          '</body>'
          '</html>'
          )

with open(f'metabolite_level_regressors/eval_report_mice_vs_human_RF_default.html', 'w') as f:
    f.write(report)