Data preparation for Haddad OSA article


In [1]:
import pandas as pd
import numpy as np

Taxonomic feature table: Samples x Taxonomic species

In [136]:
taxa = pd.read_csv('taxonomic_observed_abundance_HaddadOSA.csv')
taxa = taxa.set_index('#SampleID')

In [137]:
taxa.head()

Meta-data:

In [64]:
metadata = pd.read_csv('original/haddad_6weeks_metadata_matched.txt', sep='\t')

In [27]:
metadata.columns

In [51]:
metadata.head()

Which metadata we care about: 
* age / collection_timestamp for time-series data. Or maybe better, use the Description to extract the serial number of the collection.
* IHH exposure - exposure_type - different between control and study-group. IHH - study_group, Air - Control
* host_subject_id / mouse_number


Some information about the experiment: 
Should be 16 mouse, 8 control. 
Each mouse should have 3 timestamps samples (feces).
 

In [11]:
metadata['exposure_type'].unique()

In [16]:
metadata['host_subject_id'].unique()

In [65]:
metadata = metadata.set_index('#SampleID')

In [66]:
metadata = pd.concat([metadata['host_subject_id'].to_frame()
                         , metadata['Description'].str.extract(r'.*collection (\d+) of .*').squeeze().to_frame('seq_sample_number'),
                      metadata['exposure_type'].map({'IHH': False, 'Air': True}).to_frame('control')], axis=1)

In [77]:
metadata['seq_sample_number'] = metadata['seq_sample_number'].astype(int)

Samples statistics from metadata:

In [70]:
# How manny control samples there are?
print(f'There are {metadata[metadata["control"]].shape[0]} samples under control (including multiple-timestamps per mice)')

In [82]:
print(f"There are {metadata.groupby('host_subject_id').apply(lambda x: x.sort_values(by='seq_sample_number', ascending=True).iloc[0]).query('control').shape[0]} under control, when taking only the first sample")

In [83]:
metadata.head()

In [84]:
metadata.to_csv('relevant_metadata_haddad_osa.csv')

Load feature table of samples x metabolites (molecule annotations)

Note this data is after pqn normalization (what they did in the article) 
 probabilistic quotient normalization to an internal standard (m/z 278.189; real time, 3.81 min) for subsequent analysis: https://pubmed.ncbi.nlm.nih.gov/16808434/


In [88]:
feature = pd.read_table('original/haddad_6weeks_allFeatures_pqn_matched_wIDs.txt', sep ='\t')

In [90]:
feature.columns

In [93]:
feature.head()

In [109]:
feature.shape

Using what they call as "standard identification" we have 12 unique metabolites. 

In [91]:
feature['level_1_identification'].unique()

In [95]:
feature['level_1_identification'].unique()[0]

Note, that we have duplication in mapping. AKA we have several features that are mapped to the same metabolites. And we also sometimes have 
different mapping in GNP and Standard.

In [96]:
feature[feature['level_1_identification']!= 'none']

In [97]:
# Extract few statistics about the metabolites:
# How many different feature per metabolite do I have? (for level 1 and level 2)? 
# How many not-agreeing?
# Does the different feature that map to the same level-1 metabolites are correlative to each other?
feature[feature['level_1_identification']!= 'none'].groupby('level_1_identification').apply(lambda x: x.shape[0])


Are the features that mapped to the same level_1 identifiers are correlated?

In [141]:
feature['level_1_identification'].unique()

In [169]:
for metabolic in feature['level_1_identification'].unique():
    if metabolic == 'none':
        continue
    feature_corr = feature.query('level_1_identification == @metabolic').drop(columns=['obs_mz','obs_rt','level2_gnps','level_1_identification','level_1_identification']).set_index('#featureID').T.corr()
    for i in range(feature_corr.shape[0]):
        feature_corr.iloc[i, i] = np.nan
    print(f"Mean correlation of features under {metabolic}: {round(feature_corr.corr().mean().mean(),3)}, #features :{feature.query('level_1_identification == @metabolic').shape[0]}")

In [171]:
feature_corr = feature.query('level_1_identification == "Taurocholic acid"').drop(columns=['obs_mz','obs_rt','level2_gnps','level_1_identification','level_1_identification']).set_index('#featureID').T.corr()

In [172]:
for i in range(feature_corr.shape[0]):
    feature_corr.iloc[i, i] = np.nan


In [173]:
feature_corr

In [98]:
# GNPs seems to be mapped to be more unique (a lot of them are mapped feature -> metabolite. So maybe this can be my demand, that 
# I would work with a level of identification, that I have a unique metabolism. 

# Seems like it is possible to annotate between GNPs to HMDB ids. Might take a while due.

feature[feature['level2_gnps']!= 'none'].groupby('level2_gnps').apply(lambda x: x.shape[0])

In [103]:
feature[feature['level_1_identification']!= 'none'][['level2_gnps', 'level_1_identification']]

In [106]:
feature[feature['level_1_identification']!= 'none'][['level2_gnps', 'level_1_identification']].iloc[0]
# 3.beta.-Hydroxy-5-cholenoic acid I try to see if it is a specification of Deoxycholic acid. I'm not sure. 
feature[feature['level_1_identification']!= 'none'][['level2_gnps', 'level_1_identification']].iloc[3]
# 12-Ketodeoxycholic acid - how can it be both Cholic acid and a-Muricholic acid ? 

In [108]:
# It seems to me that I can take the feature with 1 mapping to GNPs. It seems to be equivalent to HMDB identifier. 

(feature[feature['level2_gnps']!= 'none'].groupby('level2_gnps').apply(lambda x: x.shape[0]) == 1).sum()
# I would have 121 metabolites like that. Create this table.

# If I needed to work manually (due seems to have this library) 
# Then, out of this table select 10 that I would also annotate using HMDB for comparison with Efrat's work on Humans.

In [99]:
# In the end I need a metabolic name in a convention I could work with: KEGG, HMDB, 
# Map using the help of Efrat and this library: MetaboAnalystR (?) verify if they didn't provide any...?


# I would like to have a mapping to KEGG and HMDB identifies. See if it is provided somehow.

# In the end I would like to look on metabolites with high-confidence. AKA one-to-one mapping to KEDD/HMDB identifiers. 
# Understand why I have conflicts between level-1 and level-2 identification and consider how to handle this subset of features -
# I might want to drop them as well.

# I think that level 1 should be more accurate than level-2. So, if I'm using level-1 it's ok. But if I'm using level 2 and there is missmatch 
# with level 1 - i might want to drop that feature \ use level 1. Check first if level-2 is a case of level 1. 



In [None]:
# Create the standard identification metabolites feature table:

Using GNPS we have 201 identified un-targeted metabolites, out of 1710 features of metabolites (aka most metabolites are not identified)

In [92]:
feature['level2_gnps'].unique()


In [114]:
# Create the GNPs metabolites feature table:
unique_gnp_metabolite = (feature[feature['level2_gnps']!= 'none'].groupby('level2_gnps').apply(lambda x: x.shape[0]) == 1)
unique_gnp_metabolite = unique_gnp_metabolite[unique_gnp_metabolite].index

In [125]:
# feature['level2_gnps'].isin(unique_gnp_metabolite)
metabolite_features = feature.query('level2_gnps in @unique_gnp_metabolite').drop(['obs_mz', 'obs_rt', 'level_1_identification', '#featureID'], axis=1).set_index('level2_gnps').T


In [126]:
metabolite_features.to_csv('metabolite_unique_gnp_annotated_HaddadOSA.csv')

Map metabolites GNPS annotation to HMDB and KEGG ids:

In [2]:
metabolite_features = pd.read_csv('metabolite_unique_gnp_annotated_HaddadOSA.csv', index_col='Unnamed: 0')

In [12]:
# pd.DataFrame({'GNPS indexes': metabolite_features.columns}).to_csv('gnps_indexes_to_hmdb_ids.csv', index=False)

In [4]:
gnps_indexes = pd.read_csv('gnps_indexes_to_hmdb_ids.csv')

In [17]:
import re
# gnps_indexes['Spectral Match to (.*) from NIST14'] = gnps_indexes['GNPS indexes'].str.extract(r'Spectral Match to (.*) from NIST14')
# gnps_indexes.to_csv('gnps_indexes_to_hmdb_ids.csv', index=False)
# with open ('GNPs_spectral_matches_names.txt', 'w') as f:
#     for ind in gnps_indexes['Spectral Match to (.*) from NIST14'].dropna().to_list():
#         f.write(f"{ind}\n")

In [7]:
gnps_indexes.head()

In [8]:
metabolite_features.head()

In [12]:
gnps_to_hmdb_ids = gnps_indexes.set_index('GNPS indexes')['HMDB id']

In [15]:
print(f"There are {gnps_to_hmdb_ids.dropna().shape[0]} metabolites mapped from GNPS to HMDB, out of {gnps_to_hmdb_ids.shape[0]}")

Create HMDB ids metabolities data, and revisit the statistical analysis with this annotation. Then, save both the significant couples result and the coefficient as is (without filtering). 

In [20]:
metabolite_features.loc[:, gnps_to_hmdb_ids.dropna().index].rename(columns=gnps_to_hmdb_ids.dropna()).to_csv('metabolite_HMDB_annotated_HaddadOSA.csv')

Process Metabolites features (according to the article): 

filter only the control samples from both the metabolite and the taxonomy using the metadata:

In [132]:
control_samples = metadata[metadata.control].index

In [138]:
metabolite_features_control = metabolite_features.loc[control_samples, :]
taxa_features_control = taxa.loc[control_samples, :]

Correlation test:
For each metabolite independently, and for each taxa independently, (aka for each (metabolite, genus) tuple) calculate correlation across samples. 
(Should I apply log on top of the pqn normalization? I think not but verify with Efrat)

Note, Need to think with Elhanhan whether we should drop the unknown taxa or not... as I'm working with different taxa... (human/ mice) and maybe the unknown taxa in mice have meaning... 

Prep taxa: calculate relative-abundance, consider if to drop unknown or not

In [None]:
DROP_UNKNOWN_TAXA = False



In [140]:
taxa