Data preparation for Haddad OSA article


In [1]:
import pandas as pd
import numpy as np

Taxonomic feature table: Samples x Taxonomic species

In [3]:
taxa = pd.read_csv('data/taxonomic_observed_abundance_HaddadOSA.csv')
taxa = taxa.set_index('#SampleID')

In [4]:
taxa.head()

Unnamed: 0_level_0,14-2,1XD42-69,1XD8-76,Acetatifactor,Acetoanaerobium,Achromobacter,Acidovorax,Acinetobacter,Actinomycetospora,Acutalibacter,...,UBA7109,UBA8904,UBA964,UTPLA1,Unknown,UphvI-Ar2,Usitatibacter,Variovorax,Ventrimonas,XBB1006
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10422.17.F.10,9.0,3.0,1.0,15.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1267.0,0.0,0.0,0.0,0.0,0.0
10422.17.F.11,15.0,2.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,32.0,...,10.0,0.0,0.0,0.0,1131.0,0.0,0.0,0.0,1.0,0.0
10422.17.F.12,16.0,4.0,2.0,11.0,0.0,0.0,0.0,0.0,0.0,17.0,...,12.0,0.0,0.0,0.0,1106.0,0.0,0.0,0.0,1.0,0.0
10422.17.F.13,38.0,13.0,3.0,10.0,0.0,0.0,0.0,0.0,0.0,22.0,...,3.0,0.0,0.0,0.0,938.0,0.0,0.0,0.0,0.0,0.0
10422.17.F.3,5.0,4.0,0.0,19.0,0.0,0.0,0.0,0.0,0.0,10.0,...,2.0,0.0,0.0,0.0,948.0,0.0,0.0,0.0,4.0,0.0


Meta-data:

In [9]:
metadata = pd.read_csv('original/haddad_6weeks_metadata_matched.txt', sep='\t')

Which metadata we care about: 
* age / collection_timestamp for time-series data. Or maybe better, use the Description to extract the serial number of the collection.
* IHH exposure - exposure_type - different between control and study-group. IHH - study_group, Air - Control
* host_subject_id / mouse_number


Some information about the experiment: 
Should be 16 mouse, 8 control. 
Each mouse should have 3 timestamps samples (feces).
 

In [10]:
metadata['exposure_type'].unique()

array(['IHH', 'Air'], dtype=object)

In [11]:
metadata['host_subject_id'].unique()

array(['Mouse 17', 'Mouse 18', 'Mouse 19', 'Mouse 20', 'Mouse 21',
       'Mouse 22', 'Mouse 23', 'Mouse 24', 'Mouse 25', 'Mouse 26',
       'Mouse 27', 'Mouse 28', 'Mouse 29', 'Mouse 30', 'Mouse 31',
       'Mouse 32'], dtype=object)

In [12]:
metadata = metadata.set_index('#SampleID')

In [13]:
metadata = pd.concat([metadata['host_subject_id'].to_frame()
                         , metadata['Description'].str.extract(r'.*collection (\d+) of .*').squeeze().to_frame('seq_sample_number'),
                      metadata['exposure_type'].map({'IHH': False, 'Air': True}).to_frame('control')], axis=1)

In [14]:
metadata['seq_sample_number'] = metadata['seq_sample_number'].astype(int)

Samples statistics from metadata:

In [15]:
# How manny control samples there are?
print(f'There are {metadata[metadata["control"]].shape[0]} samples under control (including multiple-timestamps per mice)')

There are 90 samples under control (including multiple-timestamps per mice)


In [16]:
print(f"There are {metadata.groupby('host_subject_id').apply(lambda x: x.sort_values(by='seq_sample_number', ascending=True).iloc[0]).query('control').shape[0]} under control, when taking only the first sample")

There are 8 under control, when taking only the first sample


In [18]:
metadata.to_csv('relevant_metadata_haddad_osa.csv')

Metabolites:

Load feature table of samples x metabolites (molecule annotations)

Note this data is after pqn normalization (what they did in the article) 
 probabilistic quotient normalization to an internal standard (m/z 278.189; real time, 3.81 min) for subsequent analysis: https://pubmed.ncbi.nlm.nih.gov/16808434/

Metabolite feature was annotated in OSA article using two methods: 
"standard identification" label as "level_1_identification", very high level, have 12 unique metabolite label with a lot of compounds mapped to the same label.
The "standard method", using theoretical matching to obs_mz/obs_rt.  
GNP label as "level_2_identification", more descriptive and fine-grained. GNPS method is algorithmic, data dependent method.  
We choose to work with GNP labels as the "compound" (name convention from Efrat's work on humans) in this article.  
 

Using GNPS we have 201 identified un-targeted metabolites, out of 1710 features of metabolites (aka most metabolites are not identified)

As many metabolite features can be mapped to the same GNPs we will group and sum those. 


In [63]:
metabolite_features = pd.read_table('original/haddad_6weeks_allFeatures_pqn_matched_wIDs.txt', sep ='\t')

In [64]:
metabolite_features.shape

(1710, 187)

In [65]:
metabolite_features.head()

Unnamed: 0,#featureID,10422.17.F.10,10422.17.F.11,10422.17.F.12,10422.17.F.13,10422.17.F.3,10422.17.F.4,10422.17.F.5,10422.17.F.6,10422.17.F.7,...,10422.32.F.4,10422.32.F.5,10422.32.F.6,10422.32.F.7,10422.32.F.8,10422.32.F.9,obs_mz,obs_rt,level2_gnps,level_1_identification
0,132.1020923761221_0.4491537660256411,417818400.0,599176700.0,271010900.0,374372800.0,394579100.0,237869700.0,606701800.0,148560700.0,422735400.0,...,278835800.0,591675900.0,572151900.0,439901100.0,304919200.0,721050700.0,132.102092,26.949226,none,none
1,166.0862755041856_0.5858793269230769,363478000.0,651303200.0,223933400.0,367612100.0,363360500.0,223205800.0,596553000.0,99965640.0,363523900.0,...,276344600.0,601801800.0,609614800.0,478803400.0,266983800.0,959568000.0,166.086276,35.15276,Massbank:PB006064 Phenylalanine|(2S)-2-amino-3...,none
2,357.2784138555112_5.010000161030595,373010100.0,435300000.0,435849900.0,392079800.0,479809500.0,604022200.0,450688400.0,338018700.0,351623200.0,...,422084600.0,330066900.0,279712500.0,374946200.0,210245100.0,492174300.0,357.278414,300.60001,Spectral Match to 3.beta.-Hydroxy-5-cholenoic ...,Deoxycholic acid
3,104.1070008277893_0.3359217147435899,530218700.0,526654100.0,220650000.0,490423500.0,386665400.0,231921000.0,415931300.0,351438300.0,424729200.0,...,370002800.0,442259500.0,418912300.0,423527700.0,299363200.0,588373200.0,104.107001,20.155303,none,none
4,373.2735011621581_3.9791876602564105,350478300.0,710333500.0,585270000.0,600715600.0,419050700.0,337573300.0,535634900.0,465673800.0,505418100.0,...,577675200.0,526527000.0,632355800.0,707208100.0,304817200.0,734383700.0,373.273501,238.75126,Spectral Match to Cholic acid from NIST14,b-Muricholic acid


In [67]:
gnp_metabolite = metabolite_features[metabolite_features['level2_gnps']!= 'none'].drop(columns=['#featureID','level_1_identification', 'obs_mz','obs_rt']).groupby('level2_gnps').sum()


In [68]:
gnp_metabolite.shape

(201, 182)

In [69]:
gnp_metabolite = gnp_metabolite.T

In [146]:
gnp_metabolite.head()

level2_gnps,()-Myristoylcarnitine,(S)-Equol,0498_Daidzein,"1,2,3,4-tetrahydroharmane-3-carboxylic acid",1-(9Z-Octadecenoyl)-sn-glycero-3-phosphocholine,1-Stearoyl-2-hydroxy-sn-glycero-3-phosphocholine,1-Stearoyl-2-hydroxy-sn-glycero-3-phosphoethanolamine,11Z-hexadecenoic acid,"13S-Hydroxy-9Z,11E,15Z-octadecatrienoic acid",1OPalmitoyl2Oacetylsnglycero3phosphorylcholine,...,Ursodiol,Val Ile Ser,Val Ile Thr,Val Leu Ile,Val Pro Val,Val Tyr,Val Val Leu,cis-7-Hexadecenoic acid methyl ester,none_16904_dereplictor_pv_3.86203e-22,sucrose
10422.17.F.10,315964900.0,392290900.0,411811000.0,142431900.0,512618400.0,89690070.0,419876800.0,56480330.0,1184549000.0,1108540000.0,...,552866700.0,218180800.0,347183100.0,153483500.0,182182200.0,140729100.0,269380200.0,125653700.0,0.0,1000284000.0
10422.17.F.11,21260230.0,186705900.0,0.0,449613500.0,172769500.0,201981200.0,31417540.0,104628400.0,1934521000.0,290116900.0,...,1452149000.0,930788500.0,3394768000.0,841990000.0,665894000.0,634923700.0,1361782000.0,205995000.0,331944300.0,1110240000.0
10422.17.F.12,233657700.0,395400400.0,215965500.0,115978400.0,271484400.0,293963000.0,304757600.0,125114600.0,1904190000.0,112720900.0,...,939014700.0,371545500.0,1247775000.0,438339300.0,37229120.0,139373200.0,32970860000.0,270968700.0,695911000.0,2403973000.0
10422.17.F.13,250200100.0,85837490.0,304380700.0,789607900.0,698341800.0,660137700.0,527940400.0,230484100.0,2999902000.0,5924575000.0,...,1072192000.0,1465024000.0,2544299000.0,809160100.0,1091217000.0,1016178000.0,1253986000.0,542060700.0,3120713000.0,7512926000.0
10422.17.F.3,1676555000.0,15247040.0,484880000.0,1208026000.0,1179622000.0,2965064000.0,2442140000.0,139487600.0,1313219000.0,432074200.0,...,308173200.0,968414000.0,1597777000.0,412113600.0,542828200.0,487322200.0,636513300.0,490327800.0,46377530.0,410908500.0


In [147]:
gnp_metabolite.to_csv('data/metabolite_gnp_annotated_HaddadOSA.csv')

Map metabolites GNPS annotation to HMDB and KEGG ids:

In [132]:
metabolite_features = pd.read_csv('data/metabolite_gnp_annotated_HaddadOSA.csv', index_col='Unnamed: 0')

In [133]:
metabolite_features = metabolite_features.stack().reset_index(drop=False).rename(columns={'level_0': 'sample_id', 'level_1':'metabolite_id' ,0:'metabolite_level'})

In [134]:
metabolite_features.head()

Unnamed: 0,sample_id,metabolite_id,metabolite_level
0,10422.17.F.10,()-Myristoylcarnitine,315964900.0
1,10422.17.F.10,(S)-Equol,392290900.0
2,10422.17.F.10,0498_Daidzein,411811000.0
3,10422.17.F.10,"1,2,3,4-tetrahydroharmane-3-carboxylic acid",142431900.0
4,10422.17.F.10,1-(9Z-Octadecenoyl)-sn-glycero-3-phosphocholine,512618400.0


Map the GNPS metabolites to HMDB

In [135]:
gnps_indexes = pd.read_csv('data/gnps_indexes_to_hmdb_ids.csv')
gnps_to_hmdb_ids = gnps_indexes.set_index('GNPS indexes')['HMDB id']


In [106]:
gnps_to_hmdb_ids.shape

(121,)

In [143]:
print(f"There are {gnps_to_hmdb_ids.dropna().shape[0]} metabolites mapped from GNPS to HMDB, {gnps_to_hmdb_ids.dropna().unique().shape[0]} unique HMDB ids, out of {gnps_to_hmdb_ids.shape[0]} GNPS ids")

There are 73 metabolites mapped from GNPS to HMDB, 71 unique HMDB ids, out of 121 GNPS ids


In [138]:
metabolite_features['metabolite_id'] = metabolite_features['metabolite_id'].map(gnps_to_hmdb_ids)

In [140]:
metabolite_features = metabolite_features.pivot_table(columns='metabolite_id', aggfunc='sum', index='sample_id', values='metabolite_level')

In [148]:
metabolite_features.to_csv('data/metabolite_HMDB_annotated_HaddadOSA.csv')