# GC-MS Data Processing
## Automated Compound Identification Using Spectral Matching

This notebook demonstrates how to process low resolution GC-MS data using CoreMS with the MetabRef database for automated compound identification.

### Workflow Overview
1. Load GC-MS data from NetCDF files
2. Process chromatograms
3. Perform retention index calibration using FAMES standards
4. Conduct spectral matching against [MetabRef library](https://metabref.emsl.pnnl.gov/)
5. Export results

### Data Format
This tutorial uses ANDI NetCDF format GC-MS data files.

## Import Required Modules

In [1]:
from corems.mass_spectra.calc.GC_RI_Calibration import get_rt_ri_pairs
from corems.mass_spectra.input.andiNetCDF import ReadAndiNetCDF
from corems.molecular_id.search.compoundSearch import LowResMassSpectralMatch
from corems.molecular_id.search.database_interfaces import MetabRefGCInterface

## Initialize MetabRef Interface

MetabRef is a reference database for metabolomics. We'll use it for both FAMES calibration and compound identification.

In [2]:
def get_metabref_library():
    """Initialize MetabRef interface and pull compound library"""
    metabref = MetabRefGCInterface()
    return metabref.get_library(format="sql")

def get_metabref_fames():
    """Initialize MetabRef interface and pull FAMES calibration data"""
    metabref = MetabRefGCInterface()
    return metabref.get_fames(format="sql")

## Load and Process GC-MS Data

Load ANDI NetCDF format GC-MS data files.

In [3]:
def load_gcms_data(filepath):
    """Load GC-MS data from NetCDF file"""
    # Initialize reader
    reader_gcms = ReadAndiNetCDF(filepath)
    
    # Run reader
    reader_gcms.run()
    
    # Get GCMS object
    gcms = reader_gcms.get_gcms_obj()
    
    return gcms

## Example: Process a Single GC-MS File

In [4]:
# Example file path - update this to your data file location
file_path = "../../tests/tests_data/gcms/GCMS_FAMES_01_GCMS-01_20191023.cdf"

# Load the data
gcms = load_gcms_data(file_path)

# Process chromatogram
gcms.process_chromatogram()

print(f"Loaded GC-MS data with {len(gcms)} scans")
print(f"Retention time range: {gcms.retention_time.min():.2f} - {gcms.retention_time.max():.2f} min")

Loaded GC-MS data with 120 scans
Retention time range: 6.09 - 37.49 min


## Retention Index Calibration

Use FAMES (Fatty Acid Methyl Esters) standards to calibrate retention indices.

In [5]:
# Path to FAMES calibration file
fames_file_path = "../../tests/tests_data/gcms/GCMS_FAMES_01_GCMS-01_20191023.cdf"

# Load FAMES calibration data
gcms_fames = load_gcms_data(fames_file_path)

# Get FAMES reference from MetabRef
fames_sql_obj = get_metabref_fames()

# Calculate retention time to retention index pairs
rt_ri_pairs = get_rt_ri_pairs(gcms_fames, sql_obj=fames_sql_obj)

print(f"Found {len(rt_ri_pairs)} RT-RI calibration pairs")

# Apply calibration to your sample data
gcms.calibrate_ri(rt_ri_pairs, fames_file_path)
print("Retention index calibration completed")

100%|██████████| 120/120 [00:00<00:00, 410.97it/s]

Methyl Caprylate 7.763233333333333 0.9930198089008349
Methyl Caprate 10.571850000000001 0.9925354638475408
Methyl Laurate 13.180316666666666 0.9906994483516044
Methyl Myristate 15.538566666666666 0.9917784695342875
Methyl Palmitate 17.483966666666667 0.5587411960800629
Methyl Palmitate 17.68413333333333 0.989611757957125
Methyl Stearate 19.62953333333333 0.9862334799629703
Methyl Eicosanoate 21.412300000000002 0.9859390225094037
Methyl Docosanoate 22.876033333333332 0.6026944091867165
Methyl Docosanoate 23.05118333333333 0.9869119400663681
Methyl Hexacosanoate 25.991183333333332 0.9782652700959655
Methyl Octacosanoate 27.3173 0.9649357607738421
Found 10 RT-RI calibration pairs
Retention index calibration completed





## Compound Identification

Use the MetabRef library to identify compounds by spectral matching.

In [6]:
# Get MetabRef compound library
metabref_library = get_metabref_library()

# Initialize spectral match search
lowResSearch = LowResMassSpectralMatch(gcms, sql_obj=metabref_library)

# Run spectral matching
lowResSearch.run()

print("Compound identification completed")

100%|██████████| 120/120 [00:13<00:00,  8.84it/s]

Compound identification completed





## Export Results

In [7]:
# Export to CSV
output_filename = "example_gcms_results"
gcms.to_csv(output_filename)
print(f"Results exported to {output_filename}.csv")

# Export to HDF5
gcms.to_hdf()
print(f"Results exported to {output_filename}.hdf5")

# Return results as a DataFrame
results_df = gcms.to_dataframe()
print("Results returned as a DataFrame")
results_df.head()



Results exported to example_gcms_results.csv
Results exported to example_gcms_results.hdf5
Results returned as a DataFrame


Unnamed: 0,Sample name,Peak Index,Retention Time,Retention Time Ref,Peak Height,Peak Area,Retention index,Retention index Ref,Retention Index Score,Similarity Score,...,Chebi ID,Kegg Compound ID,Inchi,Inchi Key,Smiles,Molecular Formula,IUPAC Name,Traditional Name,Common Name,Derivatization
0,GCMS_FAMES_01_GCMS-01_20191023,0,6.20565,,439551.942857,3825729.2,689.085374,695.33,0.114588,0.285262,...,58251,C01026,"InChI=1S/C4H9NO2/c1-5(2)3-4(6)7/h3H2,1-2H3,(H,...",FFDGPVCHZBVARC-UHFFFAOYSA-N,CN(C)CC(=O)O,C4H9NO2,2-(dimethylamino)acetic acid,,,None:None:None
1,GCMS_FAMES_01_GCMS-01_20191023,0,6.20565,,439551.942857,3825729.2,689.085374,698.96,0.00444,0.103267,...,17967,C01537,"InChI=1S/C3H7NO2/c1-2-6-3(4)5/h2H2,1H3,(H2,4,5)",JOYRKODLDBILNP-UHFFFAOYSA-N,CCOC(=O)N,C3H7NO2,ethyl carbamate,,,None:None:None
2,GCMS_FAMES_01_GCMS-01_20191023,0,6.20565,,439551.942857,3825729.2,689.085374,695.9,0.075778,0.090187,...,141702,,"InChI=1S/C8H10O/c1-7-5-3-4-6-8(7)9-2/h3-6H,1-2H3",DTFKRVXLBCAIOZ-UHFFFAOYSA-N,CC1=CC=CC=C1OC,C8H10O,1-methoxy-2-methylbenzene,,,None:None:None
3,GCMS_FAMES_01_GCMS-01_20191023,0,6.20565,,439551.942857,3825729.2,689.085374,698.06,0.011394,0.070765,...,28579,C06593,"InChI=1S/C6H11NO/c8-6-4-2-1-3-5-7-6/h1-5H2,(H,...",JBKVHLHDHHXQEQ-UHFFFAOYSA-N,C1CCC(=O)NCC1,C6H11NO,azepan-2-one,,,None:None:None
4,GCMS_FAMES_01_GCMS-01_20191023,0,6.20565,,439551.942857,3825729.2,689.085374,678.1,0.001226,0.050868,...,16997,C00583,"InChI=1S/C3H8O2/c1-3(5)2-4/h3-5H,2H2,1H3",DNIAPMSPPWPWGF-UHFFFAOYSA-N,CC(CO)O,C3H8O2,"propane-1,2-diol",,,None:None:None


## Summary

This workflow demonstrates:
1. Loading GC-MS data from NetCDF files
2. Processing chromatograms
3. Retention index calibration using FAMES standards
4. Compound identification using MetabRef spectral library
5. Exporting results