# Database construction for FIA-MS

This example notebook shows how to create the organism specific database files (structure and mapping) necessary for processing FIA-MS data. The files are based on metabolite lists and additional information that can be taken from metabolic model. Here we present the processing steps necessary for processing models saved as three different file types; .xlsx, .json and .sbml.

In [1]:
import pandas as pd
import cobra
from collections import namedtuple
import BFAIR.FIA_MS as fia_ms

Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail


## Yeast

The yeast model was described [here](https://www.nature.com/articles/s41467-019-11581-3) and can be found on [this website](https://sysbiochalmers.github.io/yeast-GEM/).

### This needs to be streamlined

In [2]:
yeast_df = pd.read_excel("data/FIA_MS_example/database_files/yeastGEM.xlsx", sheet_name='METS', engine='openpyxl')

In [3]:
# First, we reduce the list of metabolites. They are listed for each compartment in the model but we only need them once
yeast_df_unique = yeast_df.drop_duplicates(subset='NAME', keep='first')
# Then we kick out metabolites that do not have an annotated composition/formula
yeast_df_unique = yeast_df_unique.dropna(subset=['COMPOSITION'])
# And finally we remove metabolites that are connected to a "rest" of a molecule or a halogene
yeast_df_unique = yeast_df_unique[~yeast_df_unique['COMPOSITION'].str.contains('R')]
yeast_df_unique = yeast_df_unique[~yeast_df_unique['COMPOSITION'].str.contains('X')]

In [4]:
# Check how many metabolites and how many unique structures remain
len(yeast_df_unique), len(yeast_df_unique.COMPOSITION.unique())

(1200, 864)

In [5]:
# Reformat the dataframe so that it contains the necessary information
Metabolite = namedtuple('Metabolite', ['id', 'formula', 'charge', 'name'])
yeast_mets = [
    Metabolite(id=row['REPLACEMENT ID'], formula=row['COMPOSITION'], charge=row['CHARGE'], name=row['NAME'])
    for i, row in yeast_df_unique.iterrows()]

In [6]:
fia_ms.create_database(yeast_mets, 'yeastGEM', 'data/FIA_MS_example/database_files/CHEMISTRY')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


## E. coli

The E. coli model was described [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261703/) and can be found on [this website](http://bigg.ucsd.edu/models/iJO1366).

In [7]:
model_coli = cobra.io.load_json_model('data/FIA_MS_example/database_files/iJO1366.json')

In [8]:
fia_ms.create_database(model_coli.metabolites, 'iJO1366', 'data/FIA_MS_example/database_files/CHEMISTRY')

## P. putida

The P. putida model was described [here](https://sfamjournals.onlinelibrary.wiley.com/doi/full/10.1111/1462-2920.14843) and can be found on [this website](http://bigg.ucsd.edu/models/iJN1463).

In [9]:
model_putida = cobra.io.load_json_model('data/FIA_MS_example/database_files/iJN1463.json')

In [11]:
fia_ms.create_database(model_putida.metabolites, 'iJN1463', 'data/FIA_MS_example/database_files/CHEMISTRY')

## Streptomyces

The Streptomyces metabolite lists were set up in house.

In [12]:
df_mets = pd.read_excel('data/FIA_MS_example/database_files/449_2018_1900_MOESM1_ESM.xlsx', sheet_name='Metabolites', skiprows=1, engine='openpyxl')

In [13]:
mets_endo = []
mets_exo = []
for i, row in df_mets.iterrows():
    desc = row['Metabolite description']
    if 'biomass' in desc.lower() or 'Acyl_sn_glycerol_3_phosphate_C18.925H37.908O7P' == desc:
        continue
    if '.' in row['Metabolite formula']:
        continue
    m = Metabolite(
        id=row['Metabolite name'], 
        formula=row['Metabolite formula'], 
        charge=0, 
        name=row['Metabolite description'],
    )
    if m.id[-3:] == '[e]':
        mets_exo.append(m)
    else:
        mets_endo.append(m)

In [14]:
len(mets_endo), len(mets_exo)

(950, 198)

In [15]:
df_secondary = pd.read_excel('data/FIA_MS_example/database_files/secondary_collinus.xlsx', engine='openpyxl')

In [16]:
for i, row in df_secondary.iterrows():
    if isinstance(row['Formula'], float):
        continue
    m = Metabolite(
        id=row['Name'].replace(' ', ''), 
        formula=row['Formula'], 
        charge=0, 
        name=row['Name'],
    )
    mets_exo.append(m)
    mets_endo.append(m)

In [17]:
len(mets_endo), len(mets_exo)

(973, 221)

In [18]:
fia_ms.create_database(mets_endo, 'streptomyces_endo', 'data/FIA_MS_example/database_files/CHEMISTRY')
fia_ms.create_database(mets_exo, 'streptomyces_exo', 'data/FIA_MS_example/database_files/CHEMISTRY')

## C. elegans

The C. elegans model was described [here](https://doi.org/10.3389/fmolb.2019.00002) and can be found on [this website](https://figshare.com/articles/dataset/Data_Sheet_2_Multi-Omics_and_Genome-Scale_Modeling_Reveal_a_Metabolic_Shift_During_C_elegans_Aging_ZIP/7679876/1).

In [19]:
model_celegans = cobra.io.read_sbml_model('data/FIA_MS_example/database_files/wormjam-20180125.sbml')

'' is not a valid SBML 'SId'.


In [20]:
fia_ms.create_database(model_celegans.metabolites, 'Celegans', 'data/FIA_MS_example/database_files/CHEMISTRY')

## Human

The human database we're using is based on [this publication](https://doi.org/10.1093/nar/gkx1089) but ut was prepared in a different way