# Database

This is used to acquire the necessary data for large scale mass spectrometry experiments with Dilute-and-Shoot Flow-Injection-Analysis Tandem Mass Spectrometry (DS-FIA-MS/MS).

Main functions:
- Organism list (KEGG)
- Metabolite list (KEGG)
- Metabolite Information (PubChem, ChEMBL)
- Metabolite classes (KEGG)
- Pathway information (KEGG)
- MS/MS prediction (CFM-ID)

## Import packages

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('..')

from supplementcode import database

## Processing

### Get list of organisms

The KEGG REST API from Biopython allows to access organism identifier. Provide a path to a created results folder.

In [None]:
# Set path for results
path_results = r'C:\...\examples\Database\Database\Lists'

# Get organisms
database.database_get_organisms(
    path_results = path_results
)

### Get list of metabolites

The KEGG REST API from Biopython allows to access metabolite identifier. Provide a path to a created results folder.

In [None]:
# Set path for results
path_results = r'C:\...\examples\Database\Database\Lists'

# Get metabolites
database.database_get_metabolites(
    path_results = path_results
)

### Get pKa

The PubChem REST API and ChEMBL REST API allows to access metabolite information, e.g. pKa values. Provide a path to your metabolite list from above.

In [None]:
# Set path of metabolite list
path_data = r'C:\...\examples\Database\Database\Lists\KEGG_list_compound.xlsx'

database.database_get_metabolite_information(
    path_data = path_data
)

### Get metabolite class

The KEGG REST API from Biopython allows to access metabolite classification. Provide a path to a results folder.

In [None]:
path_results = r'C:\...\examples\Database\Database\Lists'
database.database_get_metabolite_class(
    path_results = path_results
)

### Get list of pathways and pathway information

The KEGG REST API from Biopython allows to access pathway information for a given organism. Provide a path to a created results folder. In addition, provide one or multiple comma separated organism identifier from the KEGG organism list created previously.

In [None]:
# Set path to KEGG_list_organisms.xlsx
path_results = r'C:\...\examples\Database\Database\Pathways'

# Set organisms in list
# 'reference' for all pathways
list_organisms = [
    'sce'
]

database.database_get_pathway_information(
    path_results = path_results, 
    list_organisms = list_organisms
)

### Predict all metabolites (offline)

Predict MS/MS spectra for mass transition selection. The provided workflow in this chapter works with the source code from in https://sourceforge.net/projects/cfm-id/ (accessed 2021-08-21). Installation procedures for windows binaries are descriped in the wiki for cfm-id. The necessary software LPSolve IDE 5.5.2.0 can be downloaded from https://sourceforge.net/projects/lpsolve/ (accessed 2021-08-21). The compatible binaries are also provided in the examples/database folder.<br><br>For this algorithm, you need to provide a raw path to the previously created xlsx pKa list, a created results folder for the output files and the previously created pathway file of an organism. Additionally, provide a raw path to the prediction exe from cfm-id, as well as the paths to the positive and negative parameter and configuration files.<br><br>The specific path to an organism is implemented due to time reasons. If you change the organism, new metabolite fragment spectra are added to the folder. In case of problematic setups due to version compatibility of LPSolve and the binaries, a web-scaper tool is provided below. The workflow with the binaries is still way faster and independent of server accessability and stability.

In [None]:
# Set pathway to KEGG_List_Metabolites.xlsx
path_metabolite_list = r'C:\...\examples\Database\Database\Lists\KEGG_list_pKa.xlsx'
path_results = r'C:\...\examples\Database\Database\Predictions\SMILES'
path_organism = r'C:\...\examples\Database\Database\Pathways\cgb.xlsx'

# Executable
path_exe_predict = r'C:\...\examples\Database\Database\Predictions\CFM-ID\01_exe\cfm-predict.exe'

# Positive MSMS
path_file_parameter_pos = r'C:\...\examples\Database\Database\Predictions\CFM-ID\02_positive\param_output0.log'
path_file_config_pos = r'C:\...\examples\Database\Database\Predictions\CFM-ID\02_positive\param_config.txt'

# Negative MSMS
path_file_parameter_neg = r'C:\...\examples\Database\Database\Predictions\CFM-ID\03_negative\param_output0.log'
path_file_config_neg = r'C:\...\examples\Database\Database\Predictions\CFM-ID\03_negative\param_config.txt'

database.database_predict_fragmentation_offline(
    path_metabolite_list = path_metabolite_list, 
    path_results = path_results, 
    path_organism = path_organism,
    path_exe_predict = path_exe_predict,
    path_file_parameter_pos = path_file_parameter_pos, 
    path_file_parameter_neg = path_file_parameter_neg,
    path_file_config_pos = path_file_config_pos, 
    path_file_config_neg = path_file_config_neg,
    structure_key = 'smiles',
    modes = ['Pos','Neg'],
)

### Predict all metabolites (online)

The setup of the prediction tool in windows can be tedious due to specific LPSolve and binary dependencies. This web-scraper works similar to the offline workflow above. The web-server https://cfmid.wishartlab.com/ (accessed 2021-08-27) can be accessed with requests and selenium.<br><br>In addition to the raw paths to the metabolite list, results folder for MS/MS spectra and organism pathway file, please add a raw path to a chrome driver from https://sites.google.com/a/chromium.org/chromedriver/downloads (accessed 2021-08-27). The chrome driver version needs to be identical to your chrome browser version.

In [None]:
path_metabolite_list = r'C:\...\examples\Database\Database\Lists\KEGG_list_pKa.xlsx'
path_results = r'C:\...\examples\Database\Database\Predictions\SMILES'
path_organism = r'C:\...\examples\Database\Database\Pathways\cgb.xlsx'
path_driver = r'C:\...\examples\Database\Database\Predictions\CFM-ID\04_chrome\chromedriver.exe'
url = 'http://cfmid3.wishartlab.com/predict'

database.database_predict_fragmentation_online(
    path_metabolite_list = path_metabolite_list, 
    path_results = path_results, 
    path_organism = path_organism, 
    path_driver = path_driver, 
    url = url, 
    structure_key = 'smiles',
    modes = ['Pos','Neg'],
)